# Regularized Regression Problem

We are given a set of $m = 300$ observations, each comprised of 27 *features* $(a_i)$ representing a student characteristics and an objective value $(b_i)$ representing its final grade. In this lab, the *squared 2-norm* is used as loss to form the following optimization problem
\begin{align*}
\min_{x\in\mathbb{R}^d } l(x) := \frac{1}{m}  \sum_{i=1}^m (\langle a_i,x \rangle - b_i)^2  = \frac{1}{m} \|Ax-b\|_2^2 .
\end{align*}


In previous labs, we considered:
- $\ell_2$ regularization to prevent overfitting. The whole function was smooth and thus gradient algorithms were efficient. 
- $\ell_1$ regularization to promote sparsity of the iterates. After separation of the smooth  and non-smooth parts of the loss, we used the proximal gradient algorithm.
- efficient libraries for solving linear and quadratic programs.

**Objective of this lab:**  We introduce a new regularizer: the *group sparsity norm* which has an explicit proximal operator but makes the problem not a QP problem anymore. We will use our previous knowledges about proximity operators, efficient QP solving, and new knowledge about splitting algorithms (as ADMM) to efficiently solve this problem.

#### Group Sparsity ($\ell_1/\ell_2$ norm)

We introduce as a regularizer the group sparsity norm (also called $\ell_1/\ell_2$ norm) which writes as the sum of the (non-squared) 2-norm of groups of coordinates:
$$ g(x) = \sum_{g\in\mathcal{G} } \| x_g\|_2 $$
where for a group of indexes $g$, $x_g = [x_i]_{i\in g}$.

For instance, in our  *student performance* dataset, we can group features as such:
1. $(1,2)$ physical characteristics.
2. $(3,4,5,6,7,8)$ home environment.
3. $(9,10,11,12,13)$ studying habits.
4. $(14,15,16)$ misc.
5. $(17,18,19,20,21,22,23,24)$ social.
6. $(25,26,27)$ attendence and grades.




### Full problem 


The whole learning problem for this lab reformulates as: 



\begin{align*}
\min_{x\in\mathbb{R}^d } F(x) := \underbrace{ \frac{1}{m} \|Ax-b\|_2^2 }_{f(x)} + \underbrace{\lambda_g \sum_{g\in\mathcal{G} } \| x_g\|_2}_{g(x)}.
\end{align*}


### Features signification 

The dataset is comprised of $27$ features described below and the goal is to predict if the student may pass its year or not. It is thus of importance to investigate which features are the most significant for the student success. We will see how the regularization can help to this goal.

### Function definition 

In [1]:
import numpy as np
import csv

#### File reading
dat_file = np.load('student.npz')
A = dat_file['A_learn']
b = dat_file['b_learn']
m = b.size


A_test = dat_file['A_test']
b_test = dat_file['b_test']
m_test = b_test.size


d = 27 # features
n = d+1 # with the intercept

lamG = 0.5 # for the regularization best:0.5

### Groups

In [None]:
G = [[ 1,2],
     [3,4,5,6,7,8],
     [9,10,11,12,13],
     [14,15,16],
     [17,18,19,20,21,22,23,24],
     [25,26,27]] 

n_g = len(G)

## Oracles


> The prox operator for function $f$ has to be completed in `f_prox` using `cvxopt`'s QP solver.

> The prox operator for function $g$ is given.

In [None]:
from cvxopt import matrix, solvers

def f(x):
    return np.linalg.norm(np.dot(A,x)-b)**2/float(m)

def f_prox(x,gamma):
    
    P = 0.0 # TODO
    q = 0.0 # TODO
    
    sol=solvers.qp(P,q) 
    p = sol['x']
    return np.squeeze(np.array(p))
    

In [None]:
def g(x):
    g = 0
    for i in range(n_g):
        g += np.linalg.norm(x[G[i]],2)
    return lamG*g

def g_prox(x,gamma):
    p = np.copy(x)
    for i in range(n_g):
        if  np.linalg.norm(x[G[i]],2) < - lamG*gamma:
            p[G[i]] = x[G[i]]*(1 + lamG*gamma/np.linalg.norm(x[G[i]],2))
        elif np.linalg.norm(x[G[i]],2) > lamG*gamma:
            p[G[i]] = x[G[i]]*(1 - lamG*gamma/np.linalg.norm(x[G[i]],2))
        else:
            p[G[i]] = np.zeros(len(G[i]))
    return p

In [None]:
def F(x):
    return f(x) + g(x)


## Prediction Functions

They are given below:

In [None]:
import numpy as np 

def prediction_train(w,PRINT):

    pred = np.dot(A,w)
    perf = 0
    
    for i in range(A.shape[0]):
        if PRINT:
                print("True grade: {:d} \t-- Predicted: {:.1f} ".format(int(b[i]),pred[i]))
        perf += np.abs(pred[i]-b[i]) 

    return pred,float(perf)/A.shape[0]


def prediction_test(w,PRINT):

    pred = np.dot(A_test,w)
    perf = 0
    
    for i in range(A_test.shape[0]):
        if PRINT:
                print("True grade: {:d} \t-- Predicted: {:.1f} ".format(int(b_test[i]),pred[i]))
        perf += np.abs(pred[i]-b_test[i]) 

    return pred,float(perf)/A_test.shape[0]