<center><img src="Fig/UGA.png" width="30%" height="30%"></center>
<center><h3>Master of Science in Industrial and Applied Mathematics (MSIAM)  - 1st year</h3></center>
<hr>
<center><h1>Numerical Optimization</h1></center>
<center><h2>Lab 7: Variance-Reduced Stochastic Gradient</h2></center>

---

In [None]:
%load_ext autoreload
%autoreload 2

# Logistic Regression Problem
 
 
 
### Machine Learning as an Optimization problem
 
We have some *data*  $\mathcal{D}$ consisting of $m$ *examples* $\{d_i\}$; each example consisting of a *feature* vector $a_i\in\mathbb{R}^d$ and an *observation* $b_i\in \mathcal{O}$: $\mathcal{D} = \{[a_i,b_i]\}_{i=1..m}$. In this lab, we will consider the <a href="http://archive.ics.uci.edu/ml/datasets/Student+Performance">student performance</a> dataset.
 
 
The goal of *supervised learning* is to construct a predictor for the observations when given feature vectors.
 
 
 A popular approach is based on *linear models* which are based on finding a *parameter* $x$ such that the real number $\langle a_i , x \rangle$ is used to predict the value of the observation through a *predictor function* $g:\mathbb{R}\to \mathcal{O}$: $g(\langle a_i , x \rangle)$ is the predicted value from $a_i$.
 
 
 In order to find such a parameter, we use the available data and a *loss* $\ell$ that penalizes the error made between the predicted $g(\langle a_i , x \rangle)$ and observed $b_i$ values. For each example $i$, the corresponding error function for a parameter $x$ is $f_i(x) =   \ell( g(\langle a_i , x \rangle) ; b_i )$. Using the whole data, the parameter that minimizes the total error is the solution of the minimization problem
 $$ \min_{x\in\mathbb{R}^d}  \frac{1}{m} \sum_{i=1}^m f_i(x) = \frac{1}{m} \sum_{i=1}^m  \ell( g(\langle a_i , x \rangle) ; b_i ). $$
 
 
 
### Regularized Problem 
 
In this lab, we will consider an $\ell_1$ regularization to promote sparsity of the iterates. A sparse final solution would select the most important features. The new function (below) is non-smooth but it has a smooth part, $f$; and a non-smooth part, $g$, that we will treat with proximal operations.
 
 \begin{align*}
     \min_{x\in\mathbb{R}^d } F(x) := \underbrace{\frac{1}{m}  \sum_{i=1}^m \overbrace{  \log( 1+\exp(-b_i \langle a_i,x \rangle) ) + \frac{\lambda_2}{2} \|x\|_2^2 }^{f_i(x)} }_{f(x)} + \underbrace{\lambda_1 \|x\|_1 }_{g(x)}.
 \end{align*}
 
 




# Recall of the proximal gradient algorithm



In [None]:
from algoProx import *      # import all methods of the module into the current environment
import numpy as np
import logistic_regression_student as pb

#### Parameter we give at our algorithm (see algoGradient.ipynb)
PREC    = 1e-5                     # Sought precision
ITE_MAX = 1000                      # Max number of iterations
x0      = np.zeros(pb.n)              # Initial point
step    = 1.0/pb.L

##### gradient algorithm
x,x_tab = proximal_gradient_algorithm(pb.F , pb.f_grad , pb.g_prox , x0 , step , PREC, ITE_MAX , True)


## Decrease of the algorithm.

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline

F = []
for i in range(x_tab.shape[0]):
    F.append( pb.F(x_tab[i])) 

plt.figure()
plt.plot( F, color="black", linewidth=1.0, linestyle="-")
plt.grid(True)
plt.show()

### Support of the vector $x_k$ 


In [None]:
plt.figure()

for i in np.arange(0,x_tab.shape[0],int(x_tab.shape[0]/40)):
    for j in range(pb.n):
        if np.abs(x_tab[i,j])>1e-14:
            plt.plot( i , j  , 'ko')

plt.grid(True)
plt.ylabel('Non-null Coordinates')
plt.xlabel('Nb. Iterations')
plt.ylim(-1,pb.d+1)
plt.yticks(np.arange(0,pb.d+1))
plt.show()

---

# Stochastic gradient 


In the following, instead of considering $f$ as a whole, we will use its structure 
$$ f(x) := \frac{1}{m}\sum_{i=1}^m f_i(x)$$

> Implement the gradient related to $f_i$, related to one example, in `logistic_regression_student.py`

With this structure a popular minimization algorithm is the *stochastic gradient algorithm* which writes as follows:
* Select uniformly $i$ in $1,..,m$
* $x_{k+1} = \mathbf{prox}_{\gamma_k g}\left( x_k - \gamma_k \nabla f_i(x_k) \right) $

> Implement this algorithm with a stepsize vanishing as $1/k$

### Variance reduction

The poor performance of this algorithm is notably due to the variance of the gradients. In order to overcome it, *variance reduced* algorithms have been proposed.

We will consider here the popular **SAGA** algorithm  (SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives
A Defazio, F Bach, S Lacoste-Julien, NIPS 2014. ) 

> Implement SAGA from the paper ( http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives ) and compare with the stochastic gradient algorithm.
