# Lasso & Elastic Net

# Sparse models for variable selection

The goal of this lab is to familiarize with the concept of *sparsity* and to see its connection with the selection of variables in high dimensional contexts.

A machine learning model is said to be **sparse** when it only contains a small number of non-zero parameters, with respect to the number of features that can be measured on the objects this model represents.

Let's see how sparsity works in practice.

## 1. *Sparse* variable selection on toy problems

Using `scikit-learn` let's make a toy regression problem with `n=100` samples and `d=30` variables of which only `d_rel=5` are informative. Use the flag `coef=True` to get $w^*$, *ie* the **real** weights of the toy problem.

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso, Ridge, ElasticNet

In [None]:
n = 100
d = 30
d_rel = 7

X, y, coef = make_regression(n_samples=n, n_features=d, n_informative=d_rel, noise=20, coef=True, random_state=0)
print("Data shape: {}".format(X.shape))

Let's now perform sparse linear regression fitting a [`Lasso`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) model

$$\min_w \frac{1}{2 n} ||y - Xw||^2_2 + \alpha~||w||_1$$

for a fixed value of $\alpha=40$.

In [None]:
mdl = Lasso(alpha=40)
mdl.fit(X, y)

We can inspect the value of the weights $w$ by checking `mdl.coef_`, let's plot their value on a 2D space and let's compare them with $w^*$. 

**Comment the results.**

### Try to play a bit with the value of $\alpha$.

**What do you expect?**

**What happens?**

**What can you say about the choice of setting $\alpha=40$ in the previous example?**

## 2. Recap on the regularization path

We sensed that the role played by the regularization parameter $\alpha$ is crucial. When performing variable selection in high-dimensional contexts it may be helpful to observe the weight assigned to each variable for increasing values of $\alpha$.

An intuitive representation of such phenomenon is called **`regularization path`**

### 2.1 Hands on the Lasso path

**With the toy regression problem above, implement a function that estimates the `lasso path` and visualize it.**

**Implement a function that estimates the `ridge path`.Compare your results and comment differences**

### 2.2 Lasso and correlations among features

In presence of correlation between variables, the lasso penalty may not be very informative. Let's try to evaluate it on a toy dataset in which each informative variable is repeated twice.

Use `n=100` samples, `half_d_rel=5`, `d_dummy = 25`.

In [None]:
n = 100
half_d_rel = 5
d = 30
d_dummy = d - half_d_rel

X, y, coef = make_regression(n_samples=n, n_features=d-half_d_rel, n_informative=half_d_rel, coef=True)
relevant = np.nonzero(coef)[0]
X = np.hstack((X, -2*X[:,relevant]))

print("Data shape: {}".format(X.shape))

**The total number of relevant variables here is `2*half_d_rel = 10`.**

**Use the function implemented in Exercise #2.1 and visualize the lasso path for this dataset.**

**How many selected variable do you see? Why?**

## `Elastic-Net` model :

$$\min_w \frac{1}{2n} ||y - Xw||^2_2 + \alpha \cdot l_{1_{ratio}} \cdot ||w||_1 + \frac{1}{2}~\alpha \cdot (1 - l_{1_{ratio}}) \cdot ||w||^2_2$$

or equivalently as:

$$\min_w \frac{1}{2n} ||y - Xw||^2_2 + \tau~||w||_1 + \mu~||w||^2_2$$

with appropriate parameters formulation.
The Elastic-Net, thanks to the combined influence of the $\ell_1$ and the $\ell_2$-norm, achieves a sparse and stable solution in which joint selection of collinear variables is promoted.

**Fix the `l_1ratio=0.5` and evaluate the Elastic-Net path on the datasets of Exercise 2.2**

**How many selected variable do you see? Why?**

**Which is the asymptotic behavior of the weights corresponding to correlated features?**

# 3 Elastic-net for variable selection in Microarray study

**We will train an ElasticNetClassifier (take a look at the file `enet_classifier.py`) on the [Golub dataset](http://portals.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?paper_id=43) which contains microarray data measured from two classes of leukemia: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL).**

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import os
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from collections import Counter
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

from enet_classifier import ElasticNetClassifier

### 3.1. Load data

Let's load the dataset with `pandas` in the usual way (`pandas.read_csv`). Read both data and labels.

In [None]:
data = pd.read_csv("gedm.csv", header=0, index_col=0).T
print("n_samples = {} | n_variables = {}".format(*data.shape))

n_samples, n_variables = data.shape
data.head()

In [None]:
labels = pd.read_csv('labels.csv', header=0, index_col=0)
print("n_samples (AML) = {} | n_samples (ALL) = {}".format(
    np.sum(labels.values == 'AML'), np.sum(labels.values == 'ALL')))

labels.head()

**Encode the labels into the standard binary classification problem: associate ALL with class `1` and AML with class `-1`.**

In [None]:
binary_labels = np.where(labels.values == 'ALL', 1, -1)

**What is the accuracy score of a random classifier?**

**Note that some columns of `data` start with `AFFX`. These are not real features. Instead, they are some control probes related to the microarray structure. For this reason, we can remove them before the actual analysis.**

In [None]:
relevant_features = data.columns[~data.columns.str.startswith("AFFX")]
data = data.loc[:, relevant_features]

### 3.2. Data visualization

**We can visualize the data by projecting them on a 2-dimensional space with `PCA`.**

### 3.3 Fitting the model

To control separately $\mu$ and $\tau$ in the ElasticNet model, you need to solve the appropriate linear system of equationa.

$ \alpha l_{1_{ratio}} = \tau$

$ \frac{\alpha}{2} (1 - l_{1_{ratio}}) = \mu$

$\dots$

$ l_{1_{ratio}} = \frac{\tau}{2 \mu + \tau}$

$ \alpha = 2 \mu + \tau$



**Write a Python function which performs the conversion from `tau` and `mu` to `alpha` and `l1_ratio`.**

### 3.4 Nested variable selection

We will observe the effect of increasing the value of `mu`, which controls the amount of variable correlation tolerated in the solution. For example, with fixed `tau`, increasing `mu` should result in selecting nested list of variables, as in [DeMol09](http://online.liebertpub.com/doi/abs/10.1089/cmb.2008.0171).

**Implement a data analysis pipeline following the next steps:**

1. Fix a value for `mu_0`, *e.g.* `mu_0 = 1`.

2. Split the dataset into `K` folds (non-overlapping groups).

    a. For each iteration keep $\frac{1}{K}$ samples aside and use it as test set.
    
    b. Use the remaining $\frac{K-1}{K}$ samples and use them as training set to optimize the regularization parameter `tau` via an inner `GridSearch` cross-validation.
    
    c. The best model is achieved with the optimal `tau` fitted on the training set.

    d. Evaluate the best model on the test set and keep track of the accuracy score and the list of selected variables.
    
This would allow us to obtain a ranking of variables.

In [None]:
X, y = #training data, labels

#Find an appropriate number of folds in which split the dataset
K = #?

kf = StratifiedKFold(n_splits=K, shuffle=True)

enet = ElasticNetClassifier()

Use the following ranges for the parameters:
- `tau_range` in logarithmic scale from `1e-1` to `1e5`
- `mu_range` as [`1e4`, `1e5`, `1e7`, `1e8`]

### 3.5 Build a `pandas` DataFrame whose values correspond to the number of times each variable is selected for different values of $\mu$

### 3.6 Build an explanatory visualization of the selection frequency for each variable, at different choices of $\mu$

**What do you observe?**