# Machine Learning 2024-2025 - UMONS

# Regularization

In this lab, we will experiment with two regularization methods, **ridge** and **LASSO**. In the regression setting, given a training dataset $\mathcal{D} = \{(x_i,y_i)\}_{i=1}^n$, recall from the course that ridge regression aims to find the coefficients $\beta^\text{R}$ that mimimize the following optimization problem:

$$\beta^\text{R} = \underset{\beta}{\text{argmin}} \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p \beta_j^2,$$

where $\lambda$ is a hyper-parameter controlling the amount of shrinkage applied to the coefficients $\beta_j$. On the other hand, the LASSO regression coefficients are obtained from the following optimization problem:

$$ \beta^\text{L} = \underset{\beta}{\text{argmin}} \sum_{i=1}^n \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 + \lambda \sum_{j=1}^p |\beta_j|.$$ 

In the binary classification setting, where $\mathcal{Y} = \{0,1\}$, ridge logistic regression aims instead to solve the following optimization problem:

$$\beta^\text{R} = \underset{\beta}{\text{argmin}}\left[-\text{log}~\mathcal{L(\boldsymbol{\beta};\mathcal{D})} + \lambda \sum_{j=1}^p \beta_j^2\right],$$

where $-\text{log}~\mathcal{L(\boldsymbol{\beta};\mathcal{D})}$ is the negative conditional log-likelihood, i.e.,

$$-\text{log}~\mathcal{L(\boldsymbol{\beta};\mathcal{D})} = -\frac{1}{n} \sum_{i=1}^n y_i\text{log}~p(y_i=1\mid\boldsymbol{x}_i; \boldsymbol{\beta}) + (1-y_i)\text{log}~p(y_i=0\mid\boldsymbol{x}_i;\boldsymbol{\beta}),$$

with ${p(y_i=1\mid\boldsymbol{x}_i;\boldsymbol{\beta}) = \frac{e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}}{1 + e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}}}$ being a logistic regression classifier where $\boldsymbol{\beta}, \boldsymbol{x}_i \in \mathbb{R}^p$. Alternatively, LASSO logistic regression aims to minimize the following objective:

$$\beta^\text{L} = \underset{\beta}{\text{argmin}}\left[-\text{log}~\mathcal{L(\boldsymbol{\beta};\mathcal{D})} + \lambda \sum_{j=1}^p |\beta_j|\right].$$

The purpose of the lab is to better understand the influence of $\lambda$ on the fitted models, in both the regression and classification settings.

**Import the necessary libraries.**

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import Lasso, LinearRegression, LogisticRegression, Ridge
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    confusion_matrix,
    mean_squared_error,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler

### Regularization in regression

**Use the code snippet below to generate some data points.**

In [None]:
def generate(n_samples:int=100 ,n_features:int=5,n_informatives:int=5,coef:bool=True):
    X, y, coef = datasets.make_regression(
    n_samples=n_samples,  # number of samples
    n_features=n_features,  # number of features
    n_informative=n_informatives,  # number of useful features
    noise=10,  # standard deviation of the gaussian noise
    coef=coef,  # true coefficient used to generated the data
    random_state=0,  # always create the same dataset
)
    return X , y , coef

**1) Fit a ridge regression to the data for $\lambda$ varying between $1$ and $2000$ by steps of $1$, and plot the evolution of the coefficients as a function of $\lambda$. What do you observe? Do the same operation for LASSO regression with $\lambda$ varying between $1$ and $100$.**

You can use the methods `Ridge()` and `Lasso()` from the scikit-learn library. Note that the hyperparameter we called $\lambda$ is called `alpha` in scikit-learn.

In [None]:
# creation de
modelsR = np.array([ Lasso(alpha=i) for i in range(1,2000)] )
modelsL = np.array([ Lasso(alpha=i) for i in range(1,2000)] )

**2) Reuse the code snippet above to generate $100$ samples with $90$ features, with only $2$ being informative on the response $y$. Split your dataset on a train and a test split using a $80/20$ partition. For increasing values of $\lambda$ between $0.1$ and $10$ (by steps of $0.1$), fit a ridge regression model on the training data and plot its MSE as a function of $\lambda$. What do you observe? Do the same for LASSO regression.** 

In [None]:
X,y,coef = generate()

**3) Use the code snippet below to simulate some data. Then, perform the following steps:**
- **Split your dataset into training and test sets following a $80/20$ partition.**
- **For values of $\lambda$ in $[0,10]$, perform a GridSearch cross-validation with `cv = 10` to identify the best value of $\lambda$ for both ridge and LASSO regression.**
    - **You will need the `GridSearchCV()` method of scikit-learn for this step. Use the mean squared error as selection metric.**
    - **What are the best values $\lambda^\text{R}$ and $\lambda^\text{L}$ for ridge and LASSO, respectively?**
- **Using these best values of $\lambda^\text{R}$ and $\lambda^\text{L}$, refit the models to the full training set.**
- **Evaluate the training and test MSE using the fitted models, and compare it to the training and test MSE of a linear regression model fitted on all predictors.**
- **How many coefficients are non-zero for the LASSO regression?**

In [None]:
X, y, coef = datasets.make_regression(
    n_samples=500,  # number of samples
    n_features=350,  # number of features
    n_informative=2,  # number of useful features
    noise=5,  # bias and standard deviation of the guassian noise
    coef=True,  # true coefficient used to generated the data
    random_state=1,  # always create the same dataset
)

### Regularization in classification

**Use the code snippet below to generate a classification dataset.**

In [None]:
X, y = datasets.make_classification(
    n_samples=200,  # number of samples
    n_features=30,  # number of features
    n_informative=5,  # number of useful features
    random_state=0,
    weights=[0.8, 0.2],  # Proportion of samples per class.
)

**4) Using the generated dataset above, apply the following steps:**
- **Split your dataset into training and test sets following a $80/20$ partition.**
- **For values of $\lambda$ in $[0,10]$, perform a GridSearch cross-validation with `cv = 10` to identify the best value of $\lambda$ for both the ridge and LASSO logistic regression classifiers.**
    - **Check the arguments `penalty` and `C` of the `LogisticRegression()` method to define ridge and LASSO classifiers. Select `saga` as solver for ridge and `liblinear` as solver for LASSO.**
    - **You will need the `GridSearchCV()` method of scikit-learn for this step. Use the accuracy as selection metric.**
    - **What are the best values $\lambda^\text{R}$ and $\lambda^\text{L}$ for ridge and LASSO, respectively?**
- **Using these best values of $\lambda^\text{R}$ and $\lambda^\text{L}$, refit the models to the full training set.**
- **Evaluate the training and test accuracy using the fitted models, and compare it to the training and test accuracy of a logistic regression model fitted on all predictors.**
    - **Use the `saga` solver for the logistic regression model.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=0, stratify=y
)

Similarly to the regression setting, we mitigate the overfitting phenomenon using ridge and LASSO logistic regression, which translates into a significant increase in test accuracy.

**5) For the logistic regression, ridge logistic regression and LASSO logistic regression, do:** 
- **Plot the confusion matrix of the test predictions. You may use the `confusion_matrix()` and `ConfusionMatrixDisplay()` methods.**
- **Compute the True Positive Rate, False Positive Rate, True Negative Rate, False Negative Rate, and Precision of the models on the test set.**
    - **You can access the necessary quantities out of the confusion matrix.**

We have an imbalanced dataset, with only $20\%$ of observations in the test set belonging to the positive class. Let us say our classifier only classifies observations to the negative class. In this case, the accuracy would be: $\text{Acc} = \frac{TP + TN}{N + P} = \frac{0 + 32}{40} = 0.8$. This means that, even if the model only predicts the $0$ class, we get an accuracy of $80\%$, which can be highly misleading regarding the true performance of the model. The confusion matrix can give more insight regarding what is really happening behind the scenes. 

$TPR = \frac{TP}{TP + FN}$
* Amongst all the observations belonging the positive class, how many are correctly classified as positive. 

$FPR = \frac{FP}{TN + FP}$ 
* Amongst all the observations belonging to the negative class, how many are wrongly classified as positive.

$TNR = \frac{TN}{TN + FP}$
* Amongst all the observations belonging to the negative class, how many are correctly classified as negative. 

$FNR = \frac{FN}{FN + TP}$
* Amongst all the observations belonging to the positive class, how many are wrongly classified as negative.

$\text{Precision} = \frac{TP}{TP + FP}$
* Amongst all the observations that are classified by the model as positive, 
how many are actually positive. 


**6) For the logistic regression, ridge logistic regression, and LASSO logistic regression, do:**
- **Plot the ROC curve of the test predictions, and show the AUROC (Area Under the ROC curve) on the same figure.**
    - **Check the methods `roc_curve()` and `roc_auc_score()`.**