#  Machine learning model 1 : linear regression (and SVM)

*May, 2022 - François HU*

*Master of Science - EPITA*

*This lecture is available here: https://curiousml.github.io/*

![image.png](attachment:image.png)

# Table of contents

- In this notebook we will learn **Machine Learning** (or statistical learning) modeling by practice


- see the courses *Machine Learning 1 and Machine Learning 2* for statistical learning theory


- [Application 1: Linear regression](#1) (**lecture 2 or 3**)
    - question 1
    - question 2
    - question 3
    - question 4
    - question 5
    - question 6


- [Application 2: Ridge regression](#2) (see **lecture 4 or 5**)

    
- [Application 3: SVD and PCA](#3) (see **lecture 4**)


- [(optional) Application 4: SVM](#4) (optional)
    - question 1
    - question 2
    - question 3


- [Application 5: Titanic challenge](#5) (<font color ="red">**Exam**</font>)

## Application 1: Linear regression <a name="1"></a>

Let us implement a **linear regression** from scratch using Python. Linear regression is one of the most basic and commonly used type of predictive analysis.

### Simple linear regression

- The objective of a **simple linear regression** is establishing a linear relationship between a single input variable (denoted $\mathbf{x} = [x_1, x_2, \cdots, x_n]\in\mathbb{R}^n$) and a single output variable (denoted $\mathbf{y} = [y_1, y_2, \cdots, y_n]\in\mathbb{R}^n$). 


- For all the observation $i$, we are trying to estimate $y_i$ by $\hat y_i$ and the most simplest form of the estimation is a linear combination of $x_i$:
$$
\hat y_i = \beta_0 + x_i\beta_1 \iff \hat y_i = x_i\beta^T
$$
with $\beta = [\beta_0, \beta_1]\in\mathbb{R}^2$ the model parameters (or coefficients). In other terms (matrix notation):

$$
\begin{bmatrix} \hat y_1\\ \hat y_2\\ \vdots\\ \hat y_n  \end{bmatrix} = 
\begin{bmatrix}
\beta_0 + x_1\beta_1\\ 
\beta_0 + x_2\beta_1\\ 
\vdots\\ 
\beta_0 + x_n\beta_1
\end{bmatrix} \iff
\underbrace{
\quad
\begin{bmatrix} 
\hat y_1\\ \hat y_2\\ \vdots\\ \hat y_n 
\end{bmatrix} = 
\begin{bmatrix}
1 &x_1\\ 
1 &x_2\\ 
\vdots&\vdots\\
1 &x_n
\end{bmatrix}
\times
\begin{bmatrix}
\beta_0\\ 
\beta_1
\end{bmatrix}
\quad}_{\\\boxed{\hat y = X\beta^T}}
$$

### Multiple linear regression

In **multiple linear regression**,
$$
X = \begin{bmatrix}
\mathbb{1}, X_{\cdot 1}, X_{\cdot 2}, \cdots, X_{\cdot m}
\end{bmatrix} = \begin{bmatrix}
1 & x_{1,1} & x_{1,2} & \cdots & x_{1,m}\\
1 & x_{2,1} & x_{2,2} & \cdots & x_{2,m}\\
1 &  &  & \vdots & \\
1 & x_{n,1} & x_{n,2} & \cdots & x_{n,m}\\
\end{bmatrix} \text{ and }
\hat y = \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}
$$

the estimation becomes:

$$
\hat y = \beta_0\times \mathbb{1} + \beta_1 X_{\cdot 1} + \beta_2 X_{\cdot 2} + \cdots + \beta_m X_{\cdot m} \iff \boxed{\hat y = X\beta^T}
$$

with $\beta = [\beta_0, \beta_1, \cdots, \beta_m]\in\mathbb{R}^{m+1}$ the model parameters (or coefficients).

We are trying to calibrate our parameter $\beta$ such that $\boxed{\hat y \approx \mathbf{y}}$

### Diabetes dataset

We consider the [diabete](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) dataset. More specifically we choose to study three variables:

- `bmi` (our $X_{\cdot 1}$) the body mass index of the population

- `bp` (our $X_{\cdot 2}$) the average blood pressure of the population

- and `target` (our $\mathbf{y}$) our output variable representing a quantitative measure of disease progression one year after baseline

In [131]:
# Execute this cell
from sklearn.datasets import load_diabetes
import pandas as pd
import numpy as np

np.random.seed(42)
diabete = load_diabetes(as_frame=True)
data = pd.concat((diabete["data"][["bmi", "bp"]], diabete["target"]), axis=1)
data = data.sample(frac=1)

X = data[["bmi", "bp"]][:300]
y = data["target"][:300]

X_test = data[["bmi", "bp"]][300:]
y_test = data["target"][300:]

data.head()

Unnamed: 0,bmi,bp,target
287,-0.006206,-0.015999,219.0
211,0.036907,0.021872,70.0
72,-0.00405,-0.012556,202.0
321,0.051996,0.079254,230.0
73,-0.020218,-0.002228,111.0


#### Question 1:

Plot the following three graphs:

![image-4.png](attachment:image-4.png)

### Prediction

Given a parameter $\beta$ and a set of observations $x$, its predicted value is:

$$
\hat y = x \beta^T
$$

**Question 2:**

Define the function `predict(X, beta)` that returns the prediction $\hat y$.

### Training / optimization

We want to find the best paramater $\hat\beta = [\hat\beta_0, \hat\beta_1, \hat\beta_2] \in\mathbb{R}^3$ that minimizes a **(loss) function**. For instance, the **mean absolute error** (MAE) function,

$$
\min\limits_{\beta\in\mathbb{R}^3} \dfrac{1}{n}\sum\limits_{i=1}^{n}\left| y_i - \hat y_i \right|
$$

or a **mean squared error** (MSE) function,
$$
\min\limits_{\beta\in\mathbb{R}^3} \dfrac{1}{n}\sum\limits_{i=1}^{n}\left( y_i - \hat y_i \right)^2
\quad\text{ equivalently }\quad
\min\limits_{\beta\in\mathbb{R}^3} \left\lvert\left\lvert\ \mathbf{y} - \hat y\ \right\lvert\right\lvert^2
$$



#### Question 3:

Define the function `mse(y_hat, y)` that returns the MSE of $\hat y$ and $\mathbf{y}$.

Given a dataset $(X, y)$, optimization algorithms are used to find an optimal set of parameters (e.g. $\hat\beta_0$ and $\hat\beta_1$) that minimizes a loss function (e.g. MSE).

#### Question 4:

- Compute (theoretically) the partial derivatives $\dfrac{\partial \text{mse}}{\partial \beta_i}$ for $i\in\{0, 1, 2\}$. 


- Define (in python) the function `mse_gradient(y_hat, y)` that returns the gradient of the MSE.


- Define a function named `fit(X, y)` that minimizes the loss function MSE by gradient descent (see lecture 2) and returns the best parameter $\hat\beta$.

#### Question 5:

- Fit your linear regression on the data $(X, \mathbf{y})$.


- Predict the response (call it `y_hat`) of the set `X` and evaluate the model's performance with MSE.


- Predict the response (call it `y_hat_test`) of the set `X_test` and evaluate the model's performance with MSE.


- Which performance should be taken into account? the one with `y_hat`, the one with `y_hat_test` or both. Comment.

### A more direct method

The best paramater $\hat\beta \in\mathbb{R}^d$ that minimizes a **mean squared error** (MSE) function,

$$
\min\limits_{\beta\in\mathbb{R}^d} \left\lvert\left\lvert\ \mathbf{y} - X\beta^T\ \right\lvert\right\lvert^2
$$

Can easily be found anaytically:
$$
\min\left\lvert\left\lvert\ \mathbf{y} - X\beta^T\ \right\lvert\right\lvert^2
\iff
{\displaystyle ( {X} ^{\mathsf {T}} {X} ){\hat { {\beta }}}= {X} ^{\mathsf {T}}\mathbf {y}}
\iff
\boxed{{\widehat {\beta }}=(X^{T}X)^{-1}X^{T}\mathbf{y}}
$$

#### Question 6:

Implement this solution and compare it with the parameter found in question 5.

## Application 2: Ridge regression <a name="2"></a> (lecture 4 or 5)

Soon available ...

## Application 3: SVD and PCA <a name="3"></a> (lecture 4)

Soon available ...

## (optional) Application 4: Support Vector Machines (SVM) <a name="3"></a>

### SVM: Visualization and intuition

Notes:

- space of correct hyperplanes



- margin



- optimal hyperplane: maximizing the margin



- intuition about the following expression: $\sum\limits_{i=1}^{n}\max(0, 1-y_i(w^Tx_i+b))$

### Optimization problem (Tikhonov version)

The SVM prediction function is the solution of
$$
\min\limits_{w\in\mathbb{R}^d, b\in\mathbb{R}} \left\{ \frac{1}{2}\lvert\lvert w \lvert\lvert^2 + \frac{c}{2}\sum\limits_{i=1}^{n}\max(0, 1-y_i(w^Tx_i+b))  \right\}
$$

- Unconstrained optimization


- Not differentiable because of the max


- Let us reformulate into a differentiable problem.

### Unconstraint to constraint optimization

The SVM optimization is equivalent to:
\begin{align*}
\min \quad & \dfrac{1}{2}\lvert\lvert w \lvert\lvert^2 + \dfrac{c}{n}\sum\limits_{i=1}^{n} \xi \\
\text{subject to} \quad & \xi_i \geq \max(0, 1-y_i(w^Tx_i+b)) \quad \text{for all i}
\end{align*}

Equivalently
\begin{align*}
\min \quad & \dfrac{1}{2}\lvert\lvert w \lvert\lvert^2 + \dfrac{c}{n}\sum\limits_{i=1}^{n} \xi_i \\
\text{subject to} \quad & -\xi_i \leq 0\quad \text{for all i}\\
& 1-y_i(w^Tx_i+b)-\xi_i \leq 0 \quad \text{for all i}
\end{align*}

- Differentiable function



- Quadratic programming (can be solved numerically)

### Lagrange multipliers

Therefore the primal problem:

\begin{align*}
\min \quad & \dfrac{1}{2}\lvert\lvert w \lvert\lvert^2 + \dfrac{c}{n}\sum\limits_{i=1}^{n} \xi_i \\
\text{subject to} \quad & -\xi_i \leq 0\quad \text{for all i}\\
& 1-y_i(w^Tx_i+b)-\xi_i \leq 0 \quad \text{for all i}
\end{align*}

and the associated Langrangian is:
\begin{align*}
\mathcal{L}(w, b, \xi, \alpha, \lambda) &= \frac{1}{2}\lvert\lvert w \lvert\lvert^2 + \dfrac{c}{n}\sum\limits_{i=1}^{n}\xi_i  + \sum\limits_{i=1}^{n}\alpha_i (1-y_i(w^Tx_i+b)-\xi_i) - \sum\limits_{i=1}^{n} \lambda_i\xi_i\\
& = \frac{1}{2} w^Tw + \sum\limits_{i=1}^{n}\xi_i\left( \frac{c}{n}-\alpha_i-\lambda_i \right) + \sum\limits_{i=1}^{n}\alpha_i \left( 1-y_i(w^Tx_i+b) \right)
\end{align*}




## (Optional) Questions

- **Question 1:** show that we have a **strong duality**.


- **Question 2:** given that fact, give the expression of the optimal points $w^*, b^*$ and $\xi_i$ w.r.t the lagrange multipliers and the data.


- **Question 3:** deduce the new (simplified) SVM dual problem.

## Application 5: Titanic challenge <a name="5"></a>

[Titanic challenge](https://www.kaggle.com/c/titanic)