# Supervised learning
```
TO-DO: 
1 - Complete and integrate the notes about linear regression (Data Driven System Modelling)
```

# Linear regression
Linear regression is a supervised learning problem that consists in finding the curve that best fits some data points.

We start considering the simple case of $m$ points in the 2D space, and we look for a linear regressor that fits the data well. That regressor has the form:
$$\hat{y} = \theta_1x + \theta_2$$
Our goal is to find $\left[\theta_1, \theta_2\right]^T$ that minimize the mean squared error:
$$\frac{1}{2m}\sum_{i=1}^{m}{(y-\hat{y})^2}$$
To minimize that quantity, we can use the gradient descent method:
$$w_i \rightarrow w_i - \alpha\frac{\partial}{\partial w_i}\text{Error}$$

At this point, it seems that we've seen two ways of doing linear regression.
- **Stochastic**: By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
- **Batch**: By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

If our data is huge, both are a bit slow, computationally. The best way to do linear regression, is to split the data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update the weights. This is still called mini-batch gradient descent.

In [45]:
# Load imports
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

In [7]:
bmi_life_data = pd.read_csv("datasets/bmi_and_life_expectancy.csv")

# Setup and fit the model with data
bmi_life_model = LinearRegression()
bmi_life_model.fit(bmi_life_data[['BMI']], bmi_life_data[['Life expectancy']])

# Predict life expectancy for a BMI value of 21.07931
laos_life_exp = bmi_life_model.predict([[21.07931]])
print('Life expectancy for a BMI of 21.07931 is equal to {}'.format(laos_life_exp[0][0]))

Life expectancy for a BMI of 21.07931 is equal to 60.315647163993056


## Higher dimensions
The prediction in a $n$ dimensional space, with variables $x_1,x_2,...,x_{n-1}$ is a $n-1$ dimensional hyperplane.

$$\hat{y}=\theta_1x_1+\theta_2x_2+\dots+\theta_{n-1}x_{n-1}+\theta_n$$

In [8]:
# Load the data from the boston house-prices dataset 
boston_data = load_boston()
X = boston_data['data']
y = boston_data['target']

# Fit the model and assign it to the model variable
model = LinearRegression()
model.fit(X, y)

# Make a prediction using the model
sample_house = [[2.29690000e-01, 0.00000000e+00, 1.05900000e+01, 0.00000000e+00, 4.89000000e-01,
                6.32600000e+00, 5.25000000e+01, 4.35490000e+00, 4.00000000e+00, 2.77000000e+02,
                1.86000000e+01, 3.94870000e+02, 1.09700000e+01]]
# Predict housing price for the sample_house
prediction = model.predict(sample_house)

Why don't we simply compute the derivatives of the error function and set to zero, obtaining a system of linear equations in the variables $\theta_i$? If the number of variables is $n$, we would have to solve a linear systems of $n$ equations and $n$ unknowns. That can be very expensive when $n$ is high. For this reason, gradient descent is a good alternative.

## Regularization
Consider two model, one simpler, with just two variables, and one more complex:

$$3x_1+4x_2+5=0$$

$$2x_1^3-2x_1^2x_2-4x_2^3+3x_1^3+6x_1x_2+4x_2^2+5=0$$

### L1 Regularization
L1 regularization consists in taking the absolute values of the coefficients, and adding them to the error of our model.

$$\text{Error} = |3| + |4| = 7$$

$$\text{Error} = |2| + |-2| + |-4| + |3| + |6| + |4| = 21$$

- L1 regularization is computationally inefficient unless data is sparse.
- Gives us features selection

### L2 Regularization
L2 regularization consists in taking the squared values of the coefficients, and adding them to the error of our model.

$$\text{Error} = 3^2 + 4^2 = 25$$

$$\text{Error} = 2^2 + (-2)^2 + (-4)^2 + 3^2 + 6^2 + 4^2 = 85$$

There exists applications that requires a small error in the model, so we it is ok if it is a complex model and so *punishment* on the complexity should be small. In other cases, simplicity is required, so we can accept errors in our model and so *punishment* on the complexity should be large. That *punishment* is regulated by the $\lambda$ parameter.

- L2 regularization is computationally efficient and better for unsparse data (uniformily distributed between columns).