# Regularization
<hr style="border:2px solid black">

## 1. Introduction

### 1.1 What is regularization?

* Regularization is a set of techniques that constrain the complexity of models. They thus reduce the risk of overfitting and improve the ability of a model to generalize. in other words, We can use new models that are almost like Linear Regression but with a different loss function.

**GOAL OF USING REGULARIZATION MODELS**: Reduce Overtitting; and make our model more generalizable.

### 1.2 When to regularize:
* To reduce overfitting - check if you are overfitting via the usual methods:
    * cross validation
    * train/validation score differing

### 1.3 How to regularize:
* Perform the usual ML workflow, 
* Be sure to normalize your dataset before fitting (`sklearn.preprocessing.StandardScaler`)
* Now use a **regularization** model instead of a normal linear regression model

## 2. The maths 

### 2.1 LOSS FUNCTION 
#### Residual Sum of Squares (RSS) is just Mean Squared Error without the Mean!

$$
MSE = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{y})^2 
$$

$$
RSS = \sum_{i=1}^N (y_i - \hat{y})^2 
$$


#### Which we can rewrite by substituting the linear regression equation in for yhat

$ \hat{y} = w_0 + [w_1x_1 + ... + w_nx_n] $ 

$ m = slope, w_0 = intercept $

$$
\hat{y} = (w_0 + \sum_{j=1}^M w_j x_j)
$$

$$
RSS = \sum_{i=1}^N (y_i - (w_0 + \sum_{j=1}^M w_j x_{j}))^2 
$$

*`j=1 -> M` - no. of features (cols in the dataframe)*

*`i=1 -> N` - no. of data points (rows in the dataframe)*

## Regularisation means adding an extra penalty to RSS

### 2.2 Ridge regression 


* Add a penalising term that shrinks the **square of the weights**

* Controled by a regularising term `alpha`

* `alpha` is a **hyperparameter** that we set when we instantiate the model - if alpha is zero, Ridge becomes vanilla Linear Regression

* large loss (big outliers in some feature) in a feature that is useful as predictor (large weight) create MASSIVE Ridge losses 

* How does Ridge handle this - reduces the coefficient for that X feature to a low number

* Ridge is also called `L2` regularization


$$
Ridge = \sum_{i=1}^N (y_i - (w_0 + \sum_{j=1}^M w_j x_{j}))^2  + \sum_{j=1}^M \alpha w_i^2
$$

### 2.3 Lasso regression 

* Add a penalising term that shrinks the **absolute value of the weights**

* Controled by a regularising term `alpha`

* Tends to result in the coefficients for many features becoming zero (in ridge they become close to zero, but tend not to be zero)

* Lasso is also called `L1` regularization

$$
Lasso = RSS + \sum_{i=1}^M \alpha \vert w_i \vert
$$

### 2.4 Mix Lasso and Ridge with `ElasticNet`
* Combine L1 (Lasso) and Ridge (L2) by setting the `l1_ratio` 
* l1_ratio = lasso / lasso + ridge

### Additional Reading

[StatQuest on Lasso](https://www.youtube.com/watch?v=NGf0voTMlcs)  
[StatQuest on Ridge](https://www.youtube.com/watch?v=Q81RR3yKn30)  
[StatQuest on ElasticNet](https://www.youtube.com/watch?v=1dKRdX9bfIo)  

<hr style="border:2px solid black">

## 2. Lets implement!

In [None]:
# data analysis and visualization stack
import numpy as np
import matplotlib.pyplot as plt

# machine learning stack
from sklearn.linear_model import LinearRegression

Create data following $\sqrt{x}$ 

In [None]:
# specify a random state
np.random.seed(13)

In [None]:
# create a data set fluctuating around squre root of x
X=np.arange(1,60, 5) # from 1 to 60 in steps of 6
y=[np.sqrt(xi)+np.random.normal(0, 0.5) for xi in X]

In [None]:
X

In [None]:
y

In [None]:
plt.scatter(X,y)

In [None]:
X

In [None]:
X.shape

In [None]:
# reshape the X to a 2D array for later use in sklearn mdels
X=X.reshape(-1,1)
X.shape

In [None]:
X

### Underfitting (high bias)

In [None]:
# fit linear regression on the data
model=LinearRegression()
model.fit(X,y)

In [None]:
# predict y by lr model
y_pred=model.predict(X)

In [None]:
# plot both linear regression line and the original data
plt.scatter(X,y)
plt.plot(X, y_pred)

### Underfitting:

To see if you got an underfitting model, compare the scores:       
`model.score(X_train, y_train)`    
`model.score(X_test, y_test)`
     
If both scores are weak, you have probably an underfit situation
     
How could it happen?
 * Small data sets 
 * Weak feature engineering
     * Too little features
     * Features uninformative

### Overfit

In [None]:
# additional pakages from sklearn for adding more terms to the equation
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# convert the feature matrix to a polynomial form with the degree of 10
poly=PolynomialFeatures(degree=10, include_bias=False)
X_poly=poly.fit_transform(X)

In [None]:
X_poly.shape

In [None]:
X_poly

In [None]:
# fit the linear regression model on X_poly
model=LinearRegression()
model.fit(X_poly,y)
y_pred_poly=model.predict(X_poly)

In [None]:
# plot the fitted line and original data
plt.scatter(X,y)
plt.plot(X,y_pred_poly)

### Overfitting: 
To see if you got an overfitting model, compare the scores:     
 `model.score(X_train, y_train)`    
 `model.score(X_test, y_test)`
     
If the train score is exceptionally good and the test score is weak, you probably have an overfit situation

In [None]:
# importing new packages for lasso, ridge and elasticnet
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet

In [None]:
# naming new models
lasso=Lasso()
ridge=Ridge()
elast=ElasticNet()

In [None]:
# fitting new models on X_poly
lasso.fit(X_poly, y)
ridge.fit(X_poly, y)
elast.fit(X_poly, y)

If the model does not converge, the gradient did not reduce under the set tolerance during the set maximal iteration steps. This can happen easily with regularization, still you can try the following:     
* increase `max_iter` (maybe some more steps help)
* increase `tol` (being more generous could help)   

Both measures should be taken carefully since it could increase the optimization time or make the results worst

In [None]:
# calculate y_pred by new models
y_lasso=lasso.predict(X_poly)
y_ridge=ridge.predict(X_poly)
y_elast=elast.predict(X_poly)

In [None]:
# plot all the models and comparing them
plt.scatter(X,y, label='actual')
#plt.plot(X, y_pred_poly, label='poly')
#plt.plot(X, y_pred, label='Linearregression')
#plt.plot(X, y_lasso, label='lasso')
#plt.plot(X, y_ridge, label='ridge')
#plt.plot(X, y_elast, label='elast')
#plt.legend()

In [None]:
lasso.coef_

In [None]:
ridge.coef_

### Additional Reading
[Regularization in Machine Learning](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a)

<hr style="border:2px solid black">

## 3. Your Task

In [None]:
# check the lasso and Ridge results by changing the hyperparameter
lasso_1=Lasso(alpha=1) # 1, 10, 100, 1000
ridge_1=Ridge(alpha=1) # 1, 10, 100, 1000