# Regularizad Linear Models

In machine learning, we often face the problem when our model behaves wel on training data but behaves very poorly on test data. This happens when the model closely follows the training data. This is called overfitting.
Regularization is a technique to reduce overfitting. The term regularization means the act of bringing to uniformity.


Complex models can detect a subtle pattern in the data, but if the data is noisy(contains irrelevant information) or the dataset is too small, the model will end up detecting the pattern in the noise itself. When we use this model to predict our results, the result will not be accurate and the error wil be more than the expected error.

To improve the model or reduce the effect of the noise in our model, we need to reduce the weights associated with noise. Smaller the weight associated with the noise will be the less contribution it will have in predicting the output.

To regularize the model, the regularization term will be added to the cost function.

**Regularized Cost Function = MSE + Regularization term**

Here we will see three different regularization term to constrain the weights of the model, thus three different regularized linear regression or logistic regression algorithms

##### 1. Ridge Regression : called L2 norm
##### 2. Lasso Regression : called L1 norm
##### 3. Elastic Net

In [1]:
import numpy as np

np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

## Ridge Regression

$$Cost Function: MSE(\theta) + \alpha \frac{1}{2}\sum\limits_{i=1}^{n}\theta^{2}_{i}$$

In ridge regression, the regularization term is the sum of the aquare of the weights of the model. It forces the model to keep the weight as small as possible.

Here importtant to note that the regularization is only applied to our training data and we keep the testing data intact as we want to keep our test set as close to the final objective as possible.

In the above equation, alpha is a hyperparameter, it controls how much we want to regularize our regression model. If we choose a very large alpha then the learning algorithm will try to keep weights as small as possible because large weights will increase the cost function, thus the result will be a flat line passing through the mean of the data. If alpha is 0, then ridge regression is nothing but linear regression. to choose the best hyperparameter value, we do hyperarameter tuning.

Another important thing to note here is whenever we apply this technique, we first scale the data, as ridge regression is sensitive to the scale of input features. This is true for most of the regularized models.

In [2]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

array([[1.55071465]])

The penalty hyperparameter sets the type of regularization term to use. Specifying "l2" indicates that you want to SGD to add a regularization term to the cost function equal to half the square of the l2 norm of the weight vector: this is simply Ridge Regression

In [3]:
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

array([1.47012588])

## Lasso Regression

$$Cost Function: MSE(\theta) + \alpha \frac{1}{2}\sum\limits_{i=1}^{n}|\theta_{i}|$$

Least absolute shrinkage and selection operator regression(Lasso) uses L1 norm for regularization. Absolute sum of the weights of the model.

An important characteristic of the lasso regression is, it tends to eliminate the features which have less importance by shrinking the weights to zero, and because of this it is used in feature selection also.

In [4]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

array([1.53788174])

## Elastic Net

$$Cost Function: MSE(\theta) + r\alpha \frac{1}{2}\sum\limits_{i=1}^{n}|\theta_{i}|+\frac{1-r}{2}\alpha\sum\limits_{i=1}^{n}\theta_{i}^{2}$$

Elastic net is a mix of ridge and lasso regression, how much you want to mix depends on the value of 'r'. For example, if r is set to zero then it will be equal to ridge and if it is one then it will become lasso regression.

The important question if how we will decide which regression we should follow. The answer is, we should always prefer to have some regularization, so we use ridge regression by default but when you think some features are important than others use lasso regression or elastic net but when the data set has large number of features prefer slastic net. Elastic net vehaves much better than lasso when the dataset has large number of features.

In [5]:
from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

array([1.54333232])

In [6]:
# r = 0 ~~ridge

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0, random_state=42)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

  model = cd_fast.enet_coordinate_descent(


array([1.54818356])

In [7]:
# r = 1 ~~lasso

elastic_net = ElasticNet(alpha=0.1, l1_ratio=1, random_state=42)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

array([1.53788174])