# Regularization 
Agenda today:
- Reviewing overfitting & underfitting, bias variance tradeoff
- Ridge regression 
- Lasso regression 
- AIC and BIC

## Part I. Regularizing a Model
Even though Lasso and Ridge regressions are only used in regression, regularizing a model is a common procedure in the process of building machine learning models. It is an effectuve procedure for tackling the problem of overfitting. Generally speaking, applying regularization technique introduces some **bias** to the model, but reduces the **variance**, and therefore results in better performance in testing data. As you will see later in this module, models built from various classification algorithms often require tuning using regularization in order to overcome overfitting. 

What is regularization in the context of regression? As we recall, as the complexity of model increases, the model overfits and performance on the testing set decreases. Regularization techniques *shrinks* the regression coefficients such that the coefficients are not affecting the outcomes as much as they originally would have. In other words, using regularization applies a *penalty* to the coefficients of your regression model. Let's see how exactly Ridge regression and Lasso regression work to reduce variances in regression models and result in better fit. 

<img src="https://media.giphy.com/media/26ufdipQqU2lhNA4g/giphy.gif" >

## Part II. Ridge Regression (L2 Norm)
Before we dive into regularization, let's (re)visit a concept called **Cost Function**. A cost function is a measure of how good or bad the model is at estimating the relationship of our $X$ and $y$ variables. Usually, it is expressed in the difference between actual values and predicted values. For simple linear regression, the cost function is represented as:
<center> $$ \text{cost_function}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum( bx + b_0))^2$$


For linear regression with multiple predictors, the cost function is expressed as:
$$ \text{cost_function}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2$$

Where k stands for number of predictors at jth term.

The ridge regression applies a penalizing parameter $\lambda$ *slope* $^2$, such that a small bias will be introduced to the entire model depending on the value of $\lambda$, which is called a *hyperparameter*. 

$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p m_j^2$

The result of applying such a penalizing parameter to the cost function, resulting a different regression model that minimizing the residual sum of square **and** the term $\lambda \sum_{j=1}^p m_j^2$. 

The Ridge regression improves the fit of the original regression line by introducing some bias/changing the slope and intercept of the original line. Recall the way we interpret a regression model Y = mx + b: with every unit increase in x, the outcome y increase by m unit. Therefore, the bigger the coefficient m is, the more the outcome is subjected to changes in predictor x. Ridge regression works by reducing the magnitude of the coefficient m and therefore reducing the effect the predictors have on the outcome. Let's look at a simple example.

The ridge regression penalty term contains all of the coefficients squared from the original regression line except for the intercept term. 

## Part III. Lasso Regression (L1 Norm)
Lasso regression is very similar to Ridge regression except for one difference - the penalty term is not squared but the absolute values of the coefficients muliplied by lambda, expressed by:

$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p \mid m_j \mid$

The biggest difference in Ridge and Lasso is that Lasso simultaneously performs variable selection: some coefficients are shrunk to 0, rendering them nonexistence in the original regression model. Therefore, Lasso regression performs very well when you have higher dimensional dataset where some predictors are useless; whereas Ridge works best when all the predictors are needed. 

<img src="https://media.giphy.com/media/AWeYSE0qgpk76/giphy.gif" width= "400" />

In [1]:
# implementation 
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = sns.load_dataset('mpg')

#data = pd.read_csv("https://raw.githubusercontent.com/learn-co-curriculum/dsc-2-24-09-ridge-and-lasso-regression/master/auto-mpg.csv") 
data = data.sample(50)
y = data[["mpg"]]
X = data.drop(["mpg", "name", "origin"], axis=1)



In [2]:
data.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64

### Perform a train test split

In [3]:
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)


In [4]:
X_train.groupby('cylinders')['horsepower'].mean()

cylinders
3    100.000000
4     78.263158
5     77.000000
6    102.250000
8    170.636364
Name: horsepower, dtype: float64

### Handle missing values

In [46]:
cyl_hp = X_train.groupby('cylinders')['horsepower'].mean().round(1).to_dict()
print(cyl_hp)

X_train['horsepower'].fillna(data['cylinders'].map(cyl_hp), inplace=True)


{3.0: 100.0, 4.0: 78.3, 5.0: 77.0, 6.0: 102.2, 8.0: 170.6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


### Scale the data by fitting the scaler to the train set and then transforming the train and test set.  

In [6]:
scale = MinMaxScaler()
transformed = scale.fit_transform(X_train)
X_train = pd.DataFrame(transformed, columns = X_train.columns)

In [7]:
transformed = scale.transform(X_test)
X_test = pd.DataFrame(transformed, columns = X_train.columns)

In [8]:
X_test.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year
0,0.2,0.135065,0.186441,0.195721,0.511811,0.166667
1,0.2,0.109091,0.180791,0.25236,0.527559,1.0
2,0.6,0.467532,0.265537,0.474827,0.433071,0.083333
3,0.2,0.21039,0.175141,0.27124,0.543307,0.916667
4,0.2,0.127273,0.248588,0.198867,0.425197,0.666667


In [9]:
y_train

Unnamed: 0,mpg
293,31.9
95,12.0
200,18.0
58,25.0
181,33.0
307,26.8
234,24.5
278,31.5
190,14.5
43,13.0


In [10]:
X_test.isnull().sum()

cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
dtype: int64

In [11]:
X_train.isnull().sum()

cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
dtype: int64

### Build a Ridge, Lasso and regular linear regression model. 
***Note how in scikit learn, the regularization parameter is denoted by alpha (and not lambda)***


In [12]:
ridge_01 = Ridge(alpha=0.1)
ridge_01.fit(X_train, y_train)

lasso_01 = Lasso(alpha=0.1)
lasso_01.fit(X_train, y_train)

lin = LinearRegression()
lin.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [50]:
print("Unpenalized Linear Regression Coefficients are:{}".format(lin.coef_))
print(lin.coef_.sum())
print("Unpenalized Linear Regression Intercept:{}".format(lin.intercept_))

Unpenalized Linear Regression Coefficients are:[[-5.62592355e-17  1.44635115e-17 -5.11299072e-19 -9.73741646e-18
  -3.37901452e-18 -2.97921981e-19 -1.34064218e-19  8.14922419e-19
   1.61493592e-18 -2.18289438e-18  2.57182910e-18  6.64481724e-19
   8.99788651e-19  3.87409957e-17 -1.51081966e-17  4.20739928e-16
   1.39031685e-17  2.29448427e-17 -4.35722400e-17 -9.48240310e-16
  -5.17492884e-18 -3.00343553e-17 -7.33083083e-15  1.08628186e-16
  -1.62061580e-17  4.17782062e-18  1.05928947e-17  1.14399721e-17
   1.65255130e-19  3.48563330e-18 -9.85629934e-18  1.89704357e-17
   3.15440161e-18  4.10562445e-18 -8.43265453e-18 -1.05589511e-16
   7.93902335e-16  6.03272998e-17  8.69986553e-17 -2.02084667e-16
  -3.55528804e-15 -1.07569920e-17 -1.25941564e-16 -6.77720286e-15
   7.48907560e-16  7.75819895e-16  1.91596304e-17  5.42633893e-17
   7.51641091e-17 -7.73613964e-15  3.39520898e-16 -6.27728307e-15
   9.30654033e-16  2.61278349e-15 -1.59758318e-16 -1.31380248e-14
  -2.85276845e-17 -1.09383522

In [48]:
print("Lasso Regression Coefficients are:{}".format(lasso_01.coef_))
print(lasso_01.coef_.sum())
print("Lasso Linear Regression Intercept:{}".format(lasso_01.intercept_))

Lasso Regression Coefficients are:[ -0.          -0.          -0.         -18.18921903  -0.
   7.5499047 ]
-10.63931433311533
Lasso Linear Regression Intercept:[26.93430629]


In [15]:
print("Ridge Regression Coefficients are:{}".format(ridge_01.coef_))
print(ridge_01.coef_.sum())
print("Ridge Linear Regression Intercept:{}".format(ridge_01.intercept_))

Ridge Regression Coefficients are:[[ -3.46184805   7.61785535  -2.57288875 -20.83474939  -1.2874113
    8.78729399]]
-11.751748155308887
Ridge Linear Regression Intercept:[27.78501297]


### Fit models with a different lambda

In [16]:
ridge_05 = Ridge(alpha=0.5)
ridge_05.fit(X_train, y_train)

lasso_05 = Lasso(alpha=0.5)
lasso_05.fit(X_train, y_train)

Lasso(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [17]:
print("Lasso Regression Coefficients are:{}".format(lasso_05.coef_))
print(lasso_05.coef_.sum())
print("Lasso Linear Regression Intercept:{}".format(lasso_05.intercept_))


Lasso Regression Coefficients are:[ -2.04156457  -0.          -0.         -12.00068487   0.
   4.27196755]
-9.770281890336266
Lasso Linear Regression Intercept:[27.20924703]


In [18]:
print("Ridge Regression Coefficients are:{}".format(ridge_05.coef_))
print(ridge_05.coef_.sum())

print("Ridge Linear Regression Intercept:{}".format(ridge_05.intercept_))

Ridge Regression Coefficients are:[[ -3.32094826  -0.55565618  -3.52180559 -11.27927595  -1.64420668
    7.16333373]]
-13.158558941835798
Ridge Linear Regression Intercept:[28.16283947]


In [19]:
ridge_1 = Ridge(alpha=1)
ridge_1.fit(X_train, y_train)

lasso_1 = Lasso(alpha=1)
lasso_1.fit(X_train, y_train)

Lasso(alpha=1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False,
      positive=False, precompute=False, random_state=None, selection='cyclic',
      tol=0.0001, warm_start=False)

In [20]:
print("Lasso Regression Coefficients are:{}".format(lasso_1.coef_))
print(lasso_1.coef_.sum())
print("Lasso Linear Regression Intercept:{}".format(lasso_1.intercept_))


Lasso Regression Coefficients are:[-6.07926499 -0.         -0.         -2.66725576  0.          0.03606874]
-8.710452004956627
Lasso Linear Regression Intercept:[27.74342301]


In [21]:
print("Ridge Regression Coefficients are:{}".format(ridge_1.coef_))
print(ridge_1.coef_.sum())
print("Ridge Linear Regression Intercept:{}".format(ridge_1.intercept_))

Ridge Regression Coefficients are:[[-3.77256638 -2.33311642 -3.43268844 -8.26333409 -1.01592116  6.26733487]]
-12.550291613186777
Ridge Linear Regression Intercept:[27.97887295]


In [22]:
# create predictions
y_h_ridge_train_01 = ridge_01.predict(X_train)
y_h_ridge_test_01 = ridge_01.predict(X_test)

y_h_lasso_train_01 = np.reshape(lasso_01.predict(X_train),(40,1))
y_h_lasso_test_01 = np.reshape(lasso_01.predict(X_test),(10,1))

y_h_lin_train = lin.predict(X_train)
y_h_lin_test = lin.predict(X_test)

In [23]:
print(y_h_ridge_train_01.shape)
print(y_h_ridge_test_01.shape)

(40, 1)
(10, 1)


In [24]:
print(type(y_h_lasso_train_01))
print(type(y_h_ridge_train_01))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


#### Examining the Residual for Ridge, Lasso, and Unpenalized Regression coefficients

In [25]:
# examine the residual sum of sq
print('Train Error Ridge Model', np.sum((y_train - y_h_ridge_train_01)**2))
print('Test Error Ridge Model', np.sum((y_test - y_h_ridge_test_01)**2))
print('\n')

print('Train Error Lasso Model', np.sum((y_train - y_h_lasso_train_01)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_lasso_test_01)**2))
print('\n')

print('Train Error Unpenalized Linear Model', np.sum((y_train - lin.predict(X_train))**2))
print('Test Error Unpenalized Linear Model', np.sum((y_test - lin.predict(X_test))**2))

Train Error Ridge Model mpg    401.411133
dtype: float64
Test Error Ridge Model mpg    84.718805
dtype: float64


Train Error Lasso Model mpg    445.707891
dtype: float64
Test Error Lasso Model mpg    85.322756
dtype: float64


Train Error Unpenalized Linear Model mpg    369.008387
dtype: float64
Test Error Unpenalized Linear Model mpg    89.768963
dtype: float64


## How does Ridge and Lasso Perform in Higher Dimensional Data?

#### 2 degree polynomials

In [26]:
data.shape

(50, 9)

In [27]:
## try polynomial features on the regression 
from sklearn.preprocessing import PolynomialFeatures

#instantiate this class
poly_2 = PolynomialFeatures(degree=2, interaction_only=False)
#fit and transform the data and create a  new dataframe
df_poly= pd.DataFrame(poly_2.fit_transform(X), columns=poly_2.get_feature_names(X.columns))


In [28]:
df_poly.shape

(50, 28)

In [29]:
X_train , X_test, y_train, y_test = train_test_split(df_poly, y, test_size=0.2, random_state=12)


In [30]:
scale = MinMaxScaler()
transformed = scale.fit_transform(X_train)
X_train = pd.DataFrame(transformed, columns = X_train.columns)

transformed = scale.transform(X_test)
X_test = pd.DataFrame(transformed, columns = X_train.columns)

In [31]:

# Build a Ridge, Lasso and regular linear regression model. 
# Note how in scikit learn, the regularization parameter is denoted by alpha (and not lambda)
ridge = Ridge(alpha=0.3)
ridge.fit(X_train, y_train)

lasso = Lasso(alpha=0.3)
lasso.fit(X_train, y_train)

lin = LinearRegression()
lin.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [32]:
print("Unpenalized Linear Regression Coefficients are:{}".format(lin.coef_))
print("Unpenalized Linear Regression Intercept:{}".format(lin.intercept_))

Unpenalized Linear Regression Coefficients are:[[ 3.07270805e+14 -2.05157854e+02 -2.34073552e+01  4.90085996e+02
  -7.44762627e+01  4.39008592e+01 -3.97272813e+02 -1.16341699e+02
  -3.03721843e+02  5.77516235e+02  1.17142773e+02  2.46804556e+02
   1.85558012e+01  7.48704508e+01  1.10149552e+02 -8.89166163e+00
   9.17415224e+01  4.78770073e+01 -4.42904243e+02 -2.14170651e+02
  -1.40876100e+02 -1.72697541e+02  8.43023169e+01 -1.37920171e+02
   1.11668794e+02  7.09056717e+01 -6.38344531e+01  4.18090692e+02]]
Unpenalized Linear Regression Intercept:[3.54069162]


In [33]:
print("Lasso Regression Coefficients are:{}".format(lasso.coef_))
print("Lasso Linear Regression Intercept:{}".format(lasso.intercept_))

Lasso Regression Coefficients are:[  0.          -0.23419431  -0.          -0.         -14.19336848
   0.           0.          -0.          -0.          -0.
  -0.          -0.          -0.          -0.          -0.
  -0.          -0.          -0.          -0.          -0.
  -2.00615513  -0.          -0.          -0.          -0.
   0.           0.           6.0320165 ]
Lasso Linear Regression Intercept:[27.17383035]


In [34]:
print("Ridge Regression Coefficients are:{}".format(ridge.coef_))
print("Ridge Linear Regression Intercept:{}".format(ridge.intercept_))

Ridge Regression Coefficients are:[[ 0.         -1.9299617  -0.33693847 -2.07157971 -6.19739562  1.76235262
   4.7718511   0.31013532  1.79617409  0.99502781 -0.45428522  0.27207857
  -0.62311374  3.09835257  2.99596956  2.09611067 -0.09183104  0.24716009
   1.73261028  0.13156468 -5.22515694 -2.45163458 -1.39166701 -6.09173226
  -5.98110719  2.3936552   2.47184317  5.33205519]]
Ridge Linear Regression Intercept:[24.94875673]


In [35]:
# create predictions
y_h_ridge_train = ridge.predict(X_train)
y_h_ridge_test = ridge.predict(X_test)

y_h_lasso_train = np.reshape(lasso.predict(X_train),(40,1))
y_h_lasso_test = np.reshape(lasso.predict(X_test),(10,1))

y_h_lin_train = lin.predict(X_train)
y_h_lin_test = lin.predict(X_test)

In [36]:
# examine the residual sum of sq
print('Train Error Ridge Model', np.sum((y_train - y_h_ridge_train)**2))
print('Test Error Ridge Model', np.sum((y_test - y_h_ridge_test)**2))
print('\n')

print('Train Error Lasso Model', np.sum((y_train - y_h_lasso_train)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_lasso_test)**2))
print('\n')

print('Train Error Unpenalized Linear Model', np.sum((y_train - lin.predict(X_train))**2))
print('Test Error Unpenalized Linear Model', np.sum((y_test - lin.predict(X_test))**2))

Train Error Ridge Model mpg    333.532974
dtype: float64
Test Error Ridge Model mpg    60.046126
dtype: float64


Train Error Lasso Model mpg    490.901885
dtype: float64
Test Error Lasso Model mpg    80.147256
dtype: float64


Train Error Unpenalized Linear Model mpg    465.633188
dtype: float64
Test Error Unpenalized Linear Model mpg    491.194345
dtype: float64


#### Even higher degree polynomials

In [37]:
poly_5 = PolynomialFeatures(degree=5, interaction_only=False)
#fit and transform the data and create a  new dataframe
df_poly_5= pd.DataFrame(poly_5.fit_transform(X), columns=poly_5.get_feature_names(X.columns))
df_poly_5.shape

(50, 462)

In [38]:
X_train , X_test, y_train, y_test = train_test_split(df_poly_5, y, test_size=0.2, random_state=12)

# Build a Ridge, Lasso and regular linear regression model. 
# Note how in scikit learn, the regularization parameter is denoted by alpha (and not lambda)
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)

lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)

lin = LinearRegression()
lin.fit(X_train, y_train)

  positive)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [39]:
# create predictions
y_h_ridge_train = ridge.predict(X_train)
y_h_ridge_test = ridge.predict(X_test)

y_h_lasso_train = np.reshape(lasso.predict(X_train),(40,1))
y_h_lasso_test = np.reshape(lasso.predict(X_test),(10,1))

y_h_lin_train = lin.predict(X_train)
y_h_lin_test = lin.predict(X_test)

In [40]:
# examine the residual sum of sq
print('Train Error Ridge Model', np.sum((y_train - y_h_ridge_train)**2))
print('Test Error Ridge Model', np.sum((y_test - y_h_ridge_test)**2))
print('\n')

print('Train Error Lasso Model', np.sum((y_train - y_h_lasso_train)**2))
print('Test Error Lasso Model', np.sum((y_test - y_h_lasso_test)**2))
print('\n')

print('Train Error Unpenalized Linear Model', np.sum((y_train - lin.predict(X_train))**2))
print('Test Error Unpenalized Linear Model', np.sum((y_test - lin.predict(X_test))**2))

Train Error Ridge Model mpg    124.154515
dtype: float64
Test Error Ridge Model mpg    1424.952921
dtype: float64


Train Error Lasso Model mpg    110.67478
dtype: float64
Test Error Lasso Model mpg    150.580508
dtype: float64


Train Error Unpenalized Linear Model mpg    3.354251e-18
dtype: float64
Test Error Unpenalized Linear Model mpg    97234.888821
dtype: float64


## Calculating AIC and BIC 
AIC and BIC are information criteria for evaluating how good of a model is by giving a measurement of parsimony and goodness of fit. 

- AIC is defined as: $2k - 2log(L)$
- BIC is defined as: $klog(n) - 2log(L)$  

In [41]:
def aic(y, y_pred, k):
    resid = y - y_pred
    sse = (resid**2).sum()
    AIC = 2*k - 2*np.log(sse)
    
    return AIC

In [42]:
df_poly_5.shape[1]

462

In [43]:
aic(y_test, y_h_lasso_test, df_poly_5.shape[1])

mpg    913.971004
dtype: float64

In [44]:
aic(y_test, y_h_ridge_test, df_poly_5.shape[1])

mpg    909.476212
dtype: float64

In [45]:
aic(y_test, y_h_lin_test, df_poly_5.shape[1])

mpg    901.03023
dtype: float64