## Starting off

Can you explain what the bias-variance tradeoff is and how it affects your modelling process?

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
$$Err(x) = Bias^2 + Variance + Irreducible Error$$
Irreducible error is the error that canâ€™t be reduced by creating good models. It is a measure of the amount of noise in our data.
* High bias = Overfit which then means you cannot generalize your data
* High variance = Underfit which then means your errors are very high
* Good Balance = low bias and low variance
* Compare the RMSE betweeen your train and test. Too large a difference might could mean overfitting. Try removing some of the features that you think might be less relevant 
* underfitting ---- because of heteroskedasticity. Try feature transformations.*

# Regularization 
Agenda today:
- Reviewing overfitting & underfitting, bias variance tradeoff
- Ridge regression 
- Lasso regression 
- AIC and BIC

# Background

## Bias vs Variance Tradeoff

<img src ="resources/rsme_poly_2.png" width = "500">

### Using the chart above, determine what is the optimal number of degrees for our polynomial features for this model? In general, how does increasing the polynomial degree relate to the Bias/Variance tradeoff?  (Note that this graph shows RMSE and not MSE.)

#### Your answer here
* As the degree of polynomial increases, our variance increases, leading to overfitting of the model to the training set. 
* From the chart the best degree of polynomial is 3 where  the variance and bias may be minimal or the difference between the RMSE of the train and test is the least.
(https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229)

## Cost Function

Before we dive into regularization, let's (re)visit a concept called **Cost Function**. A cost function is a measure of how good or bad the model is at estimating the relationship of our $X$ and $y$ variables. Usually, it is expressed in the difference between actual values and predicted values. For simple linear regression, the cost function is represented as:
<center> $$ \text{cost_function}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum( bx + b_0))^2$$


For linear regression with multiple predictors, the cost function is expressed as:
$$ \text{cost_function}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2$$

Where k stands for number of predictors at jth term.

***Our goal in fitting the model is to find the terms of the model that minimizes this cost_function.***



<center> Minimize(Loss(Data|Model))

## Part I. Regularizing a Model
Even though Lasso and Ridge regressions are only used in regression, regularizing a model is a common procedure in the process of building machine learning models. It is an effective procedure for tackling the problem of overfitting. Generally speaking, applying regularization technique introduces some **bias** to the model, but reduces the **variance**, and therefore results in better performance in testing data. As you will see later in this module, models built from various classification algorithms often require tuning using regularization in order to overcome overfitting. 

* *At zero coefficients, there is high bias but very little variance. As the coefficients increase the variance increases but bias reduces. complexity in the model is represented by the size of the coefficients*

What is regularization in the context of regression? As we recall, as the complexity of model increases, the model overfits and performance on the testing set decreases. Regularization techniques *shrinks* the regression coefficients such that the coefficients are not affecting the outcomes as much as they originally would have. In other words, using regularization applies a *penalty* to the coefficients of your regression model. Let's see how exactly Ridge regression and Lasso regression work to reduce variances in regression models and result in better fit. 

***We will now minimize loss+complexity which is call structural risk minimization:***
<center> Minimize(Loss(Data|Model) + complexity(Model))

<img src="https://media.giphy.com/media/26ufdipQqU2lhNA4g/giphy.gif" >

## Part II. Ridge Regression (L2 Norm)



The ridge regression applies a penalizing parameter $\lambda$, such that a small bias will be introduced to the entire model depending on the value of $\lambda$, which is called a *hyperparameter*. 

$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(\beta_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p \beta_j^2$

The result of applying such a penalizing parameter to the cost function, resulting a different regression model that minimizing the residual sum of square **and** the term $\lambda \sum_{j=1}^p \beta_j^2$. 

**The Ridge regression improves the fit of the original regression line by introducing some bias/changing the slope and intercept of the original line.** Recall the way we interpret a regression model Y = mx + b: with every unit increase in x, the outcome y increase by m unit. Therefore, the bigger the coefficient m is, the more the outcome is subjected to changes in predictor x. Ridge regression works by reducing the magnitude of the coefficient m and therefore reducing the effect the predictors have on the outcome. Let's look at a simple example.

The ridge regression penalty term contains all of the coefficients squared from the original regression line except for the intercept term. 

## Part III. Lambda Amount

Performing L2 regularization has the following effect on a model

- Encourages weight values toward 0 (but not exactly 0)
- Encourages the mean of the weights toward 0, with a normal (bell-shaped or Gaussian) distribution.

Increasing the lambda value strengthens the regularization effect. For example, the histogram of weights for a high value of lambda might look as shown below.

Range of x axis on the graph is from -100 to 100 with 0 in the center


<img src ="resources/HighLambda.svg" width = "500">

Lowering the value of lambda tends to yield a flatter histogram, as shown below.

<img src ="resources/LowLambda.svg" width = "500">

When choosing a lambda value, the goal is to strike the right balance between simplicity and training-data fit:

- If your lambda value is too high, your model will be simple, but you run the risk of underfitting your data. Your model won't learn enough about the training data to make useful predictions.

- If your lambda value is too low, your model will be more complex, and you run the risk of overfitting your data. Your model will learn too much about the particularities of the training data, and won't be able to generalize to new data.

## Part IV. Lasso Regression (L1 Norm)
Lasso regression is very similar to Ridge regression except for one difference - the penalty term is not squared but the absolute values of the coefficients muliplied by lambda, expressed by:

$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(\beta_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p \mid \beta_j \mid$

The biggest difference in Ridge and Lasso is that Lasso simultaneously performs variable selection: some coefficients are shrunk to 0, rendering them nonexistence in the original regression model. Therefore, Lasso regression performs very well when you have higher dimensional dataset where some predictors are useless; whereas Ridge works best when all the predictors are needed. 

https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/playground-exercise-examining-l2-regularization

# Applied: Comparing different models

In [1]:
# implementation 
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = sns.load_dataset('mpg')

#data = pd.read_csv("https://raw.githubusercontent.com/learn-co-curriculum/dsc-2-24-09-ridge-and-lasso-regression/master/auto-mpg.csv") 
# data = data.sample(50)
y = data[["mpg"]]
X = data.drop(["mpg", "name", "origin"], axis=1)


In [2]:
data.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64

### Perform a train test split

In [3]:
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)


In [4]:
X_train.groupby('cylinders')['horsepower'].mean()

cylinders
3    105.000000
4     77.561290
5     90.000000
6    100.116667
8    159.726027
Name: horsepower, dtype: float64

### Handle missing values

In [5]:
cyl_hp = X_train.groupby('cylinders')['horsepower'].mean().round(1).to_dict()
print(cyl_hp)

X_train['horsepower'].fillna(X_train['cylinders'].map(cyl_hp), inplace=True)


{3: 105.0, 4: 77.6, 5: 90.0, 6: 100.1, 8: 159.7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


### Scale the data by fitting the scaler to the train set and then transforming the train and test set.  

In [6]:
X_train.describe()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year
count,298.0,298.0,298.0,298.0,298.0,298.0
mean,5.389262,188.776846,102.574161,2925.04698,15.762416,75.902685
std,1.682596,103.828472,38.991373,841.656942,2.838563,3.665371
min,3.0,68.0,46.0,1613.0,8.0,70.0
25%,4.0,98.0,75.0,2190.0,14.0,73.0
50%,4.0,140.0,90.0,2772.0,15.5,76.0
75%,6.0,258.0,115.75,3506.25,17.5,79.0
max,8.0,455.0,230.0,5140.0,24.8,82.0


In [7]:
#Applying a standardscaler to the data. range of scale of features 
#can change the size of the coefficients. This standardizes the 
#features
#penalty ....... look at lecture recording.....!!
#polynomial and then scale
scale = StandardScaler()
transformed = scale.fit_transform(X_train) #fit training
X_train = pd.DataFrame(transformed, columns = X_train.columns)

minmax vs standardscaler
* can try both to see what gives best prediction
* there are some applications, esp in text analysis and neural nets, that require non-zero data values, which would require minmax
* in general, standard scaling in regression and classification predictions is the standard
* sorry, non-negative instead of non-zero

In [8]:
transformed = scale.transform(X_test) #transform the test
X_test = pd.DataFrame(transformed, columns = X_train.columns)

In [9]:
X_test.describe()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year
count,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.155223,0.178509,0.158143,0.214941,-0.272926,0.116778
std,1.041057,1.011772,0.925539,1.014045,0.855078,1.036493
min,-1.422374,-1.145896,-1.299242,-1.518663,-2.386345,-1.613101
25%,-0.827054,-0.67317,-0.477167,-0.621899,-0.851302,-0.793253
50%,0.363584,-0.113617,-0.143199,0.026722,-0.269044,0.026595
75%,1.554223,1.092317,0.942196,0.89522,0.233815,0.914763
max,1.554223,2.558732,3.016651,2.41591,1.848257,1.66629


In [39]:
X_test.isnull().sum()

cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
dtype: int64

In [40]:
X_train.isnull().sum()

cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
dtype: int64

### Build a Ridge, Lasso and regular linear regression model. 
***Note how in scikit learn, the regularization parameter is denoted by alpha (and not lambda)***


In [27]:
# linear regression model
lin = LinearRegression()
lin.fit(X_train, y_train)

#ridge model
ridge_01 = Ridge(alpha=0.1) #increased from .1 to .5 .... reduces the coeff
ridge_01.fit(X_train, y_train)

#lasso model
lasso_01 = Lasso(alpha=0.1)
lasso_01.fit(X_train, y_train)

#right now just compare the size of the coeff

Lasso(alpha=0.1)

#### Unpenalized Regression

In [28]:
print("The sum of coefficients", abs(lin.coef_).sum()+ abs(lin.intercept_))


The sum of coefficients [34.72125377]


#### Lasso Regression

In [29]:
print("The sum of coefficients", abs(lasso_01.coef_).sum()+ abs(lasso_01.intercept_))


The sum of coefficients [32.1095765]


#### Ridge Regression

In [30]:
print("The sum of coefficients", abs(ridge_01.coef_).sum()+ abs(ridge_01.intercept_))


The sum of coefficients [34.65936473]


Create a dataframe with all of the coefficients for the models to compare the total value for coefficients


In [31]:
a = pd.DataFrame(data=lin.coef_ , columns=X_test.columns).T
b = pd.DataFrame(data=lasso_01.coef_ , index=X_test.columns)
c = pd.DataFrame(data=ridge_01.coef_ , columns=X_test.columns).T

In [32]:
all_coefs = pd.concat([a,b,c], axis=1)
all_coefs.columns = ['Unreguralized', 'Lasso','Ridge']

intercepts= [*lin.intercept_,*lasso_01.intercept_,*ridge_01.intercept_]

intercept_df = pd.DataFrame([intercepts], index=['intercept'], columns = ['Unreguralized', 'Lasso','Ridge'])

final = pd.concat([all_coefs, intercept_df])

In [33]:
final
#compare the penalties by each model

Unnamed: 0,Unreguralized,Lasso,Ridge
cylinders,-0.136971,-0.0,-0.137331
displacement,0.738396,-0.0,0.720956
horsepower,0.417693,-0.0,0.406697
weight,-6.374134,-5.530301,-6.349448
acceleration,0.387689,0.070772,0.381652
model_year,2.904962,2.747094,2.901871
intercept,23.761409,23.761409,23.761409


In [34]:
final.abs().sum()

Unreguralized    34.721254
Lasso            32.109576
Ridge            34.659365
dtype: float64

### Fit models with a different penalties

**Alpha = 0.5**

In [35]:
ridge_05 = Ridge(alpha=0.5)
ridge_05.fit(X_train, y_train)

lasso_05 = Lasso(alpha=0.5)
lasso_05.fit(X_train, y_train)

Lasso(alpha=0.5)

**Alpha =1**

In [36]:
ridge_1 = Ridge(alpha=1)
ridge_1.fit(X_train, y_train)

lasso_1 = Lasso(alpha=1)
lasso_1.fit(X_train, y_train)

Lasso(alpha=1)

In [37]:
r_05 = pd.DataFrame(data=ridge_05.coef_ , columns=X_test.columns).T
r_1 = pd.DataFrame(data=ridge_1.coef_ , columns=X_test.columns).T

In [38]:
l_05 = pd.DataFrame(data=lasso_05.coef_ , index=X_test.columns)
l_1 = pd.DataFrame(data=lasso_1.coef_ , index=X_test.columns)

In [39]:
more_coefs = pd.concat([l_05,r_05,l_1,r_1], axis=1)
more_coefs.columns = ['Lasso_05', 'Ridge_05','Lasso_1','Ridge_1']

In [40]:
more_coefs

Unnamed: 0,Lasso_05,Ridge_05,Lasso_1,Ridge_1
cylinders,-0.0,-0.139485,-0.0,-0.143584
displacement,-0.0,0.654002,-0.0,0.576128
horsepower,-0.0,0.363864,-0.0,0.312799
weight,-5.244371,-6.253433,-4.853253,-6.139191
acceleration,0.0,0.358272,0.0,0.330674
model_year,2.446808,2.889766,2.055685,2.875185


In [41]:
intercepts_= [*lasso_05.intercept_,*ridge_05.intercept_, *lasso_1.intercept_,*ridge_1.intercept_ ]

intercepts_df = pd.DataFrame([intercepts_], index=['intercept'], columns = ['Lasso_05', 'Ridge_05','Lasso_1','Ridge_1'])



In [42]:
additional_df = pd.concat([more_coefs, intercepts_df])

In [43]:
final_coefs = pd.concat([final,additional_df], axis=1)

In [44]:
final_coefs

Unnamed: 0,Unreguralized,Lasso,Ridge,Lasso_05,Ridge_05,Lasso_1,Ridge_1
cylinders,-0.136971,-0.0,-0.137331,-0.0,-0.139485,-0.0,-0.143584
displacement,0.738396,-0.0,0.720956,-0.0,0.654002,-0.0,0.576128
horsepower,0.417693,-0.0,0.406697,-0.0,0.363864,-0.0,0.312799
weight,-6.374134,-5.530301,-6.349448,-5.244371,-6.253433,-4.853253,-6.139191
acceleration,0.387689,0.070772,0.381652,0.0,0.358272,0.0,0.330674
model_year,2.904962,2.747094,2.901871,2.446808,2.889766,2.055685,2.875185
intercept,23.761409,23.761409,23.761409,23.761409,23.761409,23.761409,23.761409


In [46]:
final_coefs.abs().sum()
#increasing the lambda the coeff size continually goes down

Unreguralized    34.721254
Lasso            32.109576
Ridge            34.659365
Lasso_05         31.452588
Ridge_05         34.420230
Lasso_1          30.670348
Ridge_1          34.138971
dtype: float64

## Model Evaluation

In [47]:
X_train.shape

(298, 6)

In [48]:
# create predictions

y_h_lin_train = lin.predict(X_train)
y_h_lin_test = lin.predict(X_test)


y_h_ridge_train_01 = ridge_01.predict(X_train)
y_h_ridge_test_01 = ridge_01.predict(X_test)

y_h_lasso_train_01 = np.reshape(lasso_01.predict(X_train),(298,1))
y_h_lasso_test_01 = np.reshape(lasso_01.predict(X_test),(100,1))

y_h_ridge_train_05 = ridge_05.predict(X_train)
y_h_ridge_test_05 = ridge_05.predict(X_test)

y_h_lasso_train_05 = np.reshape(lasso_05.predict(X_train),(298,1))
y_h_lasso_test_05 = np.reshape(lasso_05.predict(X_test),(100,1))

y_h_ridge_train_1 = ridge_1.predict(X_train)
y_h_ridge_test_1 = ridge_1.predict(X_test)

y_h_lasso_train_1 = np.reshape(lasso_1.predict(X_train),(298,1))
y_h_lasso_test_1 = np.reshape(lasso_1.predict(X_test),(100,1))


#### Examining the Residual for Ridge, Lasso, and Unpenalized Regression coefficients

In [49]:
# examine the residual sum of sq


print('Train Error Unpenalized Linear Model', mean_squared_error(y_train, lin.predict(X_train)))
print('Test Error Unpenalized Linear Model', mean_squared_error(y_test, lin.predict(X_test)))
print('\n')
print('Train Error Ridge Model alpha=0.1:', mean_squared_error(y_train, y_h_ridge_train_01))
print('Test Error Ridge Model alpha=0.1:', mean_squared_error(y_test, y_h_ridge_test_01))
print('\n')

print('Train Error Ridge Model alpha=0.5:', mean_squared_error(y_train, y_h_ridge_train_05))
print('Test Error Ridge Model alpha=0.5:', mean_squared_error(y_test, y_h_ridge_test_05))
print('\n')
print('Train Error Ridge Model alpha=1:', mean_squared_error(y_train, y_h_ridge_train_1))
print('Test Error Ridge Model alpha=1:', mean_squared_error(y_test, y_h_ridge_test_1))
print('\n')

print('Train Error Lasso Model alpha=0.1:', mean_squared_error(y_train, y_h_lasso_train_01))
print('Test Error Lasso Model alpha=0.1:', mean_squared_error(y_test, y_h_lasso_test_01))
print('\n')

print('Train Error Lasso Model alpha=0.5:', mean_squared_error(y_train, y_h_lasso_train_05))
print('Test Error Lasso Model alpha=0.5:', mean_squared_error(y_test, y_h_lasso_test_05))
print('\n')
print('Train Error Lasso Model alpha=1:', mean_squared_error(y_train, y_h_lasso_train_1))
print('Test Error Lasso Model alpha=1:', mean_squared_error(y_test, y_h_lasso_test_1))
print('\n')

Train Error Unpenalized Linear Model 12.345012453050336
Test Error Unpenalized Linear Model 9.764004099854432


Train Error Ridge Model alpha=0.1: 12.34507453607205
Test Error Ridge Model alpha=0.1: 9.75380014065847


Train Error Ridge Model alpha=0.5: 12.34649512695093
Test Error Ridge Model alpha=0.5: 9.715024883595461


Train Error Ridge Model alpha=1: 12.350623768999672
Test Error Ridge Model alpha=1: 9.670795004580572


Train Error Lasso Model alpha=0.1: 12.439146496138045
Test Error Lasso Model alpha=0.1: 9.48894610609167


Train Error Lasso Model alpha=0.5: 12.824789517391716
Test Error Lasso Model alpha=0.5: 9.59586461409152


Train Error Lasso Model alpha=1: 13.99814589011848
Test Error Lasso Model alpha=1: 10.514110875815367


