> # LASSO and RIDGE Regression

## Conceptual Overview


Lasso and Ridge regression, also known as L1 & L2 respectively, are **“regularization”** techniques.

![image.png](attachment:image.png)

The goal of regularization is to improve the overall fit by adjusting **“bias”** ande **“variance”**, by adding a penalty that scales with model complexity.

![image.png](attachment:image.png)

Applying this to linear regression, we start with a line through our data.

![image.png](attachment:image.png)

We calculate the residuals as usual.

![image.png](attachment:image.png)

Next, the penalty is calculated.  For Lasso, the penalty scales with the absolute value of the slope, and for Ridge it scales with the slope squared.

![image.png](attachment:image.png)

The penalty is added to our residual, and then the algorithm proceeds via the least-squares method.

![image.png](attachment:image.png)

The result is a best-fit line with a smaller slope, that will hopefully fit our test data better.

![image.png](attachment:image.png)

This is particularly useful when working with a small amount of training data.

![image.png](attachment:image.png)

This is also powerful with higher dimensional data, as the penalty is calculated using the coefficients of all predictive variables.

![image.png](attachment:image.png)

With Ridge Regression, the influence of unnecessary variables minimized, and with Lasso Regression their coefficients can actually drop to zero, removing them from the model all together.

![image.png](attachment:image.png)

Overall, both regularization techniques help reduce overfitting, especially with small datasets or those with many variables.

## Implementation

Below we will predict salary based on the current role someone is in, and then make some predictions on start ups.  I will also include side by side examples showing how it performs compared to decision trees. 

The first step is to start with "imports".

In [1]:
#Numpy is used so that we can deal with array's, which are necessary for any linear algebra
import numpy as np

#Pandas is used so that we can create dataframes, which is particularly useful when
# reading or writing from a CSV.
import pandas as pd

#Matplotlib is used to generate graphs in just a few lines of code.
import matplotlib.pyplot as plt

#Import the classes we need to test linear, ridge, and lasso to compare
from sklearn.linear_model import LinearRegression, Ridge, Lasso


#These will be our main evaluation metrics 
from sklearn.metrics import r2_score, mean_squared_error

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

# Will use this to "normalize" our data.
from sklearn.preprocessing import normalize

Next we will need to load our data.

I will use data on 50 start ups as I have for decision trees and random forests. 

In [2]:
#read the data from csv
dataset = pd.read_csv('Startups_data.csv')

#take a look at our dataset.  head() gives the first 5 lines. 
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


Now to keep this simple, I am only going to look at the continous variables, so we need to drop the State Column.

In [3]:
#drop the column
dataset = dataset.drop(columns = ['State'])

#take a look again 
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
0,165349.2,136897.8,471784.1,192261.83
1,162597.7,151377.59,443898.53,191792.06
2,153441.51,101145.55,407934.54,191050.39
3,144372.41,118671.85,383199.62,182901.99
4,142107.34,91391.77,366168.42,166187.94


Ok it's looking good.  Now we need to select the X variables and the Y variable (Independent and Dependent)

In [4]:
#set independent variable by using all rows, but just column 1.
X = dataset.iloc[:, :-1].values

#set the dependent variable using all rows but only the last column. 
y = dataset.iloc[:, -1].values

#lets take a look at X right now.
X[0:10]

array([[165349.2 , 136897.8 , 471784.1 ],
       [162597.7 , 151377.59, 443898.53],
       [153441.51, 101145.55, 407934.54],
       [144372.41, 118671.85, 383199.62],
       [142107.34,  91391.77, 366168.42],
       [131876.9 ,  99814.71, 362861.36],
       [134615.46, 147198.87, 127716.82],
       [130298.13, 145530.06, 323876.68],
       [120542.52, 148718.95, 311613.29],
       [123334.88, 108679.17, 304981.62]])

Next we need to scale down our X variables in order for our "alphas" to have an impact later when we introduce LASSO and RIDGE.

We have two options, standardize and normalize.  Since we don't know the distrobution of our data, we will use normalize.

In [5]:
X = normalize(X, 'l2')

X[0:10]

array([[0.31900633, 0.26411537, 0.91020769],
       [0.32756301, 0.30495941, 0.89426072],
       [0.34294681, 0.22606362, 0.91174707],
       [0.33862972, 0.2783483 , 0.89880595],
       [0.35238805, 0.22662705, 0.90799936],
       [0.33070362, 0.25030226, 0.9099362 ],
       [0.56834481, 0.62147181, 0.53921884],
       [0.34450073, 0.38477307, 0.85631123],
       [0.32960386, 0.40664771, 0.85205571],
       [0.35598308, 0.31368211, 0.88027245]])

In order to understand the effectiveness of Lasso and Ridge regression we will need a test set, so lets split our data into training and test sets.

In [6]:
#split the dataset.  Take 40% to be our test set. 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 0)


With the dataset split, we can now load and fit the models.

I will start with basic Linear Regression first.

In [7]:
#this sets the object regressor to the class of LinearRegression from the Sklearn library.
regressor = LinearRegression()

#this fits the model to our training data.
regressor.fit(X_train, y_train)

LinearRegression()

Now we can use the model to predict fit on the test set. 

In [8]:
#Predict on our test set.
y_pred = regressor.predict(X_test)

Finally, we can evaluate the quality of the fit. 

In [9]:
#calculate the R^2 score
score = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

#print out our score properly formatted as a percent.
print("R^2 score:", "{:.4f}%".format(score))
print("MSE", round(mse,2))

R^2 score: 0.8341%
MSE 209417522.03


We get an R^2 score of 83%, which is pretty good, but the mean squared error looks super high. So let's look at Lasso Regression and see if the error reduces

With Lasso regression, we have a new hyperparameter, 'alpha' 

In [10]:
alphas = [-5, -1, 1e-4, 1e-3, 1e-2, 1, 5]

def test_alpha(a):
    model_lasso = Lasso(alpha=a)
    model_lasso.fit(X_train, y_train) 
    pred_test_lasso = model_lasso.predict(X_test)
    new_score = r2_score(y_test, pred_test_lasso)
    new_mse = mean_squared_error(y_test, pred_test_lasso)
    print('ALPHA: {:.3f} R2 SCORE: {:.4f}% new_score, {:.1f}'.format(a, new_score, new_mse))
    
    
for alpha in alphas:
    test_alpha(alpha)


ALPHA: -5.000 R2 SCORE: 0.8342% new_score, 209220870.4
ALPHA: -1.000 R2 SCORE: 0.8341% new_score, 209376097.3
ALPHA: 0.000 R2 SCORE: 0.8341% new_score, 209417526.2
ALPHA: 0.001 R2 SCORE: 0.8341% new_score, 209417563.3
ALPHA: 0.010 R2 SCORE: 0.8341% new_score, 209417934.7
ALPHA: 1.000 R2 SCORE: 0.8340% new_score, 209459372.7
ALPHA: 5.000 R2 SCORE: 0.8339% new_score, 209637792.8


  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive
  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


Here we see that Lasso only slightly improved the accuracy of the model in the base case scenario.  Specifically where the Alpha was a large negative value.  This means that the true distrubtion actually has a higher slope than our inital model predicted. 

Now lets move on to ridge.

In [11]:
alphas = [-5, -1, 1e-4, 1e-3, 1e-2, 1, 5]

def test_alpha_ridge(a):
    model_lasso = Ridge(alpha=a)
    model_lasso.fit(X_train, y_train) 
    pred_test_lasso = model_lasso.predict(X_test)
    new_score = r2_score(y_test, pred_test_lasso)
    new_mse = mean_squared_error(y_test, pred_test_lasso)
    print('ALPHA: {:.3f} R2 SCORE: {:.4f}% new_score, {:.1f}'.format(a, new_score, new_mse))
    
    
for alpha in alphas:
    test_alpha_ridge(alpha)

ALPHA: -5.000 R2 SCORE: -9.2274% new_score, 12906885540.4
ALPHA: -1.000 R2 SCORE: -1.1315% new_score, 2689990172.3
ALPHA: 0.000 R2 SCORE: 0.8340% new_score, 209430808.7
ALPHA: 0.001 R2 SCORE: 0.8339% new_score, 209556856.1
ALPHA: 0.010 R2 SCORE: 0.8325% new_score, 211341121.1
ALPHA: 1.000 R2 SCORE: 0.6424% new_score, 451293520.7
ALPHA: 5.000 R2 SCORE: 0.3771% new_score, 786067879.6


Here we see that Ridge is much more sensitive to the scale of Alpha, so lets fine tune it using only positive values. 

In [12]:
new_alphas = [1e-15,1e-10,1e-8,1e-4, 1e-3, 1e-2, 1]

for alpha in new_alphas:
    test_alpha_ridge(alpha)

ALPHA: 0.000 R2 SCORE: 0.8341% new_score, 209417522.0
ALPHA: 0.000 R2 SCORE: 0.8341% new_score, 209417522.0
ALPHA: 0.000 R2 SCORE: 0.8341% new_score, 209417523.3
ALPHA: 0.000 R2 SCORE: 0.8340% new_score, 209430808.7
ALPHA: 0.001 R2 SCORE: 0.8339% new_score, 209556856.1
ALPHA: 0.010 R2 SCORE: 0.8325% new_score, 211341121.1
ALPHA: 1.000 R2 SCORE: 0.6424% new_score, 451293520.7


Here we see that Lasso and Ridge Regression in this case **DO NOT** significantly improve the overall fit of the model. 

This is to be expected in some situations, but it is always great to check and verify when trying to fine tune your regression model. 