# Regularization and the Bias-Variance Tradeoff

In this lecture, we will discuss ridge and lasso regression, and see how these regularization techniques can help improve your linear regression models.

<b>Functions and attributes in this lecture: </b>
- `sklearn.linear_model` - Submodule for linear models
 - `Ridge` - Implements Ridge Regression
 - `Lasso` - Implements Lasso Regression

In [1]:
# Non-sklearn packages
import numpy as np
import pandas as pd

# Sklearn packages
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

## Importing the dataset and creating a base linear model

In [2]:
# Importing the cleaned tips dataset
cleaned_tips = pd.read_csv("cleaned_tips.csv")

In [3]:
# Checking out the dataset
cleaned_tips.head()

Unnamed: 0,total_bill,tip,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner
0,16.99,1.01,2,1.0,1.0,0.0,0.0,1.0,0.0,1.0
1,10.34,1.66,3,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,21.01,3.5,3,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,23.68,3.31,2,0.0,1.0,0.0,0.0,1.0,0.0,1.0
4,24.59,3.61,4,1.0,1.0,0.0,0.0,1.0,0.0,1.0


In [4]:
# Splitting into features and targets
X = cleaned_tips.drop("tip", axis=1)
y = cleaned_tips["tip"]

In [5]:
# A baseline linear model
linear_reg = LinearRegression()
linear_result = cross_validate(linear_reg, X, y, cv=5, scoring="neg_mean_squared_error")
print("Mean Absolute Error: ", -np.mean(linear_result["test_score"]))

Mean Absolute Error:  1.1252797795692773


## Ridge Regression and Lasso Regression

In [6]:
# Ridge Regression
from sklearn.linear_model import Ridge

ridge_model = Ridge(alpha=30)
ridge_result = cross_validate(ridge_model, X, y, cv=5, scoring="neg_mean_squared_error")
print("Mean Absolute Error: ", -np.mean(ridge_result["test_score"]))

Mean Absolute Error:  1.085993827471052


In [7]:
# Lasso Regression
from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=1)
lasso_result = cross_validate(lasso_model, X, y, cv=5, scoring="neg_mean_squared_error")
print("Mean Absolute Error: ", -np.mean(lasso_result["test_score"]))

Mean Absolute Error:  1.0691001605259944


## Finding the best $\alpha$ for the Lasso Regression

In [8]:
# Searching for a good alpha value
alphas = [0.001, 0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 0.8, 1, 2, 5, 10, 50]
scores = []

for alpha in alphas:
    lasso_model = Lasso(alpha=alpha)
    lasso_result = cross_validate(lasso_model, X, y, cv=5, scoring="neg_mean_squared_error")
    score = -np.mean(lasso_result["test_score"])
    print("Mean Absolute Error: ", score)
    scores.append(score)

Mean Absolute Error:  1.1211365981262227
Mean Absolute Error:  1.1062899113874012
Mean Absolute Error:  1.0917016243989324
Mean Absolute Error:  1.052339323765285
Mean Absolute Error:  1.0605979804997434
Mean Absolute Error:  1.0610918747082556
Mean Absolute Error:  1.0620375541693847
Mean Absolute Error:  1.0654696766627012
Mean Absolute Error:  1.0691001605259944
Mean Absolute Error:  1.103361406255443
Mean Absolute Error:  1.367233407573625
Mean Absolute Error:  1.9239219035570045
Mean Absolute Error:  1.9239219035570045


In [9]:
# Finding the best of the alpha values
best_alpha = alphas[np.argmin(scores)]

In [10]:
# Getting the best model
best_model = Lasso(alpha=best_alpha)