# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge of Ridge and Lasso regression!

## Objectives

You will be able to:

* Use Lasso and ridge regression in Python
* Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# Load necessary packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Create y
y = df['SalePrice'].copy()

# remove "object"-type features and SalesPrice from `X`
X = df.copy()
X.drop('SalePrice', axis = 1, inplace=True)
for column in X.columns:
    if X[column].dtype == 'object':
        X.drop(column, axis=1, inplace=True)

# Impute null values
for column in X.columns:
    if X[column].isna().sum() != 0:
#         print(column)
#         print(X[column].value_counts())
        median = X[column].median()
        X[column].fillna(median, inplace=True)
#         print(median)
        

Look at the information of `X` again

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [5]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error, r2_score

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=144)
# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()
linreg.fit(X_train, y_train)
MSE = mean_squared_error(y_train, linreg.predict(X_train))
R2 = r2_score(y_train, linreg.predict(X_train))

# linreg.fit(X_test, y_test)
MSE_test = mean_squared_error(y_test, linreg.predict(X_test))
R2_test = r2_score(y_test, linreg.predict(X_test))
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 1300136785.77
R-squared for training set: 0.8
MSE for test set: 864883615.27
R-squared for test set: 0.85


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [6]:
from sklearn.preprocessing import MinMaxScaler

# Scale the data and perform train test split
scale = MinMaxScaler()
transformed_X = scale.fit_transform(X)

X_train_trans, X_test_trans, y_train, y_test = train_test_split(transformed_X, y, random_state=144)

Perform the same linear regression on this data and print out R-squared and MSE.

In [7]:
linreg_trans = LinearRegression()
linreg_trans.fit(X_train_trans, y_train)
MSE = mean_squared_error(y_train, linreg_trans.predict(X_train_trans))
r2 = r2_score(y_train, linreg_trans.predict(X_train_trans))

MSE = mean_squared_error(y_test, linreg_trans.predict(X_test_trans))
r2 = r2_score(y_test, linreg_trans.predict(X_test_trans))
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 860911178.85
R-squared for training set: 0.8
MSE for test set: 864883615.27
R-squared for test set: 0.85


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [8]:
X_cat = df.copy()
X_cat.drop('SalePrice', axis = 1, inplace=True)
for column in X_cat.columns:
    if X_cat[column].dtype != 'object':
        X_cat.drop(column, axis=1, inplace=True)
# X_cat

In [10]:
# Make dummies
# from sklearn.preprocessing import OneHotEncoder
# X_cat = OneHotEncoder(X_cat)
X_cat = pd.get_dummies(X_cat, drop_first=True)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [15]:
X_all = pd.concat([pd.DataFrame(transformed_X), X_cat], axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y, random_state=144)
# Fit the model and print R2 and MSE for train and test
linreg_all = LinearRegression()
linreg_all.fit(X_train, y_train)
MSE = mean_squared_error(y_train, linreg_all.predict(X_train))
R2 = r2_score(y_train, linreg_all.predict(X_train))

# linreg.fit(X_test, y_test)
MSE_test = mean_squared_error(y_test, linreg_all.predict(X_test))
R2_test = r2_score(y_test, linreg_all.predict(X_test))
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 429657488.77
R-squared for training set: 0.93
MSE for test set: 4.098809727345456e+28
R-squared for test set: -6.930829492395863e+18


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [25]:
from sklearn.linear_model import Ridge, Lasso
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)

MSE = mean_squared_error(y_train, lasso.predict(X_train))
R2 = r2_score(y_train, lasso.predict(X_train))

# linreg.fit(X_test, y_test)
MSE_test = mean_squared_error(y_test, lasso.predict(X_test))
R2_test = r2_score(y_test, lasso.predict(X_test))
# print(lasso.coef_)
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 430136898.94
R-squared for training set: 0.93
MSE for test set: 2956638643.03
R-squared for test set: 0.5


With a higher regularization parameter (alpha = 10)

In [26]:
lasso = Lasso(alpha=10.0)
lasso.fit(X_train, y_train)

MSE = mean_squared_error(y_train, lasso.predict(X_train))
R2 = r2_score(y_train, lasso.predict(X_train))

# linreg.fit(X_test, y_test)
MSE_test = mean_squared_error(y_test, lasso.predict(X_test))
R2_test = r2_score(y_test, lasso.predict(X_test))
# print(lasso.coef_)
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 451526183.73
R-squared for training set: 0.93
MSE for test set: 1592927031.77
R-squared for test set: 0.73


## Ridge

With default parameter (alpha = 1)

In [30]:

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

MSE = mean_squared_error(y_train, ridge.predict(X_train))
R2 = r2_score(y_train, ridge.predict(X_train))

# linreg.fit(X_test, y_test)
MSE_test = mean_squared_error(y_test, ridge.predict(X_test))
R2_test = r2_score(y_test, ridge.predict(X_test))
# print(lasso.coef_)
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 568187322.99
R-squared for training set: 0.91
MSE for test set: 731842756.36
R-squared for test set: 0.88


With default parameter (alpha = 10)

In [31]:
ridge = Ridge(alpha=10.0)
ridge.fit(X_train, y_train)

MSE = mean_squared_error(y_train, ridge.predict(X_train))
R2 = r2_score(y_train, ridge.predict(X_train))

# linreg.fit(X_test, y_test)
MSE_test = mean_squared_error(y_test, ridge.predict(X_test))
R2_test = r2_score(y_test, ridge.predict(X_test))
# print(lasso.coef_)
print(f"MSE for training set: {round(MSE,2)}")
print(f"R-squared for training set: {round(R2,2)}")
print(f"MSE for test set: {round(MSE_test,2)}")
print(f"R-squared for test set: {round(R2_test,2)}")

MSE for training set: 763232510.22
R-squared for training set: 0.88
MSE for test set: 721227528.36
R-squared for test set: 0.88


## Look at the metrics, what are your main conclusions?   

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [48]:
# number of Ridge params almost zero
n = len(ridge.coef_)
num_zero = [x for x in ridge.coef_ if x < 0.0001 if x > -0.0001]
print(f"Number of coefficients that became zero under Ridge Regression: {len(num_zero)}")
print(f"Proportion of coefficients that became zero under Ridge Regression: {len(num_zero)/n}")

Number of coefficients that became zero under Ridge Regression: 5
Proportion of coefficients that became zero under Ridge Regression: 0.02032520325203252


In [49]:
# number of Lasso params almost zero
n = len(lasso.coef_)
num_zero = [x for x in lasso.coef_ if x < 0.0001 if x > -0.0001]
print(f"Number of coefficients that became zero under Lasso Regression: {len(num_zero)}")
print(f"Proportion of coefficients that became zero under Lasso Regression: {len(num_zero)/n}")


Number of coefficients that became zero under Lasso Regression: 49
Proportion of coefficients that became zero under Lasso Regression: 0.1991869918699187


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

## Summary

Great! You now know how to perform Lasso and Ridge regression.