# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [3]:
df["SalePrice"].mean()

180921.19589041095

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [4]:
# Load necessary packages
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

In [5]:
np.dtype(df["Fence"])

dtype('O')

In [6]:
# remove "object"-type features and SalesPrice from `X`
dropped_df = df.copy()
for column in df.columns:
    if np.dtype(df[column]) == np.dtype("O"):
        dropped_df = dropped_df.drop(columns=[column])
    else:
        continue

In [7]:
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 38 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1452 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

In [8]:
# Impute null values
dropped_df = dropped_df.fillna(dropped_df.median())
dropped_df.isna().sum()

Id               0
MSSubClass       0
LotFrontage      0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
MasVnrArea       0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Fireplaces       0
GarageYrBlt      0
GarageCars       0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
MiscVal          0
MoSold           0
YrSold           0
SalePrice        0
dtype: int64

In [9]:
X = dropped_df.drop(columns="SalePrice")

In [10]:
# Create y
y = df[["SalePrice"]]

Look at the information of `X` again

In [11]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
LotFrontage      1460 non-null float64
LotArea          1460 non-null int64
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null int64
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
TotRmsAbvGrd     1460 non-null int64
F

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [12]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Fit the model and print R2 and MSE for train and test
lr = LinearRegression()
naive_model = lr.fit(X_train, y_train)

In [13]:
print("Training Score: {}".format(naive_model.score(X_train, y_train)))
print("Training MSE: {}".format(mean_squared_error(y_train, naive_model.predict(X_train))))
print("Test Score: {}".format(naive_model.score(X_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, naive_model.predict(X_test))))

Training Score: 0.8046400702344446
Training MSE: 1186117094.297817
Test Score: 0.8240865405545399
Test MSE: 1232328149.8816075


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [14]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)

X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

Perform the same linear regression on this data and print out R-squared and MSE.

In [15]:
lr2 = LinearRegression()
scaled_model = lr2.fit(X_scaled_train, y_train)

In [16]:
print("Training Score: {}".format(scaled_model.score(X_scaled_train, y_train)))
print("Training MSE: {}".format(mean_squared_error(y_train, scaled_model.predict(X_scaled_train))))
print("Test Score: {}".format(scaled_model.score(X_scaled_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, scaled_model.predict(X_scaled_test))))

Training Score: 0.8047337713515315
Training MSE: 1185548193.1067445
Test Score: 0.8241699406436402
Test MSE: 1231743906.4824674


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [17]:
# Create X_cat which contains only the categorical variables
cat_df = df.copy()
for column in df.columns:
    if np.dtype(df[column]) != np.dtype("O"):
        cat_df = cat_df.drop(columns=[column])
    else:
        continue

In [18]:
cat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422

In [19]:
# Make dummies
X_cat = pd.get_dummies(cat_df)

In [20]:
X_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 252 entries, MSZoning_C (all) to SaleCondition_Partial
dtypes: uint8(252)
memory usage: 359.4 KB


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [21]:
X_merged = pd.concat([pd.DataFrame(X_scaled, columns=X.columns), X_cat], axis=1)

In [22]:
X_merged.shape

(1460, 289)

In [46]:
X_merged_train, X_merged_test, y_train, y_test = train_test_split(X_merged, y, random_state=4000)

In [47]:
print(X_merged_train)

            Id  MSSubClass  LotFrontage   LotArea  OverallQual  OverallCond  \
773   0.103211   -0.872563     0.006190 -0.036764    -0.795151    -0.517200   
1282  1.310902   -0.872563    -0.402527 -0.172064    -0.795151     1.280685   
836   0.252690   -0.636078     0.914450 -0.242219    -0.795151     0.381743   
314  -0.985846    0.309859    -0.447940 -0.091886     0.651479     1.280685   
1329  1.422417    0.073375    -0.311701 -0.143601     0.651479    -0.517200   
937   0.492330    0.073375     0.233255 -0.084370     0.651479    -0.517200   
1114  0.912293   -0.872563     0.914450 -0.512819    -0.795151     1.280685   
631  -0.233708    1.492282    -1.628678 -0.593999     1.374795    -0.517200   
367  -0.860094    0.546344     1.413992 -0.136986    -0.071836    -0.517200   
345  -0.912293   -0.163109    -0.220875 -0.409089    -0.071836    -0.517200   
784   0.129311    0.428102    -1.583265 -0.422619    -0.071836     0.381743   
916   0.442503   -0.872563    -0.902070 -0.152020   

Perform the same linear regression on this data and print out R-squared and MSE.

In [48]:
lr3 = LinearRegression()
merged_model = lr3.fit(X_merged_train, y_train)
print("Training Score: {}".format(merged_model.score(X_merged_train, y_train)))
print("Training MSE: {}".format(mean_squared_error(y_train, merged_model.predict(X_merged_train))))
print("Test Score: {}".format(merged_model.score(X_merged_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, merged_model.predict(X_merged_test))))

Training Score: 0.9446746517280878
Training MSE: 300306217.9976598
Test Score: -6.180992882385147e+18
Test MSE: 5.4635634687973615e+28


In [49]:
RSS = ((y_test - merged_model.predict(X_merged_test)) ** 2).sum() 
print(RSS)
TSS = ((y_test - y_test.mean()) ** 2).sum()
print(TSS)
score = 1 - (RSS/TSS)
print(score)

SalePrice    1.994201e+31
dtype: float64
SalePrice    3.226344e+12
dtype: float64
SalePrice   -6.180993e+18
dtype: float64


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [50]:
lasso1 = Lasso(alpha=1)
lasso1.fit(X_merged_train, y_train)

print("Test Score: {}".format(lasso1.score(X_merged_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, lasso1.predict(X_merged_test))))

Test Score: 0.6719683435027672
Test MSE: 2899569388.204465


With a higher regularization parameter (alpha = 10)

In [51]:
lasso2 = Lasso(alpha=10)
lasso2.fit(X_merged_train, y_train)

print("Test Score: {}".format(lasso2.score(X_merged_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, lasso2.predict(X_merged_test))))

Test Score: 0.6925929521989798
Test MSE: 2717262336.934437


## Ridge

With default parameter (alpha = 1)

In [60]:
ridge1 = Ridge()
ridge1.fit(X_merged_train, y_train)

print("Test Score: {}".format(ridge1.score(X_merged_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, ridge1.predict(X_merged_test))))

Test Score: 0.7571493300796457
Test MSE: 2146629309.881702


With default parameter (alpha = 10)

In [55]:
ridge2 = Ridge(alpha=10)
ridge2.fit(X_merged_train, y_train)

print("Test Score: {}".format(ridge2.score(X_merged_test, y_test)))
print("Test MSE: {}".format(mean_squared_error(y_test, ridge2.predict(X_merged_test))))

Test Score: 0.7902884598200559
Test MSE: 1853702684.5276637


## Look at the metrics, what are your main conclusions?   

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [68]:
# number of Ridge params almost zero
print('Number of Ridge parameter coefficients for Alpha=10:', len(ridge1.coef_[0]))
print('Number of Ridge parameter coefficients for Alpha=10:', len(ridge2.coef_[0]))

print('Ridge parameter coefficients that are almost zero for Alpha=1:', sum(abs(ridge1.coef_[0]) < 10**(-10)))
print('Ridge parameter coefficients that are almost zero for Alpha=10:', sum(abs(ridge2.coef_[0]) < 10**(-10)))


Number of Ridge parameter coefficients for Alpha=10: 289
Number of Ridge parameter coefficients for Alpha=10: 289
Ridge parameter coefficients that are almost zero for Alpha=1: 4
Ridge parameter coefficients that are almost zero for Alpha=10: 4


In [56]:
# number of Lasso params almost zero
print('Number of Lasso parameter coefficients for Alpha=1:', len(lasso1.coef_))
print('Number of Lasso parameter coefficients for Alpha=10:', len(lasso2.coef_))

print('Lasso parameter coefficients that are almost zero for Alpha=1', sum(abs(lasso1.coef_) < 10**(-10)))
print('Lasso parameter coefficients that are almost zero for Alpha=10', sum(abs(lasso2.coef_) < 10**(-10)))

Number of Lasso parameter coefficients for Alpha=1: 289
Number of Lasso parameter coefficients for Alpha=10: 289
Lasso parameter coefficients that are almost zero for Alpha=1 29
Lasso parameter coefficients that are almost zero for Alpha=10 66


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

## Summary

Great! You now know how to perform Lasso and Ridge regression.