# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [9]:
# Load necessary packages
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# remove "object"-type features and SalesPrice from `X`
columns = [col for col in df.columns if df[col].dtype in [np.float64, np.int64]]
X = df[columns]
X = X.drop('SalePrice', axis = 1)


# Impute null values
for col in X:
    median = X[col].median()
    X[col].fillna(value = median, inplace = True)

# Create y
y = df.SalePrice

Look at the information of `X` again

In [None]:
X.info()

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [27]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X,y)


# Fit the model and print R2 and MSE for train and test
linreg = LinearRegression()


linreg.fit(X_train, y_train)

print(linreg.score(X_train, y_train)) #R2 for training set
print(linreg.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, linreg.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, linreg.predict(X_test))) #MSE for testing set

0.8150263031716833
0.7969891977997344
1107301266.1736867
1474366811.104531


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [28]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)

Perform the same linear regression on this data and print out R-squared and MSE.

In [29]:
linreg = LinearRegression()


linreg.fit(X_train, y_train)

print(linreg.score(X_train, y_train)) #R2 for training set
print(linreg.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, linreg.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, linreg.predict(X_test))) #MSE for testing set

0.8017017507690365
0.8410975703781115
1278054811.650263
936130871.1556543


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [30]:
# Create X_cat which contains only the categorical variables

features_cat = [col for col in df.columns if df[col].dtype in [np.object]]
X_cat = df[features_cat]

np.shape(X_cat)

(1460, 43)

In [31]:
# Make dummies
X_cat = pd.get_dummies(X_cat, drop_first = True)

np.shape(X_cat)

(1460, 209)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [32]:
X_all = pd.concat([pd.DataFrame(X_scaled),X_cat], axis = 1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y)

linreg = LinearRegression()
linreg.fit(X_train, y_train)


print(linreg.score(X_train, y_train)) #R2 for training set
print(linreg.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, linreg.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, linreg.predict(X_test))) #MSE for testing set

linreg.coef_

#OVERFITTED!!

0.9409586655434856
-1.008602859083722e+17
376229283.6022283
6.162807010933384e+26


array([ 1.04281490e+03, -2.08845380e+03,  9.58654353e+02,  7.03963443e+03,
        8.79122205e+03,  7.17475121e+03,  9.61498812e+03,  2.09103406e+03,
        3.92917012e+03,  5.95460170e+16,  2.10610839e+16,  5.76880671e+16,
       -5.72752994e+16,  1.69194291e+16,  1.91051375e+16,  2.12804157e+15,
       -2.29982154e+16,  1.51963401e+03,  6.02680993e+02,  1.93940491e+03,
       -4.52243190e+02, -1.60801559e+03, -2.66448356e+03, -2.64043616e+02,
        3.54344002e+03, -1.84379039e+03,  3.72711550e+03,  3.22733185e+03,
        3.09354357e+03,  2.80335357e+02,  1.01092422e+03,  9.56529929e+02,
        1.58979060e+03,  1.64489840e+04,  7.19106050e+02, -1.11631977e+03,
       -1.26968558e+03,  3.43105668e+04,  1.29869188e+04,  1.86813994e+04,
        1.85888229e+04,  3.72960026e+04, -2.58384283e+03,  1.38568297e+03,
        1.62140814e+03,  2.39674516e+03,  1.05082626e+04,  1.06692335e+03,
        1.10895591e+04, -4.92505059e+04,  1.09617640e+04, -9.40560152e+03,
       -4.32387099e+04, -

Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [42]:
#Lasso regression is very similar to Ridge regression
#except that the magnitude of the coefficients are not squared in the penalty term


from sklearn.linear_model import Lasso, Ridge

lasso = Lasso()

lasso.fit(X_train, y_train)


print(lasso.score(X_train, y_train)) #R2 for training set
print(lasso.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, lasso.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, lasso.predict(X_test))) #MSE for testing set

lasso.coef_

0.9408989961876238
0.5467639838199467
376609514.82863265
2769381499.3341107


array([ 1.00546925e+03, -2.34644012e+03,  1.02454547e+03,  6.81229123e+03,
        8.75526387e+03,  7.09439785e+03,  9.73430999e+03,  2.08925376e+03,
        3.80222806e+03,  9.46547069e+03,  3.81167014e+03,  3.03150502e+03,
        3.58369930e+03,  1.45386217e+04,  2.55484544e+04,  1.18860570e+03,
        5.06890586e+03,  1.56619278e+03,  6.29775181e+02,  1.97128111e+03,
       -4.34949135e+02, -1.75628166e+03, -2.57571049e+03, -4.72610370e+01,
        3.56486149e+03, -1.82622971e+03,  3.90599861e+03,  3.17929896e+03,
        3.08417454e+03,  3.36514914e+02,  9.88522755e+02,  9.91749494e+02,
        1.64475215e+03,  1.63054029e+04,  1.03252770e+03, -1.20091921e+03,
       -1.25226318e+03,  3.25602377e+04,  1.04761572e+04,  1.73678267e+04,
        1.70150367e+04,  3.69493251e+04, -2.69806456e+03,  1.86535527e+03,
        1.61666321e+03,  2.36387708e+03,  1.04013430e+04,  1.16079590e+03,
        1.09802987e+04, -4.73314588e+04,  1.11645166e+04, -8.94092611e+03,
       -4.20190255e+04, -

With a higher regularization parameter (alpha = 10)

In [37]:
lasso = Lasso(alpha=10)


lasso.fit(X_train, y_train)


print(lasso.score(X_train, y_train)) #R2 for training set
print(lasso.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, lasso.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, lasso.predict(X_test))) #MSE for testing set

0.9295600372615882
0.8013427483021793
457164716.08595544
1130514318.143185


## Ridge

With default parameter (alpha = 1)

In [40]:
# Ridge regression puts a constraint on the coefficients m
# this means that large coefficients penalize the optimization function



ridge = Ridge()

ridge.fit(X_train, y_train)


print(ridge.score(X_train, y_train)) #R2 for training set
print(ridge.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, ridge.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, ridge.predict(X_test))) #MSE for testing set

0.9121095201942101
0.883725452494536
570420890.1456664
661692637.2790899


With default parameter (alpha = 10)

In [39]:
ridge = Ridge(alpha=10)

ridge.fit(X_train, y_train)


print(ridge.score(X_train, y_train)) #R2 for training set
print(ridge.score(X_test, y_test)) #R2 for testing set

print(mean_squared_error(y_train, ridge.predict(X_train))) #MSE for training set
print(mean_squared_error(y_test, ridge.predict(X_test))) #MSE for testing set

0.8908897245386075
0.8942918588976493
708140182.9897192
601561564.1471125


## Look at the metrics, what are your main conclusions?   

- penalising the coefficients improves R2 and avoids overfitting

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [43]:
# number of Ridge params almost zero
print(sum(abs(ridge.coef_) < 10**(-10)))

3


In [44]:
# number of Lasso params almost zero
print(sum(abs(lasso.coef_) < 10**(-10)))

9


Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

## Summary

Great! You now know how to perform Lasso and Ridge regression.