# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

First, make a selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [4]:
# Load necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

X = pd.DataFrame()

# remove "object"-type features and SalesPrice from `X`
for column_type, column in zip(df.dtypes, df.columns):
    if not column_type == object and not column == "SalePrice":
        X[column] = df[column]

# Impute null values
for column in X.columns:
    X[column] = X[column].fillna(X[column].median())

# Create y
y = df["SalePrice"]

Look at the information of `X` again

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [6]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


# Fit the model and print R2 and MSE for train and test
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)

print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

y_pred = reg.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = reg.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.8118554543955105
0.784969688929444
1090644660.2119064
1578621667.95895


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [8]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)
y_scaled = preprocessing.scale(y)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.33, random_state=42)

Perform the same linear regression on this data and print out R-squared and MSE.

In [9]:
reg = LinearRegression()
reg.fit(X_train, y_train)

print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

y_pred = reg.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = reg.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.8118780879248153
0.7846447529160908
0.1729110532889931
0.2506833912385795


## Include dummy variables

Your model hasn't included dummy variables so far: let's use the "object" variables again and create dummies

In [41]:
df_cat = pd.DataFrame()
for column_type, column in zip(df.dtypes, df.columns):
    if column_type == np.object:
        df_cat[column] = df[column]

In [42]:
# Create X_cat which contains only the categorical variables
X_cat = pd.get_dummies(df_cat)

In [43]:
# Make dummies


Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [44]:
X_all = pd.concat([pd.DataFrame(X_scaled), X_cat], axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [45]:
X_cat.shape

(1460, 252)

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.33, random_state=42)
reg = LinearRegression()
reg.fit(X_train, y_train)

print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

y_pred = reg.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = reg.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.9383764315887964
-3.485321417601015e+17
357222238.97028375
2.558710854406494e+27


Notice the severe overfitting above; our training R squared is quite high, but the testing R squared is negative! Our predictions are far off. Similarly, the scale of the Testing MSE is orders of magnitude higher than that of the training.

## Perform Ridge and Lasso regression

Use all the data (normalized features and dummy categorical variables) and perform Lasso and Ridge regression for both! Each time, look at R-squared and MSE.

## Lasso

With default parameter (alpha = 1)

In [47]:
from sklearn.linear_model import Lasso, Ridge

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.33, random_state=42)
lasso = Lasso(alpha=1)
lasso.fit(X_train, y_train)
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))

y_pred = lasso.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = lasso.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.9383429913678842
0.8799341162729882
357416086.7284369
881450641.4954165


With a higher regularization parameter (alpha = 10)

In [55]:
lasso = Lasso(alpha=10)
lasso.fit(X_train, y_train)
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))

y_pred = lasso.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = lasso.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.9363299390224195
0.8906232536247098
369085437.99373275
802977500.8892994


## Ridge

With default parameter (alpha = 1)

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.33, random_state=42)
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))

y_pred = ridge.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = ridge.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.9241379330026552
0.8643679188115277
439760600.114456
995728189.0562645


With default parameter (alpha = 10)

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.33, random_state=42)
ridge = Ridge(alpha=10)
ridge.fit(X_train, y_train)
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))

y_pred = ridge.predict(X_train)
print(mean_squared_error(y_train, y_pred))

y_pred = ridge.predict(X_test)
print(mean_squared_error(y_test, y_pred))

0.9037681391933872
0.8481665001951825
557841125.7885724
1114669143.7156472


## Look at the metrics, what are your main conclusions?   

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

Compare with the total length of the parameter space and draw conclusions!

In [58]:
ridge.coef_

array([-7.70726244e+02, -5.18993725e+03, -3.55568450e+03,  4.40821458e+03,
        1.38533736e+04,  4.21651577e+03,  3.70389864e+03,  1.97774941e+03,
        3.50391651e+03, -2.80494406e+02,  1.00286655e+03,  1.51078599e+02,
        2.29323363e+02,  3.30325038e+03,  8.71953582e+03,  1.47325962e+03,
        9.80998723e+03,  4.62778339e+03, -3.11669495e+01,  4.42888861e+03,
        2.35982194e+03, -1.85607515e+03, -2.90198758e+03,  8.77187606e+03,
        3.31230956e+03, -1.30405535e+03,  9.22386780e+03,  1.50605294e+03,
        2.65702013e+03, -8.44713754e+02,  1.39652114e+03,  1.71057865e+03,
        3.74162621e+03, -3.62008740e+03, -1.42393922e+02, -1.15149470e+02,
       -3.45321484e+01, -4.37461584e+03,  5.93922687e+03, -9.00120552e+02,
        2.83464790e+03, -3.49913838e+03, -5.60064822e+03,  5.60064822e+03,
       -1.18811458e+03,  6.23200131e+03,  4.17784814e+02,  5.10293538e+03,
       -6.65940657e+03,  1.13868638e+03, -1.44332308e+04,  1.34940474e+04,
       -1.67877707e+03,  

In [None]:
# number of Lasso params almost zero

Lasso was very effective to essentially perform variable selection and remove about 25% of the variables from your model!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.