# Ridge and Lasso Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on Ridge and Lasso regression!

## Objectives

You will be able to:

- Use Lasso and ridge regression in Python
- Compare Lasso and Ridge with standard regression

## Housing Prices Data

Let's look at yet another house pricing data set.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('Housing_Prices/train.csv')

Look at df.info

In [2]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

We'll make a first selection of the data by removing some of the data with `dtype = object`, this way our first model only contains **continuous features**

Make sure to remove the SalesPrice column from the predictors (which you store in `X`), then replace missing inputs by the median per feature.

Store the target in `y`.

In [3]:
# Load necessary packages
import numpy as np
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
%matplotlib inline

# remove "object"-type features and SalesPrice from `X`
df_con = df.select_dtypes(exclude=[object]).copy()

y = df_con['SalePrice']
X = df_con.drop(columns={'SalePrice'})


# Impute null values

for col in X:
    X[col].fillna(X[col].mean(), inplace=True)

# Create y


Look at the information of `X` again

## Let's use this data to perform a first naive linear regression model

Compute the R squared and the MSE for both train and test set.

In [4]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit the model and print R2 and MSE for train and test
lm = LinearRegression()
lm.fit(X_train, y_train)

print('r2_train', '-->', lm.score(X_train, y_train))
print('r2_test', '-->', lm.score(X_test, y_test))
print('MSE_train', '-->', mean_squared_error(y_train, lm.predict(X_train)))
print('MSE_test', '-->', mean_squared_error(y_test, lm.predict(X_test)))

r2_train --> 0.7936364910689578
r2_test --> 0.8533351298363874
MSE_train --> 1262377162.113852
MSE_test --> 1006435290.8101844


## Normalize your data

We haven't normalized our data, let's create a new model that uses `preprocessing.scale` to scale our predictors!

In [5]:
from sklearn import preprocessing

# Scale the data and perform train test split
X_scaled = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y)


Perform the same linear regression on this data and print out R-squared and MSE.

In [6]:
# Your code here
lm = LinearRegression()
lm.fit(X_train, y_train)

print('r2_train', '-->', lm.score(X_train, y_train))
print('r2_test', '-->', lm.score(X_test, y_test))
print('MSE_train', '-->', mean_squared_error(y_train, lm.predict(X_train)))
print('MSE_test', '-->', mean_squared_error(y_test, lm.predict(X_test)))

r2_train --> 0.8075579228773865
r2_test --> 0.8023034330432126
MSE_train --> 1143374419.6279728
MSE_test --> 1458100102.938741


## Include dummy variables

We haven't included dummy variables so far: let's use our "object" variables again and create dummies

In [7]:
# Create X_cat which contains only the categorical variables
x_cat = df.select_dtypes(include=[object]).copy()

In [8]:
x_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 43 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422

In [9]:
# Make dummies
x_cat = pd.get_dummies(x_cat)
x_cat.shape

(1460, 252)

Merge `x_cat` together with our scaled `X` so you have one big predictor dataframe.

In [10]:
# Your code
X = pd.concat([X, x_cat], axis=1)

Perform the same linear regression on this data and print out R-squared and MSE.

In [11]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(X, y)

lm = LinearRegression()
lm.fit(X_train, y_train)

print('r2_train', '-->', lm.score(X_train, y_train))
print('r2_test', '-->', lm.score(X_test, y_test))
print('MSE_train', '-->', mean_squared_error(y_train, lm.predict(X_train)))
print('MSE_test', '-->', mean_squared_error(y_test, lm.predict(X_test)))

r2_train --> 0.9489219325732245
r2_test --> 0.555495975964357
MSE_train --> 319450101.3384243
MSE_test --> 2869007199.7836714


With default parameter (alpha = 1)

In [None]:
# Your code here

With default parameter (alpha = 10)

In [None]:
# Your code here

## Look at the metrics, what are your main conclusions?

Conclusions here

## Compare number of parameter estimates that are (very close to) 0 for Ridge and Lasso

In [None]:
# number of Ridge params almost zero

In [None]:
# number of Lasso params almost zero

Compare with the total length of the parameter space and draw conclusions!

In [None]:
# your code here

## Summary

Great! You now know how to perform Lasso and Ridge regression.