## Day 28 Lecture 2 Assignment

In this assignment, we will learn about overfitting and regularization. We will use the king county housing dataset loaded below and analyze the regression from this dataset.

In [8]:
%reload_ext nb_black
import warnings

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [4]:
def print_vif(x):
    """Utility for checking multicollinearity assumption
    
    :param x: input features to check using VIF. This is assumed to be a pandas.DataFrame
    :return: nothing is returned the VIFs are printed as a pandas series
    """
    # Silence numpy FutureWarning about .ptp
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        x = sm.add_constant(x)

    vifs = []
    for i in range(x.shape[1]):
        vif = variance_inflation_factor(x.values, i)
        vifs.append(vif)

    print("VIF results\n-------------------------------")
    print(pd.Series(vifs, index=x.columns))
    print("-------------------------------\n")

<IPython.core.display.Javascript object>

In [2]:
king_county = pd.read_csv(
    "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/kc_house_data.csv"
)

<IPython.core.display.Javascript object>

In [3]:
king_county.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


<IPython.core.display.Javascript object>

Perform the same transformations in the previous assignment to meet model assumptions:
1. Remove all columns except: price, bedrooms, bathrooms, sqft_living, floors, waterfront
1. Remove outliers
1. Split the data into train and test subsets. 20% of the data should be in the test subset

In [5]:
# answer below:
king_county = king_county[
    ["price", "bedrooms", "bathrooms", "sqft_living", "floors", "waterfront"]
]
king_county = king_county[king_county["bedrooms"] < 15]
king_county = king_county.drop(12777)


<IPython.core.display.Javascript object>

In [6]:
X = king_county.drop(columns = ['price'])
y = king_county['price']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)


<IPython.core.display.Javascript object>

Apply a ridge regression model with lambda=50 to the data and evaluate by looking at r squared for test and train

In [9]:
# answer below:
ridge = Ridge(alpha=50)
ridge.fit(X_train, y_train)
ridge.score(X_train, y_train)

0.5375108729226652

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [10]:
ridge.score(X_test, y_test)

0.5470540168189424

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Perform a grid search for the following values of alpha: 0.001, 0.01, 0.1, 1, 10, 100, 1000 to find the most optimal ridge regression model. Experiment with different scoring metrics in the grid search (R^2 is the default, but you can use root mean squared error or many others). 
https://scikit-learn.org/stable/modules/model_evaluation.html

In [11]:
#answer below:
grid = {"alpha": [0.0001, 0.001, 0.01, 0.1]}

ridge_cv = GridSearchCV(Ridge(), grid, verbose=1, cv=5)
ridge_cv.fit(X_train,y_train)

# The best fit is in the best_estimator_ attribute
print(f"selected alpha: {ridge_cv.best_estimator_.alpha}")
ridge_cv.best_estimator_.coef_


Fitting 5 folds for each of 4 candidates, totalling 20 fits
selected alpha: 0.1


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.0s finished


array([-5.39884815e+04,  1.06463986e+04,  2.95924463e+02,  3.76955835e+03,
        7.69500782e+05])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [12]:
ridge_cv.score(X_train, y_train)

0.5401672144006096

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [13]:
ridge_cv.score(X_test, y_test)

0.5515425671482483

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [14]:
grid = {"alpha": [0.0001, 0.001, 0.01, 0.1]}

ridge_cv = GridSearchCV(Ridge(), grid, verbose=1, cv=5)
ridge_cv.fit(X_train,np.log(y_train))

# The best fit is in the best_estimator_ attribute
print(f"selected alpha: {ridge_cv.best_estimator_.alpha}")
ridge_cv.best_estimator_.coef_


Fitting 5 folds for each of 4 candidates, totalling 20 fits
selected alpha: 0.0001


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.0s finished


array([-5.10160740e-02,  3.30306782e-02,  3.91334705e-04,  5.98313388e-02,
        6.05747388e-01])

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [17]:
ridge_cv.score(X_train, np.log(y_train))

0.5029121081934966

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>