# Assignment - Regression


In this assignment, we will focus on housing. The data set for this exercise includes information on house sales in King County, WA (between May 2014 and May 2015). (Each row in the data set pertains to one house. There is a total of 21,613 houses in the data set). You will use this data set to predict the sale price of a house (i.e., the `price` column) based on the characteristics of the house. This is important, because this information can be helpful for buyers, sellers, realtors, and lenders.

## Description of Variables

The description and type of each variable is provided in "KC house data - Data Dictionary.docx". Make sure to read this document to learn about the variables.

## Goal

Use the **kc_house_data.csv** data set and build a model to predict **price**. <br>

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data

In [43]:
# Common imports

import numpy as np
import pandas as pd

np.random.seed(42)


# Get the data

In [44]:
#We will predict the "price" value in the data set:
house = pd.read_csv("kc_house_data.csv")

# Split data (train/test)

In [45]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(house, test_size=0.3)

In [46]:
train_set.isna().sum()

price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         1
floors           1
waterfront       0
view             0
condition        0
grade            1
sqft_above       0
sqft_basement    0
yr_built         1
yr_renovated     0
zipcode          2
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

# Data Prep

Perform your data prep here. You can use pipelines like we do in the tutorials. Otherwise, feel free to use your own data prep steps. Eventually, you should do the following at a minimum:<br>
- Separate inputs from target<br>
- Impute/remove missing values<br>
- Standardize the continuous variables<br>
- One-hot encode categorical variables<br>

In [47]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [48]:
# Separate the target variable and input variables
train_y = train_set[['price']]
test_y = test_set[['price']]

train_inputs = train_set.drop(['price'], axis=1)
test_inputs = test_set.drop(['price'], axis=1)


In [49]:
#Converting zipcode to object to treat as categorical variable
train_inputs['zipcode'] = train_inputs['zipcode'].astype(str)
test_inputs['zipcode'] = test_inputs['zipcode'].astype(str)


In [50]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()
# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()
# Identify the binary columns so we can pass them through without transforming
binary_columns = ['waterfront']

numeric_columns.remove('waterfront')


In [51]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='mean')),
                ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [52]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        ('binary', binary_transformer, binary_columns)],
        remainder='passthrough')

In [53]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x.toarray()
train_x.shape

(15129, 88)

In [54]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)
test_x.toarray()
test_x.shape

(6484, 88)

# Calculate the Baseline

In [55]:
#First find the average value of the target

mean_value = np.mean(train_y['price'])

mean_value

539148.6909247141

In [56]:
baseline_pred = np.repeat(mean_value, len(test_y))

baseline_pred

array([539148.69092471, 539148.69092471, 539148.69092471, ...,
       539148.69092471, 539148.69092471, 539148.69092471])

In [57]:
from sklearn.metrics import mean_squared_error
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 363575.99565640965


# Train a SGD model (with no regularization)

In [58]:
from sklearn.linear_model import SGDRegressor 

# tol = stopping criterion
# eta0 = learning rate
# penalty = regularization term
# max_iter = number of passes over training data (i.e., epochs)

sgd_reg = SGDRegressor(max_iter=100, penalty=None, eta0=0.1, tol=0.0001) 

sgd_reg.fit(train_x, train_y)



  return f(**kwargs)


SGDRegressor(eta0=0.1, max_iter=100, penalty=None, tol=0.0001)

### Generate the error metrics

In [59]:
#Train RMSE
reg_train_pred = sgd_reg.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 161322.79727908064


In [60]:
#Test RMSE
reg_test_pred = sgd_reg.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))


Test RMSE: 165492.9855450757


# Try L1 Regularization in SGD

In [61]:
#Stochastic Gradient:
sgd_reg_L1 = SGDRegressor(max_iter=100, penalty='l1', alpha = .1, eta0=0.1, tol=0.0001)

sgd_reg_L1.fit(train_x, train_y)

  return f(**kwargs)


SGDRegressor(alpha=0.1, eta0=0.1, max_iter=100, penalty='l1', tol=0.0001)

### Generate the error metrics

In [62]:
#Train RMSE
reg_train_pred = sgd_reg_L1.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 170058.7148630164


In [63]:
#Test RMSE
reg_test_pred = sgd_reg_L1.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 175030.33989290654


# Try L2 Regularization in SGD

In [64]:
#Stochastic Gradient:

sgd_reg_L2 = SGDRegressor(max_iter=100, penalty='l2', alpha = 0.1, eta0=0.1, tol=0.0001)

sgd_reg_L2.fit(train_x, train_y)

  return f(**kwargs)


SGDRegressor(alpha=0.1, eta0=0.1, max_iter=100, tol=0.0001)

### Generate the error metrics

In [65]:
#Train RMSE
reg_train_pred = sgd_reg_L2.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 217733.58212180543


In [66]:
#Test RMSE
reg_test_pred = sgd_reg_L2.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 214747.22881867253


# Try ElasticNet in SGD

In [67]:
#Stochastic Gradient:
sgd_reg_elastic = SGDRegressor(max_iter=100, penalty='elasticnet', l1_ratio=0.7, alpha = 0.1, 
                          eta0=0.1, tol=0.00001)
sgd_reg_elastic.fit(train_x, train_y)



  return f(**kwargs)


SGDRegressor(alpha=0.1, eta0=0.1, l1_ratio=0.7, max_iter=100,
             penalty='elasticnet', tol=1e-05)

### Generate the error metrics

In [68]:
#Train RMSE
reg_train_pred = sgd_reg_elastic.predict(train_x)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 196932.89270458964


In [69]:
#Test RMSE
reg_test_pred = sgd_reg_elastic.predict(test_x)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 194897.44073935173


# Create Polynomial Features

Create polynomial features with degree = 2. 

In [70]:
from sklearn.preprocessing import PolynomialFeatures

# Create second degree terms and interaction terms
poly_features = PolynomialFeatures(degree=2).fit(train_x)

train_x_poly = poly_features.transform(train_x)

test_x_poly = poly_features.transform(test_x)

# Try L2 Regularization in SGD (with polynomial features)

In [71]:
from sklearn.linear_model import SGDRegressor 
#Stochastic Gradient:

sgd_reg_L2 = SGDRegressor(max_iter=50, penalty='l2', alpha = 0.1, eta0=0.1, tol=0.0001)

sgd_reg_L2.fit(train_x_poly, train_y)

  return f(**kwargs)


SGDRegressor(alpha=0.1, eta0=0.1, max_iter=50, tol=0.0001)

### Generate the error metrics

In [72]:
#Train RMSE
reg_train_pred = sgd_reg_L2.predict(train_x_poly)

train_mse = mean_squared_error(train_y, reg_train_pred)

train_rmse = np.sqrt(mean_squared_error (train_y, reg_train_pred))

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 12234291507437.07


In [73]:
#Test RMSE
reg_test_pred = sgd_reg_L2.predict(test_x_poly)

test_mse = mean_squared_error (test_y, reg_test_pred)

test_rmse = np.sqrt(mean_squared_error (test_y, reg_test_pred))

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 16938916148179.182


# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) Does the best model perform better than the baseline (and why)?<br>
3) Does the best model exhibit any overfitting; what did you do about it?