# Predicting Home Value and Nox in Boston Homes
CS 6140 Midterm
Author: Sid Nagaich

Problem 3: Using the Boston data set, build a linear regressor that predicts NOX and another one that predicts median home value.

# Loading the Data
Here we simply import some libraries, suppress warnings, and view the provided data

In [1]:
# import libraries
import numpy as np
import pandas as pd

# suppress warnings
import warnings
warnings.filterwarnings('ignore')

# read data
data = pd.read_csv("BostonHousing.csv")

# show data
data

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


# Preparing the Data
Here we will create two data sets to use for our two models. One to predict median home value and one to predict NOX. We drop the corresponding columns in each dataset so that our model does not train on the data which it is trying to predict. Since this is a smaller data set, we reserve 20% of it for testing. 

In [2]:
# data set for predicting medv
medv_data = data.copy().drop(columns='medv')

# data set for predicting NOX
nox_data = data.copy().drop(columns='nox')

# medv is what we want to predict in one model
medv_y = data.loc[:,'medv']

# grab column headers for predicting medv
medv_predictors = [p for p in medv_data.columns]

# x is data with which we will predict y
medv_x = data.loc[:, medv_predictors]

# grab column headers for predicting nox
nox_predictors = [p for p in nox_data.columns]

# nox is what we want to predict in other model
nox_y = data.loc[:,'nox']

# x is data with which we will predict y
nox_x = data.loc[:, nox_predictors]

In [3]:
from sklearn.model_selection import train_test_split

# shuffle medv data before splitting 75/25
medv_train_x, medv_test_x, medv_train_y, medv_test_y = train_test_split(medv_x, medv_y, test_size=0.25, random_state=13)

# shuffle and split nox data
nox_train_x, nox_test_x, nox_train_y, nox_test_y = train_test_split(nox_x, nox_y, test_size=0.20, random_state=42)

# Using Gradient Boosted Trees
We use gradient boosted trees as a regressor to predict median home value and Nox. Hyperparameters have been tuned.
We normalzie our data as Gradient Boosting is susceptible to overfitting -- see discussion at end of notebook.

Our first model predicts Median Home Value:

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingRegressor

# Gradient Boosted Trees
gbt = make_pipeline(StandardScaler(), GradientBoostingRegressor(n_estimators=6500, learning_rate=0.05, max_depth=9, subsample=0.85, 
                                validation_fraction=0.2, n_iter_no_change=20, max_features='log2',
                                random_state=13)).fit(medv_train_x, medv_train_y)

gbt.score(medv_test_x, medv_test_y)

gbt_predictions = gbt.predict(medv_test_x)

# show some predictions
i = 0
for pre in medv_test_y:
    if (i > 10): break
    print("Actual Med Value: " + str(pre) + " | Predicted Med Value: " + str(gbt_predictions[i]))
    i += 1

# model score
print("\nmodel score: " + str(gbt.score(medv_test_x, medv_test_y)))

Actual Med Value: 12.0 | Predicted Med Value: 14.473995517789293
Actual Med Value: 15.2 | Predicted Med Value: 15.490871528891724
Actual Med Value: 21.0 | Predicted Med Value: 20.278436040939003
Actual Med Value: 24.0 | Predicted Med Value: 27.998183576064935
Actual Med Value: 19.4 | Predicted Med Value: 20.158831577276732
Actual Med Value: 22.2 | Predicted Med Value: 22.58281887408838
Actual Med Value: 23.3 | Predicted Med Value: 22.978760152883776
Actual Med Value: 15.6 | Predicted Med Value: 15.703294825333597
Actual Med Value: 20.8 | Predicted Med Value: 20.368323535984253
Actual Med Value: 13.8 | Predicted Med Value: 21.578153209603638
Actual Med Value: 19.6 | Predicted Med Value: 18.31828689277949

model score: 0.8913935471647142


Our second model predict NOX:

In [5]:
# Gradient Boosted Trees
gbt = make_pipeline(StandardScaler(), GradientBoostingRegressor(n_estimators=5000, learning_rate=0.05, max_depth=9, subsample=0.75, 
                                validation_fraction=0.1, max_features=4, 
                                random_state=13)).fit(nox_train_x, nox_train_y)

gbt.score(nox_test_x, nox_test_y)

gbt_predictions = gbt.predict(nox_test_x)

# show some predictions
i = 0
for pre in nox_test_y:
    if (i > 10): break
    print("Actual Nox: " + str(pre) + " | Predicted Nox: " + str(gbt_predictions[i]))
    i += 1

# model score
print("\nmodel score: " + str(gbt.score(nox_test_x, nox_test_y)))

Actual Nox: 0.51 | Predicted Nox: 0.5152136106645683
Actual Nox: 0.447 | Predicted Nox: 0.4530425422254208
Actual Nox: 0.609 | Predicted Nox: 0.6146453499443025
Actual Nox: 0.413 | Predicted Nox: 0.42647379675314706
Actual Nox: 0.713 | Predicted Nox: 0.7082267547123053
Actual Nox: 0.437 | Predicted Nox: 0.5016121001357734
Actual Nox: 0.544 | Predicted Nox: 0.5379573938611847
Actual Nox: 0.624 | Predicted Nox: 0.625316137214737
Actual Nox: 0.532 | Predicted Nox: 0.6723139543336586
Actual Nox: 0.585 | Predicted Nox: 0.5703054871263784
Actual Nox: 0.55 | Predicted Nox: 0.5377643163524639

model score: 0.9023741809370741


# Using Stochastic Gradient Descent
I wanted to compare the GBT models with an SGD model, as the GBT models are susceptible to overfitting. Despite normalization of the data, I believe the GBT models are overfit and the SGD models may generalize better.

Predicting Median Home Value with SGD:

In [6]:
from sklearn.linear_model import SGDRegressor

# Stochastic Gradient Descent
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3)).fit(medv_train_x, medv_train_y)

reg.score(medv_test_x, medv_test_y)

reg_predictions = reg.predict(medv_test_x)

# show some predictions
i = 0
for pre in medv_test_y:
    if (i > 10): break
    print("Actual Med Value: " + str(pre) + " | Predicted Med Value: " + str(reg_predictions[i]))
    i += 1

# model score
print("\nmodel score: " + str(reg.score(medv_test_x, medv_test_y)))

Actual Med Value: 12.0 | Predicted Med Value: 10.636018151929182
Actual Med Value: 15.2 | Predicted Med Value: 19.660237283477716
Actual Med Value: 21.0 | Predicted Med Value: 20.95375964748582
Actual Med Value: 24.0 | Predicted Med Value: 30.535320073523046
Actual Med Value: 19.4 | Predicted Med Value: 23.103766801291172
Actual Med Value: 22.2 | Predicted Med Value: 22.376591535957
Actual Med Value: 23.3 | Predicted Med Value: 21.87014734673475
Actual Med Value: 15.6 | Predicted Med Value: 22.304824629360272
Actual Med Value: 20.8 | Predicted Med Value: 19.390727996318866
Actual Med Value: 13.8 | Predicted Med Value: -0.3528533659774524
Actual Med Value: 19.6 | Predicted Med Value: 19.444555013979297

model score: 0.7133038492950874


Predicting NOX with SGD:

In [7]:
# Gradient Boosted Trees
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3)).fit(nox_train_x, nox_train_y)

reg.score(nox_test_x, nox_test_y)

reg_predictions = reg.predict(nox_test_x)

# show some predictions
i = 0
for pre in nox_test_y:
    if (i > 10): break
    print("Actual Nox: " + str(pre) + " | Predicted Nox: " + str(reg_predictions[i]))
    i += 1

# model score
print("\nmodel score: " + str(reg.score(nox_test_x, nox_test_y)))

Actual Nox: 0.51 | Predicted Nox: 0.5577290054655057
Actual Nox: 0.447 | Predicted Nox: 0.4813754046100445
Actual Nox: 0.609 | Predicted Nox: 0.7004422661708929
Actual Nox: 0.413 | Predicted Nox: 0.4487596178238911
Actual Nox: 0.713 | Predicted Nox: 0.6687792802829436
Actual Nox: 0.437 | Predicted Nox: 0.558224458014589
Actual Nox: 0.544 | Predicted Nox: 0.5548999635133647
Actual Nox: 0.624 | Predicted Nox: 0.6389056812855617
Actual Nox: 0.532 | Predicted Nox: 0.6478431411788487
Actual Nox: 0.585 | Predicted Nox: 0.57356113769577
Actual Nox: 0.55 | Predicted Nox: 0.6387377232981007

model score: 0.7473944621687161


# Discussion:
The gradient boosted trees are likely overfitted to the training data. Having more data for validation purposes could potentially solve this. While it *could* be the case that GBTs perform better than SGD here, I believe the SGD models would generalize better in real-world application, despite their lower scores.