---
title: GSB - S544 Practice Activites
author: Karissa Mohr
format:
  html:
    embed-resources: true
echo: true
theme: lux
---

In [9]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


In [5]:
ames = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/AmesHousing.txt", sep="\t")
ames

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,2926,923275080,80,RL,37.0,7937,Pave,,IR1,Lvl,...,0,,GdPrv,,0,3,2006,WD,Normal,142500
2926,2927,923276100,20,RL,,8885,Pave,,IR1,Low,...,0,,MnPrv,,0,6,2006,WD,Normal,131000
2927,2928,923400125,85,RL,62.0,10441,Pave,,Reg,Lvl,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,132000
2928,2929,924100070,20,RL,77.0,10010,Pave,,Reg,Lvl,...,0,,,,0,4,2006,WD,Normal,170000


### 13.2.5
Consider four possible models for predicting house prices:

* Using only the size and number of rooms.
* Using size, number of rooms, and building type.
* Using size and building type, and their interaction.
* Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [6]:
X = ames[["Gr Liv Area", "TotRms AbvGrd", "Bldg Type"]]
y = ames["SalePrice"]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [16]:
from math import sqrt

In [10]:
# Model 1
ct1 = ColumnTransformer(
  [("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])],
  remainder="drop"
)

pipe1 = Pipeline([
  ("preprocessing", ct1),
  ("linear_regression", LinearRegression())
])

In [19]:
# Fit Model 1 & RMSE
pipe1.fit(X_train, y_train)
y_pred1 = pipe1.predict(X_test)

rmse1 = sqrt(mean_squared_error(y_test, y_pred1))
rmse1

59261.71322786227

In [11]:
# Model 2
ct2 = ColumnTransformer(
  [
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"]),
    ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"])
  ],
  remainder="drop"
)

pipe2 = Pipeline([
  ("preprocessing", ct2),
  ("linear_regression", LinearRegression())
])

In [21]:
# Fit Model 2 & RMSE
pipe2.fit(X_train, y_train)
y_pred2 = pipe2.predict(X_test)

rmse2 = sqrt(mean_squared_error(y_test, y_pred2))
rmse2

57078.218094312484

In [13]:
# Model 3
ct3_dummies = ColumnTransformer(
  [("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"])],
  remainder="passthrough"
).set_output(transform="pandas")

X_train_dummified = ct3_dummies.fit_transform(X_train)

ct3_inter = ColumnTransformer(
  [
    ("interaction", PolynomialFeatures(interaction_only=True, include_bias=False),
     ["remainder__Gr Liv Area", "dummify__Bldg Type_1Fam"])
  ],
  remainder="drop"
)

pipe3 = Pipeline([
  ("preprocessing", ct3_dummies),
  ("interaction", ct3_inter),
  ("linear_regression", LinearRegression())
])


In [22]:
# Fit Model 3 & RMSE
pipe3.fit(X_train, y_train)
y_pred3 = pipe3.predict(X_test)

rmse3 = sqrt(mean_squared_error(y_test, y_pred3))
rmse3

58339.958266813366

In [14]:
# Model 4
ct4 = ColumnTransformer(
  [
    ("poly", PolynomialFeatures(degree=5, include_bias=False),
     ["Gr Liv Area", "TotRms AbvGrd"]),
    ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"])
  ],
  remainder="drop"
)

pipe4 = Pipeline([
  ("preprocessing", ct4),
  ("linear_regression", LinearRegression())
])

In [23]:
# Fit Model 4 & RMSE
pipe4.fit(X_train, y_train)
y_pred4 = pipe4.predict(X_test)

rmse4 = sqrt(mean_squared_error(y_test, y_pred4))
rmse4

59494.4651581882

In [25]:
# RMSE Summary
print("Model 1:", rmse1)
print("Model 2:", rmse2)
print("Model 3:", rmse3)
print("Model 4:", rmse4)

Model 1: 59261.71322786227
Model 2: 57078.218094312484
Model 3: 58339.958266813366
Model 4: 59494.4651581882


Model 2 performed the best because it has the lowest RMSE.

### 13.3.2
Once again consider four modeling options for house price:

* Using only the size and number of rooms.
* Using size, number of rooms, and building type.
* Using size and building type, and their interaction.
* Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.
* Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

In [28]:
from sklearn.model_selection import cross_val_score

In [34]:
# Model 1
scores1 = cross_val_score(pipe1, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse1 = -scores1.mean()
rmse1

np.float64(55806.32634926364)

In [35]:
# Model 2
scores2 = cross_val_score(pipe2, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse2 = -scores2.mean()
rmse2

np.float64(54168.081429193844)

In [36]:
# Model 3
scores3 = cross_val_score(pipe3, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse3 = -scores3.mean()
rmse3

np.float64(55807.6373006867)

In [37]:
# Model 4
scores4 = cross_val_score(pipe4, X, y, cv=5, scoring='neg_root_mean_squared_error')
rmse4

59494.4651581882

In [38]:
# Cross Validated RMSE
# RMSE Summary
print("Model 1:", rmse1)
print("Model 2:", rmse2)
print("Model 3:", rmse3)
print("Model 4:", rmse4)

Model 1: 55806.32634926364
Model 2: 54168.081429193844
Model 3: 55807.6373006867
Model 4: 59494.4651581882


Model 2 performed the best because it has the lowest cross validated RMSE.

### 13.3.3
Consider one hundred modeling options for house price:

* House size, trying degrees 1 through 10
* Number of rooms, trying degrees 1 through 10
* Building Type

Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?



In [40]:
from sklearn.model_selection import GridSearchCV

In [41]:
ct_poly = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("poly_size", PolynomialFeatures(), ["Gr Liv Area"]),
    ("poly_rooms", PolynomialFeatures(), ["TotRms AbvGrd"])
  ],
  remainder = "drop"
)

lr_pipeline_poly = Pipeline(
  [("preprocessing", ct_poly),
   ("linear_regression", LinearRegression())]
).set_output(transform = "pandas")

degrees = {
  "preprocessing__poly_size__degree": np.arange(1, 11),
  "preprocessing__poly_rooms__degree": np.arange(1, 11)
}

gscv = GridSearchCV(lr_pipeline_poly, degrees, cv = 5, scoring = "neg_root_mean_squared_error")
gscv_fitted = gscv.fit(X, y)


In [42]:
gscv_fitted.cv_results_['mean_test_score']

array([ -54168.08142919,  -53925.41781725,  -52781.98419773,
        -56058.00606344,  -56255.73634349,  -59099.29729768,
        -67308.27760185,  -84600.65720046, -116661.14747448,
       -171423.44648757,  -54218.38403911,  -54152.70400286,
        -52837.44407633,  -55567.85407013,  -56255.73634349,
        -59099.29729768,  -67308.27760185,  -84600.65720046,
       -116661.14747448, -171423.4464819 ,  -53995.89644106,
        -54101.47839453,  -53003.26228233,  -54468.66412299,
        -56255.73634349,  -59099.29729768,  -67308.27760185,
        -84600.65720046, -116661.14747448, -171423.4464819 ,
        -53596.30184666,  -53951.82868647,  -53163.13168313,
        -54672.62593258,  -56255.7363435 ,  -59099.29729768,
        -67308.27760185,  -84600.65720046, -116661.14747448,
       -171423.4464819 ,  -53614.99780116,  -54253.59397689,
        -53386.54617201,  -54908.22937562,  -56255.7363447 ,
        -59099.29729768,  -67308.27760185,  -84600.65720046,
       -116661.14747448,

In [45]:
results = pd.DataFrame({
    "degree_size": gscv_fitted.cv_results_["param_preprocessing__poly_size__degree"],
    "degree_rooms": gscv_fitted.cv_results_["param_preprocessing__poly_rooms__degree"],
    "mean_test_score": gscv_fitted.cv_results_["mean_test_score"]
})

results.sort_values("mean_test_score", ascending=False)

Unnamed: 0,degree_size,degree_rooms,mean_test_score
2,3,1,-52781.984198
12,3,2,-52837.444076
22,3,3,-53003.262282
32,3,4,-53163.131683
62,3,7,-53168.850061
...,...,...,...
49,10,5,-171423.446482
69,10,7,-171423.446482
89,10,9,-171423.446488
99,10,10,-171423.446488


* The model using a 3rd degree polynomial on house size and a 1st degree polynomial on number of rooms performed the best because it has the lowest RMSE.
* Trying all 100 degree combinations risks overfitting the model so it's better to stick with smaller degrees (1–4), when testing which model is the best fit.