Consider four possible models for predicting house prices:

Using only the size and number of rooms.

Using size, number of rooms, and building type.

Using size and building type, and their interaction.

Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Set up a pipeline for each of these four models.

Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [37]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from math import sqrt
from sklearn.model_selection import GridSearchCV


In [38]:
lr = LinearRegression()
data = pd.read_csv("AmesHousing.csv")

In [39]:
X = data[['Gr Liv Area', 'TotRms AbvGrd', 'Bldg Type']]

y = data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

ct1 = ColumnTransformer(transformers=[
    ("scale", StandardScaler(), ['Gr Liv Area', 'TotRms AbvGrd'])
], remainder='drop')

pipeline_1 = Pipeline(steps=[
    ('preprocessor', ct1),
    ('model', LinearRegression())
])
pipeline_1.fit(X_train, y_train)
y_pred = pipeline_1.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(sqrt(mse))

55372.453007850665


In [40]:
X = data[['Gr Liv Area', 'TotRms AbvGrd', 'Bldg Type']]

y = data['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# Model 1
ct1 = ColumnTransformer(transformers=[
    ("scale", StandardScaler(), ['Gr Liv Area', 'TotRms AbvGrd'])
], remainder='drop')

pipeline_1 = Pipeline(steps=[
    ('preprocessor', ct1),
    ('model', LinearRegression())
])
# Model 2
ct2 = ColumnTransformer(transformers=[
    ("scale", StandardScaler(), ['Gr Liv Area', 'TotRms AbvGrd']),
    ("dummify", OneHotEncoder(sparse_output=False), ['Bldg Type'])
], remainder='passthrough')
pipeline_2 = Pipeline(steps=[
    ('preprocessor', ct2),
    ('model', LinearRegression())
])
# Model 3
ct3 = ColumnTransformer(transformers=[
    ('cat', OneHotEncoder(), ['Bldg Type'])
], remainder='passthrough')
pipeline_3 = Pipeline([
    ('preprocessor', ct3),
    ('interactions', PolynomialFeatures(interaction_only=True, include_bias=False)),
    ('regressor', LinearRegression())
])
# Model 4
ct4 = ColumnTransformer(transformers=[
    ("size_poly", PolynomialFeatures(degree=5), ['Gr Liv Area']),
    ("rooms_poly", PolynomialFeatures(degree=5), ['TotRms AbvGrd']),
    ("dummify", OneHotEncoder(sparse_output=False), ['Bldg Type'])
], remainder='passthrough')
pipeline_4 = Pipeline(steps=[
    ('preprocessor', ct4),
    ('model', LinearRegression())
])

In [41]:
# pipeline_3.fit(X_train, y_train)
# y_pred = pipeline_3.predict(X_test)
# r2 = r2_score(y_test, y_pred)
# mse = mean_squared_error(y_test, y_pred)
# print(mse)
models = [pipeline_1,pipeline_2, pipeline_3, pipeline_4]
mse_results = {}

for i, model in enumerate(models, 1):
    if i == 1:
        X_train_mod = X_train[['Gr Liv Area', 'TotRms AbvGrd']]
        X_test_mod = X_test[['Gr Liv Area', 'TotRms AbvGrd']]
        model.fit(X_train_mod, y_train)
        y_pred = model.predict(X_test_mod)
        mse = mean_squared_error(y_test, y_pred)
        mse_results[f'Model {i}'] = sqrt(mse)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        mse_results[f'Model {i}'] = sqrt(mse)

print(mse_results)

{'Model 1': 55372.453007850665, 'Model 2': 54083.12550273998, 'Model 3': 52903.489062684515, 'Model 4': 53218.50422621488}


Model 3 with mse 52903.489062684515 is the best among the 4 because of lowest mse


Practice Activity

Once again consider four modeling options for house price:

Using only the size and number of rooms.

Using size, number of rooms, and building type.

Using size and building type, and their interaction.

Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

Which do you prefer? Does this agree with your conclusion from earlier?

In [42]:
ct2 = ColumnTransformer(transformers=[
    ("standardize", StandardScaler(), ['Gr Liv Area', 'TotRms AbvGrd']),
    ("dummify", OneHotEncoder(sparse_output=False), ['Bldg Type'])
])

pipeline_2 = Pipeline(
   [ ('preprocessor', ct2),
    ('model', LinearRegression())]
)
scores = cross_val_score(pipeline_2, X, y, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
rmse_scores.mean()

54168.081429193844

In [43]:
cross_val_mse_results = {}
for i, model in enumerate(models, 1):
    if i == 1:
        X_mod = X[['Gr Liv Area', 'TotRms AbvGrd']]
        scores = cross_val_score(model, X_mod, y, cv=5, scoring='neg_mean_squared_error')
        rmse_scores = np.sqrt(-scores)
        cross_val_mse_results[f'Model {i}'] = np.mean(rmse_scores)
    else:
        scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
        rmse_scores = np.sqrt(-scores)
        cross_val_mse_results[f'Model {i}'] = np.mean(rmse_scores)

print(cross_val_mse_results)

{'Model 1': 55806.32634926364, 'Model 2': 54168.081429193844, 'Model 3': 53350.439449155354, 'Model 4': 56303.24517642801}


Model 3 with cross-validated root mean squared error value of 53350.439449155354 is consider to be the best, which is same as of previously, considering rmse of these models.

Consider one hundred modeling options for house price:

House size, trying degrees 1 through 10

Number of rooms, trying degrees 1 through 10

Building Type

Hint: The dictionary of possible values that you make to give to GridSearchCV
will have two elements instead of one.

Q1: Which model performed the best?

Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

In [44]:
ct_poly = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("polynomial", PolynomialFeatures(), ["Gr Liv Area"])
  ],
  remainder = "drop"
)

lr_pipeline_poly = Pipeline(
  [("preprocessing", ct_poly),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")

degrees = {'ff': np.arange(1, 10)}

gscv = GridSearchCV(lr_pipeline_poly, degrees, cv = 5, scoring='r2')

In [45]:
ct_poly = ColumnTransformer(
    [
        ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
        ("polynomial_area", PolynomialFeatures(), ["Gr Liv Area"]),
        ("polynomial_room", PolynomialFeatures(), ["TotRms AbvGrd"])

    ],
    remainder="drop"
)

lr_pipeline_poly = Pipeline(
    [
        ("preprocessing", ct_poly),
        ("linear_regression", LinearRegression())
    ]
).set_output(transform="pandas")

# Define the degrees to try for polynomial features
degrees = {
    'preprocessing__polynomial_area__degree': np.arange(1, 11),
        'preprocessing__polynomial_room__degree': np.arange(1, 11),
    }

gscv = GridSearchCV(lr_pipeline_poly, degrees, cv=5, scoring='r2')

gscv.fit(X, y)
print(gscv.best_params_)
print(gscv.best_score_)


{'preprocessing__polynomial_area__degree': 3, 'preprocessing__polynomial_room__degree': 1}
0.5576406065380459


Trying all possible models can be time consuming, when dealing with large number of parameters, also it may cause oferfitting because of that.
to choose a smaller number of tuning values, we can focus on the data analyisis, or most relevant features, which will reduce the search space, or we can use methods like random search or other.