### Consider four possible models for predicting house prices:

- Using only the size and number of rooms.
- Using size, number of rooms, and building type.
- Using size and building type, and their interaction.
- Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

### Set up a pipeline for each of these four models.

### Then, get predictions on the test set for each of your pipelines, and compute the root mean squared error. Which model performed best?

### Note: You should only use the function train_test_split() one time in your code; that is, we should be predicting on the same test set for all three models.

In [28]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np

In [2]:
housing = pd.read_csv("/Users/nicholaseah/Downloads/AmesHousing.csv")
housing.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [3]:
lr = LinearRegression()

X = housing.drop("SalePrice", axis = 1)
y = housing["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

### size and number of rooms

In [10]:
# size and number of rooms
ct = ColumnTransformer(
  [
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)


lr_pipeline1 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

lr_fitted = lr_pipeline1.fit(X_train,y_train)

In [11]:
y_preds = lr_fitted.predict(X_test)
mean_squared_error(y_test, y_preds, squared = False)

53827.345528802136

### size, number of rooms, and building type

In [12]:
# size, num of rooms, and building type
ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"]),
    ("standardize", StandardScaler(), ["Gr Liv Area", "TotRms AbvGrd"])
  ],
  remainder = "drop"
)


lr_pipeline2 = Pipeline(
  [("preprocessing", ct),
  ("linear_regression", LinearRegression())]
)

lr_fitted = lr_pipeline2.fit(X_train,y_train)
y_preds = lr_fitted.predict(X_test)
mean_squared_error(y_test, y_preds, squared = False)

51901.22722549979

### size and building type, and their interaction

In [13]:
ct_dummies = ColumnTransformer(
  [("dummify", OneHotEncoder(sparse_output = False), ["Bldg Type"])],
  remainder = "passthrough"
).set_output(transform = "pandas")

ct_inter = ColumnTransformer(
  [
    ("interaction1", PolynomialFeatures(interaction_only=True), ["remainder__TotRms AbvGrd", "dummify__Bldg Type_1Fam"]),
    ("interaction2", PolynomialFeatures(interaction_only=True), ["remainder__TotRms AbvGrd", "dummify__Bldg Type_TwnhsE"]),
    ("interaction3", PolynomialFeatures(interaction_only=True), ["remainder__TotRms AbvGrd", "dummify__Bldg Type_Twnhs"]),
    ("interaction4", PolynomialFeatures(interaction_only=True), ["remainder__TotRms AbvGrd", "dummify__Bldg Type_Duplex"]),
    ("interaction5", PolynomialFeatures(interaction_only=True), ["remainder__TotRms AbvGrd", "dummify__Bldg Type_2fmCon"]),
  ],
  remainder = "drop"
).set_output(transform = "pandas")

lr_pipeline3 = Pipeline(
    [("dummify", ct_dummies),
     ("interactions", ct_inter),
     ("linear_regression", LinearRegression())
    ]
)

lr_fitted = lr_pipeline3.fit(X_train,y_train)
y_preds = lr_fitted.predict(X_test)
mean_squared_error(y_test, y_preds, squared = False)


60510.25264703423

### 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type

In [14]:
#5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type

ct = ColumnTransformer(
    [
      ("dummify", OneHotEncoder(sparse_output=False), ["Bldg Type"]),
      ("poly_size", PolynomialFeatures(degree=5, include_bias=False), ["Gr Liv Area"]),
      ("poly_rooms", PolynomialFeatures(degree=5, include_bias=False), ["TotRms AbvGrd"])

    ],
    remainder="drop")

lr_pipeline4 = Pipeline(
    [
        ("preprocessing", ct),
        ("ols", LinearRegression())
    ]
)


lr_fitted = lr_pipeline4.fit(X_train,y_train)
y_preds = lr_fitted.predict(X_test)
mean_squared_error(y_test, y_preds, squared = False)

54690.16361529879

Based on the MSE values, the model with the lowest MSE was the model with size, num of rooms, and building type. Therefore, we can say that this model performed the best.

#### Once again consider four modeling options for house price:

- Using only the size and number of rooms.
- Using size, number of rooms, and building type.
- Using size and building type, and their interaction.
- Using a 5-degree polynomial on size, a 5-degree polynomial on number of rooms, and also building type.

#### Use cross_val_score with the pipelines you made earlier to find the cross-validated root mean squared error for each model.

#### Which do you prefer? Does this agree with your conclusion from earlier?

In [30]:
# Model 1
mse1 = cross_val_score(lr_pipeline1, X, y, cv=5, scoring='neg_mean_squared_error')
mse1 = np.sqrt((mse1*-1).mean())

# Model 2
mse2 = cross_val_score(lr_pipeline2, X, y, cv=5, scoring='neg_mean_squared_error')
mse2 = np.sqrt((mse2*-1).mean())

# Model 3
mse3 = cross_val_score(lr_pipeline3, X, y, cv=5, scoring='neg_mean_squared_error')
mse3 = np.sqrt((mse3*-1).mean())

# Model 4
mse4 = cross_val_score(lr_pipeline4, X, y, cv=5, scoring='neg_mean_squared_error')
mse4 = np.sqrt((mse4*-1).mean())

print(mse1, mse2, mse3, mse4)

56001.24023779208 54311.685543940046 63970.57235835044 56557.832467076514


Based on the RMSE values, we see that the second model is still prefered, consistent with our results from the previous Practice Activity

#### Consider one hundred modeling options for house price:

- House size, trying degrees 1 through 10
- Number of rooms, trying degrees 1 through 10
- Building Type
#### Hint: The dictionary of possible values that you make to give to GridSearchCV will have two elements instead of one.

#### Q1: Which model performed the best?

#### Q2: What downsides do you see of trying all possible model options? How might you go about choosing a smaller number of tuning values to try?

#### House size, degrees 1 through 10

In [32]:
from sklearn.model_selection import GridSearchCV


ct_poly = ColumnTransformer(
  [
    ("polynomial", PolynomialFeatures(), ["Gr Liv Area"])
  ],
  remainder = "drop"
)

lr_pipeline_poly = Pipeline(
  [("preprocessing", ct_poly),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")

degrees = {'preprocessing__polynomial__degree': np.arange(1, 11)}

gscv = GridSearchCV(lr_pipeline_poly, degrees, cv = 5, scoring='r2')

In [41]:
gscv_fitted = gscv.fit(X, y)
pd.DataFrame(data = {"degrees": np.arange(1, 11), "scores": gscv_fitted.cv_results_['mean_test_score']})

Unnamed: 0,degrees,scores
0,1,0.488503
1,2,0.490241
2,3,0.507396
3,4,0.499218
4,5,0.45186
5,6,0.333837
6,7,0.029322
7,8,-0.968096
8,9,-4.545597
9,10,-16.187917


In [33]:
ct_poly = ColumnTransformer(
  [
    ("polynomial", PolynomialFeatures(), ["TotRms AbvGrd"])
  ],
  remainder = "drop"
)

lr_pipeline_poly = Pipeline(
  [("preprocessing", ct_poly),
  ("linear_regression", LinearRegression())]
).set_output(transform="pandas")

degrees = {'preprocessing__polynomial__degree': np.arange(1, 11)}

gscv = GridSearchCV(lr_pipeline_poly, degrees, cv = 5, scoring='r2')
gscv_fitted = gscv.fit(X, y)
pd.DataFrame(data = {"degrees": np.arange(1, 11), "scores": gscv_fitted.cv_results_['mean_test_score']})

Unnamed: 0,degrees,scores
0,1,0.230878
1,2,0.226764
2,3,0.23522
3,4,0.233668
4,5,0.221095
5,6,0.13603
6,7,0.2036
7,8,-0.431101
8,9,-0.318159
9,10,-630.55777


In looking at the house size variable, we see that the model with 3 degrees performed the best as it had the highest R-squared value. In looking at the number of bedrooms, we see that the model with 3 degrees performed the best. The problem with testing all possible models is that it may lead to overfitting the training data. Instead, one may consider lower degrees as it will stray away from the idea of overfitting the data. 