# Training and Test Errors

In [6]:
import pandas as pd
import numpy as np

**Try to use scikit-learn whenever possible.**

## Ames Housing Data

In [7]:
df_ames = pd.read_csv("http://dlsun.github.io/pods/data/AmesHousing.txt", sep = "\t")
df_ames

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,2926,923275080,80,RL,37.0,7937,Pave,,IR1,Lvl,...,0,,GdPrv,,0,3,2006,WD,Normal,142500
2926,2927,923276100,20,RL,,8885,Pave,,IR1,Low,...,0,,MnPrv,,0,6,2006,WD,Normal,131000
2927,2928,923400125,85,RL,62.0,10441,Pave,,Reg,Lvl,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,132000
2928,2929,924100070,20,RL,77.0,10010,Pave,,Reg,Lvl,...,0,,,,0,4,2006,WD,Normal,170000


1\. Fit a $10$-nearest neighbors model to predict **SalePrice** using **Bldg Type** as the only feature.

In [8]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = make_column_transformer(
    (OneHotEncoder(), ["Bldg Type"]),
    remainder="drop"  # all other columns in X will be dropped.
)
ct


In [9]:
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor

pipeline = make_pipeline(
    ct,
    KNeighborsRegressor(n_neighbors=10)
)

pipeline.fit(X=df_ames[["Bldg Type"]],
             y=df_ames["SalePrice"])

2\. Calculate the **training error** of this model. Try a few different performance metrics.

In [10]:
y_train_ = pipeline.predict(X=df_ames[["Bldg Type"]])
y_train_

array([185170., 185170., 185170., ..., 185170., 185170., 185170.])

In [11]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(df_ames["SalePrice"], y_train_)
rmse = np.sqrt(mse)
rmse

78699.30648123682

3\. Repeat the above process to calculate the training error for $k=1, 2, \ldots, 10$. Which value of $k$ gives the smallest training error? Does that necessarily mean this is the best value of $k$? Discuss with your partner.

In [12]:
'''for k in [1,2,3,4,5,6,7,8,9,10]:

  pipeline = make_pipeline(
    ct,
    KNeighborsRegressor(n_neighbors=k)
)

pipeline.fit(X=df_ames[["Bldg Type"]],
             y=df_ames["SalePrice"])

y_new_train_ = pipeline.predict(X=df_ames[["Bldg Type"]])
y_new_train_

mse = mean_squared_error(df_ames["SalePrice"], y_new_train_)
rmse = np.sqrt(mse)
print('training error for k =', k, "is", rmse)
'''
errors = []

for k in [1,2,3,4,5,6,7,8,9,10]:
    pipeline = make_pipeline(
        ct,
        KNeighborsRegressor(n_neighbors=k)
    )

    # Convert the reshaped array back to DataFrame
    X_train = pd.DataFrame(df_ames["Bldg Type"].values, columns=["Bldg Type"])
    y_train = df_ames["SalePrice"]

    pipeline.fit(X=X_train, y=y_train)

    y_pred = pipeline.predict(X_train)
    mse = mean_squared_error(y_train, y_pred)
    rmse = np.sqrt(mse)

    print(f'training error for k = {k} is {rmse}')
    errors.append(rmse)

# Get the k value with the smallest error
min_error_k = errors.index(min(errors)) + 1
print(f"The smallest training error was with k = {min_error_k}")


training error for k = 1 is 83869.12582123668
training error for k = 2 is 82461.56991333194
training error for k = 3 is 81459.84807307809
training error for k = 4 is 79077.36953503109
training error for k = 5 is 78841.98384073898
training error for k = 6 is 78718.47565629905
training error for k = 7 is 78666.84969111855
training error for k = 8 is 78813.2596795962
training error for k = 9 is 78781.76304262652
training error for k = 10 is 78699.30648123682
The smallest training error was with k = 7


4\. Return to the model in part 1. Now estimate the **test error** **of** the model using cross-validation. Try a few different performance metrics.

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

x_test = pd.Series()
x_test["Bldg Type"] = "1Fam"
predicted_price = pipeline.predict(X=pd.DataFrame([x_test]))

print(f"Predicted SalePrice: {predicted_price[0]}")

# Estimating test error using Cross Validation
cv_scores = cross_val_score(pipeline, df_ames[["Bldg Type"]], df_ames["SalePrice"], cv=5, scoring='neg_mean_squared_error')
cv_rmse = np.sqrt(-cv_scores)
print(f"Cross-validated RMSE: {cv_rmse.mean()}")

Predicted SalePrice: 185170.0
Cross-validated RMSE: 79374.27020857538


5\. Now, define a 10-nearest neighbors model to predict **SalePrice** using **Neighborhood** as the only feature. Try to estimate the test error of this model using cross validation.

You will get an error. Can you figure out why this error occurs? Can you figure out how to fix it?

In [14]:
from sklearn.model_selection import cross_val_score

# Define a new column transformer for the "Neighborhood" feature
ct_neighborhood = make_column_transformer(
    (OneHotEncoder(handle_unknown = 'ignore'), ["Neighborhood"]),
    remainder="drop"
)

# Create a pipeline using the column transformer and a 10-nearest neighbors regressor
pipeline_neighborhood = make_pipeline(
    ct_neighborhood,
    KNeighborsRegressor(n_neighbors=10)
)

# Try to estimate the test error using cross validation
cv_scores_neighborhood = cross_val_score(pipeline_neighborhood, df_ames[["Neighborhood"]], df_ames["SalePrice"], cv=5, scoring='neg_mean_squared_error')
cv_scores_neighborhood

array([-4.11299122e+09, -2.51802089e+09, -2.79031387e+09, -3.63297657e+09,
       -2.56233932e+09])

**Can fix it by set the handle_unknown parameter of the OneHotEncoder to 'ignore'. This will ensure that when the encoder encounters a category in the test set that it hasn't seen during training, it will ignore it instead of throwing an error.**

6\. Recall that in a previous notebook we fit a 10-nearest neighbors regression model that predicts the price (just **SalePrice**, not log) of a home using square footage (**Gr Liv Area**), number of bedrooms (**Bedroom AbvGr**), number of full bathrooms (**Full Bath**), number of half bathrooms (**Half Bath**), and **Neighborhood**. Fit this model and estimate its test error using cross-validation. Try a few different performance metrics.

In [15]:
from sklearn.preprocessing import StandardScaler

# Features to use
features = ["Gr Liv Area", "Bedroom AbvGr", "Full Bath", "Half Bath", "Neighborhood"]

# Column transformer setup
ct_full = make_column_transformer(
    (StandardScaler(), ["Gr Liv Area", "Bedroom AbvGr", "Full Bath", "Half Bath"]),
    (OneHotEncoder(handle_unknown='ignore'), ["Neighborhood"]),
    remainder="drop"
)

# Create a pipeline
pipeline_full = make_pipeline(
    ct_full,
    KNeighborsRegressor(n_neighbors=10)
)

# Mean squared error
mse_scores = cross_val_score(pipeline_full, df_ames[features], df_ames["SalePrice"], cv=5, scoring='neg_mean_squared_error')
mse_avg = -mse_scores.mean()

# Mean absolute error
mae_scores = cross_val_score(pipeline_full, df_ames[features], df_ames["SalePrice"], cv=5, scoring='neg_mean_absolute_error')
mae_avg = -mae_scores.mean()

# R^2 score
r2_scores = cross_val_score(pipeline_full, df_ames[features], df_ames["SalePrice"], cv=5, scoring='r2')
r2_avg = r2_scores.mean()

print(f"Average MSE: {mse_avg}")
print(f"Average MAE: {mae_avg}")
print(f"Average R^2 Score: {r2_avg}")



Average MSE: 1600499821.191833
Average MAE: 25584.599761092148
Average R^2 Score: 0.7470736050159748


7\. Repeat the process in part 6 to fit $k$-nearest neighbors regression models for several values of $k$ (say $k=1, \ldots, 20$). Which value of $k$ produces the best test error? Try a few different performance metrics; does the best value of $k$ depend on the metric?

In [16]:
# YOUR CODE HERE. ADD CELLS AS NEEDED