# Preprocessing, Models and Results - Ames Housing Data

Having cleaned and selected preliminary features, in this notebook we prepare the data for modeling, and model and score using Linear Regression, Lasso and Ridge.

## Import Data from EDA

In [500]:
# Import the usual suspects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn import set_config
set_config(display='diagram')

In [443]:
train = pd.read_csv('../data/train_cleaned.csv')
train.drop(columns='Unnamed: 0', inplace=True)
test = pd.read_csv('../data/test_cleaned.csv')
test.drop(columns='Unnamed: 0', inplace=True)

## Linear Regression Numericals

Let's start with a simple Linear Regression on the numerical features selected in EDA. I will use this as my baseline model, using Root Mean Squared Error as score to align with Kaggle.

In [444]:
X = train[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
y = train['SalePrice']

In [445]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

In [446]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

In [447]:
preds = linreg.predict(X_test)

### Baseline RMSE

In [448]:
mean_squared_error(y_test, preds, squared=False)

37279.67084087376

### Save Baseline for Kaggle submission

In [449]:
test['SalePrice'] = linreg.predict(test[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']])

In [450]:
test[['Id', 'SalePrice']].to_csv('../data/submission_baseline.csv', index=False)

## Location, Location, Location model
Fit a Linear Regression using only the Neighborhoods features.

In [451]:
X = train[['Neighborhood']]
y = train['SalePrice']
kaggle = test[['Neighborhood']]

In [452]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

In [453]:
ct = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct

In [454]:
pipe = make_pipeline(ct, StandardScaler(with_mean=False), LinearRegression())
pipe

In [455]:
pipe.fit(X_train, y_train)

In [456]:
pipe.score(X_test, y_test)

0.5985982825845704

In [457]:
preds = pipe.predict(X_test)

In [458]:
mean_squared_error(y_test, preds, squared=False
)

49726.203146518536

Our RMSE using only neighborhood is considerably worse than our baseline Linear Regression.

### Kaggle submission

In [459]:
test['SalePrice'] = pipe.predict(kaggle)

In [460]:
test[['Id', 'SalePrice']].to_csv('../data/submission_location_linreg.csv', index=False)

## Incorporate all location proxies identified in EDA

In addtion to the Neighborhoods feature, the following features were identified as location proxies in EDA:

- Lot Shape
- Lot Config
- Condition 1
- Condition 2

In [461]:
X = train[['Lot Shape', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2']]
y = train['SalePrice']
kaggle = test[['Lot Shape', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2']]

In [462]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

In [463]:
ct2 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct2

In [464]:
pipe2 = make_pipeline(ct2, StandardScaler(with_mean=False), LinearRegression())
pipe2

In [465]:
pipe2.fit(X_train, y_train)

In [466]:
pipe2.score(X_test, y_test)

0.6262624740055753

In [467]:
preds = pipe2.predict(X_test)

In [468]:
mean_squared_error(y_test, preds, squared=False)

47982.07662062991

Our RMSE using location proxies is considerably worse than our baseline Linear Regression.

### Kaggle submission

In [469]:
test['SalePrice'] = pipe2.predict(kaggle)

In [470]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_location_linreg.csv', index=False)

## Combine all selected features and fit on Linear Regression

### Train-Test split

In [471]:
X = train.drop(columns='SalePrice')
y = train['SalePrice']
kaggle = test[X.columns]

In [472]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

### Column transformations

In [474]:
ct3 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct3

### Pipeline with Column transformer, Standard Scaler and Linear Regression

In [475]:
pipe3 = make_pipeline(ct3, StandardScaler(with_mean=False), LinearRegression())
pipe3

### Fit, Predict and Score

In [476]:
pipe3.fit(X_train, y_train)

In [477]:
pipe3.score(X_test, y_test)

0.8686755662913601

In [478]:
preds = pipe3.predict(X_test)

In [479]:
mean_squared_error(y_test, preds, squared=False)

28442.545078913496

Our RMSE using all features selected in EDA is considerably better than our baseline Linear Regression.

### Kaggle submission

In [480]:
test['SalePrice'] = pipe3.predict(kaggle)

In [481]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_linreg.csv', index=False)

## Lasso and Ridge using GridSearchCV

### Train-Test Split

In [482]:
X = train.drop(columns='SalePrice')
y = train['SalePrice']
kaggle = test[X.columns]

In [483]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

### Column transformations

In [484]:
ct4 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct4

### Pipeline with Column transformer, Standard Scaler and Lasso Regression

In [485]:
pipe4 = make_pipeline(ct4, StandardScaler(with_mean=False), Lasso())
pipe4

In [486]:
params = {
    'lasso__alpha': [.01, .1, 1, 10, 100],
    'lasso__max_iter': [100_000]
}

### Grid Search

In [487]:
gs1 = GridSearchCV(pipe4, params, n_jobs=-1)
gs1

In [488]:
gs1.fit(X_train, y_train)
gs1.score(X_test, y_test)

  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


0.8694410739351349

In [489]:
preds = gs1.predict(X_test)
mean_squared_error(y_test, preds, squared=False)

28359.526225239984

### Kaggle submission Lasso

In [490]:
test['SalePrice'] = gs1.predict(kaggle)

In [491]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_lasso.csv', index=False)

### Ridge Regression pipeline  

In [492]:
pipe5 = make_pipeline(ct4, StandardScaler(with_mean=False), Ridge())
pipe5

In [493]:
pipe5.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'standardscaler', 'ridge', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__onehotencoder', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'ridge__alpha', 'ridge__copy_X', 'ridge__fit_intercept', 'ridge__max_iter', 'ridge__normalize', 'ridge__positive', 'ridge__random_state', 'ridge__solver', 'ridge__tol'])

In [494]:
params = {
    'ridge__alpha': [.01, .1, 1, 10, 100],
    'ridge__max_iter': [100_000]
}

In [495]:
gs2 = GridSearchCV(pipe5, params, n_jobs=-1)
gs2

In [496]:
gs2.fit(X_train, y_train)
gs2.score(X_test, y_test)

0.8703680799485327

In [497]:
preds = gs2.predict(X_test)
mean_squared_error(y_test, preds, squared=False)

28258.666467040028

### Kaggle submission

In [498]:
test['SalePrice'] = gs2.predict(kaggle)

In [499]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_ridge.csv', index=False)

## Bonus attempt for Kaggle. Polynomial features, Ridge


### Train-Test Split

In [508]:
X = train.drop(columns='SalePrice')
y = train['SalePrice']
kaggle = test[X.columns]

In [509]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1331)

In [501]:
pipe6 = make_pipeline(ct4, PolynomialFeatures(), StandardScaler(with_mean=False), Ridge())
pipe6

In [505]:
pipe6.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'polynomialfeatures', 'standardscaler', 'ridge', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__onehotencoder', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'polynomialfeatures__degree', 'polynomialfeatures__include_bias', 'polynomialfeatures__interaction_only', 'polynomialfeatures__order', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'ridge__alpha', 'ridge__copy_X', 'ridge__fit_intercept', 'ridge__max_iter', 'ridge__normalize', 'ridge__positive', 'ridge__random_state', 'ridge__solver'

In [506]:
params = {
    'ridge__alpha': [.01, .1, 1, 10, 100],
    'ridge__max_iter': [100_000]
}

In [507]:
gs3 = GridSearchCV(pipe6, params, n_jobs=-1)
gs3

In [510]:
gs3.fit(X_train, y_train)
gs3.score(X_test, y_test)

0.8576599948036657

In [511]:
preds = gs3.predict(X_test)
mean_squared_error(y_test, preds, squared=False)

29611.415986208664

### Kaggle submission

In [512]:
test['SalePrice'] = gs3.predict(kaggle)

In [513]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_poly_ridge.csv', index=False)