# Preprocessing, Models and Results - Ames Housing Data

Having cleaned and selected preliminary features, in this notebook we prepare the data for modeling, and model and score using Linear Regression, Lasso and Ridge.

## Import Data from EDA

In [406]:
# Import the usual suspects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn import set_config
set_config(display='diagram')

In [360]:
train = pd.read_csv('../data/train_cleaned.csv')
train.drop(columns='Unnamed: 0', inplace=True)
test = pd.read_csv('../data/test_cleaned.csv')
test.drop(columns='Unnamed: 0', inplace=True)

## Linear Regression Numericals

Let's start with a simple Linear Regression on the numerical features selected in EDA. I will use this as my baseline model, using Root Mean Squared Error as score to align with Kaggle.

In [361]:
X = train[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']]
y = train['SalePrice']

In [362]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [363]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

In [364]:
preds = linreg.predict(X_test)

### Baseline RMSE

In [365]:
mean_squared_error(y_test, preds, squared=False)

35294.454074127374

### Save Baseline for Kaggle submission

In [366]:
test['SalePrice'] = linreg.predict(test[['Overall Qual', 'Gr Liv Area', 'Garage Area', 'Year Built', 'Year Remod/Add']])

In [367]:
test[['Id', 'SalePrice']].to_csv('../data/submission_baseline.csv', index=False)

## Location, Location, Location model
Fit a Linear Regression using only the Neighborhoods features.

In [368]:
X = train[['Neighborhood']]
y = train['SalePrice']
neigh_test = test[['Neighborhood']]

In [369]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [370]:
ct = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct

In [371]:
pipe = make_pipeline(ct, StandardScaler(with_mean=False), LinearRegression())
pipe

In [372]:
pipe.fit(X_train, y_train)

In [373]:
pipe.score(X_test, y_test)

0.5467053704757524

In [374]:
preds = pipe.predict(X_test)

In [375]:
mean_squared_error(y_test, preds, squared=False
)

51741.6741156098

Our RMSE using only neighborhood is considerably worse than our baseline Linear Regression.

### Kaggle submission

In [376]:
test['SalePrice'] = pipe.predict(neigh_test)

In [377]:
test[['Id', 'SalePrice']].to_csv('../data/submission_location_linreg.csv', index=False)

## Incorporate all location proxies identified in EDA

In addtion to the Neighborhoods feature, the following features were identified as location proxies in EDA:

- Lot Shape
- Lot Config
- Condition 1
- Condition 2

In [378]:
X = train[['Lot Shape', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2']]
y = train['SalePrice']
neigh_test = test[['Lot Shape', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2']]

In [379]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [380]:
ct2 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct2

In [381]:
pipe2 = make_pipeline(ct2, StandardScaler(with_mean=False), LinearRegression())
pipe2

In [386]:
pipe2.fit(X_train, y_train)

In [387]:
pipe2.score(X_test, y_test)

0.610617647492115

In [388]:
preds = pipe2.predict(X_test)

In [389]:
mean_squared_error(y_test, preds, squared=False)

52098.59375019075

Our RMSE using location proxies is considerably worse than our baseline Linear Regression.

### Kaggle submission

In [390]:
test['SalePrice'] = pipe2.predict(neigh_test)

In [391]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_location_linreg.csv', index=False)

## Combine all selected features and fit on Linear Regression

### Train-Test split

In [394]:
X = train.drop(columns='SalePrice')
y = train['SalePrice']
kaggle = test[X.columns]

In [395]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Column transformations

In [396]:
ct3 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct2

### Pipeline with Column transformer, Standard Scaler and Linear Regression

In [398]:
pipe3 = make_pipeline(ct3, StandardScaler(with_mean=False), LinearRegression())
pipe3

### Fit, Predict and Score

In [399]:
pipe3.fit(X_train, y_train)

In [400]:
pipe3.score(X_test, y_test)

0.867614085009067

In [401]:
preds = pipe3.predict(X_test)

In [402]:
mean_squared_error(y_test, preds, squared=False)

29363.425624948224

Our RMSE using all features selected in EDA is considerably better than our baseline Linear Regression.

### Kaggle submission

In [403]:
test['SalePrice'] = pipe3.predict(kaggle)

In [404]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_linreg.csv', index=False)

## Lasso and Ridge using GridSearchCV

### Train-Test Split

In [407]:
X = train.drop(columns='SalePrice')
y = train['SalePrice']
kaggle = test[X.columns]

In [408]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Column transformations

In [410]:
ct4 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector(dtype_include=object)),
    remainder='passthrough',
    verbose_feature_names_out=False
)
ct4

### Pipeline with Column transformer, Standard Scaler and Lasso Regression

In [414]:
pipe4 = make_pipeline(ct4, StandardScaler(with_mean=False), Lasso())
pipe4

In [428]:
params = {
    'lasso__alpha': [.01, .1, 1, 10, 100],
    'lasso__max_iter': [100_000]
}

### Grid Search

In [429]:
gs1 = GridSearchCV(pipe4, params, n_jobs=-1)
gs1

In [430]:
gs1.fit(X_train, y_train)
gs1.score(X_test, y_test)

  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(
  model = cd_fast.sparse_enet_coordinate_descent(


0.8674996843166977

In [431]:
preds = gs1.predict(X_test)
mean_squared_error(y_test, preds, squared=False)

32356.229975264556

### Kaggle submission Lasso

In [432]:
test['SalePrice'] = gs1.predict(kaggle)

In [433]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_lasso.csv', index=False)

### Ridge Regression pipeline  

In [434]:
pipe5 = make_pipeline(ct4, StandardScaler(with_mean=False), Ridge())
pipe5

In [435]:
pipe5.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'standardscaler', 'ridge', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__onehotencoder', 'columntransformer__onehotencoder__categories', 'columntransformer__onehotencoder__drop', 'columntransformer__onehotencoder__dtype', 'columntransformer__onehotencoder__handle_unknown', 'columntransformer__onehotencoder__sparse', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'ridge__alpha', 'ridge__copy_X', 'ridge__fit_intercept', 'ridge__max_iter', 'ridge__normalize', 'ridge__positive', 'ridge__random_state', 'ridge__solver', 'ridge__tol'])

In [436]:
params = {
    'ridge__alpha': [.01, .1, 1, 10, 100],
    'ridge__max_iter': [100_000]
}

In [437]:
gs2 = GridSearchCV(pipe5, params, n_jobs=-1)
gs2

In [438]:
gs2.fit(X_train, y_train)
gs2.score(X_test, y_test)

0.8667845607423209

In [439]:
preds = gs2.predict(X_test)
mean_squared_error(y_test, preds, squared=False)

32443.428130008357

### Kaggle submission

In [440]:
test['SalePrice'] = gs2.predict(kaggle)

In [441]:
test[['Id', 'SalePrice']].to_csv('../data/submission_all_EDA_ridge.csv', index=False)