# House Prices: Advanced Regression Techniques (Kaggle Competition) - Predictions, 1st Pass

This notebook follows the process of my first attempt at making predictions from the dataset. For the sake of learning the process end to end, I brushed over some steps that could definitely improve my score, including:

* Accounting for multicollinearity
* Feature engineering
* Feature Selection / Dimensionality Reduction (based on evidence other than what I learned from EDA)
* Model Selection

I will give these more attention in my next attempt.

## Get the Data

In [1]:
from zipfile import ZipFile

# Having some trouble with Kaggle API at the moment, but in future try to download data programmatically if possible

ZIP_PATH = "data/house-prices-advanced-regression-techniques.zip"

with ZipFile(ZIP_PATH, 'r') as zip:
    zip.extractall('data')

In [2]:
import pandas as pd

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## Data Preparation


### Feature Selection

We could do more here to select more statistically significant features, but for now I'm going to brush over it for now. I plan on taking the time to learn feature selection methods properly in the near future rather than doing a rushed job of it now. The features I have chosen are the attributes that seem to be promising predictors to me based on the exploratory analysis.

In [3]:
train_X = train[['GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'Fireplaces', 'LotFrontage', 'Neighborhood', 'OverallQual', 'ExterQual', 'BsmtQual', 'KitchenQual']]
train_y = train[['SalePrice']]

### Data Preprocessing

In [4]:
train_X.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   GrLivArea     1460 non-null   int64  
 1   GarageCars    1460 non-null   int64  
 2   TotalBsmtSF   1460 non-null   int64  
 3   FullBath      1460 non-null   int64  
 4   TotRmsAbvGrd  1460 non-null   int64  
 5   YearBuilt     1460 non-null   int64  
 6   Fireplaces    1460 non-null   int64  
 7   LotFrontage   1201 non-null   float64
 8   Neighborhood  1460 non-null   object 
 9   OverallQual   1460 non-null   int64  
 10  ExterQual     1460 non-null   object 
 11  BsmtQual      1423 non-null   object 
 12  KitchenQual   1460 non-null   object 
dtypes: float64(1), int64(8), object(4)
memory usage: 148.4+ KB


There are some null values to deal with. 
* We will impute `LotFrontage` with the median
* The 37 values missing from `BsmtQual` are homes with no basement (there are 37 homes with `TotalBsmtSF` = 0). We will add a new category for these.

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

num_features = ['GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'Fireplaces', 'LotFrontage']
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_features = ['Neighborhood']
cat_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder())
])

ord_features = ['OverallQual', 'ExterQual', 'BsmtQual', 'KitchenQual']
ord_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='NoBsmt')),
    ('ordinal', OrdinalEncoder())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features),
        ('ord', ord_transformer, ord_features),
    ]
)

train_X = preprocessor.fit_transform(train_X)

## Model Selection

I'm still learning about what kind of models work best for different kinds of data, so I'm just going to pick a bunch of models from different categories of approaches for regression and throw the kitchen sink at this.

In [6]:
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.svm import SVR

models = [
    {'name': 'Linear Regression', 'obj': LinearRegression()},
    {'name': 'Stochastic Gradient Descent', 'obj': SGDRegressor()},
    {'name': 'K Neighbors Regressor', 'obj': KNeighborsRegressor()},
    {'name': 'Decision Tree Regressor', 'obj': DecisionTreeRegressor()},
    {'name': 'Random Forest Regressor', 'obj': RandomForestRegressor()},
    {'name': 'Kernel Ridge', 'obj': KernelRidge()},
    {'name': 'Support Vector Regressor', 'obj': SVR()}
]

In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_log_error

for mdl in models:  
    cv_mae = cross_val_score(mdl['obj'], train_X, train_y, cv=5, scoring='neg_mean_absolute_error')
    print()
    print(mdl['name'])
    print("MAE mean:", -(cv_mae.mean()))
    print("MAE Standard deviation:", cv_mae.std())


Linear Regression
MAE mean: 21696.961362291164
MAE Standard deviation: 1142.9194033326394

Stochastic Gradient Descent
MAE mean: 25152.581616050975
MAE Standard deviation: 1372.4404521678796

K Neighbors Regressor
MAE mean: 21399.802054794523
MAE Standard deviation: 1746.853116148743

Decision Tree Regressor
MAE mean: 27881.964155251142
MAE Standard deviation: 2503.2013973197954

Random Forest Regressor
MAE mean: 19346.336112720157
MAE Standard deviation: 1101.763148260912

Kernel Ridge
MAE mean: 23597.89819824807
MAE Standard deviation: 1279.337005615513

Support Vector Regressor
MAE mean: 55473.54456463078
MAE Standard deviation: 3359.595186863918


## Model Tuning

In [8]:
from pprint import pprint
rfr = RandomForestRegressor(random_state=33)
pprint(rfr.get_params())

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 33,
 'verbose': 0,
 'warm_start': False}


In [9]:
from sklearn.model_selection import GridSearchCV
# I tried multiple different grids to achieve a better score, refining the parameters with each iteration. These are the last set I tried
param_grid = {   
        'bootstrap': [True],
        'n_estimators': [250, 300, 350, 400], 
        'max_features': [5, 6], 
        'max_depth': [25, 30, 60, 120],
}

rfr = RandomForestRegressor(random_state=33)
grid_search = GridSearchCV(rfr, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(train_X, train_y)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=33),
             param_grid={'bootstrap': [True], 'max_depth': [25, 30, 60, 120],
                         'max_features': [5, 6],
                         'n_estimators': [250, 300, 350, 400]},
             scoring='neg_mean_absolute_error')

In [10]:
print(grid_search.best_params_)
print(grid_search.best_score_)

{'bootstrap': True, 'max_depth': 30, 'max_features': 5, 'n_estimators': 300}
-18193.76072973703


## Make Predictions

In [11]:
mdl_final = grid_search.best_estimator_

test_X = test # the target variable has already been removed from the test set provided by kaggle

test_X = preprocessor.fit_transform(test_X)
predictions = mdl_final.predict(test_X)
predictions

array([119257.28111111, 152035.03      , 177325.24166667, ...,
       160348.49333333, 115330.23333333, 222651.78      ])

In [12]:
submission_dict = {'Id': test['Id'], 'SalePrice': predictions}
submission_df = pd.DataFrame(data=submission_dict)
submission_df.to_csv('predictions_1.csv', index=False)