# Model Fitting

This notebook will use the following models on the Ames, Iowa dataset: Ridge Regression, Lasso Regression, KNN, Decision Tree, and Support Vector Machines. For learning purposes, everything will be fitted using its default arguments.

## Data Preparation

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from tqdm import tqdm
from time import time

In [2]:
train = pd.read_csv('train.csv') #Load data
train.drop('Id', axis=1, inplace=True) #Drop ID column

# Change categorical variables from object type to category type
for column in train.select_dtypes(['object']).columns: 
    train[column] = train[column].astype('category')

# Change certain numeric variables into categorical variables
to_be_category = ['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 
                 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'MoSold']
for column in to_be_category:
    train[column] = train[column].astype('category')

# Replace NA's in numeric variables with the mean
train.LotFrontage.fillna(train.LotFrontage.mean(), inplace=True)
train.MasVnrArea.fillna(train.MasVnrArea.mean(), inplace=True)
train.GarageYrBlt.fillna(train.GarageYrBlt.mean(), inplace=True)

# These NA's indicate that the house just doesn't have it
empty_means_without = ['Alley','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
                       'GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']
for feature in empty_means_without:
    train[feature].cat.add_categories(['None'], inplace=True)
    train[feature].fillna('None', inplace=True)

train.dropna(inplace=True) #Drop any remaining NA's
train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
train = pd.get_dummies(train) #One-hot encode
train = np.log(train + 1) #Deskew
train = train - train.mean()/(2*train.std()) #Scaling using Gelman's method of 2 SD
train.head()

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-2.506367,0.236613,-238.982507,-356.525471,4.877854,5.854147,-0.17823,3.498545,3.815125,-4.305591,...,-0.029391,-0.149397,-0.022751,-0.594512,-0.136714,-0.026279,-0.043685,-0.05909,-0.38166,-0.151439
1,-2.301573,0.364195,-238.996072,-356.539036,-0.405349,6.179648,-0.17823,4.133754,4.202932,-3.917784,...,-0.029391,-0.149397,-0.022751,-0.594512,-0.136714,-0.026279,-0.043685,-0.05909,-0.38166,-0.151439
2,-2.461915,0.522785,-238.983506,-356.52597,4.688401,5.481381,-0.17823,4.556611,3.887147,-4.233569,...,-0.029391,-0.149397,-0.022751,-0.594512,-0.136714,-0.026279,-0.043685,-0.05909,-0.38166,-0.151439
3,-2.585148,0.358974,-239.027413,-356.542075,-0.405349,4.673014,-0.17823,4.774684,3.69105,-4.190015,...,-0.029391,-0.149397,-0.022751,-0.594512,0.556433,-0.026279,-0.043685,-0.05909,-1.074807,-0.151439
4,-2.25337,0.759856,-238.984005,-356.526969,5.455437,5.779278,-0.17823,4.677709,4.10572,-4.014996,...,-0.029391,-0.149397,-0.022751,-0.594512,-0.136714,-0.026279,-0.043685,-0.05909,-0.38166,-0.151439


The target feature of this dataset is the SalePrice and all the other variables are predictor features.

In [4]:
target = train['SalePrice']
features = train.drop(['SalePrice'], axis = 1)

## Model Fitting

The following functions will be used to run the models and to see results.

In [5]:
# Getting the Function Call Time
def time_function_call(function_call):
    start = time()
    result = function_call
    execution_time = time() - start
    return result, execution_time

# Running the Model
def run_model(model, model_name, features, target):
    
    x_train, x_test, y_train, y_test = train_test_split(features, target, random_state = 100)
    
    _, fit_time = time_function_call(model.fit(x_train, y_train))
    _, train_pred_time = time_function_call(model.predict(x_train))
    _, test_pred_time = time_function_call(model.predict(x_test)) 
    
    return {
            'Model' : model,
            'Model Name' : model_name,
            'Train Score' : model.score(x_train, y_train),
            'Test Score' : model.score(x_test, y_test),
            #'Fit Time' : fit_time,
            #'Train Prediction Time' : train_pred_time,
            #'Test Prediction Time' : test_pred_time
    }

model_fit = []

### Ridge Regression

Ridge regression is a common type of regularized linear regression algorithm. Regularization is a technique used to prevent overfitting by artificially penalizing model coefficients. It dampens large coefficients or removes features entirely. Ridge performs a linear regression with an L1-regularization.

Ridge regression penalizes the squared size of coefficients, leading to smaller coefficients without forcing them to 0. Stronger penalty leads to coefficients pushed closer to zero. It offers feature shrinkage. 

In [6]:
model_fit.append(run_model(Ridge(), 'Ridge', features, target))

### Lasso Regression

Lasso stands for Least Absolute Shrinkage and Selection Operator and is another common type of regularized linear regression algorithm. Lasso performs a linear regression with an L2-regularization. 

Lasso regression penalizes the absolute size of coefficients, leading coefficients to become exactly 0. Stronger penalty leads to more coefficients pushed to zero. It offers automatic feature selection by completely remove some features. 

In [7]:
model_fit.append(run_model(Lasso(alpha = 0.1), 'Lasso', features, target))

### K Nearest Neighbors Regressor

KNN is a non-parametric, instance-based algorithm that stores all available cases and predict the numerical target based on a similarity measure. It searches the memorized training observations for the K instances that most closely resemble the new instance and assigns to it the their most common class.

In [8]:
model_fit.append(run_model(KNeighborsRegressor(), 'KNN', features, target))

### Decision Tree

Decision trees model data as a "tree" of hierarchical branches. They make branches until they reach "leaves" that represent predictions. Due to their branching structure, decision trees can easily model nonlinear relationships. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Beware that individual unconstrained decision trees are very prone to overfitting. 

In [9]:
model_fit.append(run_model(DecisionTreeRegressor(), 'Decision Tree', features, target))

### Support Vector Machines

SVM is a discriminative classifier formally defined by a separating hyperplane. The algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space, this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

In [10]:
model_fit.append(run_model(SVR(), 'SVR', features, target))

## Results

In [11]:
model_fit = pd.DataFrame(model_fit)
cols = ['Model Name','Model', 'Train Score', 'Test Score']
#cols = ['Model Name','Model', 'Train Score', 'Test Score', 'Fit Time', 'Train Prediction Time', 'Test Prediction Time']
model_fit = model_fit[cols]
model_fit

Unnamed: 0,Model Name,Model,Train Score,Test Score
0,Ridge,"Ridge(alpha=1.0, copy_X=True, fit_intercept=Tr...",0.949318,0.891713
1,Lasso,"Lasso(alpha=0.1, copy_X=True, fit_intercept=Tr...",0.443075,0.442686
2,KNN,"KNeighborsRegressor(algorithm='auto', leaf_siz...",0.76437,0.639967
3,Decision Tree,"DecisionTreeRegressor(criterion='mse', max_dep...",1.0,0.682992
4,SVR,"SVR(C=1.0, cache_size=200, coef0=0.0, degree=3...",0.899767,0.868764


All these models need to be tuned. However, looking at the results, the following conclusions can be made:
- Ridge performed better than the other models. 
- As expected, unregulated Decision Tree is overfit and the difference between train and test score indicates this.
- Lasso performed significantly worse than the other models, which was suprising and needs to be tuned.