# Advanced Modelling

Now that I have found a base model to work with, let's see how I can improve upon or build a better competing model. In this part of the project, I will be attempting to optimize the Ridge Regression base model. In addition, I will build RandomForest, Light Gradient Boosting Machine (LGBM), and Extreme Gradient Boosting (XGBoost) models and compare their performance against the optimized Ridge Regression. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import pickle

In [2]:
df = pd.read_csv('../capstone2-housing/documents/final_housing_df.csv', index_col=0)
f_pairs = pd.read_csv('../capstone2-housing/documents/feature_pairs_df.csv', index_col=0)
ft_pairs = pd.read_csv('../capstone2-housing/documents/feature_target_df.csv', index_col=0)

In [3]:
X_train = pickle.load(open('X_train', 'rb'))
X_test = pickle.load(open('X_test', 'rb'))
y_train = pickle.load(open('y_train', 'rb'))
y_test = pickle.load(open('y_test', 'rb'))

model1 = pickle.load(open('RR_base', 'rb'))

Okay, now that everything has been imported, the fun can begin. First, I want to try to improve on my Ridge Regression base model. In the EDA part of this project, I created two data sets: ***Feature - Target*** and ***Feature - Feature*** pairs. Let's take a look at them below.

In [4]:
print(ft_pairs.info())
print(ft_pairs)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 24
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Feature    25 non-null     object 
 1   Corr Coef  25 non-null     float64
dtypes: float64(1), object(1)
memory usage: 600.0+ bytes
None
                 Feature  Corr Coef
0              GrLivArea   0.708624
1             GarageCars   0.640409
2             GarageArea   0.623431
3            TotalBsmtSF   0.613581
4               1stFlrSF   0.605852
5           ExterQual_TA   0.589044
6               FullBath   0.560664
7            BsmtQual_Ex   0.553105
8           TotRmsAbvGrd   0.533723
9         KitchenQual_TA   0.519298
10        KitchenQual_Ex   0.504094
11      Foundation_PConc   0.497734
12            MasVnrArea   0.472614
13        FireplaceQu_Na   0.471908
14            Fireplaces   0.466929
15          ExterQual_Gd   0.452466
16           BsmtQual_TA   0.452394
17          ExterQual_Ex   0

In [5]:
print(f_pairs.info())
print(f_pairs)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 138 entries, 0 to 137
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Feature1   138 non-null    object 
 1   Feature2   138 non-null    object 
 2   Corr Coef  138 non-null    float64
dtypes: float64(1), object(2)
memory usage: 4.3+ KB
None
                 Feature1          Feature2  Corr Coef
0            ExterQual_Gd      ExterQual_TA   0.906121
1              Fireplaces    FireplaceQu_Na   0.900457
2              GarageCars        GarageArea   0.882475
3               GrLivArea      TotRmsAbvGrd   0.825489
4          KitchenQual_Gd    KitchenQual_TA   0.824457
..                    ...               ...        ...
133  Neighborhood_NridgHt    KitchenQual_Ex   0.409159
134           BsmtQual_TA  GarageFinish_Unf   0.408701
135          ExterQual_TA      HeatingQC_TA   0.407461
136              FullBath        GarageArea   0.405656
137           TotalBsmtSF       Bs

I have 25 features that have a relatively high correlation with the target, and 138 feature-feature combinations with high correlation between the features in each combination. To begin with, I will create a filter with the 25 features from the feature-target set. I will create a new X_train with this filter and refit my model  to see how it performs.

In [6]:
ft_pair_list = list(ft_pairs['Feature'])
print(ft_pair_list)

['GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'ExterQual_TA', 'FullBath', 'BsmtQual_Ex', 'TotRmsAbvGrd', 'KitchenQual_TA', 'KitchenQual_Ex', 'Foundation_PConc', 'MasVnrArea', 'FireplaceQu_Na', 'Fireplaces', 'ExterQual_Gd', 'BsmtQual_TA', 'ExterQual_Ex', 'BsmtFinType1_GLQ', 'HeatingQC_Ex', 'OverallQual_8', 'GarageFinish_Fin', 'GarageFinish_Unf', 'OverallQual_9', 'Neighborhood_NridgHt']


In [7]:
X_train_new = X_train[ft_pair_list]
X_train_new.head()

Unnamed: 0,GrLivArea,GarageCars,GarageArea,TotalBsmtSF,1stFlrSF,ExterQual_TA,FullBath,BsmtQual_Ex,TotRmsAbvGrd,KitchenQual_TA,...,ExterQual_Gd,BsmtQual_TA,ExterQual_Ex,BsmtFinType1_GLQ,HeatingQC_Ex,OverallQual_8,GarageFinish_Fin,GarageFinish_Unf,OverallQual_9,Neighborhood_NridgHt
0,-0.613684,-0.729569,1.276231,0.297379,0.063291,0.0,-1.027466,0.0,-0.952983,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-1.520937,-2.201913,-0.753917,-0.791134,-1.172651,0.0,-1.027466,0.0,-0.952983,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-0.559989,0.143455,0.552845,0.297379,0.136438,0.0,0.781337,0.0,-0.952983,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.71737,-0.073621,1.206226,0.172977,-0.07796,0.0,-1.027466,0.0,-0.952983,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.074887,0.431317,0.366165,0.886064,0.797289,1.0,0.781337,0.0,-0.952983,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Now that I have a new training data-set, let's refit my model.

In [8]:
model1.fit(X_train_new, y_train)
mod1_y_train_pred = model1.predict(X_train_new)
mod1_r2_train = model1.score(X_train_new, y_train)
mod1_mae_train = mean_absolute_error(y_train, mod1_y_train_pred)
print('Ridge Regression R2 score:', mod1_r2_train, ', Ridge Regression MAE:', mod1_mae_train)

Ridge Regression R2 score: 0.7446356405249538 , Ridge Regression MAE: 25526.80880596362


Reminder: the base model training score was:  
*Ridge R2 score: 0.9541983341629067 , Ridge MAE: 11022.373245216*  
Does that mean that this model, through feature selection has become less overfitted?  
  
I will apply the same feature selection filter on my test set and see how the model performs.

In [9]:
X_test_new = X_test[ft_pair_list]

In [10]:
mod1_y_test_pred = model1.predict(X_test_new)
mod1_r2_test = model1.score(X_test_new, y_test)
mod1_mae_test = mean_absolute_error(y_test, mod1_y_test_pred)
print('Ridge Regression R2 score:', mod1_r2_test, ', Ridge Regression MAE:', mod1_mae_test)

Ridge Regression R2 score: 0.7387245223232938 , Ridge Regression MAE: 26260.3975642835


Indeed, compared to the base model test scores:  
*Ridge R2 score: 0.8353466836793504 , Ridge MAE: 20668.859316633*  
this model performs worse after feature selection, however, the difference between train and test score is now very small.

# TESTS BELOW

In [11]:
# LINEARREGRESSION TEST BELOW - DISREGARD
from sklearn.linear_model import LinearRegression
LinReg = LinearRegression()
LinReg.fit(X_train_new, y_train)
y_train_pred = LinReg.predict(X_train_new)
r2_train = LinReg.score(X_train_new, y_train)
mae_train = mean_absolute_error(y_train, y_train_pred)
print('LinReg Train R2 score:', r2_train, ', LinReg Train MAE:', mae_train)

LinReg Train R2 score: 0.7446359555925534 , LinReg Train MAE: 25529.09335324988


In [12]:
y_test_pred = LinReg.predict(X_test_new)
r2_test = LinReg.score(X_test_new, y_test)
mae_test = mean_absolute_error(y_test, y_test_pred)
print('LinReg Test R2 score:', r2_test, ', Linreg Test MAE:', mae_test)

LinReg Test R2 score: 0.738668996375156 , Linreg Test MAE: 26264.13492918781


In [13]:
# LASSO TEST BELOW - DISREGARD
from sklearn.linear_model import Lasso
model3 = Lasso(alpha=0.4)
model3.fit(X_train_new, y_train)
mod3_y_train_pred = model3.predict(X_train_new)
mod3_r2_train = model3.score(X_train_new, y_train)
mod3_mae_train = mean_absolute_error(y_train, mod3_y_train_pred)
print('Lasso Train R2 score:', mod3_r2_train, ', Lasso Train MAE:', mod3_mae_train)

Lasso Train R2 score: 0.7446359189820538 , Lasso Train MAE: 25528.797812527744


In [14]:
mod3_y_test_pred = model3.predict(X_test_new)
mod3_r2_test = model3.score(X_test_new, y_test)
mod3_mae_test = mean_absolute_error(y_test, mod3_y_test_pred)
print('Lasso Test R2 score:', mod3_r2_test, ', Lasso Test MAE:', mod3_mae_test)

Lasso Test R2 score: 0.7386783119095597 , Lasso Test MAE: 26263.96939710893


# TESTS ABOVE