# Advanced Modelling

Now that I have found a base model to work with, let's see how I can improve upon or build a better competing model. In this part of the project, I will be attempting to optimize the Ridge Regression base model. In addition, I will build RandomForest, Light Gradient Boosting Machine (LGBM), and Extreme Gradient Boosting (XGBoost) models and compare their performance against the optimized Ridge Regression. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import SelectFromModel
import pickle

In [2]:
df = pd.read_csv('../capstone2-housing/documents/final_housing_df.csv', index_col=0)
X_train = pickle.load(open('X_train', 'rb'))
X_test = pickle.load(open('X_test', 'rb'))
y_train = pickle.load(open('y_train', 'rb'))
y_test = pickle.load(open('y_test', 'rb'))

model1 = pickle.load(open('RR_base', 'rb'))

Okay, now that everything has been imported, the fun can begin. First, I want to try to improve on my Ridge Regression base model. In my base modelling, I identified the top three positive and top three negative importance features, let's retrace the steps and get a list of the top 10 coefficients.

##### Ridge Regression

In [3]:
coefs = model1.coef_
feature_dict = {}
for coef, feat in zip(coefs, X_train.columns):
    feature_dict[round(coef)] = feat

In [4]:
positive_coefs = sorted([round(coef) for coef in coefs if coef >=0], reverse=True)
negative_coefs = sorted([round(coef) for coef in coefs if coef < 0], reverse=True, key=abs)
top_pos_feat = positive_coefs[:10]
top_neg_feat = negative_coefs[:10]

print(top_pos_feat)
print(top_neg_feat)

[145606, 124965, 92548, 72769, 66604, 62588, 51464, 47327, 43907, 43710]
[-404345, -170360, -110046, -102164, -58605, -56044, -52028, -50921, -47994, -40665]


In [5]:
top_features = positive_coefs[:10] + negative_coefs[:10]
for i in top_features:
    print(feature_dict.get(i), ":", i)

Fence_GdPrv : 145606
Exterior1st_AsbShng : 124965
RoofMatl_Metal : 92548
YearBuilt_1934 : 72769
PoolQC_Fa : 66604
RoofMatl_Membran : 62588
RoofMatl_ClyTile : 51464
OverallCond_1 : 47327
RoofMatl_WdShngl : 43907
RoofStyle_Flat : 43710
RoofMatl_CompShg : -404345
Condition2_RRAe : -170360
PoolQC_Na : -110046
PoolQC_Gd : -102164
YearBuilt_1893 : -58605
GarageYrBlt_1906.0 : -56044
YearBuilt_1965 : -52028
GarageYrBlt_1933.0 : -50921
ExterCond_TA : -47994
GarageYrBlt_1920.0 : -40665


As there are many way we can approach this, I will try the following three subsets for feature selection: Top 10 Positive/10 Negative, Top 5 Positive/5 Negative, Top 3 Positive/3 Negative features. I will retrain the model for each of them and compare the results to see what difference it makes.

To do so, I will create a function that extracts the top k features based on the model coefficients, trains and evaluates the model with those features, and return the results.

In [6]:
def k_feature_score(k):
    selected_features = []
    top_k = positive_coefs[:k] + negative_coefs[:k]
    for coef in top_k:
        selected_features.append(feature_dict.get(coef))
    X_train_k = X_train[selected_features]
    X_test_k = X_test[selected_features]
    model1.fit(X_train_k, y_train)
    mod1_y_test_pred = model1.predict(X_test_k)
    mod1_r2_test = model1.score(X_test_k, y_test)
    mod1_mae_test = mean_absolute_error(y_test, mod1_y_test_pred)
    return(mod1_r2_test, mod1_mae_test)

Time to find the best k value. I will use my function to iterate over k values the length of the list of negative coefficients (as that list is smaller than the list of positive coefficients). I will gather te results in two separate lists: R2 scores and MAE scores.

In [7]:
iterations = len(negative_coefs)
r2s = []
maes = []

for num in range(1, iterations):
    r2_score, mae_score = k_feature_score(num)
    r2s.append(r2_score)
    maes.append(mae_score)

r2_index = r2s.index(max(r2s))
mae_index = maes.index(min(maes))
print('K with best R2 score:', r2_index+1, ', R2 Score:', max(r2s), ', MAE:', maes[r2_index])
print('K with smallest MAE:', mae_index+1, ', R2 Score:', r2s[mae_index], ', MAE:', min(maes))

K with best R2 score: 190 , R2 Score: 0.8458229177969665 , MAE: 20716.53215373754
K with smallest MAE: 200 , R2 Score: 0.8394433917505123 , MAE: 20691.08981478877


Based on the very low difference in MAE, the model performs best when for the top 190 features. I will train the model and summarize the scores below

In [8]:
r2_score, mae_score = k_feature_score(190)
print('Ridge Regression R2 score:', r2_score, ', Ridge Regression MAE:', mae_score)

Ridge Regression R2 score: 0.8458229177969665 , Ridge Regression MAE: 20716.53215373754


##### Random Forest Regression

Now that I have fine tuned the Ridge Regression, let's see if there are other models that could perform better. The first one I want to try is Random Forest Regression. I will start with establishing a base model.

In [9]:
rfr = RandomForestRegressor(random_state=123)
rfr.fit(X_train, y_train)
rfr_y_train_pred = rfr.predict(X_train)
rfr_r2_train = rfr.score(X_train, y_train)
rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
print('Random Forest R2 score:', rfr_r2_train, ', Random Forest 2 MAE:', rfr_mae_train)

Random Forest R2 score: 0.9778821466705006 , Random Forest 2 MAE: 6956.267291585127


In [10]:
rfr_y_test_pred = rfr.predict(X_test)
rfr_r2_test = rfr.score(X_test, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('Random Forest R2 score:', rfr_r2_test, ', Random Forest 2 MAE:', rfr_mae_test)

Random Forest R2 score: 0.8525919728615938 , Random Forest 2 MAE: 18405.52422700587


Looks like the Random Forest Regressor is overfitted and needs some refining. Let's start with hyperparameter tuning. I've been using this guide to identify which parameters to tune: https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/  
I will build a loop to see which n_estimators provides the smallest train/test R2 score and MAE difference. After a preliminary run through, I have decided to start the iterations at n_estimators=50, going in 10 estimator increments up to 300 estimators (anything below 50 and above 300 yielded clearly poorer results). 

In [11]:
estimators = []
r2_diff = []
mae_diff = []

for i in range(50, 310, 10):
    rfr = RandomForestRegressor(n_estimators=i, oob_score=True, n_jobs=-1, random_state=123, 
                            max_features="auto", min_samples_leaf=50)
    rfr.fit(X_train, y_train)
    rfr_y_train_pred = rfr.predict(X_train)
    rfr_r2_train = rfr.score(X_train, y_train)
    rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
    #print({i}, 'TRAIN R2:', rfr_r2_train, ', MAE:', rfr_mae_train)
    rfr_y_test_pred = rfr.predict(X_test)
    rfr_r2_test = rfr.score(X_test, y_test)
    rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
    estimators.append(i)
    r2_diff.append(rfr_r2_train - rfr_r2_test)
    mae_diff.append(rfr_mae_test-rfr_mae_train) 
    #print({i}, 'TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

In [12]:
print(min(r2_diff), min(mae_diff))
print(r2_diff.index(min(r2_diff)), mae_diff.index(min(mae_diff)))
print(estimators[r2_diff.index(min(r2_diff))])

0.04483304726304305 1046.5459769366753
14 14
190


Above, I have iterated over 50-300 n_estimators (in increments of 10) and found that the best result between train test score is given by n_estimators=190. Below I will rebuild the model with that hyper parameter.

In [13]:
rfr = RandomForestRegressor(n_estimators=190, oob_score=True, n_jobs=-1, random_state=123, min_samples_leaf=50)
rfr.fit(X_train, y_train)
rfr_y_train_pred = rfr.predict(X_train)
rfr_r2_train = rfr.score(X_train, y_train)
rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
print('TRAIN R2:', rfr_r2_train, ', MAE:', rfr_mae_train)

TRAIN R2: 0.6952763823835019 , MAE: 26615.647422751725


In [14]:
rfr_y_test_pred = rfr.predict(X_test)
rfr_r2_test = rfr.score(X_test, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

TEST R2: 0.6504433351204588 , MAE: 27662.193399688404


Above, I have fine tuned the parameters for the Random Forest Regressor. We have a test R2 score of 0.6504 and MAE of 27662.1934. Not particularly great performance compared to my Ridge Regression model. Let's see if feature selection can improve the performance.  
I will be using the method SalectFromModel (discussed in this article: https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f) to do my feature selection. First, I want to check out the feature importances from the model.

In [15]:
rfr.feature_importances_

array([4.91696617e-04, 6.15152100e-03, 0.00000000e+00, 1.04074069e-02,
       0.00000000e+00, 5.79236407e-05, 6.26694189e-02, 2.07832884e-02,
       1.77886236e-03, 0.00000000e+00, 2.08626125e-01, 2.81567576e-05,
       0.00000000e+00, 2.85556694e-02, 1.29848428e-03, 2.94683931e-04,
       1.84578403e-05, 4.84919465e-04, 4.73323521e-01, 2.05015209e-02,
       1.06138678e-04, 3.68739385e-04, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 2.98746153e-05, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 1.62199510e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.05447112e-04,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

Looks like there are a lot of features that have 0 importance. I will begin with a feature selection of everything above 0.

In [16]:
sel = SelectFromModel(rfr, threshold=0.00001)
sel.fit(X_train, y_train)
selected_feat= X_train.columns[(sel.get_support())]
print(len(selected_feat))
print(selected_feat)

43
Index(['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'MSSubClass_30',
       'MSSubClass_70', 'MSZoning_RM', 'LandContour_Bnk', 'HouseStyle_2.5Fin',
       'OverallQual_6', 'YearRemodAdd_1951', 'ExterQual_TA', 'ExterCond_Ex',
       'Foundation_Slab', 'BsmtQual_Fa', 'BsmtQual_Na', 'BsmtCond_Fa',
       'BsmtFinType1_ALQ', 'BsmtFinType1_LwQ', 'HeatingQC_Fa', 'CentralAir_N',
       'KitchenQual_TA', 'Functional_Maj1', 'FireplaceQu_Po',
       'GarageType_Basment', 'GarageType_Na', 'GarageQual_Ex', 'GarageCond_Ex',
       'PavedDrive_N'],
      dtype='object')


I will use the 43 features with the highest importance for this model printed above to create new X_train/X_test sets and retrain the model.

In [17]:
X_train_rfr = X_train[selected_feat]
X_test_rfr = X_test[selected_feat]

rfr.fit(X_train_rfr, y_train)
rfr_y_test_pred = rfr.predict(X_test_rfr)
rfr_r2_test = rfr.score(X_test_rfr, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

TEST R2: 0.6504433351204588 , MAE: 27662.193399688404


Selecting any subset of features (not printed above, just tried out) with the highest importance and dropping the rest led to a slightly worse performance for all subsets than when using all features. The best Random Forest Regression performance scores that I have found in this part of the project, are R2: 0.6504433351204588 , MAE: 27662.193399688396

##### Light Gradient Boosted Machine Algorithm (LGBM)

So far, I have built a Ridge Regression model and a Random Forest model. The Ridge Regression model is currently the best performing one. 