# Advanced Modelling

Now that I have found a base model to work with, let's see how I can improve upon or build a better competing model. In this part of the project, I will be attempting to optimize the Ridge Regression base model. In addition, I will build RandomForest, Light Gradient Boosting Machine (LGBM), and Extreme Gradient Boosting (XGBoost) models and compare their performance against the optimized Ridge Regression. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import pickle

In [2]:
df = pd.read_csv('../capstone2-housing/documents/final_housing_df.csv', index_col=0)
X_train = pickle.load(open('X_train', 'rb'))
X_test = pickle.load(open('X_test', 'rb'))
y_train = pickle.load(open('y_train', 'rb'))
y_test = pickle.load(open('y_test', 'rb'))

model1 = pickle.load(open('RR_base', 'rb'))

Okay, now that everything has been imported, the fun can begin. First, I want to try to improve on my Ridge Regression base model. In my base modelling, I identified the top three positive and top three negative importance features, let's retrace the steps and get a list of the top 10 coefficients.

##### Ridge Regression

In [3]:
coefs = model1.coef_
feature_dict = {}
for coef, feat in zip(coefs, X_train.columns):
    feature_dict[round(coef)] = feat

In [4]:
positive_coefs = sorted([round(coef) for coef in coefs if coef >=0], reverse=True)
negative_coefs = sorted([round(coef) for coef in coefs if coef < 0], reverse=True, key=abs)
top_pos_feat = positive_coefs[:10]
top_neg_feat = negative_coefs[:10]

print(top_pos_feat)
print(top_neg_feat)

[145606, 124965, 92548, 72769, 66604, 62588, 51464, 47327, 43907, 43710]
[-404345, -170360, -110046, -102164, -58605, -56044, -52028, -50921, -47994, -40665]


In [5]:
top_features = positive_coefs[:10] + negative_coefs[:10]
for i in top_features:
    print(feature_dict.get(i), ":", i)

Fence_GdPrv : 145606
Exterior1st_AsbShng : 124965
RoofMatl_Metal : 92548
YearBuilt_1934 : 72769
PoolQC_Fa : 66604
RoofMatl_Membran : 62588
RoofMatl_ClyTile : 51464
OverallCond_1 : 47327
RoofMatl_WdShngl : 43907
RoofStyle_Flat : 43710
RoofMatl_CompShg : -404345
Condition2_RRAe : -170360
PoolQC_Na : -110046
PoolQC_Gd : -102164
YearBuilt_1893 : -58605
GarageYrBlt_1906.0 : -56044
YearBuilt_1965 : -52028
GarageYrBlt_1933.0 : -50921
ExterCond_TA : -47994
GarageYrBlt_1920.0 : -40665


As there are many way we can approach this, I will try the following three subsets for feature selection: Top 10 Positive/10 Negative, Top 5 Positive/5 Negative, Top 3 Positive/3 Negative features. I will retrain the model for each of them and compare the results to see what difference it makes.

To do so, I will create a function that extracts the top k features based on the model coefficients, trains and evaluates the model with those features, and return the results.

In [6]:
def k_feature_score(k):
    selected_features = []
    top_k = positive_coefs[:k] + negative_coefs[:k]
    for coef in top_k:
        selected_features.append(feature_dict.get(coef))
    X_train_k = X_train[selected_features]
    X_test_k = X_test[selected_features]
    model1.fit(X_train_k, y_train)
    mod1_y_test_pred = model1.predict(X_test_k)
    mod1_r2_test = model1.score(X_test_k, y_test)
    mod1_mae_test = mean_absolute_error(y_test, mod1_y_test_pred)
    return(mod1_r2_test, mod1_mae_test)

Time to find the best k value. I will use my function to iterate over k values the length of the list of negative coefficients (as that list is smaller than the list of positive coefficients). I will gather te results in two separate lists: R2 scores and MAE scores.

In [8]:
iterations = len(negative_coefs)
r2s = []
maes = []

for num in range(1, iterations):
    r2_score, mae_score = k_feature_score(num)
    r2s.append(r2_score)
    maes.append(mae_score)

r2_index = r2s.index(max(r2s))
mae_index = maes.index(min(maes))
print('K with best R2 score:', r2_index+1, ', R2 Score:', max(r2s), ', MAE:', maes[r2_index])
print('K with smallest MAE:', mae_index+1, ', R2 Score:', r2s[mae_index], ', MAE:', min(maes))

K with best R2 score: 190 , R2 Score: 0.8458229177969665 , MAE: 20716.53215373754
K with smallest MAE: 200 , R2 Score: 0.8394433917505123 , MAE: 20691.08981478877


Based on the very low difference in MAE, the model performs best when for the top 190 features. I will train the model and summarize the scores below

In [9]:
r2_score, mae_score = k_feature_score(190)
print('Ridge Regression R2 score:', r2_score, ', Ridge Regression MAE:', mae_score)

Ridge Regression R2 score: 0.8458229177969665 , Ridge Regression MAE: 20716.53215373754


##### Random Forest Regression

Now that I have fine tuned the Ridge Regrssion, let's see if there are other models that could perform better. The first one I want to try is Random Forest Regression. I will start with establishing a base model.

In [10]:
rfr = RandomForestRegressor(random_state = 123)
rfr.fit(X_train, y_train)
rfr_y_test_pred = rfr.predict(X_test)
rfr_r2_test = rfr.score(X_test, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)