# Advanced Modelling

Now that I have found a base model to work with, let's see how I can improve upon or build a better competing model. In this part of the project, I will be attempting to optimize the Ridge Regression base model. In addition, I will build RandomForest and Light Gradient Boosting Machine (LGBM) models and compare their performance against the optimized Ridge Regression. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm
import pickle
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score, RepeatedKFold
from lightgbm import LGBMRegressor

In [2]:
df = pd.read_csv('../capstone2-housing/documents/final_housing_df.csv', index_col=0)
X_train = pickle.load(open('X_train', 'rb'))
X_test = pickle.load(open('X_test', 'rb'))
y_train = pickle.load(open('y_train', 'rb'))
y_test = pickle.load(open('y_test', 'rb'))

model1 = pickle.load(open('RR_base', 'rb'))

Okay, now that everything has been imported, the fun can begin. First, I want to try to improve on my Ridge Regression base model. In my base modelling, I identified the top three positive and top three negative importance features, let's retrace the steps and get a list of the top 10 coefficients.

##### Ridge Regression

In [3]:
coefs = model1.coef_
feature_dict = {}
for coef, feat in zip(coefs, X_train.columns):
    feature_dict[round(coef)] = feat

In [4]:
positive_coefs = sorted([round(coef) for coef in coefs if coef >=0], reverse=True)
negative_coefs = sorted([round(coef) for coef in coefs if coef < 0], reverse=True, key=abs)
top_pos_feat = positive_coefs[:10]
top_neg_feat = negative_coefs[:10]

print(top_pos_feat)
print(top_neg_feat)

[145606, 124965, 92548, 72769, 66604, 62588, 51464, 47327, 43907, 43710]
[-404345, -170360, -110046, -102164, -58605, -56044, -52028, -50921, -47994, -40665]


In [5]:
top_features = positive_coefs[:10] + negative_coefs[:10]
for i in top_features:
    print(feature_dict.get(i), ":", i)

Fence_GdPrv : 145606
Exterior1st_AsbShng : 124965
RoofMatl_Metal : 92548
YearBuilt_1934 : 72769
PoolQC_Fa : 66604
RoofMatl_Membran : 62588
RoofMatl_ClyTile : 51464
OverallCond_1 : 47327
RoofMatl_WdShngl : 43907
RoofStyle_Flat : 43710
RoofMatl_CompShg : -404345
Condition2_RRAe : -170360
PoolQC_Na : -110046
PoolQC_Gd : -102164
YearBuilt_1893 : -58605
GarageYrBlt_1906.0 : -56044
YearBuilt_1965 : -52028
GarageYrBlt_1933.0 : -50921
ExterCond_TA : -47994
GarageYrBlt_1920.0 : -40665


As there are many way we can approach this, I will try the following three subsets for feature selection: Top 10 Positive/10 Negative, Top 5 Positive/5 Negative, Top 3 Positive/3 Negative features. I will retrain the model for each of them and compare the results to see what difference it makes.

To do so, I will create a function that extracts the top k features based on the model coefficients, trains and evaluates the model with those features, and return the results.

In [6]:
def k_feature_score(k):
    selected_features = []
    top_k = positive_coefs[:k] + negative_coefs[:k]
    for coef in top_k:
        selected_features.append(feature_dict.get(coef))
    X_train_k = X_train[selected_features]
    X_test_k = X_test[selected_features]
    model1.fit(X_train_k, y_train)
    mod1_y_test_pred = model1.predict(X_test_k)
    mod1_r2_test = model1.score(X_test_k, y_test)
    mod1_mae_test = mean_absolute_error(y_test, mod1_y_test_pred)
    return(mod1_r2_test, mod1_mae_test)

Time to find the best k value. I will use my function to iterate over k values the length of the list of negative coefficients (as that list is smaller than the list of positive coefficients). I will gather te results in two separate lists: R2 scores and MAE scores.

In [7]:
iterations = len(negative_coefs)
r2s = []
maes = []

for num in range(1, iterations):
    r2_score, mae_score = k_feature_score(num)
    r2s.append(r2_score)
    maes.append(mae_score)

r2_index = r2s.index(max(r2s))
mae_index = maes.index(min(maes))
print('K with best R2 score:', r2_index+1, ', R2 Score:', max(r2s), ', MAE:', maes[r2_index])
print('K with smallest MAE:', mae_index+1, ', R2 Score:', r2s[mae_index], ', MAE:', min(maes))

K with best R2 score: 190 , R2 Score: 0.8458229177969665 , MAE: 20716.53215373754
K with smallest MAE: 200 , R2 Score: 0.8394433917505123 , MAE: 20691.08981478877


Based on the very low difference in MAE, the model performs best when for the top 190 features. I will train the model and summarize the scores below

In [8]:
r2_score, mae_score = k_feature_score(190)
print('Ridge Regression R2 score:', r2_score, ', Ridge Regression MAE:', mae_score)

Ridge Regression R2 score: 0.8458229177969665 , Ridge Regression MAE: 20716.53215373754


##### Random Forest Regression

Now that I have fine tuned the Ridge Regression, let's see if there are other models that could perform better. The first one I want to try is Random Forest Regression. I will start with establishing a base model.

In [9]:
rfr = RandomForestRegressor(random_state=123)
rfr.fit(X_train, y_train)
rfr_y_train_pred = rfr.predict(X_train)
rfr_r2_train = rfr.score(X_train, y_train)
rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
print('Random Forest R2 score:', rfr_r2_train, ', Random Forest 2 MAE:', rfr_mae_train)

Random Forest R2 score: 0.9778821466705006 , Random Forest 2 MAE: 6956.267291585127


In [10]:
rfr_y_test_pred = rfr.predict(X_test)
rfr_r2_test = rfr.score(X_test, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('Random Forest R2 score:', rfr_r2_test, ', Random Forest 2 MAE:', rfr_mae_test)

Random Forest R2 score: 0.8525919728615938 , Random Forest 2 MAE: 18405.52422700587


Looks like the Random Forest Regressor is overfitted and needs some refining. Let's start with hyperparameter tuning. I've been using this guide to identify which parameters to tune: https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/  
I will build a loop to see which n_estimators provides the smallest train/test R2 score and MAE difference. After a preliminary run through, I have decided to start the iterations at n_estimators=50, going in 10 estimator increments up to 300 estimators (anything below 50 and above 300 yielded clearly poorer results). 

In [11]:
estimators = []
r2_diff = []
mae_diff = []

for i in range(50, 310, 10):
    rfr = RandomForestRegressor(n_estimators=i, oob_score=True, n_jobs=-1, random_state=123)
    rfr.fit(X_train, y_train)
    rfr_y_train_pred = rfr.predict(X_train)
    rfr_r2_train = rfr.score(X_train, y_train)
    rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
    #print({i}, 'TRAIN R2:', rfr_r2_train, ', MAE:', rfr_mae_train)
    rfr_y_test_pred = rfr.predict(X_test)
    rfr_r2_test = rfr.score(X_test, y_test)
    rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
    estimators.append(i)
    r2_diff.append(rfr_r2_train - rfr_r2_test)
    mae_diff.append(rfr_mae_test-rfr_mae_train) 
    print({i}, 'TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

{50} TEST R2: 0.8549814645733403 , MAE: 18467.25210958904
{60} TEST R2: 0.8558269976647231 , MAE: 18369.911187214613
{70} TEST R2: 0.8566206110076304 , MAE: 18312.671506849318
{80} TEST R2: 0.855871805812004 , MAE: 18351.67640410959
{90} TEST R2: 0.8534162343685465 , MAE: 18407.209689062838
{100} TEST R2: 0.8525919728615938 , MAE: 18405.52422700587
{110} TEST R2: 0.8503965283754009 , MAE: 18478.167404376443
{120} TEST R2: 0.8527321069656002 , MAE: 18355.790668623613
{130} TEST R2: 0.8543758426100986 , MAE: 18254.03743489387
{140} TEST R2: 0.854020987167779 , MAE: 18251.272471344702
{150} TEST R2: 0.8530815367691508 , MAE: 18303.59262622309
{160} TEST R2: 0.8523882443819647 , MAE: 18305.535689823875
{170} TEST R2: 0.8525508598281965 , MAE: 18265.950673420055
{180} TEST R2: 0.8525823462323945 , MAE: 18315.47362687541
{190} TEST R2: 0.8525362941622763 , MAE: 18322.809521062933
{200} TEST R2: 0.852599778140666 , MAE: 18354.16245596869
{210} TEST R2: 0.8518568099339394 , MAE: 18363.67205852

In [12]:
print(min(r2_diff), min(mae_diff))
print(r2_diff.index(min(r2_diff)), mae_diff.index(min(mae_diff)))
print(estimators[r2_diff.index(min(r2_diff))])

0.12073705906933141 11256.259947814746
2 2
70


Above, I have iterated over 50-300 n_estimators (in increments of 10) and found that the best result between train test score is given by n_estimators=70. Below I will rebuild the model with that parameter.

In [13]:
rfr = RandomForestRegressor(n_estimators=70, oob_score=True, n_jobs=-1, random_state=123)
rfr.fit(X_train, y_train)
rfr_y_train_pred = rfr.predict(X_train)
rfr_r2_train = rfr.score(X_train, y_train)
rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
print('TRAIN R2:', rfr_r2_train, ', MAE:', rfr_mae_train)

TRAIN R2: 0.9773576700769618 , MAE: 7056.411559034573


In [14]:
rfr_y_test_pred = rfr.predict(X_test)
rfr_r2_test = rfr.score(X_test, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

TEST R2: 0.8566206110076304 , MAE: 18312.671506849318


The model still appears overfitted, would it help to add a min_samples_leaf argument?  
To find the best min_samples_leaf value, I will do the same iteration as above - this time iterating over leaf values 5 through 60 at increments of 5.

In [15]:
leaves = []
r2_diff = []
mae_diff = []

for i in range(5, 65, 5):
    rfr = RandomForestRegressor(n_estimators=70, oob_score=True, n_jobs=-1, random_state=123, min_samples_leaf=i)
    rfr.fit(X_train, y_train)
    rfr_y_train_pred = rfr.predict(X_train)
    rfr_r2_train = rfr.score(X_train, y_train)
    rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
    #print({i}, 'TRAIN R2:', rfr_r2_train, ', MAE:', rfr_mae_train)
    rfr_y_test_pred = rfr.predict(X_test)
    rfr_r2_test = rfr.score(X_test, y_test)
    rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
    leaves.append(i)
    r2_diff.append(rfr_r2_train - rfr_r2_test)
    mae_diff.append(rfr_mae_test-rfr_mae_train) 
    print({i}, 'TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

{5} TEST R2: 0.8325684112450087 , MAE: 18991.70562192042
{10} TEST R2: 0.8144302996547779 , MAE: 20028.647163941594
{15} TEST R2: 0.7915916884157694 , MAE: 20915.483482605265
{20} TEST R2: 0.7859444658831446 , MAE: 21289.720817489295
{25} TEST R2: 0.7791961432734009 , MAE: 21982.218220841012
{30} TEST R2: 0.7593840324796688 , MAE: 22928.173430763185
{35} TEST R2: 0.7440867936380864 , MAE: 23775.1702358942
{40} TEST R2: 0.7232216924814665 , MAE: 25011.702586415664
{45} TEST R2: 0.6628129884756686 , MAE: 27024.72955344304
{50} TEST R2: 0.6438298770284498 , MAE: 27928.9107132763
{55} TEST R2: 0.6343858868236527 , MAE: 28309.927586214213
{60} TEST R2: 0.6256813540648787 , MAE: 28783.477292694988


In [16]:
print(min(r2_diff), min(mae_diff))
print(r2_diff.index(min(r2_diff)), mae_diff.index(min(mae_diff)))
print(leaves[r2_diff.index(min(r2_diff))], leaves[mae_diff.index(min(mae_diff))])

0.033337536897174225 939.78566070546
5 11
30 60


After having tried both leaf values, I have found that choosing a min_samples_leaf=30 is the best choice as it reduces the overfitting without lowering the performance too much (using 60 significantly lowers both train and test performance). I will rebuild the random forest regressor below with n_estimators=70 and min_samples_leaf=30

In [17]:
rfr = RandomForestRegressor(n_estimators=70, oob_score=True, n_jobs=-1, random_state=123, min_samples_leaf=30)
rfr.fit(X_train, y_train)
rfr_y_train_pred = rfr.predict(X_train)
rfr_r2_train = rfr.score(X_train, y_train)
rfr_mae_train = mean_absolute_error(y_train, rfr_y_train_pred)
print('TRAIN R2:', rfr_r2_train, ', MAE:', rfr_mae_train)
rfr_y_test_pred = rfr.predict(X_test)
rfr_r2_test = rfr.score(X_test, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

TRAIN R2: 0.792721569376843 , MAE: 21876.880455825136
TEST R2: 0.7593840324796687 , MAE: 22928.173430763185


Above, I have fine tuned the parameters for the Random Forest Regressor. We have a test R2 score of 0.7594 and MAE of 22928.1734. Not a great performance compared to my Ridge Regression model. Would feature selection be able to improve the performance even more?  
For feature selection, I will be using the method SelectFromModel (with the parameters discussed in this article: https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f) to do my feature selection.  
  
First, I want to check out the feature importances from the model.

In [18]:
rfr.feature_importances_

array([1.14464785e-03, 7.52869939e-03, 4.19805745e-03, 2.41211833e-02,
       0.00000000e+00, 2.88929691e-04, 5.79135539e-02, 3.25553627e-02,
       1.39117400e-03, 0.00000000e+00, 2.17015120e-01, 1.32015925e-04,
       0.00000000e+00, 2.92720654e-02, 1.18602954e-03, 2.21635672e-04,
       4.04398106e-05, 7.81140791e-04, 4.20417755e-01, 2.22589364e-02,
       7.84814655e-05, 1.32218414e-03, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.98065112e-05,
       0.00000000e+00, 0.00000000e+00, 2.28444007e-04, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 8.01771973e-04, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.42545787e-04,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

Looks like there are a lot of features that have 0 importance. I started with a feature selection of everything above 0, and after playing around with the different values, I found that using a feature importance threshold 0.0001 slightly improves the model. 

In [19]:
sel = SelectFromModel(rfr, threshold=0.0001)
sel.fit(X_train, y_train)
selected_feat= X_train.columns[(sel.get_support())]
print(len(selected_feat))
print(selected_feat)

46
Index(['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageCars', 'WoodDeckSF', 'MSSubClass_40', 'MSSubClass_70',
       'MSZoning_RM', 'BldgType_2fmCon', 'OverallQual_8', 'OverallQual_9',
       'YearRemodAdd_1951', 'Exterior1st_Wd Sdng', 'ExterQual_TA',
       'ExterCond_Ex', 'Foundation_PConc', 'Foundation_Slab', 'BsmtQual_Fa',
       'BsmtQual_Na', 'BsmtCond_Fa', 'BsmtFinType1_LwQ', 'HeatingQC_Fa',
       'CentralAir_Y', 'Electrical_FuseA', 'KitchenQual_Fa', 'KitchenQual_TA',
       'Functional_Maj1', 'FireplaceQu_Po', 'GarageType_Basment',
       'GarageType_Na', 'GarageFinish_Na', 'GarageQual_Ex', 'GarageCond_Ex',
       'PavedDrive_N'],
      dtype='object')


I will use the 46 features with the highest importance for this model printed above to create new X_train/X_test sets and retrain the model.

In [20]:
X_train_rfr = X_train[selected_feat]
X_test_rfr = X_test[selected_feat]

rfr.fit(X_train_rfr, y_train)
rfr_y_test_pred = rfr.predict(X_test_rfr)
rfr_r2_test = rfr.score(X_test_rfr, y_test)
rfr_mae_test = mean_absolute_error(y_test, rfr_y_test_pred)
print('TEST R2:', rfr_r2_test, ', MAE:', rfr_mae_test)

TEST R2: 0.759389711644032 , MAE: 22925.267378791483


Selecting a subset of the 46 features with the highest importance slightly improved the performance of the model. The best Random Forest Regression performance scores that I have found in this part of the project are:  
R2: 0.7594 and MAE: 22925.2674

##### Light Gradient Boosted Machine Algorithm (LGBM)

So far, I have built a Ridge Regression model and a Random Forest model. The Ridge Regression model is currently the best performing one.  
Finally, I want to build a LGBM model to see if that model could outperform my Ridge Regression. As before, I will begin with building a base model, training and testing it on the data as is. For this part of the project, I have been using this site as a guide: https://machinelearningmastery.com/light-gradient-boosted-machine-lightgbm-ensemble/

In [21]:
lgbm = LGBMRegressor(random_state=123)
lgbm.fit(X_train, y_train)
lgbm_y_train_pred = lgbm.predict(X_train)
lgbm_r2_train = lgbm.score(X_train, y_train)
lgbm_mae_train = mean_absolute_error(y_train, lgbm_y_train_pred)
print('TRAIN R2:', lgbm_r2_train, ', MAE:', lgbm_mae_train)

TRAIN R2: 0.9776385877338646 , MAE: 5647.614989434855


In [22]:
lgbm_y_test_pred = lgbm.predict(X_test)
lgbm_r2_test = lgbm.score(X_test, y_test)
lgbm_mae_test = mean_absolute_error(y_test, lgbm_y_test_pred)
print('TEST R2:', lgbm_r2_test, ', MAE:', lgbm_mae_test)

TEST R2: 0.8507357166979355 , MAE: 17327.200122753264


The train and test results show that the model is overfitted. Let's see if that can be fixed by hyperparameter tuning. I will begin with num_leaves. As before, I will iterate over a range of possible values of num_leaves (in incfrements of ten) and find which one performs the best.

In [23]:
leaves = []
r2_diff = []
mae_diff = []

for i in range(10, 160, 10):
    lgbm = LGBMRegressor(num_leaves=i, random_state=123)
    lgbm.fit(X_train, y_train)
    lgbm_y_train_pred = lgbm.predict(X_train)
    lgbm_r2_train = lgbm.score(X_train, y_train)
    lgbm_mae_train = mean_absolute_error(y_train, lgbm_y_train_pred)
    #print({i}, 'TRAIN R2:', lgbm_r2_train, ', MAE:', lgbm_mae_train)
    lgbm_y_test_pred = lgbm.predict(X_test)
    lgbm_r2_test = lgbm.score(X_test, y_test)
    lgbm_mae_test = mean_absolute_error(y_test, lgbm_y_test_pred)
    leaves.append(i)
    r2_diff.append(lgbm_r2_train - lgbm_r2_test)
    mae_diff.append(lgbm_mae_test-lgbm_mae_train) 
    print({i}, 'TEST R2:', lgbm_r2_test, ', MAE:', lgbm_mae_test)

{10} TEST R2: 0.8594634117211949 , MAE: 17397.849294308904
{20} TEST R2: 0.8452620300251685 , MAE: 17968.286819192905
{30} TEST R2: 0.8496770829011758 , MAE: 17519.3971804949
{40} TEST R2: 0.8467046425125285 , MAE: 17855.905693430646
{50} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{60} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{70} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{80} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{90} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{100} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{110} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{120} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{130} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{140} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067
{150} TEST R2: 0.8495531433952419 , MAE: 17769.13229547067


In [24]:
print(min(r2_diff), min(mae_diff))
print(r2_diff.index(min(r2_diff)), mae_diff.index(min(mae_diff)))
print(leaves[r2_diff.index(min(r2_diff))], leaves[mae_diff.index(min(mae_diff))])

0.0987101633526759 6738.829073672934
0 0
10 10


Setting the parameter num_leaves=10 clearly improved the model's testing performance, but it the model still is overfitted.

In [25]:
lgbm = LGBMRegressor(num_leaves=10, random_state=123)
lgbm.fit(X_train, y_train)
lgbm_y_train_pred = lgbm.predict(X_train)
lgbm_r2_train = lgbm.score(X_train, y_train)
lgbm_mae_train = mean_absolute_error(y_train, lgbm_y_train_pred)
print('TRAIN R2:', lgbm_r2_train, ', MAE:', lgbm_mae_train)
lgbm_y_test_pred = lgbm.predict(X_test)
lgbm_r2_test = lgbm.score(X_test, y_test)
lgbm_mae_test = mean_absolute_error(y_test, lgbm_y_test_pred)
print('TEST R2:', lgbm_r2_test, ', MAE:', lgbm_mae_test)

TRAIN R2: 0.9581735750738708 , MAE: 10659.02022063597
TEST R2: 0.8594634117211949 , MAE: 17397.849294308904


To prevent overfitting, I will adjust the parameter min_data_in_leaf. As before, I will iterate over a range of possible values to see which leads to the better performance.

In [26]:
leaves = []
r2_diff = []
mae_diff = []

for i in range(10, 310, 10):
    lgbm = LGBMRegressor(num_leaves=10, min_data_in_leaf=i, random_state=123)
    lgbm.fit(X_train, y_train)
    lgbm_y_train_pred = lgbm.predict(X_train)
    lgbm_r2_train = lgbm.score(X_train, y_train)
    lgbm_mae_train = mean_absolute_error(y_train, lgbm_y_train_pred)
    #print({i}, 'TRAIN R2:', lgbm_r2_train, ', MAE:', lgbm_mae_train)
    lgbm_y_test_pred = lgbm.predict(X_test)
    lgbm_r2_test = lgbm.score(X_test, y_test)
    lgbm_mae_test = mean_absolute_error(y_test, lgbm_y_test_pred)
    leaves.append(i)
    r2_diff.append(lgbm_r2_train - lgbm_r2_test)
    mae_diff.append(lgbm_mae_test-lgbm_mae_train) 
    print({i}, 'TEST R2:', lgbm_r2_test, ', MAE:', lgbm_mae_test)

{10} TEST R2: 0.8637098980850493 , MAE: 17067.585589870814
{20} TEST R2: 0.8594634117211949 , MAE: 17397.849294308904
{30} TEST R2: 0.8677917704363621 , MAE: 16973.465692653655
{40} TEST R2: 0.86842942410872 , MAE: 16984.84373594008
{50} TEST R2: 0.857732491976405 , MAE: 17529.681041150707
{60} TEST R2: 0.8593371637312683 , MAE: 17489.426370786805
{70} TEST R2: 0.8279364968618854 , MAE: 19184.18063181552
{80} TEST R2: 0.8270402813748383 , MAE: 19381.032084283986
{90} TEST R2: 0.8234612135755379 , MAE: 19506.0434165535
{100} TEST R2: 0.8092841717084351 , MAE: 20904.976086681156
{110} TEST R2: 0.8096671495050073 , MAE: 21029.3749748495
{120} TEST R2: 0.8063022346983 , MAE: 21078.79289465227
{130} TEST R2: 0.7967275309675871 , MAE: 21964.26421474057
{140} TEST R2: 0.8011568866276068 , MAE: 21570.993163111918
{150} TEST R2: 0.7945727459829284 , MAE: 22286.252829702436
{160} TEST R2: 0.7943672786516713 , MAE: 22065.99464795908
{170} TEST R2: 0.7844504682408888 , MAE: 22910.52586491235
{180}

In [27]:
print(min(r2_diff), min(mae_diff))
print(r2_diff.index(min(r2_diff)), mae_diff.index(min(mae_diff)))
print(leaves[r2_diff.index(min(r2_diff))], leaves[mae_diff.index(min(mae_diff))])

0.02195973018842101 294.10834436078585
24 24
250 250


There we go! Using num_leaves=10 and min_data_in_leaf=250 has minimized the overfitting. Let's rebuild the model below.

In [28]:
lgbm = LGBMRegressor(num_leaves=10, min_data_in_leaf=250, random_state=123)
lgbm.fit(X_train, y_train)
lgbm_y_train_pred = lgbm.predict(X_train)
lgbm_r2_train = lgbm.score(X_train, y_train)
lgbm_mae_train = mean_absolute_error(y_train, lgbm_y_train_pred)
print('TRAIN R2:', lgbm_r2_train, ', MAE:', lgbm_mae_train)
lgbm_y_test_pred = lgbm.predict(X_test)
lgbm_r2_test = lgbm.score(X_test, y_test)
lgbm_mae_test = mean_absolute_error(y_test, lgbm_y_test_pred)
print('TEST R2:', lgbm_r2_test, ', MAE:', lgbm_mae_test)

TRAIN R2: 0.7827620281671568 , MAE: 23909.35211837513
TEST R2: 0.7608022979787358 , MAE: 24203.460462735915


Now that I have tuned some of the parameters, I will take a look at feature importance. I will recreate the steps I did for feature importance for my random forest model.

In [29]:
lgbm.feature_importances_

array([ 3, 10,  4, 18,  0,  2, 15, 23,  6,  0, 39,  0,  0,  0, 11,  1,  1,
        6,  2, 14,  7,  6,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0

In [30]:
for i in range(1, 39):
    lgbm = LGBMRegressor(num_leaves=10, min_data_in_leaf=250, random_state=123)
    sel = SelectFromModel(lgbm, threshold=i)
    sel.fit(X_train, y_train)
    selected_feat= X_train.columns[(sel.get_support())]
    print(len(selected_feat))  
    X_train_lgbm = X_train[selected_feat]
    X_test_lgbm = X_test[selected_feat]
    lgbm.fit(X_train_lgbm, y_train)
    lgbm_y_test_pred = lgbm.predict(X_test_lgbm)
    lgbm_r2_test = lgbm.score(X_test_lgbm, y_test)
    lgbm_mae_test = mean_absolute_error(y_test, lgbm_y_test_pred)
    print('TEST R2:', lgbm_r2_test, ', MAE:', lgbm_mae_test)

34
TEST R2: 0.7608022979787358 , MAE: 24203.460462735915
29
TEST R2: 0.7598343010604406 , MAE: 24269.586699173396
27
TEST R2: 0.757702660996993 , MAE: 24289.849569335798
25
TEST R2: 0.7580642262802978 , MAE: 24216.48203657827
19
TEST R2: 0.7515782269865307 , MAE: 24954.85677322027
15
TEST R2: 0.7554922148259646 , MAE: 24153.064266330988
10
TEST R2: 0.7381453209515632 , MAE: 25842.76462971077
9
TEST R2: 0.7361814909457668 , MAE: 25921.848021195085
9
TEST R2: 0.7361814909457668 , MAE: 25921.848021195085
9
TEST R2: 0.7361814909457668 , MAE: 25921.848021195085
7
TEST R2: 0.7187527459528948 , MAE: 26499.22168796868
5
TEST R2: 0.6575851870010971 , MAE: 29045.906747069384
5
TEST R2: 0.6575851870010971 , MAE: 29045.906747069384
5
TEST R2: 0.6575851870010971 , MAE: 29045.906747069384
4
TEST R2: 0.6104984217330727 , MAE: 31295.800308435006
3
TEST R2: 0.562139710465301 , MAE: 33641.74239256328
3
TEST R2: 0.562139710465301 , MAE: 33641.74239256328
3
TEST R2: 0.562139710465301 , MAE: 33641.74239256

Even for this model, choosing to keep all vs. dropping some columns seems to only make the model perform worse. My Ridge Regression is outperforming both the Random Forest and LGBM model.

##### Model Score Comparison table:  
  

| Model | R2 Score | MAE | Upper/Lower bound |
| :---: | :---: | :---: | :---: |
| Ridge Regression | 0.8458 | 20716.5322 | TBA |
| Random Forest | 0.7594 | 22925.2674 | TBA |
| LGBM | 0.7608 | 24203.4605 | TBA |