# Random Forest

Random forest is an ensemble modeling technique which uses an aggregation of decision trees to reduce the inherent high variance of a single decision tree.  At each node, each individual tree votes on what the solution should be and the aggregated answer is the output of the random forest. In order to provide a wide range of individual solutions each tree **bags** random samples from a **random subspace** of the full sample set.  **Bagging** is sampling with replacement and the **random subspace** consists of only some of the features.  That is to say that random forests only see a random subset of *some* of the features.  This randomness is a key feature of the model and the main reason that random forests are usually very accurate.  


## Model Example

We will work through data from [2015 Lending Club](https://www.lendingclub.com/info/download-data.action) to predict the state of a loan given some information about it.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pld
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

%matplotlib inline

In [9]:
yr2015 = pd.read_csv('LoanStats3d.csv',
                    skipinitialspace=True,
                    header=1,
                    skipfooter=2)

  after removing the cwd from sys.path.


In [10]:
yr2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421090,36371250,39102635,10000,10000,10000.0,36 months,11.99%,332.1,B,B5,...,0,1,100.0,100.0,0,0,32950,25274,9200,15850
421091,36441262,39152692,24000,24000,24000.0,36 months,11.99%,797.03,B,B5,...,0,2,56.5,100.0,0,0,152650,8621,9000,0
421092,36271333,38982739,13000,13000,13000.0,60 months,15.99%,316.07,D,D2,...,0,3,100.0,50.0,1,0,51239,34178,10600,33239
421093,36490806,39222577,12000,12000,12000.0,60 months,19.99%,317.86,E,E3,...,1,2,95.0,66.7,0,0,96919,58418,9700,69919
421094,36271262,38982659,20000,20000,20000.0,36 months,11.99%,664.2,B,B5,...,0,1,100.0,50.0,0,1,43740,33307,41700,0


In [11]:
yr2015.dtypes

id                                  int64
member_id                           int64
loan_amnt                           int64
funded_amnt                         int64
funded_amnt_inv                   float64
term                               object
int_rate                           object
installment                       float64
grade                              object
sub_grade                          object
emp_title                          object
emp_length                         object
home_ownership                     object
annual_inc                        float64
verification_status                object
issue_d                            object
loan_status                        object
pymnt_plan                         object
url                                object
desc                               object
purpose                            object
title                              object
zip_code                           object
addr_state                        

Because Decision Trees, and therefore Random Forest models, must have categorical data as inputs, we need to cull this list of inputs based on the number of unique values each has.

In [13]:
categorical = yr2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

term
2
int_rate
110
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
25
next_pymnt_d
4
last_credit_pull_d
26
application_type
2
verification_status_joint
3


In [14]:
# Converting interest rate to numeric
yr2015['int_rate'] = pd.to_numeric(yr2015['int_rate'].str.strip('%'), errors='coerce')

# Dropping columns with too many values
yr2015.drop(['emp_title', 'url', 'zip_code', 'addr_state', 'earliest_cr_line', 
             'revol_util'], 1, inplace=True)



In [15]:
# Testing the get_dummies method on full dataset
# Can be very memory intensive and need to be pared down
pd.get_dummies(yr2015)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401,72868139,16000,16000,16000.0,14.85,379.39,48000.0,33.18,0,...,0,0,0,0,0,1,0,0,0,0
1,68354783,73244544,9600,9600,9600.0,7.49,298.58,60000.0,22.44,0,...,0,0,0,0,0,1,0,0,0,0
2,68466916,73356753,25000,25000,25000.0,7.49,777.55,109000.0,26.02,0,...,0,0,0,0,0,1,0,0,0,0
3,68466961,73356799,28000,28000,28000.0,6.49,858.05,92000.0,21.60,0,...,0,0,0,0,0,1,0,0,0,0
4,68495092,73384866,8650,8650,8650.0,19.89,320.99,55000.0,25.49,0,...,0,0,0,0,0,1,0,0,0,0
5,68506798,73396623,23000,23000,23000.0,8.49,471.77,64000.0,18.28,0,...,0,0,0,0,0,1,0,0,0,0
6,68566886,73456723,29900,29900,29900.0,12.88,678.49,65000.0,21.77,0,...,0,0,0,0,0,1,0,0,0,0
7,68577849,73467703,18000,18000,18000.0,11.99,400.31,112000.0,8.68,0,...,0,0,0,0,0,1,0,0,0,0
8,66310712,71035433,35000,35000,35000.0,14.85,829.90,110000.0,17.06,0,...,0,0,0,0,0,1,0,0,0,0
9,68476807,73366655,10400,10400,10400.0,22.45,289.91,104433.0,25.37,1,...,0,0,0,0,0,1,0,0,0,0


In [16]:
# Instantiating the model

rfc = ensemble.RandomForestClassifier()

X = yr2015.drop('loan_status', 1)
X = pd.get_dummies(X)
# Dropping NA instead of imputing because data is probably rich enough
X = X.dropna(axis=1)
Y = yr2015['loan_status']

cross_val_score(rfc, X, Y, cv=10)



array([0.97962528, 0.98026644, 0.98157251, 0.98159626, 0.97751128,
       0.98019473, 0.94609228, 0.98033675, 0.97981333, 0.98040659])

That's some great cross validation, but the model is bloated.  Let's see if we can trim down the input data to strictly what's necessary to get CV scores ~90%

In [18]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y)

In [19]:
rfc.fit(x_train, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [61]:
rfc1_fi = rfc.feature_importances_
indicies = np.argsort(rfc1_fi)
feat_names = X.columns

In [76]:
# Reverse order list of top 30 features in model
top_thirty_features = []

for i in indicies[-30:]:
    print(f'{i}.  {feat_names[i]}')
    top_thirty_features.append(feat_names[i])

222.  last_pymnt_d_Jul-2016
1.  member_id
21.  total_rec_late_fee
5.  int_rate
213.  last_pymnt_d_Aug-2016
217.  last_pymnt_d_Feb-2016
242.  last_credit_pull_d_Aug-2016
224.  last_pymnt_d_Jun-2016
6.  installment
20.  total_rec_int
2.  loan_amnt
234.  last_pymnt_d_Sep-2016
4.  funded_amnt_inv
3.  funded_amnt
18.  total_pymnt_inv
232.  last_pymnt_d_Oct-2016
245.  last_credit_pull_d_Dec-2016
23.  collection_recovery_fee
22.  recoveries
230.  last_pymnt_d_Nov-2016
262.  last_credit_pull_d_Oct-2016
235.  next_pymnt_d_Feb-2017
215.  last_pymnt_d_Dec-2016
17.  total_pymnt
19.  total_rec_prncp
250.  last_credit_pull_d_Jan-2017
220.  last_pymnt_d_Jan-2017
16.  out_prncp_inv
24.  last_pymnt_amnt
15.  out_prncp


The feature names with dates attached are from the one-hot encoding.  I will incorporate the base feature into the next model.

In [78]:
# From the top 30 encoded features is a list of 14 base features for the next model.
x1 = yr2015.loc[:, 
                ['loan_amnt',
                'funded_amnt_inv',
                'funded_amnt',
                'total_pymnt_inv',
                'collection_recovery_fee',
                'recoveries',
                'next_pymnt_d',
                'total_pymnt',
                'total_rec_prncp',
                'last_credit_pull_d',
                'last_pymnt_d',
                'out_prncp_inv',
                'last_pymnt_amnt',
                'out_prncp']]


In [79]:
x1.head()

Unnamed: 0,loan_amnt,funded_amnt_inv,funded_amnt,total_pymnt_inv,collection_recovery_fee,recoveries,next_pymnt_d,total_pymnt,total_rec_prncp,last_credit_pull_d,last_pymnt_d,out_prncp_inv,last_pymnt_amnt,out_prncp
0,16000,16000.0,16000,4519.68,0.0,0.0,Jan-2017,4519.68,2331.12,Jan-2017,Jan-2017,13668.88,379.39,13668.88
1,9600,9600.0,9600,3572.97,0.0,0.0,Jan-2017,3572.97,2964.31,Jan-2017,Jan-2017,6635.69,298.58,6635.69
2,25000,25000.0,25000,26224.23,0.0,0.0,,26224.23,25000.0,Jan-2017,Sep-2016,0.0,20807.39,0.0
3,28000,28000.0,28000,10271.36,0.0,0.0,Jan-2017,10271.36,8736.23,Jan-2017,Jan-2017,19263.77,858.05,19263.77
4,8650,8650.0,8650,9190.49,0.0,0.0,,9190.49,8650.0,Jun-2016,May-2016,0.0,8251.42,0.0


In [83]:
x1 = pd.get_dummies(x1)
x1 = x1.dropna(axis=1)
y1 = Y

rfc1 = ensemble.RandomForestClassifier()

cross_val_score(rfc1, x1, y1, cv=10)



array([0.96815559, 0.97214505, 0.9772506 , 0.98088385, 0.96943719,
       0.97492282, 0.97249994, 0.97869814, 0.97176241, 0.98100033])

Those are still great scores.  Will continue to pare down the model - trying to get ~90% CV scores with the least amount of input data.

In [85]:
# Model with just the top four features

x2 = yr2015.loc[:,
               ['last_pymnt_d',
                'out_prncp_inv',
                'last_pymnt_amnt',
                'out_prncp']]
x2 = x2.dropna(axis=1)
x2 = pd.get_dummies(x2)

y2 = y1

rfc2 = ensemble.RandomForestClassifier(n_estimators=10)

cross_val_score(rfc2, x2, y2, cv=10)

array([0.87361972, 0.93574126, 0.91394173, 0.94267531, 0.94101164,
       0.93478984, 0.91569498, 0.9261678 , 0.93758757, 0.93908232])

That's probably about right.  While the CV scores have some variance, that could most likely be alleviated by increasing the number of estimators.