# Random Forest: Guided Example -- Kristofer Schobert

## I will be following along with Thinkful's code which exemplifies a Random Forest Model

We will try to predict classify one's loan status based on a large number of features. 

After following the course's model we are asked to try to reduce the number of features while keeping the average cross validation score of our random forest above 90%. I did this using Principal Component Analysis. I chose 50 Prinicpal components. This is a large improvement over the 200+ feature model Thinkful's walkthrough created. 



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    'https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1',
    skipinitialspace=True,
    header=1
)

# Note the warning about dtypes.

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401,72868139.0,16000.0,16000.0,16000.0,60 months,14.85%,379.39,C,C5,...,0.0,2.0,78.9,0.0,0.0,2.0,298100.0,31329.0,281300.0,13400.0
1,68354783,73244544.0,9600.0,9600.0,9600.0,36 months,7.49%,298.58,A,A4,...,0.0,2.0,100.0,66.7,0.0,0.0,88635.0,55387.0,12500.0,75635.0
2,68466916,73356753.0,25000.0,25000.0,25000.0,36 months,7.49%,777.55,A,A4,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
3,68466961,73356799.0,28000.0,28000.0,28000.0,36 months,6.49%,858.05,A,A2,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0
4,68495092,73384866.0,8650.0,8650.0,8650.0,36 months,19.89%,320.99,E,E3,...,0.0,12.0,100.0,50.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0


In [4]:
# viewing the columns with categorial data
y2015.select_dtypes(include='object').head()

Unnamed: 0,id,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,...,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint
0,68009401,60 months,14.85%,C,C5,Bookkeeper/Accounting,10+ years,MORTGAGE,Not Verified,Dec-2015,...,297xx,SC,Jun-1991,29.6%,w,Jan-2017,Jan-2017,Jan-2017,INDIVIDUAL,
1,68354783,36 months,7.49%,A,A4,tech,8 years,MORTGAGE,Not Verified,Dec-2015,...,299xx,SC,Jun-1996,59.4%,w,Jan-2017,Jan-2017,Jan-2017,INDIVIDUAL,
2,68466916,36 months,7.49%,A,A4,Sales Manager,10+ years,MORTGAGE,Not Verified,Dec-2015,...,226xx,VA,Dec-2001,54.3%,w,Sep-2016,,Jan-2017,INDIVIDUAL,
3,68466961,36 months,6.49%,A,A2,Senior Manager,10+ years,MORTGAGE,Not Verified,Dec-2015,...,275xx,NC,May-1984,64.5%,w,Jan-2017,Jan-2017,Jan-2017,INDIVIDUAL,
4,68495092,36 months,19.89%,E,E3,Program Coordinator,8 years,RENT,Verified,Dec-2015,...,462xx,IN,Mar-2005,46%,w,May-2016,,Jun-2016,INDIVIDUAL,


In [5]:
# seeing the number of diffent categories for each of these columns
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
421097
term
2
int_rate
110
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
25
next_pymnt_d
4
last_credit_pull_d
26
application_type
2
verification_status_joint
3


In [6]:

# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [8]:
# inspecting the tail of the dataframe which has all NaN values
y2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421092,36271333.0,38982739.0,13000.0,13000.0,13000.0,60 months,15.99,316.07,D,5 years,...,0.0,3.0,100.0,50.0,1.0,0.0,51239.0,34178.0,10600.0,33239.0
421093,36490806.0,39222577.0,12000.0,12000.0,12000.0,60 months,19.99,317.86,E,1 year,...,1.0,2.0,95.0,66.7,0.0,0.0,96919.0,58418.0,9700.0,69919.0
421094,36271262.0,38982659.0,20000.0,20000.0,20000.0,36 months,11.99,664.2,B,10+ years,...,0.0,1.0,100.0,50.0,0.0,1.0,43740.0,33307.0,41700.0,0.0
421095,,,,,,,,,,,...,,,,,,,,,,
421096,,,,,,,,,,,...,,,,,,,,,,


In [9]:
# removing bottom two rows with nan values
y2015 = y2015[:-2]

In [10]:
# we see no more nan values in the last two rows
y2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421090,36371250.0,39102635.0,10000.0,10000.0,10000.0,36 months,11.99,332.1,B,8 years,...,0.0,1.0,100.0,100.0,0.0,0.0,32950.0,25274.0,9200.0,15850.0
421091,36441262.0,39152692.0,24000.0,24000.0,24000.0,36 months,11.99,797.03,B,10+ years,...,0.0,2.0,56.5,100.0,0.0,0.0,152650.0,8621.0,9000.0,0.0
421092,36271333.0,38982739.0,13000.0,13000.0,13000.0,60 months,15.99,316.07,D,5 years,...,0.0,3.0,100.0,50.0,1.0,0.0,51239.0,34178.0,10600.0,33239.0
421093,36490806.0,39222577.0,12000.0,12000.0,12000.0,60 months,19.99,317.86,E,1 year,...,1.0,2.0,95.0,66.7,0.0,0.0,96919.0,58418.0,9700.0,69919.0
421094,36271262.0,38982659.0,20000.0,20000.0,20000.0,36 months,11.99,664.2,B,10+ years,...,0.0,1.0,100.0,50.0,0.0,1.0,43740.0,33307.0,41700.0,0.0


In [11]:
# inspecting the increased number of features when turning each catagorical features into multiple features, one for each category.
print(pd.get_dummies(y2015).shape)
print(y2015.shape)

(421095, 237)
(421095, 103)


In [12]:
# the model presented in the Thinkful's lesson

from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)

  from numpy.core.umath_tests import inner1d


array([0.97976776, 0.98029019, 0.9813113 , 0.9816675 , 0.95891712,
       0.97898361, 0.9617184 , 0.98062172, 0.97990833, 0.9800741 ])

## Model 1

We will use 50 Prinicpal Components as our features.

In [42]:
# my model using PCA with 50 components

rfc = ensemble.RandomForestClassifier()
Y = y2015['loan_status']
X = y2015.drop('loan_status', 1)
X = pd.get_dummies(X)
X = X.dropna(axis=1)

from matplotlib.mlab import PCA as mlabPCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 


sklearn_pca = PCA(n_components=50)
X_PCA = sklearn_pca.fit_transform(X)

In [43]:
cross_val_score(rfc, X_PCA, Y, cv=10)

array([0.80271188, 0.96112655, 0.96162523, 0.95984422, 0.95922584,
       0.95841843, 0.96022228, 0.95704006, 0.95801173, 0.94801216])

## Model 1 Results

These are all good scores. We have mostly consistant scores besides the first one of 80%. The mean is still above 90%. Also, we are using far less features than we were previously (200+) 

The Thinkful lesson brought up to my attention the Outstanding Principal columns, which as we can see is far too correlated with the outcome. We will remove it for our next model.   

In [54]:
pd.DataFrame({'out_prncp' : y2015['out_prncp'], 'loan_status' : y2015['loan_status']}).head(25)

Unnamed: 0,out_prncp,loan_status
0,13668.88,Current
1,6635.69,Current
2,0.0,Fully Paid
3,19263.77,Current
4,0.0,Fully Paid
5,19143.69,Current
6,25346.66,Current
7,0.0,Fully Paid
8,29900.89,Current
9,0.0,Fully Paid


yes, Outstanding Prinical seems a bit too correlated with the outcome variable... let's remove this column from the data. We will try using only 20 Principal Components this time.

## Model 2

In [50]:
# model 2

rfc = ensemble.RandomForestClassifier()
Y = y2015['loan_status']
X = y2015.drop(['loan_status','out_prncp'], axis=1)
X = pd.get_dummies(X)
X = X.dropna(axis=1)

from matplotlib.mlab import PCA as mlabPCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA 


sklearn_pca = PCA(n_components=20)
X_PCA = sklearn_pca.fit_transform(X)
cross_val_score(rfc, X_PCA, Y, cv=10)

array([0.80316307, 0.95488115, 0.95640094, 0.95480991, 0.95404892,
       0.95214913, 0.95129307, 0.94352751, 0.93623388, 0.91020282])

## Model 2 Results

Our second model works similarly well. We are only using 20 pricipal components this time (also no "outstanding principal" feature) and our results are very close. We once again have a mean cross validation score of over 90%. 