In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

We've talked about Random Forests. Now it's time to build one.

Here we'll use data from Lending Club (2015) to predict the state of a loan given some information about it. You can download the dataset [here](https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1)

In [2]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    'https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1',
    skipinitialspace=True,
    header=1
)

# Note the warning about dtypes.

In [3]:
y2015.shape

(421097, 111)

In [3]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401,72868139.0,16000.0,16000.0,16000.0,60 months,14.85%,379.39,C,C5,...,0.0,2.0,78.9,0.0,0.0,2.0,298100.0,31329.0,281300.0,13400.0
1,68354783,73244544.0,9600.0,9600.0,9600.0,36 months,7.49%,298.58,A,A4,...,0.0,2.0,100.0,66.7,0.0,0.0,88635.0,55387.0,12500.0,75635.0
2,68466916,73356753.0,25000.0,25000.0,25000.0,36 months,7.49%,777.55,A,A4,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
3,68466961,73356799.0,28000.0,28000.0,28000.0,36 months,6.49%,858.05,A,A2,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0
4,68495092,73384866.0,8650.0,8650.0,8650.0,36 months,19.89%,320.99,E,E3,...,0.0,12.0,100.0,50.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0


## Data Cleaning

Well, `get_dummies` can be a very memory intensive thing, particularly if data are typed poorly. We got a warning about that earlier. Mixed data types get converted to objects, and that could create huge problems. Our dataset is about 400,000 rows. If there's a bad type there its going to see 400,000 distinct values and try to create dummies for all of them. That's bad. Lets look at all our categorical variables and see how many distinct counts there are...

In [6]:
y2015.select_dtypes(include=['object'])

Unnamed: 0,id,term,int_rate,grade,sub_grade,emp_title,emp_length,home_ownership,verification_status,issue_d,...,zip_code,addr_state,earliest_cr_line,revol_util,initial_list_status,last_pymnt_d,next_pymnt_d,last_credit_pull_d,application_type,verification_status_joint
0,68009401,60 months,14.85%,C,C5,Bookkeeper/Accounting,10+ years,MORTGAGE,Not Verified,Dec-2015,...,297xx,SC,Jun-1991,29.6%,w,Jan-2017,Jan-2017,Jan-2017,INDIVIDUAL,
1,68354783,36 months,7.49%,A,A4,tech,8 years,MORTGAGE,Not Verified,Dec-2015,...,299xx,SC,Jun-1996,59.4%,w,Jan-2017,Jan-2017,Jan-2017,INDIVIDUAL,
2,68466916,36 months,7.49%,A,A4,Sales Manager,10+ years,MORTGAGE,Not Verified,Dec-2015,...,226xx,VA,Dec-2001,54.3%,w,Sep-2016,,Jan-2017,INDIVIDUAL,
3,68466961,36 months,6.49%,A,A2,Senior Manager,10+ years,MORTGAGE,Not Verified,Dec-2015,...,275xx,NC,May-1984,64.5%,w,Jan-2017,Jan-2017,Jan-2017,INDIVIDUAL,
4,68495092,36 months,19.89%,E,E3,Program Coordinator,8 years,RENT,Verified,Dec-2015,...,462xx,IN,Mar-2005,46%,w,May-2016,,Jun-2016,INDIVIDUAL,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421092,36271333,60 months,15.99%,D,D2,Radiologist Technologist,5 years,RENT,Verified,Jan-2015,...,378xx,TN,Sep-2003,61.3%,w,May-2016,,Dec-2016,INDIVIDUAL,
421093,36490806,60 months,19.99%,E,E3,Painter,1 year,RENT,Source Verified,Jan-2015,...,010xx,MA,Oct-2003,30.6%,w,Jan-2016,,Oct-2016,INDIVIDUAL,
421094,36271262,36 months,11.99%,B,B5,Manager Hotel Operations Oasis,10+ years,RENT,Verified,Jan-2015,...,331xx,FL,Dec-2001,79.8%,f,Jan-2017,Feb-2017,Jan-2017,INDIVIDUAL,
421095,Total amount funded in policy code 1: 6417608175,,,,,,,,,,...,,,,,,,,,,


In [4]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
421097
term
2
int_rate
110
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
1
url
421095
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
25
next_pymnt_d
4
last_credit_pull_d
26
application_type
2
verification_status_joint
3


Well that right there is what's called a problem. Some of these have over a hundred thousand distinct types. Lets drop the ones with over 30 unique values, converting to numeric where it makes sense. In doing this there's a lot of code that gets written to just see if the numeric conversion makes sense. It's a manual process that we'll abstract away and just include the conversion.

You could extract numeric features from the dates, but here we'll just drop them. There's a lot of data, it shouldn't be a huge problem.

In [7]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

Wonder what was causing the dtype error on the id column, which _should_ have all been integers? Let's look at the end of the file.

In [6]:
y2015.tail()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
421092,36271333.0,38982739.0,13000.0,13000.0,13000.0,60 months,15.99,316.07,D,5 years,...,0.0,3.0,100.0,50.0,1.0,0.0,51239.0,34178.0,10600.0,33239.0
421093,36490806.0,39222577.0,12000.0,12000.0,12000.0,60 months,19.99,317.86,E,1 year,...,1.0,2.0,95.0,66.7,0.0,0.0,96919.0,58418.0,9700.0,69919.0
421094,36271262.0,38982659.0,20000.0,20000.0,20000.0,36 months,11.99,664.2,B,10+ years,...,0.0,1.0,100.0,50.0,0.0,1.0,43740.0,33307.0,41700.0,0.0
421095,,,,,,,,,,,...,,,,,,,,,,
421096,,,,,,,,,,,...,,,,,,,,,,


In [4]:
# Remove two summary rows at the end that don't actually contain data.
y2015 = y2015[:-2]

Now this should be better. Let's try again.

In [9]:
pd.get_dummies(y2015)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401.0,72868139.0,16000.0,16000.0,16000.0,14.85,379.39,48000.0,33.18,0.0,...,0,0,0,0,0,1,0,0,0,0
1,68354783.0,73244544.0,9600.0,9600.0,9600.0,7.49,298.58,60000.0,22.44,0.0,...,0,0,0,0,0,1,0,0,0,0
2,68466916.0,73356753.0,25000.0,25000.0,25000.0,7.49,777.55,109000.0,26.02,0.0,...,0,0,0,0,0,1,0,0,0,0
3,68466961.0,73356799.0,28000.0,28000.0,28000.0,6.49,858.05,92000.0,21.60,0.0,...,0,0,0,0,0,1,0,0,0,0
4,68495092.0,73384866.0,8650.0,8650.0,8650.0,19.89,320.99,55000.0,25.49,0.0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421090,36371250.0,39102635.0,10000.0,10000.0,10000.0,11.99,332.10,31000.0,28.69,0.0,...,0,0,0,0,0,1,0,0,0,0
421091,36441262.0,39152692.0,24000.0,24000.0,24000.0,11.99,797.03,79000.0,3.90,0.0,...,0,0,0,0,0,1,0,0,0,0
421092,36271333.0,38982739.0,13000.0,13000.0,13000.0,15.99,316.07,35000.0,30.90,0.0,...,0,0,0,0,0,1,0,0,0,0
421093,36490806.0,39222577.0,12000.0,12000.0,12000.0,19.99,317.86,64400.0,27.19,1.0,...,0,0,1,0,0,1,0,0,0,0


It finally works! We had to sacrifice sub grade, state address and description, but that's fine. If you want to include them you could run the dummies independently and then append them back to the dataframe.

## Second Attempt

Now let's try this model again.

We're also going to drop NA columns, rather than impute, because our data is rich enough that we can probably get away with it.

This model may take a few minutes to run.

In [122]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()

# remove loan_status from feature set
X = y2015.drop('loan_status', 1)
# include loan_status as target variable for model
Y = y2015['loan_status']

X = pd.get_dummies(X)

# remove columns with nan values
X = X.dropna(axis=1)

# cross_val_score(rfc, X, Y, cv=10)

MemoryError: Unable to allocate array with shape (421097, 421097) and data type uint8

The score cross validation reports is the accuracy of the tree. Here we're about 98% accurate.

That works pretty well, but there are a few potential problems. Firstly, we didn't really do much in the way of feature selection or model refinement. As such there are a lot of features in there that we don't really need. Some of them are actually quite impressively useless.

There's also some variance in the scores. The fact that one gave us only 93% accuracy while others gave higher than 98 is concerning. This variance could be corrected by increasing the number of estimators. That will make it take even longer to run, however, and it is already quite slow.

## DRILL: Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matrices.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

In [87]:
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

In [64]:
# based on features from above
# what are some categories that 
# I might want to keep?
X.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,policy_code,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,acc_open_past_24mths,avg_cur_bal,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_tl_bal_gt_0,num_sats,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,emp_length_1 year,emp_length_10+ years,emp_length_2 years,emp_length_3 years,emp_length_4 years,emp_length_5 years,emp_length_6 years,emp_length_7 years,emp_length_8 years,emp_length_9 years,emp_length_< 1 year,home_ownership_ANY,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,verification_status_Not Verified,verification_status_Source Verified,verification_status_Verified,issue_d_Apr-2015,issue_d_Aug-2015,issue_d_Dec-2015,issue_d_Feb-2015,issue_d_Jan-2015,issue_d_Jul-2015,issue_d_Jun-2015,issue_d_Mar-2015,issue_d_May-2015,issue_d_Nov-2015,issue_d_Oct-2015,issue_d_Sep-2015,pymnt_plan_n,purpose_car,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,title_Business,title_Car financing,title_Credit Card/Auto Repair,title_Credit card refinancing,title_Debt consolidation,title_DebtC,title_Green loan,title_Home buying,title_Home improvement,title_Learning and training,title_Major purchase,title_Medical expenses,title_Moving and relocation,title_New Baby and New House (CC Consolidate),title_Other,title_Pay off Lowes Card,title_Paying off higher interest cards & auto,title_Prescription Drug and Medical Costs,title_SAVE,title_Simple Loan Until Contract Is Completed,title_Student Loan,title_Trying to come back to reality!,title_Vacation,title_considerate,title_new day,title_new kitchen for momma!,title_odymeds,initial_list_status_f,initial_list_status_w,last_pymnt_d_Apr-2015,last_pymnt_d_Apr-2016,last_pymnt_d_Aug-2015,last_pymnt_d_Aug-2016,last_pymnt_d_Dec-2015,last_pymnt_d_Dec-2016,last_pymnt_d_Feb-2015,last_pymnt_d_Feb-2016,last_pymnt_d_Jan-2015,last_pymnt_d_Jan-2016,last_pymnt_d_Jan-2017,last_pymnt_d_Jul-2015,last_pymnt_d_Jul-2016,last_pymnt_d_Jun-2015,last_pymnt_d_Jun-2016,last_pymnt_d_Mar-2015,last_pymnt_d_Mar-2016,last_pymnt_d_May-2015,last_pymnt_d_May-2016,last_pymnt_d_Nov-2015,last_pymnt_d_Nov-2016,last_pymnt_d_Oct-2015,last_pymnt_d_Oct-2016,last_pymnt_d_Sep-2015,last_pymnt_d_Sep-2016,next_pymnt_d_Feb-2017,next_pymnt_d_Jan-2017,next_pymnt_d_Jul-2016,next_pymnt_d_Mar-2017,last_credit_pull_d_Apr-2015,last_credit_pull_d_Apr-2016,last_credit_pull_d_Aug-2015,last_credit_pull_d_Aug-2016,last_credit_pull_d_Dec-2014,last_credit_pull_d_Dec-2015,last_credit_pull_d_Dec-2016,last_credit_pull_d_Feb-2015,last_credit_pull_d_Feb-2016,last_credit_pull_d_Jan-2015,last_credit_pull_d_Jan-2016,last_credit_pull_d_Jan-2017,last_credit_pull_d_Jul-2015,last_credit_pull_d_Jul-2016,last_credit_pull_d_Jun-2015,last_credit_pull_d_Jun-2016,last_credit_pull_d_Mar-2015,last_credit_pull_d_Mar-2016,last_credit_pull_d_May-2015,last_credit_pull_d_May-2016,last_credit_pull_d_Nov-2015,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401.0,72868139.0,16000.0,16000.0,16000.0,14.85,379.39,48000.0,33.18,0.0,0.0,11.0,2.0,19108.0,19.0,13668.88,13668.88,4519.68,4519.68,2331.12,2188.56,0.0,0.0,0.0,379.39,0.0,1.0,0.0,0.0,31329.0,284700.0,6.0,2848.0,0.0,0.0,294.0,11.0,6.0,2.0,2.0,6.0,9.0,6.0,8.0,6.0,9.0,9.0,11.0,0.0,0.0,2.0,78.9,0.0,2.0,298100.0,31329.0,281300.0,13400.0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,68354783.0,73244544.0,9600.0,9600.0,9600.0,7.49,298.58,60000.0,22.44,0.0,0.0,7.0,0.0,7722.0,9.0,6635.69,6635.69,3572.97,3572.97,2964.31,608.66,0.0,0.0,0.0,298.58,0.0,1.0,0.0,0.0,55387.0,13000.0,2.0,7912.0,0.0,0.0,91.0,9.0,9.0,0.0,0.0,3.0,3.0,3.0,3.0,5.0,4.0,3.0,7.0,0.0,0.0,2.0,100.0,0.0,0.0,88635.0,55387.0,12500.0,75635.0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,68466916.0,73356753.0,25000.0,25000.0,25000.0,7.49,777.55,109000.0,26.02,0.0,1.0,9.0,0.0,20862.0,19.0,0.0,0.0,26224.23,26224.23,25000.0,1224.23,0.0,0.0,0.0,20807.39,0.0,1.0,0.0,0.0,305781.0,38400.0,2.0,33976.0,0.0,0.0,168.0,13.0,13.0,3.0,0.0,3.0,3.0,5.0,6.0,7.0,5.0,3.0,9.0,0.0,0.0,0.0,100.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,68466961.0,73356799.0,28000.0,28000.0,28000.0,6.49,858.05,92000.0,21.6,0.0,0.0,16.0,0.0,51507.0,24.0,19263.77,19263.77,10271.36,10271.36,8736.23,1535.13,0.0,0.0,0.0,858.05,0.0,1.0,0.0,0.0,221110.0,79900.0,1.0,13819.0,0.0,0.0,379.0,19.0,19.0,2.0,0.0,7.0,9.0,9.0,11.0,4.0,13.0,9.0,16.0,0.0,0.0,0.0,91.7,0.0,0.0,304003.0,74920.0,41500.0,42503.0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,68495092.0,73384866.0,8650.0,8650.0,8650.0,19.89,320.99,55000.0,25.49,0.0,4.0,18.0,1.0,9568.0,19.0,0.0,0.0,9190.49,9190.49,8650.0,540.49,0.0,0.0,0.0,8251.42,0.0,1.0,0.0,0.0,18926.0,20750.0,17.0,1051.0,0.0,0.0,95.0,0.0,0.0,0.0,0.0,2.0,17.0,2.0,2.0,2.0,17.0,13.0,18.0,0.0,0.0,12.0,100.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


### Subset data to make it more managable to investigate features

#### Investigate features using random forest

In [4]:
# subset full dataset to attempt 
# to investigate feature importance
# then generalize to full dataset
df = y2015.head(20000).copy()

In [6]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()

X = df.drop('loan_status', 1)
Y = df['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

rfc_model = rfc.fit(X, Y)

In [7]:
f_importances = pd.DataFrame(rfc_model.feature_importances_,
                            X.columns,
                            columns=['importance']).sort_values('importance', ascending=False)

In [8]:
f_importances.head(10)

Unnamed: 0,importance
last_pymnt_d_Jan-2017,0.099832
total_rec_prncp,0.067383
total_pymnt_inv,0.063343
out_prncp,0.059673
total_pymnt,0.053923
last_pymnt_amnt,0.052826
next_pymnt_d_Feb-2017,0.042364
out_prncp_inv,0.029299
last_credit_pull_d_Jan-2017,0.025774
next_pymnt_d_Jan-2017,0.020848


Most of these features are related to payment amount or outstanding principal.

#### Investigate features using decision trees to see how it compares to using random forests

In [12]:
from sklearn import tree

decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=2)

decision_tree.fit(X, Y)

# maybe adjust max_features - squre root of the number of features


MemoryError: Unable to allocate array with shape (52689, 20000) and data type float64

In [10]:
f_importances = pd.DataFrame(decision_tree.feature_importances_,
                            X.columns,
                            columns=['importance']).sort_values('importance', ascending=False)

In [11]:
f_importances.head(10)

Unnamed: 0,importance
url_https://lendingclub.com/browse/loanDetail.action?loan_id=68545516,0.647121
emp_title_Customer service,0.192507
zip_code_425xx,0.096218
id_67810265,0.032078
id_66441242,0.032076
member_id,0.0
url_https://lendingclub.com/browse/loanDetail.action?loan_id=67346558,0.0
url_https://lendingclub.com/browse/loanDetail.action?loan_id=67346574,0.0
url_https://lendingclub.com/browse/loanDetail.action?loan_id=67346583,0.0
url_https://lendingclub.com/browse/loanDetail.action?loan_id=67346608,0.0


These features don't make sense at all.  Decision tree classifiers seem to be massively overfitting.

### By visual examination, these categories look promising to me:
 - home_ownership
 - purpose
 - employment length

In [89]:
y2015[['home_ownership', 'purpose', "emp_length"]]

Unnamed: 0,home_ownership,purpose,emp_length
0,MORTGAGE,credit_card,10+ years
1,MORTGAGE,credit_card,8 years
2,MORTGAGE,debt_consolidation,10+ years
3,MORTGAGE,debt_consolidation,10+ years
4,RENT,debt_consolidation,8 years
...,...,...,...
421090,RENT,debt_consolidation,8 years
421091,MORTGAGE,home_improvement,10+ years
421092,RENT,debt_consolidation,5 years
421093,RENT,debt_consolidation,1 year


### Correlated continuous features

#### PCA 1

In [98]:
# find the top 5 attributes correlated with int_rate
X.corr().loc[:,'int_rate'].sort_values(ascending=False).head().index

MemoryError: Unable to allocate array with shape (28234, 28234) and data type float64

Create PCA made up of positively correlated features - int_rate

In [94]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X2 = StandardScaler().fit_transform(X[['int_rate', 'total_rec_int', 'grade_E', 'term_ 60 months', 'grade_D']])

sklearn_pca = PCA(n_components=1)

# what is happening by the creation of this new column?
X["pca_1"] = sklearn_pca.fit_transform(X2)

# ratio of total variance in the dataset explained 
# by each component from PCA sklearn
sklearn_pca.explained_variance_ratio_

KeyError: "['grade_E', 'term_ 60 months', 'grade_D'] not in index"

#### PCA 2

In [100]:
# find the bottom 5 attributes correlated with int_rate
X.corr().loc[:,'int_rate'].sort_values(ascending=False).tail().index

MemoryError: Unable to allocate array with shape (28234, 28234) and data type float64

Create PCA made up of negatively correlated features - int_rate

In [87]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X2 = StandardScaler().fit_transform(X[['grade_B', 'term_ 36 months', 'grade_A', 'policy_code', 'pymnt_plan_n']])

sklearn_pca = PCA(n_components=1)

# what is happening by the creation of this new column?
X["pca_2"] = sklearn_pca.fit_transform(X2)

# ratio of total variance in the dataset explained 
# by each component from PCA sklearn
sklearn_pca.explained_variance_ratio_

array([0.4444703])

#### PCA 3

In [56]:
# find the bottom 5 attributes correlated with total_rec_late_fee
X.corr().loc[:,'total_rec_late_fee'].sort_values(ascending=False).tail().index

Index(['next_pymnt_d_Feb-2017', 'grade_A', 'last_pymnt_d_Jan-2017',
       'policy_code', 'pymnt_plan_n'],
      dtype='object')

In [85]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X2 = StandardScaler().fit_transform(X[['next_pymnt_d_Feb-2017', 'grade_A', 'last_pymnt_d_Jan-2017',
       'policy_code', 'pymnt_plan_n']])

sklearn_pca = PCA(n_components=1)

# what is happening by the creation of this new column?
X["pca_3"] = sklearn_pca.fit_transform(X2)

# ratio of total variance in the dataset explained 
# by each component from PCA sklearn
sklearn_pca.explained_variance_ratio_

array([0.5544726])

#### PCA 4

In [55]:
# find the top 5 attributes correlated with total_rec_late_fee 
X.corr().loc[:,'total_rec_late_fee'].sort_values(ascending=False).head().index

Index(['total_rec_late_fee', 'collection_recovery_fee', 'recoveries',
       'installment', 'int_rate'],
      dtype='object')

In [84]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X2 = StandardScaler().fit_transform(X[['total_rec_late_fee', 'collection_recovery_fee', 'recoveries',
       'installment', 'int_rate']])

sklearn_pca = PCA(n_components=1)

# what is happening by the creation of this new column?
X["pca_4"] = sklearn_pca.fit_transform(X2)

# ratio of total variance in the dataset explained 
# by each component from PCA sklearn
sklearn_pca.explained_variance_ratio_

array([0.40669958])

In [91]:
# drop everything except my pca features

X.drop(X.columns.difference(['pca_1', 'pca_2', 'pca_3', 'pca_4']), 1, inplace=True)

In [92]:
X.head()

0
1
2
3
4


In [62]:
# test with the above 4 features to get a baseline

from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()

# include loan_status as target variable for model
Y = y2015['loan_status']

cross_val_score(rfc, X, Y, cv=10)



array([0.87098383, 0.85084657, 0.84213151, 0.87794163, 0.83436238,
       0.86231299, 0.86667933, 0.8551616 , 0.8605695 , 0.87628841])

For only using 4 pca features, I'm impressed with how well this did.  However, I'm still below 90% accuracy.

### Add 3 categorical features to the above PCA features to see if this gives me any lift

In [90]:
X = pd.concat((X, pd.get_dummies(y2015[['home_ownership', 'purpose', "emp_length"]])), axis=1)

In [106]:
# verify categories have been added correctly
X.shape

(421095, 22)

In [93]:
# test with the above 4 features to get a baseline

from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()

# include loan_status as target variable for model
Y = y2015['loan_status']

cross_val_score(rfc, X, Y, cv=10)



array([0.87354848, 0.85618959, 0.8503004 , 0.88207357, 0.84573735,
       0.86471147, 0.87055024, 0.85993493, 0.86035576, 0.87647841])

Adding a few categories didn't seem to make any difference, I'm still below 90% cross validation score.

How could I run the above on the full features to be able find the most relevant features?  When run with the full feature set the kernel crashes.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?
