In [19]:
import pandas as pd
import numpy as np
import scipy
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline

We've talked about Random Forests and Decision Trees, but now it's time to build one.

Here we'll use data from Lending Club to predict the state of a loan given some information about it. You can find the dataset [here](https://www.lendingclub.com/info/download-data.action). We'll use 2015 Lending data.

In [120]:
y2015 = pd.read_csv('C:\\Users\\Peter\\Desktop\\Thinkful\\LoanStats3d.csv', skipinitialspace=True, header=1, low_memory=False)

In [21]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_il_6m,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog
0,,,18000.0,18000.0,18000.0,60 months,19.48%,471.7,E,E2,...,,,,,,,,,,
1,,,21000.0,21000.0,21000.0,60 months,13.99%,488.53,C,C4,...,,,,,,,,,,
2,,,16000.0,16000.0,16000.0,60 months,13.99%,372.21,C,C4,...,,,,,,,,,,
3,,,15700.0,15700.0,15700.0,60 months,16.59%,386.74,D,D2,...,,,,,,,,,,
4,,,23850.0,23850.0,23850.0,60 months,17.27%,596.21,D,D3,...,,,,,,,,,,


## The Blind Approach

Now, as we've seen before, creating a model is the easy part. Let's try just using everything we've got and throwing it without much thought into a Random Forest. SKLearn requires the independent variables to be be numeric, and all we want is dummy variables so let's use get_dummies from pandas and see what happens off of this kind of naive approach.

In [22]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)

cross_val_score(rfc, X, Y, cv=5)

MemoryError: 

Did your kernel die? My kernel died.

Guess it isn't always going to be that easy...

Can you think of what went wrong?

(You're going to have to reset your kernel and reload the column, BUT DON'T RUN THE MODEL AGAIN OR YOU'LL CRASH THE KERNEL AGAIN! You can restart the kernel in the 'Kernel' menu at the top of the page).

## Data Cleaning

Well, get_dummies can be a very memory intensive thing, particularly if things are typed poorly. We got a warning about that earlier. Mixed types get converted to objects, and that could create huge problems. Our dataset is about 400,000 rows. If theres a bad type there its going to see 400,000 distinct values and try to create dummies for all of them. That's bad. Lets look at all out categorical variables and see how many distinct counts there are...

In [23]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
2
term
2
int_rate
111
grade
7
sub_grade
35
emp_title
120812
emp_length
12
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
2
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
28
next_pymnt_d
2
last_credit_pull_d
29
application_type
2
verification_status_joint
1


In [24]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_il_6m,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog
0,,,18000.0,18000.0,18000.0,60 months,19.48%,471.7,E,E2,...,,,,,,,,,,
1,,,21000.0,21000.0,21000.0,60 months,13.99%,488.53,C,C4,...,,,,,,,,,,
2,,,16000.0,16000.0,16000.0,60 months,13.99%,372.21,C,C4,...,,,,,,,,,,
3,,,15700.0,15700.0,15700.0,60 months,16.59%,386.74,D,D2,...,,,,,,,,,,
4,,,23850.0,23850.0,23850.0,60 months,17.27%,596.21,D,D3,...,,,,,,,,,,


Well that right there is what's called a problem. Some of these have over a hundred thousand distinct types. Lets drop the ones with over 30 unique values, converting to numeric where it makes sense. In doing this there's a lot of code that gets written to just see if the numeric conversion makes sense. It's a manual process that we'll abstract away and just include the conversion.

You could extract numeric features from the dates, but here we'll just drop them. There's a lot of data, it shouldn't be a huge problem.

In [25]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(
                    y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line',
            'revol_util', 'sub_grade', 'addr_state', 'desc'],
            1, inplace=True)

# Remove two summary rows that don't actually contain data
y2015 = y2015[:-2]

Note that in `pd.to_numeric` we used the 'coerce' option for our conversion. This makes it so that if the function cannot figure out how to convert to numeric it returns a null for that value rather than returning an error for the whole process.

Now this should be better. Let's try again.

In [26]:
pd.get_dummies(y2015)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_May-2016,last_credit_pull_d_Nov-2015,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified
0,,,18000.0,18000.0,18000.0,19.48,471.70,150000.0,9.39,0.0,...,0,0,0,0,0,0,0,1,0,0
1,,,21000.0,21000.0,21000.0,13.99,488.53,52000.0,14.47,0.0,...,0,0,0,0,0,0,0,1,0,0
2,,,16000.0,16000.0,16000.0,13.99,372.21,142000.0,17.74,2.0,...,0,0,0,0,0,0,0,1,0,0
3,,,15700.0,15700.0,15700.0,16.59,386.74,48000.0,29.13,0.0,...,0,0,0,0,0,0,0,1,0,0
4,,,23850.0,23850.0,23850.0,17.27,596.21,68046.0,24.71,1.0,...,0,0,0,0,0,0,0,1,0,0
5,,,7200.0,7200.0,7200.0,9.17,229.53,50000.0,19.25,0.0,...,0,0,0,0,0,0,0,1,0,0
6,,,28000.0,28000.0,28000.0,14.85,663.92,85000.0,28.52,0.0,...,0,0,0,0,0,0,0,1,0,0
7,,,23975.0,23975.0,23975.0,19.89,633.73,70000.0,33.20,0.0,...,0,0,0,0,0,0,0,1,0,0
8,,,17600.0,17600.0,17600.0,19.89,465.22,44000.0,17.56,0.0,...,0,0,0,0,0,0,0,1,0,0
9,,,20200.0,20200.0,20200.0,18.49,518.35,60000.0,34.84,0.0,...,0,0,0,0,0,0,0,1,0,0


It works! We had to sacrifice sub grade, state address and description, but that's fine. If you want to include them you could run the dummies independently and then append them back to the dataframe.

## Second Attempt

Now let's try this model again.

We're also going to drop NA columns, rather than impute, because our data is rich enough that we can probably get away with it.

In [27]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

In [28]:
rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

In [29]:
cross_val_score(rfc, X, Y, cv=10)

array([ 0.98660714,  0.98655965,  0.98710551,  0.98658308,  0.98663025,
        0.98679617,  0.98325774,  0.98613123,  0.98641556,  0.98525116])

The score cross validation reports is the accuracy of the tree. Here we're about 98% accurate.

That works pretty well, but there are a few potential problems. Firstly, we didn't really do much in the way of feature selection or model refinement. As such there are a lot of features in there that we don't really need. Some of them are actually quite impressively useless.

There's also some variance in the scores. The fact that one gave us onluy 93% accuracy while others gave higher than 98 is concerning. This variance could be corrected by increasing the number of estimators. That will make it take even longer to run, however, and it is already quite slow.

## DRILL: Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matricies.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

In [None]:
# EXAMPLE SOLUTION

In [30]:
df = pd.DataFrame()
df['imp']= rfc.fit(X, Y).feature_importances_
df['features'] = X.columns

In [14]:
x = X.drop(['last_pymnt_amnt', 'out_prncp', 'last_pymnt_d_Jan-2017', 'out_prncp_inv'], 1)

In [15]:
cross_val_score(rfc, x, Y, cv=10)

array([ 0.98480243,  0.98477869,  0.98563321,  0.98473083,  0.98397055,
        0.98456387,  0.98233157,  0.98416016,  0.9831382 ,  0.98280489])

In [34]:
df2 = df.sort_values(by='imp', ascending=False)

In [111]:
import time

_start_time = time.time()

def tic():
    global _start_time 
    _start_time = time.time()

def tac():
    t_sec = round(time.time() - _start_time)
    (t_min, t_sec) = divmod(t_sec,60)
    (t_hour,t_min) = divmod(t_min,60) 
    print('Time passed: {}hour:{}min:{}sec'.format(t_hour,t_min,t_sec))

In [115]:
tic()
listofgoodones = np.asarray(df2['features'])
x2 =X[listofgoodones]
x = cross_val_score(rfc, x2, Y, cv=10)
print(x)
avg = float(sum(x))/len(x)
print(avg)
print('With {} features the run time for this operation was:'.format(len(listofgoodones)))
tac()

[ 0.98632219  0.98639343  0.98717675  0.98663057  0.98677274  0.98667743
  0.98484884  0.98636871  0.98660555  0.98558366]
0.986337987156006
With 203 features the run time for this operation was:
Time passed: 0hour:5min:5sec


In [116]:
tic()
listofgoodones = np.asarray(df2['features'].iloc[:50])
x2 =X[listofgoodones]
x = cross_val_score(rfc, x2, Y, cv=10)
print(x)
avg = float(sum(x))/len(x)
print(avg)
print('With {} features the run time for this operation was:'.format(len(listofgoodones)))
tac()

[ 0.98651216  0.98641717  0.98731923  0.98674931  0.98693897  0.98670118
  0.98655869  0.98615498  0.98686679  0.98598741]
0.9866205889933353
With 50 features the run time for this operation was:
Time passed: 0hour:5min:14sec


In [119]:
tic()
listofgoodones = np.asarray(df2['features'].iloc[:2])
x2 =X[listofgoodones]
x = cross_val_score(rfc, x2, Y, cv=10)
print(x)
avg = float(sum(x))/len(x)
print(avg)
print('With {} features the run time for this operation was:'.format(len(listofgoodones)))
tac()

[ 0.8603011   0.93415179  0.92714493  0.94550118  0.94511992  0.94559358
  0.94580731  0.92445795  0.94680219  0.94171714]
0.9316597086184453
With 2 features the run time for this operation was:
Time passed: 0hour:2min:41sec
