## Random Forest Thinkful Guided Example - Third Attempt
In this notebook I'll attempt to get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation. First, I'll dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. I'll use PCA or correlation matrices.

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn import ensemble

from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

We've talked about Random Forests. Now it's time to build one.

Here we'll use data from Lending Club to predict the state of a loan given some information about it. You can find the dataset [here](https://www.lendingclub.com/info/download-data.action). We'll use 2015 data. ([Thinkful mirror](https://www.dropbox.com/s/m7z42lubaiory33/LoanStats3d.csv?dl=0))

In [26]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(r'C:\Users\hafeez_poldz\Desktop\Thinkful\Unit 3\data\LoanStats3d.csv', 
                    skipinitialspace=True, header=1)

# Note the warning about dtypes.

In [27]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,35000.0,35000.0,35000.0,60 months,14.85%,829.9,C,C5,...,,,Cash,N,,,,,,
1,,,10400.0,10400.0,10400.0,60 months,22.45%,289.91,F,F1,...,,,Cash,N,,,,,,
2,,,20000.0,20000.0,20000.0,60 months,10.78%,432.66,B,B4,...,,,Cash,N,,,,,,
3,,,3600.0,3600.0,3600.0,36 months,13.99%,123.03,C,C4,...,,,Cash,N,,,,,,
4,,,24700.0,24700.0,24700.0,36 months,11.99%,820.28,C,C1,...,,,Cash,N,,,,,,


## Data Cleaning

Well, `get_dummies` can be a very memory intensive thing, particularly if data are typed poorly. We got a warning about that earlier. Mixed data types get converted to objects, and that could create huge problems. Our dataset is about 400,000 rows. If there's a bad type there its going to see 400,000 distinct values and try to create dummies for all of them. That's bad. Lets look at all our categorical variables and see how many distinct counts there are...

In [28]:
categorical = y2015.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

id
2
term
2
int_rate
111
grade
7
sub_grade
35
emp_title
120812
emp_length
11
home_ownership
4
verification_status
3
issue_d
12
loan_status
7
pymnt_plan
2
desc
34
purpose
14
title
27
zip_code
914
addr_state
49
earliest_cr_line
668
revol_util
1211
initial_list_status
2
last_pymnt_d
50
next_pymnt_d
6
last_credit_pull_d
51
application_type
2
verification_status_joint
1
hardship_flag
2
hardship_type
1
hardship_reason
9
hardship_status
3
hardship_start_date
25
hardship_end_date
25
payment_plan_start_date
25
hardship_loan_status
4
disbursement_method
1
debt_settlement_flag
2
debt_settlement_flag_date
43
settlement_status
3
settlement_date
46


Well that right there is what's called a problem. Some of these have over a hundred thousand distinct types. Lets drop the ones with over 30 unique values, converting to numeric where it makes sense. In doing this there's a lot of code that gets written to just see if the numeric conversion makes sense. It's a manual process that we'll abstract away and just include the conversion.

You could extract numeric features from the dates, but here we'll just drop them. There's a lot of data, it shouldn't be a huge problem.

In [29]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc', 'debt_settlement_flag_date', 'settlement_date',
           'last_credit_pull_d'], 1, inplace=True)

Wonder what was causing the dtype error on the id column, which _should_ have all been integers? Let's look at the end of the file.

Now this should be better. Let's try again.

In [30]:
y2015 = y2015[:-2]
print(pd.get_dummies(y2015).shape)
print(y2015.shape)

(421095, 358)
(421095, 134)


It works! I had to sacrifice 11 variables, but that's fine.

## Model 1.

Now let's try this model again.

We're also going to drop NA columns, rather than impute, because our data is rich enough that we can probably get away with it.

This model may take a few minutes to run.

In [31]:
# random forest calssifier
rfc = ensemble.RandomForestClassifier()

# define input variables and outcome variable
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']

# convert categorical variables into dummy variables
X = pd.get_dummies(X)
X = X.dropna(axis = 1)

# cross validation with 10 folds
cross_val_score(rfc, X, Y, cv=10)

array([0.99073942, 0.99104811, 0.99183131, 0.99190197, 0.992353  ,
       0.99202052, 0.99145056, 0.99228175, 0.99235282, 0.99301779])

The score cross validation reports is the accuracy of the tree. Here we're about 99% accurate.

That works pretty well, but there are a few potential problems. Firstly, we didn't really do much in the way of feature selection or model refinement. As such there are a lot of features in there that we don't really need. Some of them are actually quite impressively useless.

## Model 2.

I'll get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

I want to do a few things in this process. First, check the missing values and remove them. Later, I may want to use PCA or correlation matrices.
The lesson gave me the idea to pay attention to the relation between payment amount and outstanding principial. Let's check it.


In [32]:
# check missing values
y2015.isnull().sum()

id                                            421095
member_id                                     421095
loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
installment                                        0
grade                                              0
emp_length                                     23817
home_ownership                                     0
annual_inc                                         0
verification_status                                0
issue_d                                            0
loan_status                                        0
pymnt_plan                                         0
purpose                                            0
title                                            132
dti                                           

In [33]:
# drop missing values
y2015 = y2015.dropna(how ='any', axis = 1)
print(y2015.isnull().sum())
print(y2015.shape)

loan_amnt                     0
funded_amnt                   0
funded_amnt_inv               0
term                          0
int_rate                      0
installment                   0
grade                         0
home_ownership                0
annual_inc                    0
verification_status           0
issue_d                       0
loan_status                   0
pymnt_plan                    0
purpose                       0
delinq_2yrs                   0
inq_last_6mths                0
open_acc                      0
pub_rec                       0
revol_bal                     0
total_acc                     0
initial_list_status           0
out_prncp                     0
out_prncp_inv                 0
total_pymnt                   0
total_pymnt_inv               0
total_rec_prncp               0
total_rec_int                 0
total_rec_late_fee            0
recoveries                    0
collection_recovery_fee       0
                             ..
acc_open

In [34]:
# check the relationship between outstanding principial and loan status
out_prncp_loan_st = y2015[['out_prncp', 'loan_status']]
out_prncp_loan_st.head(10)

Unnamed: 0,out_prncp,loan_status
0,16523.08,Current
1,0.0,Fully Paid
2,0.0,Fully Paid
3,0.0,Fully Paid
4,0.0,Fully Paid
5,0.0,Fully Paid
6,7364.72,Current
7,0.0,Fully Paid
8,0.0,Fully Paid
9,0.0,Charged Off


In [35]:
# remove the outstanding principial column
y2015 = y2015.drop(['out_prncp'], 1)

# random forest classifier
rfc = ensemble.RandomForestClassifier()

# define input variables and outcome variable
X = y2015.drop(['loan_status'], axis = 1)
Y = y2015['loan_status']

# convert categorical variables into indicator variables
X = pd.get_dummies(X)
X = X.dropna(axis = 1)

# PCA normalization
pca = PCA(n_components = 20)
X_PCA = pca.fit_transform(X)

# fit the model
rfc.fit(X_PCA, Y)
print("score: ", rfc.score(X_PCA, Y))

# cross validation with 10 folds
cross_val_score(rfc, X_PCA, Y, cv = 10)

score:  0.9986107647917928


array([0.98608539, 0.98888731, 0.98784195, 0.98738987, 0.98788829,
       0.98822077, 0.98786454, 0.98841075, 0.98810174, 0.98703303])

After removing less important features and using 20 principial components, model accuracy dropped by .01 point and this is stil very good for the model. With 10 folds cross-validation our accuracy score is over 90 which is satisfactory. 