# 21.06 Guided Example

Random Forests were discussed in notebooks 21.04 & 21.05, now it's time to build one.  The forest will use data from Lending Club (2015) to predict the state of a loan given some information about it.  The dataset can be downloaded [here](https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1) 

In [34]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import ensemble 
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA

In [3]:
y2015 = pd.read_csv("F:\\thinkful\Data_Science\\21_supervised_learning_random_forest_models\\LoanStats3d.csv", skipinitialspace=True, header=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
y2015.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401,72868139.0,16000.0,16000.0,16000.0,60 months,14.85%,379.39,C,C5,...,0.0,2.0,78.9,0.0,0.0,2.0,298100.0,31329.0,281300.0,13400.0
1,68354783,73244544.0,9600.0,9600.0,9600.0,36 months,7.49%,298.58,A,A4,...,0.0,2.0,100.0,66.7,0.0,0.0,88635.0,55387.0,12500.0,75635.0
2,68466916,73356753.0,25000.0,25000.0,25000.0,36 months,7.49%,777.55,A,A4,...,0.0,0.0,100.0,20.0,0.0,0.0,373572.0,68056.0,38400.0,82117.0
3,68466961,73356799.0,28000.0,28000.0,28000.0,36 months,6.49%,858.05,A,A2,...,0.0,0.0,91.7,22.2,0.0,0.0,304003.0,74920.0,41500.0,42503.0
4,68495092,73384866.0,8650.0,8650.0,8650.0,36 months,19.89%,320.99,E,E3,...,0.0,12.0,100.0,50.0,1.0,0.0,38998.0,18926.0,2750.0,18248.0


### The Blind Approach

Creating a model is the easy part.  Try using everything in the dataset into a Random Forest.  SKLearn requires the independent variables to be numeric, and all you want is dummy variables, so use `get_dummies` from Pandas to generate a dummy variable for each categorical column and see what happens off of this kind of naive approach.

In [5]:
rfc = ensemble.RandomForestClassifier()
X = y2015.drop("loan_status", 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)

cross_val_score(rfc,X, Y, cv=5)

This approach will probability crash your Jupyter kernel.

### Data Cleaning 

Well, `get_dummies` can be a very memory intensive thing, particularly if dare typed poorly.  The warning thrown at import was evidence of that.  Mixed datatypes get converted to objects, and that could create huge problems.  The dataset is about 400,000 rows.  If there is a bad typ there it's going to see 400,000 distinct values and try to create dummies for all of them.  Look at the distinct values in the categorical columns.


In [6]:
categorical = y2015.select_dtypes(include=["object"]) 

for i in categorical: 
    column = categorical[i]
    print(f"{i}: ", end="")
    print(column.nunique())

id:421097
term: 2
int_rate: 110
grade: 7
sub_grade: 35
emp_title:120812
emp_length: 11
home_ownership: 4
verification_status: 3
issue_d:12
loan_status: 7
pymnt_plan: 1
url:421095
desc: 34
purpose: 14
title: 27
zip_code: 914
addr_state:49
earliest_cr_line: 668
revol_util: 1211
initial_list_status: 2
last_pymnt_d: 25
next_pymnt_d: 4
last_credit_pull_d: 26
application_type: 2
verification_status_joint: 3


This gives you some idea of the problem.  Some of these have over a hunderd thousand distinct types.  Drop the ones with over 30 unique values, converting to numeric where it makes sense.  You could extract numeric featues from the dates, but here you'll just drop them.  There is a lot of data so it shouldn't be a problem.  There are also two summary rows at the end that need to be removed.

In [7]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

y2015 = y2015[:-2]

Now, try and create the dummies again.

In [8]:
pd.get_dummies(y2015)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401.0,72868139.0,16000.0,16000.0,16000.0,14.85,379.39,48000.0,33.18,0.0,...,0,0,0,0,0,1,0,0,0,0
1,68354783.0,73244544.0,9600.0,9600.0,9600.0,7.49,298.58,60000.0,22.44,0.0,...,0,0,0,0,0,1,0,0,0,0
2,68466916.0,73356753.0,25000.0,25000.0,25000.0,7.49,777.55,109000.0,26.02,0.0,...,0,0,0,0,0,1,0,0,0,0
3,68466961.0,73356799.0,28000.0,28000.0,28000.0,6.49,858.05,92000.0,21.60,0.0,...,0,0,0,0,0,1,0,0,0,0
4,68495092.0,73384866.0,8650.0,8650.0,8650.0,19.89,320.99,55000.0,25.49,0.0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421090,36371250.0,39102635.0,10000.0,10000.0,10000.0,11.99,332.10,31000.0,28.69,0.0,...,0,0,0,0,0,1,0,0,0,0
421091,36441262.0,39152692.0,24000.0,24000.0,24000.0,11.99,797.03,79000.0,3.90,0.0,...,0,0,0,0,0,1,0,0,0,0
421092,36271333.0,38982739.0,13000.0,13000.0,13000.0,15.99,316.07,35000.0,30.90,0.0,...,0,0,0,0,0,1,0,0,0,0
421093,36490806.0,39222577.0,12000.0,12000.0,12000.0,19.99,317.86,64400.0,27.19,1.0,...,0,0,1,0,0,1,0,0,0,0


That worked but you had to sacrifice several columns of data but in this case it's okay.

### Second Attempt

In this attempt, NA columns will be dropped, rather than imputed.

In [22]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)

array([0.98188079, 0.98112087, 0.98209451, 0.98192828, 0.98100214,
       0.98019426, 0.96499561, 0.98100169, 0.98107293, 0.98097794])

The score cross validation reports is the accuracy of the tree. Here we're about 98% accurate.

That works pretty well, but there are a few potential problems. Firstly, we didn't really do much in the way of feature selection or model refinement. As such there are a lot of features in there that we don't really need. Some of them are actually quite impressively useless.

There's also some variance in the scores. The fact that one gave us only 93% accuracy while others gave higher than 98 is concerning. This variance could be corrected by increasing the number of estimators. That will make it take even longer to run, however, and it is already quite slow.

## DRILL: Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matrices.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

I didn't know what to do with this until I came across this article: [Scale, Standardize, or Normalize with Scikit-Learn](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02).  The article discussed several Scikit-Learn libries: _MinMaxScaler, RobustScaler, and Normalizer_.  Additionally, [PCA using Python (scikit-learn)](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60) discusses the PCA library in Scikit-Learn.

In [60]:
# reload the dataset 
y2015b = pd.read_csv("F:\\thinkful\Data_Science\\21_supervised_learning_random_forest_models\\LoanStats3d.csv", skipinitialspace=True, header=1, skipfooter=5, engine="python")

In [61]:
# Convert ID and Interest Rate to numeric.
y2015b['id'] = pd.to_numeric(y2015b['id'], errors='coerce')
y2015b['int_rate'] = pd.to_numeric(y2015b['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015b.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [86]:
# Split the data into continuous and categorical 
y2015_contin = y2015b.iloc[:,2:].copy().select_dtypes(include=["float64","int64"]) # Continuous Columns
y2015_cat = y2015b.copy().select_dtypes(include="object") # Categorical Columns
# pd.get_dummies(y2015b)

In [95]:
# Scale the data using the preprocessing scaler
# y2015_scaled =  pd.DataFrame(preprocessing.scale(y2015_contin), columns=y2015_contin.columns)
scaler = preprocessing.MinMaxScaler()
scaler.fit(y2015_contin)
# print(scaler.transform(y2015_contin))
y2015_scaled = pd.DataFrame(scaler.transform(y2015_contin), columns=y2015_contin.columns)
# Replace infinate values with nans
y2015_scaled = y2015_scaled.replace([np.inf,-np.inf],np.nan)
# Now replace nans with 0
y2015_scaled = y2015_scaled.fillna(0)
y2015_scaled

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,0.441176,0.441176,0.442815,0.402619,0.250334,0.005053,0.003318,0.000000,0.000000,0.187500,...,0.000000,0.066667,0.789,0.000,0.000000,0.023529,0.029567,0.010723,0.337169,0.006375
1,0.252941,0.252941,0.255132,0.091677,0.193508,0.006316,0.002244,0.000000,0.000000,0.000000,...,0.000000,0.066667,1.000,0.667,0.000000,0.000000,0.008616,0.018958,0.014983,0.035984
2,0.705882,0.705882,0.706745,0.091677,0.530322,0.011474,0.002602,0.000000,0.166667,0.000000,...,0.000000,0.000000,1.000,0.200,0.000000,0.000000,0.037116,0.023294,0.046027,0.039068
3,0.794118,0.794118,0.794721,0.049430,0.586930,0.009684,0.002160,0.000000,0.000000,0.238636,...,0.000000,0.000000,0.917,0.222,0.000000,0.000000,0.030158,0.025644,0.049742,0.020221
4,0.225000,0.225000,0.227273,0.615547,0.209267,0.005789,0.002549,0.000000,0.666667,0.000000,...,0.000000,0.400000,1.000,0.500,0.090909,0.000000,0.003651,0.006478,0.003296,0.008682
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421089,0.323529,0.323529,0.325513,0.281791,0.263786,0.006632,0.002369,0.025641,0.000000,0.051136,...,0.025641,0.100000,0.983,0.500,0.000000,0.000000,0.032306,0.038675,0.018459,0.046684
421090,0.264706,0.264706,0.266862,0.281791,0.217079,0.003263,0.002869,0.000000,0.000000,0.000000,...,0.000000,0.033333,1.000,1.000,0.000000,0.000000,0.003046,0.008651,0.011027,0.007541
421091,0.676471,0.676471,0.677419,0.281791,0.544021,0.008316,0.000390,0.000000,0.166667,0.147727,...,0.000000,0.066667,0.565,1.000,0.000000,0.000000,0.015019,0.002951,0.010787,0.000000
421092,0.352941,0.352941,0.354839,0.450782,0.205807,0.003684,0.003090,0.000000,0.000000,0.000000,...,0.000000,0.100000,1.000,0.500,0.090909,0.000000,0.004875,0.011699,0.012705,0.015814


In [107]:
# Run the PCA, setting the nummber of features
pca = PCA(n_components=10)

# Return an array of transfomed values
pca.fit_transform(y2015_scaled);

In [110]:
# Build a DataFrame of values from the PCA analysis
pca_df = pd.DataFrame(pca.components_,columns=y2015_contin.columns,index=range(10))

pca_df = np.abs(pca_df)

# Declare a set to store the continuous features
top_feaures = set()

for idx,row in pca_df.iterrows():
    # Return a series of correlations sorted from Top to Bottom
    top_row = row.sort_values(ascending=False)[:10]
    # Update the features set with the sorted values
    top_feaures.update(list(top_row.index))

44

In [119]:
# Loop over the categorical columns and create dummies
y2015_dummied = pd.get_dummies(y2015_cat)

y2015_dummied

Unnamed: 0,term_ 36 months,term_ 60 months,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,emp_length_1 year,...,last_credit_pull_d_Nov-2016,last_credit_pull_d_Oct-2015,last_credit_pull_d_Oct-2016,last_credit_pull_d_Sep-2015,last_credit_pull_d_Sep-2016,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421089,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
421090,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
421091,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
421092,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [154]:
# Bring dataframes together for a new model 
model = pd.concat([y2015_dummied.copy(),y2015b["loan_status"].copy(),y2015b[list(top_feaures)]].copy(),axis=1)

In [160]:
# Rerun the model and compare the results
rfc2 = ensemble.RandomForestClassifier()
# There are a number of features with nulls create a list to drop these cols
drops = model.isna().sum().sort_values(ascending=False)[:18].index.to_list()
# Append the loan_status column
drops.append("loan_status")
# Model Features
X = model.drop(drops, axis=1)
# Target variable 
Y = model["loan_status"]

cross_val_score(rfc2,X,Y,cv=10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])