# DRILL: Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matrices.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('LoanStats3d.csv', skipinitialspace=True, header=1, low_memory=False)

Cleaning from the example...

In [3]:
df = df[:-2].copy()

In [4]:
df['id'] = pd.to_numeric(df['id'], errors='coerce').copy()
df['int_rate'] = pd.to_numeric(df['int_rate'].str.strip('%'), errors='coerce').copy()

In [5]:
df.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

First do it how they did.

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [7]:
rfc = RandomForestClassifier()

In [8]:
X = df.drop(columns='loan_status')
y = df.loan_status
X = pd.get_dummies(X, drop_first=True).copy()
X = X.dropna(axis=1)

In [9]:
#cross_val_score(rfc, X, y, cv=10)

That took a long time...think twice before running the above again.

Let's look at the data...

In [13]:
from sklearn.linear_model import LogisticRegression

In [21]:
logreg = LogisticRegression(solver='sag', multi_class='multinomial')

In [36]:
from sklearn.model_selection import train_test_split

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1)

In [41]:
from sklearn.preprocessing import MinMaxScaler

In [42]:
scaler = MinMaxScaler()

In [47]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [48]:
from sklearn.decomposition import PCA

In [53]:
pca = PCA(.95)

In [54]:
pca.fit(X_train)
X_tr_pca = pca.transform(X_train)
X_te_pca = pca.transform(X_test)

In [65]:
print(pca.explained_variance_ratio_.sum())
print(len(pca.components_))
print(X.shape[1])

0.9523006785267814
54
186


So using PCA on ALL the data got the amount of components down to 54 from the 186 and still explains 95% of the variance. Let's see how this performs using Random Forest, may be useful to train/test on a k number of components and then see how performance is on the train/test models in order to tune before doing the big cross validation.

In [66]:
rfc.fit(X_tr_pca, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [67]:
rfc.score(X_te_pca, y_test)

0.9699121348848254

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [78]:
scores = []
rfc = RandomForestClassifier(n_estimators=10)
for k in range(2, 103, 10):
    print('-----Running PCA with {} components-----'.format(k))
    pca = PCA(n_components=k)
    print('Fitting and transforming PCA on training data...')
    X_tr_pca = pca.fit_transform(X_train)
    print('done...')
    print('Transforming testing data...')
    X_te_pca = pca.transform(X_test)
    print('done...')
    print('Fitting training data using random forest...')
    rfc.fit(X_tr_pca, y_train)
    print('done...')
    print('assigning score...')
    score = rfc.score(X_te_pca, y_test)
    print('done...')
    scores.append(score)
    print('PCA n_components={} done... Random Forest Score: {}'.format(k, score))

In [79]:
scores

[0.6789204336313659,
 0.9095334781937567,
 0.9108277229603771,
 0.9133924648832211,
 0.9600446455075458,
 0.96909248506869,
 0.9704935940820955,
 0.9709804200952279,
 0.9732364430829148,
 0.9737945119272373,
 0.9734857929920802]

Looks like 2 components don't work well. Once the number of components is up to 12 there is almost 91% predictive power...I am curious how low we could actually go and get consistent results, but the above was very computationally expensive (random forest and PCA both take a long time to fit)! So, let's run the cross validation with PCA with 12 components (down from 186).

In [83]:
pca = PCA(n_components=12)
X_pca = pca.fit_transform(X)
comp_12 = cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1)
comp_12.mean()

0.6684871174375336

Looks like the training and testing data may have just been really good for prior testing, let's see if it's better at 22 (91% above).

In [84]:
pca = PCA(n_components=22)
X_pca = pca.fit_transform(X)
comp_22 = cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1)
comp_22.mean()

0.9443001840082635

Looks like this does meet the average score of 90% or higher, wonder if we can do it with fewer components.

In [85]:
pca = PCA(n_components=17)
X_pca = pca.fit_transform(X)
comp_17 = cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1)
comp_17.mean()

0.9384346023393105

Closer...

So it's somewhere between 12 and 17, let's loop this, it's going to take awhile.

In [88]:
scores = []
for k in range(13, 17, 1):
    print('-----Running cross validation for Random Forest using PCA with {} components-----'.format(k))
    pca = PCA(n_components=k)
    print('Fitting PCA...')
    X_pca = pca.fit_transform(X)
    print('done...')
    print('Cross validating...')
    cv_mean = cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1).mean()
    print('done...')
    scores.append(cv_mean)
    print('k={} cv_mean={}'.format(k, cv_mean))

-----Running cross validation for Random Forest using PCA with 13 components-----
Fitting PCA...
done...
Cross validating...
done...
k=13 cv_mean=0.673642633299513
-----Running cross validation for Random Forest using PCA with 14 components-----
Fitting PCA...
done...
Cross validating...
done...
k=14 cv_mean=0.9171235149425987
-----Running cross validation for Random Forest using PCA with 15 components-----
Fitting PCA...
done...
Cross validating...
done...
k=15 cv_mean=0.9321248378498821
-----Running cross validation for Random Forest using PCA with 16 components-----
Fitting PCA...
done...
Cross validating...
done...
k=16 cv_mean=0.9362285172570581


Looks like there is a significant jump from PCA n_components=13 to n_components=14, now I wonder if we can get it to run faster while still maintaining accuracy above 90% while using 14 components from PCA. The next thing to tune will be the depth of the Random Forest (how many 'stories' of nodes). First, let's find out how deep it goes when unlimited.

In [89]:
pca = PCA(n_components=14)

In [90]:
X_pca = pca.fit_transform(X)

In [95]:
X.shape

(421095, 186)

In [96]:
X_pca.shape

(421095, 14)

In [112]:
pca.explained_variance_ratio_.sum()

0.9999999304665965

In [118]:
rfc.fit(X_pca, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [124]:
np.array([estimator.tree_.max_depth for estimator in rfc.estimators_]).max()

57

So, from the above we can see the max tree depth of any branch is at most 57, so let's limit this to the mean of the tree depth instead to see if we can still get good accuracy.

In [128]:
np.array([estimator.tree_.max_depth for estimator in rfc.estimators_]).mean()

49.7

In [129]:
rfc.max_depth = 50

In [130]:
rfc

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=50, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [131]:
rfc.fit(X_pca, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=50, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [132]:
rfc.score(X_pca, y)

0.9925005046367209

In [134]:
scores = []
means = []
for k in range(5, 51, 5):
    rfc = RandomForestClassifier(n_estimators=10, max_depth=k)
    print('-----Using Random Forest with max depth of {}-----'.format(k))
    print('Cross validating...')
    score = cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1)
    print('done...')
    scores.append(score)
    mean = score.mean()
    means.append(mean)
    print('max_depth: {} mean_cv: {}'.format(k, mean))

-----Using Random Forest with max depth of 5-----
Cross validating...
done...
max_depth: 5 mean_cv: 0.921720435746149
-----Using Random Forest with max depth of 10-----
Cross validating...
done...
max_depth: 10 mean_cv: 0.9137228324801177
-----Using Random Forest with max depth of 15-----
Cross validating...
done...
max_depth: 15 mean_cv: 0.9173085766610027
-----Using Random Forest with max depth of 20-----
Cross validating...
done...
max_depth: 20 mean_cv: 0.9154327110793178
-----Using Random Forest with max depth of 25-----
Cross validating...
done...
max_depth: 25 mean_cv: 0.920509959270351
-----Using Random Forest with max depth of 30-----
Cross validating...
done...
max_depth: 30 mean_cv: 0.9180782130435556
-----Using Random Forest with max depth of 35-----
Cross validating...
done...
max_depth: 35 mean_cv: 0.9203152505237393
-----Using Random Forest with max depth of 40-----
Cross validating...
done...
max_depth: 40 mean_cv: 0.9137489346851522
-----Using Random Forest with max de

Surprisingly, the model with the max depth of 5 seems to be the most general and also the most consistent. It also keeps a prediction rate over 90%.

In [140]:
pca

PCA(copy=True, iterated_power='auto', n_components=14, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [142]:
rfc.max_depth=5

In [144]:
rfc.n_jobs=-1

In [145]:
rfc

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [146]:
cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1)

array([0.87884401, 0.9361687 , 0.9427228 , 0.91375175, 0.9151033 ,
       0.9315127 , 0.90595835, 0.8991427 , 0.91200988, 0.91485774])

In [147]:
rfc.fit(X_pca, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

So, the above is the final model. PCA was used to reduce dimensions down from 186 to 14, then Random Forest Classifier was used to model the data. There are some concerns about the variability between cross validation instances, but this was the most consistent and also most general of the Random Forest Classifiers, so this would be the ultimate model I would use moving forward. Let's do one more instantiation with the model and use cross validation to observe if we have consistent results.

In [163]:
rfc = RandomForestClassifier(n_estimators=10, max_depth=5, n_jobs=-1)
pca = PCA(n_components=14)

In [164]:
rfc, pca

(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                        max_depth=5, max_features='auto', max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                        oob_score=False, random_state=None, verbose=0,
                        warm_start=False),
 PCA(copy=True, iterated_power='auto', n_components=14, random_state=None,
     svd_solver='auto', tol=0.0, whiten=False))

In [165]:
X_pca = pca.fit_transform(X)
final_scores = cross_val_score(rfc, X_pca, y, cv=10, n_jobs=-1)
final_scores, final_scores.mean()

(array([0.88668044, 0.94443257, 0.93472014, 0.93678611, 0.9337687 ,
        0.93545476, 0.93120236, 0.93113111, 0.8715178 , 0.92511756]),
 0.9230811558723373)

In [172]:
print(f'{round(final_scores.mean()*100, 2)}% accuracy over 10-fold Cross Validation')

92.31% accuracy over 10-fold Cross Validation
