# Drill - Third Attempt

So here's your task. Get rid of as much data as possible without dropping below an average of 90% accuracy in a 10-fold cross validation.

You'll want to do a few things in this process. First, dive into the data that we have and see which features are most important. This can be the raw features or the generated dummies. You may want to use PCA or correlation matrices.

Can you do it without using anything related to payment amount or outstanding principal? How do you know?

In [28]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#For selecting features
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif

#Import Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [7]:
# Replace the path with the correct path for your data.
y2015 = pd.read_csv(
    'https://www.dropbox.com/s/0so14yudedjmm5m/LoanStats3d.csv?dl=1',
    skipinitialspace=True,
    header=1
)

  interactivity=interactivity, compiler=compiler, result=result)


In [8]:
# Convert ID and Interest Rate to numeric.
y2015['id'] = pd.to_numeric(y2015['id'], errors='coerce')
y2015['int_rate'] = pd.to_numeric(y2015['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
y2015.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [9]:
# Remove two summary rows at the end that don't actually contain data.
y2015 = y2015[:-2]

### Feature Engineering

In [10]:
y2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421095 entries, 0 to 421094
Columns: 103 entries, id to total_il_high_credit_limit
dtypes: float64(87), object(16)
memory usage: 330.9+ MB


In [11]:
y2015['loan_status'].value_counts()

Current               287414
Fully Paid             87989
Charged Off            29178
Late (31-120 days)      9510
In Grace Period         4320
Late (16-30 days)       1888
Default                  796
Name: loan_status, dtype: int64

In [12]:
X = y2015.drop('loan_status', 1)
Y = y2015['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

In [18]:
#Split the data into training and validation
X_train, X_test, y_train, y_test = train_test_split(X,Y)

In [22]:
#Create single Tree
dt = DecisionTreeClassifier()
model = dt.fit(X_train, y_train)
prediction = dt.predict(X_test)
dt.score(X_test, y_test)

0.96715238330451958

In [23]:
#Look at important Features
zipped = zip(X.columns, dt.feature_importances_)
zipped_sorted = sorted(zipped, key=lambda x: x[1], reverse=True)
for feat, importance in zipped_sorted:
    if importance > 0.01:
        print('feature: {f}, importance: {i}'.format(f=feat, i=importance))

feature: out_prncp, importance: 0.6310804696548986
feature: last_pymnt_amnt, importance: 0.16035979376856613
feature: last_pymnt_d_Dec-2016, importance: 0.053838909787141974
feature: total_rec_prncp, importance: 0.03965800302197799
feature: last_pymnt_d_Jan-2017, importance: 0.025422683404351887


In [29]:
#Create new feature set

features = X[['out_prncp','last_pymnt_amnt','last_pymnt_d_Dec-2016','total_rec_prncp','last_pymnt_d_Jan-2017']]

In [30]:
#Use cross validation to fit model
rfc = RandomForestClassifier()

cross_val_score(rfc, features, Y, cv=10)

array([ 0.96150649,  0.96886799,  0.97041153,  0.97603951,  0.96124436,
        0.97190691,  0.96696668,  0.9717875 ,  0.9697675 ,  0.97534793])

### Write Up

Used feature importance of random forest since it has that parameter. Most of the features could be dropped and didnt add much value. The payments turned out to be the most important.