To-do:
* Update table of contents
* Write introduction
* Write conclusion
* More EDA?

# Loan Status Prediction: Lending Club, 2007-2017

***By Joe Corliss***

## Table of Contents

(Currently outdated)

1. [Summary](#1)
    1. [Spoilers](#1.1)
2. [Import the Data](#2)
3. [Target Variable](#3)
4. [Feature Selection](#4)
    1. [Drop columns that have only one distinct value](#4.1)
    2. [Remove columns that have < 2% data](#4.2)
    3. [Remove irrelevant features](#4.3)
    4. [Remove features that could make predictions too easy](#4.4)
    5. [Inspect non-numerical features](#4.5)
5. [Exploratory Data Analysis](#5)
6. [Correlations with 'charged_off'](#6)
    1. [Create dummy variables](#6.1)
    2. [Compute correlations with 'charged_off'](#6.2)
7. [More Pre-processing](#7)
    1. [Train/test split](#7.1)
    2. [Imputation with mean substitution](#7.2)
    3. [Standardize the data](#7.3)
8. [Predictive Modeling: SGDClassifier](#8)
    1. [Train with grid search](#8.1)
    2. [Test set evaluation](#8.2)

# Introduction

[Data source on Kaggle](https://www.kaggle.com/wordsforthewise/lending-club)

[Older, smaller dataset](https://www.kaggle.com/wendykan/lending-club-loan-data)

[Notebook on Kaggle](https://www.kaggle.com/pileatedperch/predicting-loan-status-mcc-0-73)

[Lending Club Statistics](https://www.lendingclub.com/info/download-data.action)

[GitHub Repository](https://github.com/jgcorliss/lending-club)

The goal of this project is to predict whether a loan will be fully paid or charged off. We'll remove some features that would make this prediction too easy, such as the total payments received on the loan to date.

# Import the Data
<a id="2"></a>

Import basic libraries.

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas options
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

# Plotting options
%matplotlib inline
mpl.style.use('ggplot')
sns.set(style='whitegrid')

Read in the data.

In [2]:
df = pd.read_csv('accepted_2007_to_2017Q3.csv.gz', compression='gzip')

  interactivity=interactivity, compiler=compiler, result=result)


Check basic dataframe info.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1646801 entries, 0 to 1646800
Columns: 150 entries, id to settlement_term
dtypes: float64(113), object(37)
memory usage: 1.8+ GB


Sample some rows:

In [10]:
df.sample(3)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
647519,41001038,,17600.0,17600.0,17600.0,36 months,9.99,567.82,B,B3,Guest Service Team Leader,< 1 year,RENT,40000.0,Source Verified,Feb-2015,Fully Paid,n,,debt_consolidation,Debt consolidation,967xx,HI,12.72,0.0,Sep-2002,745.0,749.0,1.0,,,6.0,0.0,6571.0,30.7,19.0,w,0.0,0.0,19626.27,19626.27,17600.0,2026.27,0.0,0.0,0.0,Jul-2016,10560.69,,Aug-2016,774.0,770.0,0.0,,1.0,Individual,,,,0.0,0.0,19843.0,,,,,,,,,,,,21400.0,,,,3.0,3307.0,5944.0,52.4,0.0,0.0,149.0,103.0,3.0,3.0,0.0,17.0,,0.0,,0.0,1.0,2.0,1.0,4.0,11.0,4.0,8.0,2.0,6.0,0.0,0.0,0.0,1.0,100.0,0.0,0.0,0.0,43592.0,19843.0,12500.0,22192.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
360905,63389326,,6000.0,6000.0,6000.0,36 months,7.26,185.98,A,A4,,,OWN,27750.0,Not Verified,Oct-2015,Current,n,,credit_card,Credit card refinancing,480xx,MI,26.08,0.0,Feb-1999,715.0,719.0,0.0,,,6.0,0.0,30154.0,75.6,11.0,w,1973.3,1973.3,4647.08,4647.08,4026.7,620.38,0.0,0.0,0.0,Nov-2017,185.98,Jan-2018,Dec-2017,714.0,710.0,0.0,,1.0,Individual,,,,0.0,0.0,30154.0,,,,,,,,,,,,85900.0,,,,2.0,5026.0,8546.0,77.9,0.0,0.0,156.0,200.0,9.0,9.0,0.0,9.0,,15.0,,0.0,3.0,3.0,4.0,7.0,1.0,6.0,10.0,3.0,6.0,0.0,0.0,0.0,1.0,100.0,50.0,0.0,0.0,85900.0,30154.0,38700.0,0.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
699954,112052125,,12000.0,12000.0,12000.0,60 months,16.02,291.95,C,C5,department head,10+ years,MORTGAGE,42000.0,Source Verified,Jun-2017,Current,n,,debt_consolidation,Debt consolidation,443xx,OH,23.74,1.0,Aug-1989,700.0,704.0,1.0,10.0,,11.0,0.0,13014.0,59.2,15.0,w,11323.42,11323.42,1433.05,1433.05,676.58,756.47,0.0,0.0,0.0,Dec-2017,291.95,Jan-2018,Dec-2017,739.0,735.0,0.0,,1.0,Individual,,,,0.0,0.0,118536.0,2.0,6.0,2.0,4.0,5.0,32817.0,79.0,1.0,1.0,10096.0,72.0,22000.0,0.0,1.0,1.0,5.0,10776.0,8986.0,59.2,0.0,0.0,144.0,334.0,4.0,4.0,3.0,4.0,,4.0,,0.0,4.0,4.0,4.0,4.0,7.0,4.0,5.0,4.0,11.0,0.0,0.0,0.0,3.0,86.7,50.0,0.0,0.0,153237.0,45831.0,22000.0,41637.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


# Response Variable
<a id="3"></a>

We're going to try to predict the `loan_status` column. What are the value counts in this column?

In [11]:
df['loan_status'].value_counts()

Current                                                788950
Fully Paid                                             646902
Charged Off                                            168084
Late (31-120 days)                                      23763
In Grace Period                                         10474
Late (16-30 days)                                        5786
Does not meet the credit policy. Status:Fully Paid       1988
Does not meet the credit policy. Status:Charged Off       761
Default                                                    70
Name: loan_status, dtype: int64

Let's only consider loans that meet the credit policy and have either been fully paid or charged off. These are the two cases we'll try to distinguish with a predictive model.

Retain only the rows with `loan_status` "Fully Paid" or "Charged Off."

In [None]:
df = df.loc[df['loan_status'].isin(['Fully Paid', 'Charged Off'])]

How many observations remain?

In [None]:
df.shape

In [None]:
df['loan_status'].value_counts(normalize=True)

About 79% of the loans have been fully paid, and 21% have been charged off.

Let's convert the `loan_status` column to a binary `charged_off` column.

In [None]:
df['loan_status'] = df['loan_status'].isin(['Charged Off'])

In [None]:
df['loan_status'].value_counts(normalize=True, dropna=False)

Rename the `loan_status` column to `charged_off`.

In [None]:
df.rename(columns={'loan_status':'charged_off'}, inplace=True)

Our target variable is ready to go. We have two classes to try to predict.

# Feature Selection
<a id="4"></a>

The dataset has 150 features, but we won't be using all the features to predict the loan status.

Definitions of the columns are given in the Lending Club "Data Dictionary" [available here](https://www.lendingclub.com/info/download-data.action).

First let's locate columns with missing data:

## Missing Data

In [None]:
def incomplete_cols(df):
    cmp = df.notnull().mean().sort_values()
    return cmp.loc[cmp<1]

In [None]:
incmpl_cols = incomplete_cols(df)
incmpl_cols

In [None]:
incmpl_cols = incmpl_cols.index

There are LOTS of features with missing data.

## Create features indicating missing data

In [None]:
for col in incmpl_cols:
    df[col+'_NA'] = df[col].isnull()

In [None]:
df.shape

Remove duplicate columns.

In [None]:
cols_to_remove = []
for m in range(150, 252):
    for n in range(m+1, 253):
        if (df.iloc[:,m] == df.iloc[:,n]).all():
            cols_to_remove.append(df.columns[m])
            break

In [None]:
df.drop(labels=cols_to_remove, axis=1, inplace=True)

In [None]:
df.shape

## Drop features that have >30% missing data
<a id="4.2"></a>

Find the features to drop:

In [None]:
drop_list = []
for col in df.columns:
    if df[col].isnull().mean() > 0.3:
        drop_list.append(col)

drop_list

Drop these columns.

In [None]:
df.shape

In [None]:
df.drop(labels=drop_list, axis=1, inplace=True)

In [None]:
df.shape

## Remove irrelevant features
<a id="4.3"></a>

Let's drop some features that we don't think will be useful for predicting the loan status.

Analyzing text in the borrower loan description, job title, or loan title could be an interesting direction, but we won't explore this for now. The last three features listed below contain date information. We could convert these to numerical values, but we won't bother doing so.

In [None]:
df.shape

In [None]:
df.drop(labels=['id', 'emp_title', 'title', 'issue_d', 'last_credit_pull_d', 'earliest_cr_line'], axis=1, inplace=True)

In [None]:
df.shape

## Remove features that could make predictions too easy
<a id="4.4"></a>

Some features perfectly predict the loan status. For example, if `debt_settlement_flag` is 'Y', this implies that the borrower charged off. Also, if `total_pymnt` is greater than `loan_amnt`, and similarly for interest and fee payment features, then the loan must be paid off. Let's remove these columns so that predicting charge-offs isn't trivially simple.

In [None]:
df.shape

In [None]:
df.drop(labels=['collection_recovery_fee', 'debt_settlement_flag', 'last_pymnt_amnt', 'last_pymnt_d', 'recoveries', 'total_pymnt', 'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee', 'total_rec_prncp'], axis=1, inplace=True)

In [None]:
df.shape

## Inspect non-numerical features
<a id="4.5"></a>

We're going to inspect features of type `object`, i.e. text data.

In [None]:
df.sample(3)

Which columns have text data?

In [None]:
text_cols = []
for col in df.columns:
    if df[col].dtype == np.object:
        text_cols.append(col)

text_cols

### term

In [None]:
df['term'].value_counts(dropna=False)

Convert `term` to integer values.

In [None]:
df['term'] = df['term'].apply(lambda s:np.int8(s[1:3])) # There's an extra space in the data for some reason

In [None]:
df['term'].value_counts()

### grade, sub_grade

In [None]:
df['grade'].value_counts().sort_index()

The grade is implied by the subgrade, so let's drop the grade column.

In [None]:
df.drop(labels=['grade'], axis=1, inplace=True)

### emp_length

In [None]:
df['emp_length'].value_counts(dropna=False).sort_index()

Let's convert `emp_length` to floats:

In [None]:
def emp_conv(s):
    try:
        if pd.isnull(s):
            return s
        elif s[0] == '<':
            return 0.0
        elif s[:2] == '10':
            return 10.0
        else:
            return np.float16(s[0])
    except TypeError:
        return np.float16(s)

In [None]:
df['emp_length'] = df['emp_length'].apply(lambda s: emp_conv(s))

In [None]:
df['emp_length'].value_counts(dropna=False).sort_index()

### home_ownership

In [None]:
df['home_ownership'].value_counts()

### verification_status

In [None]:
df['verification_status'].value_counts()

### pymnt_plan

In [None]:
df['pymnt_plan'].value_counts(dropna=False)

This feature will be removed automatically later.

### purpose

In [None]:
df['purpose'].value_counts()

### zip_code, addr_state

In [None]:
df['zip_code'].nunique()

In [None]:
df['addr_state'].nunique()

There are a lot of zip codes, so let's just retain the state column.

In [None]:
df.drop(labels=['zip_code'], axis=1, inplace=True)

### initial_list_status

In [None]:
df['initial_list_status'].value_counts()

I wasn't able to determine what the initial list status means.

### application_type

In [None]:
df['application_type'].value_counts()

### hardship_flag

In [None]:
df['hardship_flag'].value_counts()

This feature will be removed automatically later.

### disbursement_method

In [None]:
df['disbursement_method'].value_counts()

# Exploratory Data Analysis
<a id="5"></a>

View some random rows:

In [None]:
df.sample(3)

Let's make a count plot of the loan purpose, separated by the `charged_off` value.

In [None]:
plt.figure(figsize=(12,6), dpi=100)
sns.countplot(y='purpose', hue='charged_off', data=df, orient='h')

It appears that most of the charge-offs come from loans for debt consolidation or to pay off credit cards.

Let's make a similar plot, but with `sub_grade` instead of `purpose`.

In [None]:
plt.figure(figsize=(16,6), dpi=120)
sns.countplot(x='sub_grade', hue='charged_off', data=df, order=sorted(df['sub_grade'].value_counts().index))

There's a clear trend of higher probability of charge-off as the subgrade worsens.

Let's make a similar plot, but with `term` instead of `sub_grade`.

In [None]:
plt.figure(figsize=(4,4), dpi=90)
sns.countplot(x='term', hue='charged_off', data=df)

Loans with a term of 60 months are much more likely to be charged off.

Now let's compare the interest rate to the loan status using a kdeplot, which approximates the probability distribution of the data.

In [None]:
plt.figure(figsize=(10,4), dpi=90)
sns.distplot(df['int_rate'].loc[df['charged_off']==0], kde=True, label='charged_off = 0')
sns.distplot(df['int_rate'].loc[df['charged_off']==1], kde=True, label='charged_off = 1')
plt.xlabel('int_rate')
plt.ylabel('density')
plt.legend()

Charged-off loans tend to have higher interest rates.

Now let's compare the borrower's most recent FICO score (a credit score) to the loan status.

In [None]:
plt.figure(figsize=(10,4), dpi=90)
sns.distplot(df['last_fico_range_high'].loc[df['charged_off']==0], label='charged_off = 0')
sns.distplot(df['last_fico_range_high'].loc[df['charged_off']==1], label='charged_off = 1')
plt.xlabel('last_fico_range_high')
plt.ylabel('density')
plt.legend()

There's a strong association here: charged-off loans tend to have much lower FICO scores.

# More Pre-processing
<a id="7"></a>

## Create dummy variables
<a id="6.1"></a>

In [None]:
df.shape

In [None]:
df = pd.get_dummies(df, drop_first=True)

In [None]:
df.shape

Now all the features are numerical. What does the data look like after converting categorical features to dummy variables?

In [None]:
df.sample(3)

## Train/test split
<a id="7.1"></a>

In [None]:
X = df.drop(labels=['charged_off'], axis=1) # Features
y = df.loc[:,'charged_off'] # Response variable
df = None

In [None]:
from sklearn.model_selection import train_test_split

Let's do a 90/10 train/test split.

In [None]:
random_state = 1 # Just to make the results fixed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state, stratify=y)
X = None
y = None

In [None]:
X_train.shape

In [None]:
X_test.shape

## Imputation with mean substitution
<a id="7.2"></a>

How complete is our training data?

In [None]:
incomplete_cols(X_train)

The learning algorithms cannot have missing data. Perform mean substitution, using only the means of the training set to prevent test set leakage.

In [None]:
from sklearn.preprocessing import Imputer

In [None]:
imputer = Imputer().fit(X_train)

In [None]:
X_train = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(imputer.transform(X_test),  columns=X_test.columns)

## Standardize the data
<a id="7.3"></a>

Shift and scale each column individually so that it has zero mean and unit variance. This will help the learning algorithms.

Train the scaler using only the training data to prevent test set leakage.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler().fit(X_train)

In [None]:
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(scaler.transform(X_test),  columns=X_test.columns)

## Automatic feature selection

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif

In [None]:
selector = SelectKBest(mutual_info_classif, k=15)

In [None]:
selector.fit(X_train.iloc[1:50000], y_train.iloc[1:50000])

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
X_train = selector.transform(X_train)
X_test  = selector.transform(X_test)

In [None]:
X_train.shape

In [None]:
X_test.shape

# Predictive Modeling: SGDClassifier
<a id="8"></a>

I decided to use a SGD Classifier by looking at the machine learning flowchart here: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html.

The SGDClassifier estimator implements linear classifiers (SVM, logistic regression, a.o.) with SGD training. The linear classifier is chosen by the 'loss' hyperparameter.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import matthews_corrcoef, make_scorer

## Train with grid search
<a id="8.1"></a>

We're going to search through many hyperparameters of SGDClassifier using an exhaustive grid search with 3-fold cross-validation, implemented in GridSearchCV.

Here are the hyperparameters that we'll try:

In [None]:
param_grid = [{'loss': ['hinge'],
               'alpha': [10.0**k for k in range(-1, 0)],
               'class_weight': [None, 'balanced']},
              {'loss': ['log'],
               'alpha': [10.0**k for k in range(-1, 0)],
               'penalty': ['l2', 'l1']}]

Instantiate the grid estimator. We'll use the Matthews correlation coefficient as our scoring metric.

In [None]:
grid = GridSearchCV(estimator=SGDClassifier(max_iter=1000, tol=1e-3, random_state=random_state, warm_start=True), param_grid=param_grid, scoring=make_scorer(matthews_corrcoef), n_jobs=4, pre_dispatch=8, verbose=1)

Run the grid search (this could take some time).

In [None]:
grid.fit(X_train, y_train)

Hyperparameters that gave the best results on the hold out data:

In [None]:
grid.best_params_

Mean cross-validated MCC score of the best estimator:

In [None]:
grid.best_score_

Weights assigned to the features by the best estimator:

In [None]:
#plt.figure(figsize=(8,36), dpi=90)
sns.barplot(y=np.arange(0,20), x=grid.best_estimator_.coef_[0,:], orient='h')
plt.title("Classifier Weights")
plt.xlabel("classifier weight")

## Test set evaluation
<a id="8.2"></a>

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score, roc_auc_score

In [None]:
def classification_eval(estimator, X_test, y_test):
    """
    Print several metrics of classification performance of an estimator, given features X_test and labels y_test.
    
    Input: estimator or GridSearchCV instance, X_test, y_test
    Returns: text printout of metrics
    """
    y_pred = estimator.predict(X_test)
    
    # Number of decimal places based on number of samples
    dec = np.int64(np.ceil(np.log10(len(y_test))))
    
    print('Confusion matrix -----------------------'+3*(dec-1)*'-')
    print(confusion_matrix(y_test, y_pred), '\n')
    
    print('Classification report ------------------'+3*(dec-1)*'-')
    print(classification_report(y_test, y_pred, digits=dec))
    
    print('Scalar metrics -------------------------'+3*(dec-1)*'-')
    format_str = '%%13s = %%.%if' % dec
    print(format_str % ('MCC', matthews_corrcoef(y_test, y_pred)))
    if y_test.nunique() <= 2: # Additional metrics for binary classification
        try:
            y_score = estimator.predict_proba(X_test)[:,1]
        except:
            y_score = estimator.decision_function(X_test)
        print(format_str % ('AUPRC', average_precision_score(y_test, y_score)))
        print(format_str % ('AUROCC', roc_auc_score(y_test, y_score)))
    print(format_str % ("Cohen's kappa", cohen_kappa_score(y_test, y_pred)))
    print(format_str % ('Accuracy', accuracy_score(y_test, y_pred)))

Test set evaluation metrics:

In [None]:
classification_eval(grid, X_test, y_test)