# Loan Status Prediction: Lending Club, 2007-2017

## Table of Contents

1. [Summary](#1)
    1. [Spoilers](#1.1)
2. [Import the Data](#2)
3. [Target Variable](#3)
4. [Feature Selection](#4)
    1. [Drop columns that have only one distinct value](#4.1)
    2. [Remove columns that have < 2% data](#4.2)
    3. [Remove irrelevant features](#4.3)
    4. [Remove features that could make predictions too easy](#4.4)
    5. [Inspect non-numerical features](#4.5)
5. [Exploratory Data Analysis](#5)
6. [Correlations with 'charged_off'](#6)
    1. [Create dummy variables](#6.1)
    2. [Compute correlations with 'charged_off'](#6.2)
7. [More Pre-processing](#7)
    1. [Train/test split](#7.1)
    2. [Imputation with mean substitution](#7.2)
    3. [Standardize the data](#7.3)
8. [Predictive Modeling: SGDClassifier](#8)
    1. [Train with grid search](#8.1)
    2. [Test set evaluation](#8.2)

# Summary
<a id="1"></a>

[Data source](https://www.kaggle.com/wordsforthewise/lending-club)

The goal of this project is to predict whether a loan will be fully paid or charged off. We'll remove some features that would make this prediction too easy, such as the total payments received on the loan to date.

This is my first kernel on Kaggle. I would appreciate any constructive feedback!

## Spoilers
<a id="1.1"></a>

By far the most useful features for predicting whether a loan will be paid off are 'last_fico_range_low' and 'last_fico_range_high', which hold the most recent credit score of the borrower.

We will delete features that could make the prediction too easy, or trivial.

# Import the Data
<a id="2"></a>

Import basic libraries.

In [None]:
import numpy as np
import pandas as pd

Change pandas print options so we can print all desired rows/columns without truncation.

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

Read in the data.

In [None]:
df = pd.read_csv('../input/accepted_2007_to_2017Q3.csv.gz', compression='gzip', low_memory=True)
# low_memory=False prevents mixed data types in the DataFrame

Check basic dataframe info.

In [None]:
df.info()

Peek at the first few rows of the data.

In [None]:
df.head(3)

# Target Variable
<a id="3"></a>

We're going to try to predict the 'loan_status' column. What are the value counts in this column?

In [None]:
df['loan_status'].value_counts()

Let's only consider loans that meet the credit policy and have either been fully paid or charged off. These are the two cases we'll try to distinguish with a model.

Retain only the rows with 'loan_status' Fully Paid or Charged Off.

In [None]:
df = df.loc[df['loan_status'].isin(['Fully Paid', 'Charged Off'])]

In [None]:
df['loan_status'].value_counts()

How many rows remain?

In [None]:
df.shape

How balanced are the classes?

In [None]:
df['loan_status'].value_counts() / df.shape[0]

About 79% of the loans have been fully paid, and 21% have been charged off.

Let's convert the 'loan_status' column to a 0/1 'charged_off' column. This will allow us to compute correlations later.

In [None]:
df['loan_status'] = df['loan_status'].apply(lambda s: np.float(s == 'Charged Off'))

In [None]:
df['loan_status'].value_counts()

Rename the 'loan_status' column to 'charged_off'.

In [None]:
df.rename(columns={'loan_status':'charged_off'}, inplace=True)

Our target variable is ready to go. We have two classes to try to predict.

# Feature Selection
<a id="4"></a>

The raw data has 150 features, but we won't be using all the features for our predictions, as we'll explain below.

Definitions of the columns are given in the Lending Club "Data Dictionary" [available here](https://www.lendingclub.com/info/download-data.action).

## Drop columns that have only one distinct value
<a id="4.1"></a>

Are there any columns with only one distinct value?

In [None]:
drop_list = []
for col in df.columns:
    if df[col].nunique() == 1:
        drop_list.append(col)

drop_list

These columns do not contain any useful information, so we drop them.

In [None]:
df.shape

In [None]:
df.drop(labels=drop_list, axis=1, inplace=True)

In [None]:
df.shape

## Remove columns that have < 2% data
<a id="4.2"></a>

Are there any columns with less than 2% data?

In [None]:
drop_list = []
for col in df.columns:
    if df[col].notnull().sum() / df.shape[0] < 0.02:
        drop_list.append(col)

drop_list

Drop these columns.

In [None]:
df.shape

In [None]:
df.drop(labels=drop_list, axis=1, inplace=True)

In [None]:
df.shape

## Remove irrelevant features
<a id="4.3"></a>

Let's drop some features that we don't think will be useful for predicting the loan status.

Analyzing text in the borrower loan description, job title, or loan title could be an interesting direction, but we won't explore this for now. The last three features listed below contain date information. We could convert these to numerical values, but we won't bother doing so.

In [None]:
df.shape

In [None]:
df.drop(labels=['id', 'desc', 'emp_title', 'title', 'issue_d', 'last_credit_pull_d', 'earliest_cr_line'], axis=1, inplace=True)

In [None]:
df.shape

## Remove features that could make predictions too easy
<a id="4.4"></a>

Some features give away the loan status. For example, if 'debt_settlement_flag' is 'Y', this implies that the borrower charged off. Also, if 'total_pymnt' is greater than 'loan_amnt', then the loan must be paid off. Let's not make our job too easy---remove these columns!

In [None]:
df.shape

In [None]:
df.drop(labels=['collection_recovery_fee', 'debt_settlement_flag', 'last_pymnt_amnt', 'last_pymnt_d', 'recoveries', 'total_pymnt', 'total_pymnt_inv', 'total_rec_int', 'total_rec_late_fee', 'total_rec_prncp'], axis=1, inplace=True)

In [None]:
df.shape

Are there any other features I should have removed, or any that I should have kept? Let me know in the comments.

## Inspect non-numerical features
<a id="4.5"></a>

We're going to inspect features of type 'object', i.e. text data.

In [None]:
df.head(3)

Which columns have text data?

In [None]:
text_cols = []
for col in df.columns:
    if df[col].dtype == np.object:
        text_cols.append(col)

text_cols

### term

In [None]:
df['term'].value_counts()

Convert 'term' to integer values.

In [None]:
df['term'] = df['term'].apply(lambda s:np.float(s[1:3])) # There's an extra space in the data for some reason
df['term'].value_counts()

### grade, sub_grade

Convert the subgrade to a numerical value.

In [None]:
grade_dict = {'A':0.0, 'B':1.0, 'C':2.0, 'D':3.0, 'E':4.0, 'F':5.0, 'G':6.0}
def grade_to_float(s):
    return 5 * grade_dict[s[0]] + np.float(s[1]) - 1

In [None]:
df['sub_grade'] = df['sub_grade'].apply(lambda s: grade_to_float(s))

The grade is implied by the subgrade, so let's drop the grade column.

In [None]:
df.drop(labels=['grade'], axis=1, inplace=True)

### emp_length

In [None]:
df['emp_length'].value_counts()

Let's convert 'emp_length' to floats.

In [None]:
def emp_conv(s):
    try:
        if pd.isnull(s):
            return s
        elif s[0] == '<':
            return 0.0
        elif s[:2] == '10':
            return 10.0
        else:
            return np.float(s[0])
    except TypeError:
        return np.float64(s)

df['emp_length'] = df['emp_length'].apply(lambda s: emp_conv(s))
df['emp_length'].value_counts()

### home_ownership

In [None]:
df['home_ownership'].value_counts()

### verification_status

In [None]:
df['verification_status'].value_counts()

### purpose

In [None]:
df['purpose'].value_counts()

### zip_code, addr_state

Convert the zip code to a float.

In [None]:
df['zip_code'] = df['zip_code'].apply(lambda s:np.float(s[:3]))

The state is implied by the zip code, so remove the state column.

In [None]:
df.drop(labels=['addr_state'], axis=1, inplace=True)

### initial_list_status

In [None]:
df['initial_list_status'].value_counts()

I don't know what the initial list status means.

### application_type

In [None]:
df['application_type'].value_counts()

### disbursement_method

In [None]:
df['disbursement_method'].value_counts()

# Exploratory Data Analysis
<a id="5"></a>

Import plotting libraries.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

View the first few rows.

In [None]:
df.head(3)

Let's make a count plot of the loan purpose, separated by the 'charged_off' value.

In [None]:
plt.figure(figsize=(12,6), dpi=100)
sns.countplot(y='purpose', hue='charged_off', data=df, orient='h')

Looks like most of the charge-offs come from loans for debt consolidation or to pay off credit cards.

Let's make a similar plot, but with 'sub_grade' instead of 'purpose'.

In [None]:
plt.figure(figsize=(16,6), dpi=120)
sns.countplot(x='sub_grade', hue='charged_off', data=df, order=sorted(df['sub_grade'].value_counts().index))

There's a clear trend of higher probability of charge-off as the subgrade worsens. (A higher value is a worse subgrade.)

Let's make a similar plot, but with 'term' instead of 'sub_grade'.

In [None]:
plt.figure(figsize=(4,4), dpi=90)
sns.countplot(x='term', hue='charged_off', data=df)

Loans with a term of 60 months are much more likely to be charged off.

Now let's compare the interest rate to the loan status using a kdeplot, which approximates the probability distribution of the data.

In [None]:
plt.figure(figsize=(10,4), dpi=90)
sns.kdeplot(df['int_rate'].loc[df['charged_off']==0], gridsize=500, label='charged_off = 0')
sns.kdeplot(df['int_rate'].loc[df['charged_off']==1], gridsize=500, label='charged_off = 1')
plt.xlabel('int_rate')
plt.ylabel('density')

Charged-off loans tend to have higher interest rates.

Now let's compare the borrower's most recent FICO score (a credit score) to the loan status.

In [None]:
plt.figure(figsize=(10,4), dpi=90)
sns.kdeplot(df['last_fico_range_high'].loc[df['charged_off']==0], gridsize=500, label='charged_off = 0')
sns.kdeplot(df['last_fico_range_high'].loc[df['charged_off']==1], gridsize=500, label='charged_off = 1')
plt.xlabel('last_fico_range_high')
plt.ylabel('density')

Looks like charged-off loans tend to have much lower FICO scores.

# Correlations with 'charged_off'
<a id="6"></a>

By studying correlation coefficients, we can get an idea of which features correlate most strongly with 'charged_off'.

## Create dummy variables
<a id="6.1"></a>

To study correlations with 'charged_off', we need to convert categorial features to dummy variables.

In [None]:
cat_feats = []
for col in df.columns:
    if df[col].dtype == np.object:
        cat_feats.append(col)

cat_feats

In [None]:
df.shape

In [None]:
df = pd.get_dummies(df, columns=cat_feats, drop_first=True)

In [None]:
df.shape

We now have 105 features, all numerical. What does the dataframe look like after converting categorical features to dummy variables?

In [None]:
df.head(3)

## Compute correlations with 'charged_off'
<a id="6.2"></a>

Create the correlation matrix of all our data, then extract the 'charged_off' column. (Is there an easier way to compute correlations with only one variable?) Remove the entry for 'charged_off' (it's 1), and sort the features by their correlation coefficient with 'charged_off'.

In [None]:
corr_charged_off = df.corr()['charged_off']

In [None]:
corr_charged_off.drop(labels='charged_off', inplace=True)
corr_charged_off = corr_charged_off.sort_values()

Plot the correlation coefficients.

In [None]:
plt.figure(figsize=(8,28), dpi=90)
sns.barplot(y=corr_charged_off.index, x=corr_charged_off.values, orient='h')
plt.title("Correlation with 'charged_off'")
plt.xlabel("Correlation coefficient with 'charged_off'")
xmax = np.abs(corr_charged_off).max()
plt.xlim([-xmax, xmax])

**Things to note:** The borrower's most recent FICO scores are the features most negatively correlated with 'charged_off.' The debt-to-income ratio ('dti'), the number of payments on the loan ('term'), the interest rate ('int_rate'), and the subgrade ('sub_grade') are the features most positively correlated with 'charged_off'.

# More Pre-processing
<a id="7"></a>

Let's remind ourselves how much data we have.

In [None]:
df.shape

We have 814,986 samples and 105 features.

## Train/test split
<a id="7.1"></a>

In [None]:
X = df.drop(labels=['charged_off'], axis=1) # Features
y = df['charged_off'] # Target variable

In [None]:
from sklearn.model_selection import train_test_split

Let's do a 90/10 train/test split.

In [None]:
random_state = 12 # I chose this randomly, just to make the results fixed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state)

## Imputation with mean substitution
<a id="7.2"></a>

How complete is our training data?

In [None]:
pd.DataFrame((X_train.notnull().sum() / X_train.shape[0]).sort_values(), columns=['Fraction not null'])

The learning algorithms cannot have missing data. Perform mean substitution, using only the means of the training set to prevent test set leakage.

**Note:** I don't know that this is the best way to handle missing data. Should some columns simply be dropped? Should we impute some other way? Should incomplete rows be dropped?

In [None]:
from sklearn.preprocessing import Imputer

In [None]:
imputer = Imputer().fit(X_train)

In [None]:
X_train = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(imputer.transform(X_test),  columns=X_test.columns)

## Standardize the data
<a id="7.3"></a>

Shift and scale each column individually so that it has zero mean and unit variance. This will help the learning algorithms.

Train the scaler using only the training data to prevent test set leakage.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler().fit(X_train)

In [None]:
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(scaler.transform(X_test),  columns=X_test.columns)

# Predictive Modeling: SGDClassifier
<a id="8"></a>

I decided to use a SGD Classifier by looking at the machine learning flowchart here: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html.

The SGDClassifier estimator implements linear classifiers (SVM, logistic regression, a.o.) with SGD training. The linear classifier is chosen by the 'loss' hyperparameter.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import matthews_corrcoef, make_scorer

## Train with grid search
<a id="8.1"></a>

We're going to search through many hyperparameters of SGDClassifier using an exhaustive grid search with 3-fold cross-validation, implemented in GridSearchCV.

Here are the hyperparameters that we'll try:

In [None]:
param_grid = [{'loss': ['hinge'],
               'alpha': [10.0**k for k in range(-3,4)],
               'max_iter': [1000],
               'tol': [1e-3],
               'random_state': [random_state],
               'class_weight': [None, 'balanced'],
               'warm_start': [True]},
              {'loss': ['log'],
               'penalty': ['l2', 'l1'],
               'alpha': [10.0**k for k in range(-3,4)],
               'max_iter': [1000],
               'tol': [1e-3],
               'random_state': [random_state],
               'warm_start': [True]}]

Instantiate the grid estimator. We'll use the Matthews correlation coefficient as our scoring metric.

In [None]:
grid = GridSearchCV(estimator=SGDClassifier(), param_grid=param_grid, scoring=make_scorer(matthews_corrcoef), 
n_jobs=1, pre_dispatch=1, verbose=1, return_train_score=True)

Run the grid search (this could take some time).

In [None]:
grid.fit(X_train, y_train)

Hyperparameters that gave the best results on the hold out data:

In [None]:
grid.best_params_

## Test set evaluation
<a id="8.2"></a>

In [None]:
y_pred = grid.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
# Display evaluation metrics
def my_eval(y_test, y_pred):
    print('Confusion matrix')
    print(confusion_matrix(y_test, y_pred),'\n')
    print('Classification report')
    print(classification_report(y_test, y_pred, digits=3))
    print('MCC = ',matthews_corrcoef(y_test, y_pred))
    print('Accuracy = ',accuracy_score(y_test, y_pred))

In [None]:
my_eval(y_test, y_pred)