# A Hitchhiker's Guide to Lending Club Loan Data

In this kernel I will be going over the Lending Club Loan Data where the data is imbalanced, big and has multiple features with different data types. For the purpose of modelling, I will be taking all default loans as the target variable and will be trying to predict if a loan will default or not.

---


# Importing the data

First, importing necessary libraries,

In [None]:
#Imports 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import warnings
import gc
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
%matplotlib inline

In [None]:
start_df = pd.read_csv('../input/loan.csv', low_memory=False)

Working on a copy of the dataframe so that I do not have to re-read the entire dataset again in order to save memory.

In [None]:
df = start_df.copy(deep=True)
df.head()

Checking the dimensions,

In [None]:
df.shape

Printing out the column names,

In [None]:
df.columns

So, we've got a fair amount of columns that we need to understand. Knowing what the columns mean can help us a lot for feature engineering later on.

---

# Understanding the data 

First, let's check the description of the various column fields in the dataset.

In [None]:
df_description = pd.read_excel('../input/LCDataDictionary.xlsx').dropna()
df_description.style.set_properties(subset=['Description'], **{'width': '1000px'})

> Looking at the columns description, a good thing we could do is find columns that carry importance and at the same time find columns that are redundant for their lack of information.


Let us also see the number and percentage of missing values,

In [None]:
def null_values(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [None]:
# Missing values statistics
miss_values = null_values(df)
miss_values.head(20)

> The percentage of missing data in many columns are far more than we can work with. So, we'll have to remove columns having a certain percentage of data less than the total data later on.

Another thing we would want to examine is that how many loans have a default loan status in comparison to other loans. A common thing to predict in datasets like these are if a new loan will get default or not. I'll be keeping loans with default status as my target variable.

In [None]:
target_list = [1 if i=='Default' else 0 for i in df['loan_status']]

df['TARGET'] = target_list
df['TARGET'].value_counts()

> This clearly is a case of an imbalanced class problem where the value of class is far less than the other. There are cost function based approaches and sampling based approaches for handling this kind of problem which we will use later so that our model doesn't exhibit high bias while trying to predict if a loan will default or not.

In [None]:
df.drop('loan_status',axis=1,inplace=True)

Then, seeing the distribution of data types we are working with,

In [None]:
# Number of each type of column
df.dtypes.value_counts().sort_values().plot(kind='barh')
plt.title('Number of columns distributed by Data Types',fontsize=20)
plt.xlabel('Number of columns',fontsize=15)
plt.ylabel('Data type',fontsize=15)

> So we have quite a number of columns having objects data type which are going to pose a problem while modelling. 

Let us see how many categorical data do the columns having 'object' data types contain:

In [None]:
df.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

>  We would want to label encode the columns having only 2 categorical data and one-hot encode columns with more than 2 categorical data. Also, columns like emp_title, url, desc, etc. should be dropped because there aren't any large number of unique data for any of the categories they contain. Also, Principal Component Analysis can be carried out for the one-hot encoded columns to bring the feature dimensions down.

## Anomaly Detection

Let us check for any anomalies on the data we might have. Possible data anamolies are often found in columns dealing with time like years of employment. Let's quickly go through them.

In [None]:
df['emp_length'].head(3)

I'll be filling the null values with 0 assuming that the borrower hasn't worked many years for his data to be recorded. Also, I'll be using regex to extract the number of years from all of the data.

In [None]:
df['emp_length'].fillna(value=0,inplace=True)

df['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)

df['emp_length'].value_counts().sort_values().plot(kind='barh',figsize=(18,8))
plt.title('Number of loans distributed by Employment Years',fontsize=20)
plt.xlabel('Number of loans',fontsize=15)
plt.ylabel('Years worked',fontsize=15);

> The column looks fine. Also, it can be seen that people who have worked for 10 or more years are more likely to take a loan.

In [None]:
fig = plt.figure(figsize=(12,6))
sns.violinplot(x="TARGET",y="loan_amnt",data=df, hue="pymnt_plan", split=True)
plt.title("Payment plan - Loan Amount", fontsize=20)
plt.xlabel("TARGET", fontsize=15)
plt.ylabel("Loan Amount", fontsize=15);

> Naturally, the defaulted loans had no payment plan

## Exploratory Data Analysis

Let me remove all the columns with more than 70% missing data as they won't be helping for modelling and exploration.

In [None]:
temp = [i for i in df.count()<887379 *0.30]
df.drop(df.columns[temp],axis=1,inplace=True)

In [None]:
corr = df.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', corr.tail(10))
print('\nMost Negative Correlations:\n', corr.head(10))

> Besides from the perfect correlation of TARGET column with itself, columns like int_rate which is interest rate, out_prncp_inv which is remaining outstanding principal, etc. have high positive correlation with the TARGET column and these are quite true as higher the interest rate, higher it is harder for a borrower to pay back a loan. However, columns like out_prncp_inv, out_prncp, total_rec_int, total_rec_late_fee, inq_last_6mths and revol_util are bound to be higher when a borrower doesn't pay back a loan and thus doesn't carry much significance. So, the column of interest after int_rate could be the dti which is the Debt to Income ratio which understandably will affect if a borrower can pay back a loan or not.

> Also, columns like recoveries, total_rev_hi_lim, etc. have negative correlation with the TARGET column as a borrower who has paid back money is more likely to repay the loan.

Examining further on debt to income ratio and interest rate,

In [None]:
df.corr()['dti'].sort_values().tail(6)

> It can be seen that the interest rate is also highly positively correlated with the debt to income ratio.

Let us do make some Kernel Density Estimation Plots to see how the interest rate and debt to income ratio are distributed for the two classes in the TARGET column.

In [None]:
fig = plt.figure(figsize=(22,6))
sns.kdeplot(df.loc[df['TARGET'] == 1, 'int_rate'], label = 'target = 1')
sns.kdeplot(df.loc[df['TARGET'] == 0, 'int_rate'], label = 'target = 0');
plt.xlabel('Interest Rate (%)',fontsize=15)
plt.ylabel('Density',fontsize=15)
plt.title('Distribution of Interest Rate',fontsize=20);

> The density of interest rates follow kind of a Gaussian distribution with more density on interest rates between 12%-18%.

While we are looking at distributions, some other distributions that would be interesting to examine are,

** Violin-plot of TARGET classes with distribution of loan amount differentiated by the terms. **

In [None]:
fig = plt.figure(figsize=(12,6))
sns.violinplot(x="TARGET",y="loan_amnt",data=df, hue="term", split=True,color='pink')
plt.title("Term - Loan Amount", fontsize=20)
plt.xlabel("TARGET", fontsize=15)
plt.ylabel("Loan Amount", fontsize=15);

> Most of the Loans of higher terms have high amount and vice versa for the TARGET classes.

** Violin-plot of TARGET classes with distribution of loan amount differentiated by the application type. **

In [None]:
fig = plt.figure(figsize=(12,6))
sns.violinplot(x="TARGET",y="loan_amnt",data=df, hue="application_type", split=True,color='green')
plt.title("Application Type - Loan Amount", fontsize=20)
plt.xlabel("TARGET", fontsize=15)
plt.ylabel("Loan Amount", fontsize=15);

So all the loans that have been defaulted are from individuals rather than from two or more people. 

In [None]:
df['application_type'].value_counts()

> Seeing the number of joint applicants in comparison to the the total applicants, it **isn't** significant enough to conclude that the loan taken by all Joint applicants are paid back. 

** Violin-plot of TARGET classes with distribution of interest rate differentiated by the loan grades. **

In [None]:
fig = plt.figure(figsize=(18,8))
sns.violinplot(x="TARGET",y="int_rate",data=df, hue="grade")
plt.title("Grade - Interest Rate", fontsize=20)
plt.xlabel("TARGET", fontsize=15)
plt.ylabel("Interest Rate", fontsize=15);

> Both target classes have similar kind of interest rates by grades.

Let us also check the correlation of annual income with loan amount taken. 

In [None]:
df.corr()['annual_inc'].sort_values().tail(10)

> The annual income of the applicant has high positive correlation with the amount of loan they have taken.

** From where do most of the loans tend to be defaulted? **

In [None]:
fig = plt.figure(figsize=(18,10))
df[df['TARGET']==1].groupby('addr_state')['TARGET'].count().sort_values().plot(kind='barh')
plt.ylabel('State',fontsize=15)
plt.xlabel('Number of loans',fontsize=15)
plt.title('Number of defaulted loans per state',fontsize=20);

In [None]:
fig = plt.figure(figsize=(18,10))
df[df['TARGET']==0].groupby('addr_state')['TARGET'].count().sort_values().plot(kind='barh')
plt.ylabel('State')
plt.xlabel('Number of loans')
plt.title('Number of not-defaulted loans per state');

> It can be seen that there are more number of loans taken amount from the same states where there are more number of defaulted risk. This is why the state cannot be taken as a major feature for knowing if a loan will be defaulted or not.


Let's see if we have any members taking multiple loans.

In [None]:
df['member_id'].value_counts().head(2)

> Suprisingly there is not a single member taking loan more than once. So, member id column can also be dropped along with the id column.

# Cleaning the data


As we had observe, some columns like annual_inc, int_rate, etc. may be much useful for building our model but on the other hand, some columns like id, member_id, etc. will not be helping. 

Also, columns like 'title' and 'emp_title' are text which cannot be one-hot encoded / label encoded as they have arbitrary categorical text and very less unique data for each of their categories.

In [None]:
df['emp_title'].value_counts().head()

In [None]:
df.drop(['id','member_id','emp_title','title','zip_code','url'],axis=1,inplace=True)

In [None]:
df.shape

So, now we have 48 columns remaining. Let's print them out to get a quick look of what we are dealing with,

In [None]:
df.info()

The memory usage is 325+ MB. Some of these columns still look like they could need some work i.e. more cleaning! 

I will be fixing the data types and then handling the missing data.

First, I'll be converting the date object columns into integer number of years or months just because I do not want to blow up the number of feature columns by performing one-hot encoding on them. For filling the null values I have taken the dates with the highest number of counts.

In [None]:
df['issue_d']= pd.to_datetime(df['issue_d']).apply(lambda x: int(x.strftime('%Y')))
df['last_pymnt_d']= pd.to_datetime(df['last_pymnt_d'].fillna('2016-01-01')).apply(lambda x: int(x.strftime('%m')))
df['last_credit_pull_d']= pd.to_datetime(df['last_credit_pull_d'].fillna("2016-01-01")).apply(lambda x: int(x.strftime('%m')))
df['earliest_cr_line']= pd.to_datetime(df['earliest_cr_line'].fillna('2001-08-01')).apply(lambda x: int(x.strftime('%m')))
df['next_pymnt_d'] = pd.to_datetime(df['next_pymnt_d'].fillna(value = '2016-02-01')).apply(lambda x:int(x.strftime("%Y")))


Let's see how we can handle our categorical data. Two methods we can use are Label Encoding and One Hot Encoding.

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. So, If we only have two unique values for a categorical variable (such as Yes/No), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the better option.

However, due to the large number of columns originated after One-Hot Encoding, we may have to conduct Principle Component Analysis (PCA) for dimensionality reduction.

In [None]:
from sklearn import preprocessing

In [None]:
count = 0

for col in df:
    if df[col].dtype == 'object':
        if len(list(df[col].unique())) <= 2:     
            le = preprocessing.LabelEncoder()
            df[col] = le.fit_transform(df[col])
            count += 1
            print (col)
            
print('%d columns were label encoded.' % count)

And one-hot encoding the rest categorical columns,

In [None]:
df = pd.get_dummies(df)
print(df.shape)

For the 'mths_since_last_delinq' column, I'll be filling in the missing value with the median of the columns as the data in the column is continuous.

In [None]:
df['mths_since_last_delinq'] = df['mths_since_last_delinq'].fillna(df['mths_since_last_delinq'].median())

However for columns like 'total_rev_hi_lim','tot_col_ammnt',etc. , I won't be filling in the missing data because they will certainly be of high feature importance due to their description. If they do not seem to be of high importance we can always re-iterate and fill the missing values later. 

So, dropping all remaining null values,

In [None]:
df.dropna(inplace=True)

And checking the count,

In [None]:
df.count().sort_values().head(3)

In [None]:
df['TARGET'].value_counts()

We are now left with a reasonable amount of data for modelling.

---

# Modeling

Now, for modeling I will be using two ensemble methods and comparing them.

i) Bootstrap Aggregrating or Bagging

ii) Boosting

# 1) Bagging - Random Forest

* Ensemble of Decision Trees

* Training via the bagging method (Repeated sampling with replacement)
  * Bagging: Sample from samples
  * RF: Sample from predictors. $m=sqrt(p)$ for classification and $m=p/3$ for regression problems.

* Utilise uncorrelated trees

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Creating a classification report function,

In [None]:
def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        print("Train Result:\n")
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_train, clf.predict(X_train))))
        print("Classification Report: \n {}\n".format(classification_report(y_train, clf.predict(X_train))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, clf.predict(X_train))))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        
    elif train==False:
        print("Test Result:\n")        
        print("accuracy score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))    
        

Conducting train test split.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('TARGET',axis=1),df['TARGET'],test_size=0.15,random_state=101)

Freeing up the memory.

In [None]:
del start_df
gc.collect()

Standardizing features by removing the mean and scaling to unit variance

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test=sc.transform(X_test)

Oversampling only the training set using Synthetic Minority Oversampling Technique ([SMOTE](https://jair.org/index.php/jair/article/view/10302))

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_r, y_train_r = sm.fit_sample(X_train, y_train)

Now, I'll be trying out different models to get the best prediction score.

**Creating a baseline for accuracy and recall using Logistic regression, **

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(C = 0.0001,random_state=21)

log_reg.fit(x_train_r, y_train_r)

In [None]:
print_score(log_reg, x_train_r, y_train_r, X_test, y_test, train=False)

The accuracy came out to be satisfactory for the baseline along with the recall score. However, precision seems to be very off.

For our case, overfitting will be a huge concern. So, I'm using Random Forest as it is known to decrease overfitting by selecting features at random. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=40, random_state=21)
clf_rf.fit(x_train_r, y_train_r)

In [None]:
print_score(clf_rf, x_train_r, y_train_r, X_test, y_test, train=False)

We have high precision but a low recall for our validation set. Using this model is not a good idea as most of our default loans will be falsely classified.

## 2) Boosting:

* Train weak classifiers 
* Add them to a final strong classifier by weighting. Weighting by accuracy (typically)
* Once added, the data are reweighted
  * Misclassified samples gain weight 
  * Algo is forced to learn more from misclassified samples    

For boosting I will be using the [LightGBM](https://www.youtube.com/watch?v=5CWwwtEM2TA) classifier (evalulation metric as AUC) along with [Kfold cross validation](https://www.youtube.com/watch?v=TIgfjmp-4BA).

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold, StratifiedKFold
from lightgbm import LGBMClassifier

Function to use LightGBM with Kfold cross validation,

In [None]:
def kfold_lightgbm(train_df, num_folds, stratified = False):
    print("Starting LightGBM. Train shape: {}".format(train_df.shape))
    
    # Cross validation model
    if stratified:
        folds = StratifiedKFold(n_splits= num_folds, shuffle=True, random_state=47)
    else:
        folds = KFold(n_splits= num_folds, shuffle=True, random_state=47)

    oof_preds = np.zeros(train_df.shape[0])

    feature_importance_df = pd.DataFrame()
    feats = [f for f in train_df.columns if f not in ['TARGET']]
    
    # Splitting the training set into folds for Cross Validation
    for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_df[feats], train_df['TARGET'])):
        train_x, train_y = train_df[feats].iloc[train_idx], train_df['TARGET'].iloc[train_idx]
        valid_x, valid_y = train_df[feats].iloc[valid_idx], train_df['TARGET'].iloc[valid_idx]

        # LightGBM parameters found by Bayesian optimization
        clf = LGBMClassifier(
            nthread=4,
            n_estimators=10000,
            learning_rate=0.02,
            num_leaves=32,
            colsample_bytree=0.9497036,
            subsample=0.8715623,
            max_depth=8,
            reg_alpha=0.04,
            reg_lambda=0.073,
            min_split_gain=0.0222415,
            min_child_weight=40,
            silent=-1,
            verbose=-1,
            )

        # Fitting the model and evaluating by AUC
        clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)], 
            eval_metric= 'auc', verbose= 1000, early_stopping_rounds= 200)
        print_score(clf, train_x, train_y, valid_x, valid_y, train=False)
        # Dataframe holding the different features and their importance
        fold_importance_df = pd.DataFrame()
        fold_importance_df["feature"] = feats
        fold_importance_df["importance"] = clf.feature_importances_
        fold_importance_df["fold"] = n_fold + 1
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
        
        # Freeing up memory
        del clf, train_x, train_y, valid_x, valid_y
        gc.collect()

    display_importances(feature_importance_df)
    return feature_importance_df

Function for displaying the importance of the features,

In [None]:
def display_importances(feature_importance_df_):
    cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by="importance", ascending=False)[:40].index
    best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]
    plt.figure(figsize=(15, 12))
    sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances.png')

In [None]:
feat_importance = kfold_lightgbm(df, num_folds= 3, stratified= False)

As we can see, LightGBM did a great job for getting high precision as well as a high recall. Hence, this model is the best in terms of the 3 models that we evaluated.

For further enhancements to the model, feature engineering could be done. Also a broader term like 'good loan' and 'bad loan' could have been used by encompassing different loan statuses together to get a more balanced counts of classes rather than default/non-default.