## Loan Default Prediction

#### Project description: 
Develop basic understanding of risk anlytics in banking and financial sercice. It is under the category to **credit default** (credit risk accessment), simply explained as the possibility of a loss for an loan investor due to a borrower’s failure to repay a loan. It analyzes borrower's capability to repay by using machine learning and data analytics to help to reduce the default risk and therefore maximize the investment return. 


#### Aim:
Predict whether a loan will be paid in full or charged off to help investors increase their yield in P2P investing.

Techinqiue used here : Logistic Regression, Naive Bayes, and SVM classifiers

In [31]:
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder

import os
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 1000)

In [26]:
####  common functions   #####
title = lambda x: '-'*13+ '   ' + x.upper() +'  '+ '-'*15+'\n'

def groupby_factor(data, factor):
    df = data.groupby([factor,'loan_status']).agg({factor:  'count'})
    df = df.unstack(level=1)
    df.columns = df.columns.get_level_values(1)
    return (df.div(df.sum(axis=1), axis=0)*100).round(2)

### 1 Data 

#### 1.1 Data Structure:
Source: Kaggle.
It is a large data set containing about 2,9000,000 loan applications issued from 2007 to 2020Q3. These loans are categorized into two terms: 36 months and 60 months. First, we extracted a representative sample comprising 5% of the dataset in order to perform exploratory data analysis (EDA) and gain insights into the data.


In [5]:
DATA_PATH = os.getcwd()+'/18-LendingClub/data' if 'LendingClub' not in os.getcwd() else os.getcwd()+'/data'
meta = pd.read_csv(DATA_PATH+'/LCDataDictionary.csv')
meta

Unnamed: 0,LoanStatNew,Description
0,loan_amnt,The listed amount of the loan applied for by t...
1,loan_status,Current status of the loan
2,acc_now_delinq,The number of accounts on which the borrower i...
3,addr_state,The state provided by the borrower in the loan...
4,annual_inc,The self-reported annual income provided by th...
5,application_type,Indicates whether the loan is an individual ap...
6,chargeoff_within_12_mths,Number of charge-offs within 12 months
7,delinq_2yrs,The number of 30+ days past-due incidences of ...
8,delinq_amnt,The past-due amount owed for the accounts on w...
9,dti,debt to income ratio


In [7]:
def read_loan_sample(p):
    ''' p % of the lines  '''

    cols = meta['LoanStatNew'].to_list()
    df = pd.read_csv(DATA_PATH+'/Loan_status_2007-2020Q3.gzip', skiprows=lambda i: i>0 and random.random() > p, low_memory=False)[cols]
    df['fico_score']  =  (df['fico_range_high']+df['fico_range_low'])/2.0
    df['issue_d'] = pd.to_datetime(df['issue_d'], format='%b-%Y')
    df = df.drop(columns=['fico_range_high','fico_range_low','emp_title'])
    return df

sample = read_loan_sample(p=0.05)
print ('sample size: ',len(sample), ' Issue date 2007 - 2020Q3')
sample.head(3).T

sample size:  146434  Issue date 2007 - 2020Q3


Unnamed: 0,0,1,2
loan_amnt,13000,17500,17675
loan_status,Fully Paid,Fully Paid,Fully Paid
acc_now_delinq,0,0,0
addr_state,NJ,CA,WI
annual_inc,30000.0,40000.0,50000.0
application_type,Individual,Individual,Individual
chargeoff_within_12_mths,0.0,0.0,0.0
delinq_2yrs,0,0,0
delinq_amnt,0,0,0
dti,13.72,19.47,16.46


Assumption:  for the "Current" loans, if 90% of the loan amount is paid, we consider it as fully paid 

In [None]:
sample.isnull().sum()

### 2 EDA:

We filter out loan that are not yet matured and treat "Fully Paid" as 0 and "Charged Off" or "Default" as 1        

#### 2.1 What could be affecting the loan default rate ? 
    - Has default rate increased/decreased over time ? 
    - Grades. A > B > C > D > E > F
    - Loan terms: Longer terms can increase default risk
    - How is income affecting default rate ? 
    - Debt-to-Income Ratio
    - Loan Amount
    ... More


In [18]:
#Types of loan_status
# consider loan paid if 90% payment is received
def data_filtering(df):
    df= df.copy()
    last_date = '2020-09-01'
    df.loc[df['loan_status']=='Does not meet the credit policy. Status:Fully Paid','loan_status']='Fully Paid'
    df.loc[df['loan_status'].str.contains('Charged Off'),'loan_status']='Default'
    df = df.loc[df['annual_inc']>0]

    df['year'] = df.issue_d.dt.year
    df['payment_collected'] = df['issue_d'].sub(pd.to_datetime(last_date)).dt.days.abs()//30*df['installment']
    df.loc[(df['loan_status']=='Current')&(df['payment_collected']>0.90*df['loan_amnt']),'loan_status']= 'Fully Paid'
    
    overtime = groupby_factor(df,factor='year')
    print (overtime,'\n')
    print ('The correlation between years and default rate is not strong:  \n',
           np.corrcoef(range(2007,2018),overtime['Default'].values[0:-3]))

    return df
  
print ('sample size: ',len(sample),'\n')
print (title(' loan status '))
sample = data_filtering(sample)
print ('no clear trend indicate default rate increase or decrease over years')



sample size:  146275 

-------------    LOAN STATUS   ---------------

loan_status  Current  Default  Fully Paid  In Grace Period  Issued  Late (16-30 days)  Late (31-120 days)
year                                                                                                     
2007             NaN    22.22       77.78              NaN     NaN                NaN                 NaN
2008             NaN    22.83       77.17              NaN     NaN                NaN                 NaN
2009             NaN    15.41       84.59              NaN     NaN                NaN                 NaN
2010             NaN    14.63       85.37              NaN     NaN                NaN                 NaN
2011             NaN    16.56       83.44              NaN     NaN                NaN                 NaN
2012             NaN    16.20       83.80              NaN     NaN                NaN                 NaN
2013             NaN    14.93       85.07              NaN     NaN               

In [34]:
def data_featuring(df):
    df = df.loc[df['loan_status'].isin(['Fully Paid','Default'])]
    X = df['application_type'].values.reshape(-1, 1)
    enc = OneHotEncoder().fit(X)
    print (enc.transform(X).toarray())
    
    print (df.title.unique())
    return df
_= data_featuring(sample)

[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]
['Debt Consolidation Loan' 'Start Up' 'MAKE ON PAYMENT' ... 'Releif'
 'debt help loan' 'Trying to come back to reality!']


In [9]:
sample.head(3).T
# verification_status	

Unnamed: 0,0,1,2
loan_amnt,13000,17500,17675
loan_status,Fully Paid,Fully Paid,Fully Paid
acc_now_delinq,0,0,0
addr_state,NJ,CA,WI
annual_inc,30000.0,40000.0,50000.0
application_type,Individual,Individual,Individual
chargeoff_within_12_mths,0.0,0.0,0.0
delinq_2yrs,0,0,0
delinq_amnt,0,0,0
dti,13.72,19.47,16.46


In [None]:
####     BY GRADE   ##########
def data_analysis(data):
    
    df = data[['term','grade','issue_d','loan_status','annual_inc']].copy()
    # -------    Grade    ----------
    grade = groupby_factor(df,factor='grade')
    print (title('grade'), grade,'\n')
    ax = grade.plot(kind='bar', rot=0)
    ax.set_ylabel('%')
    
    # -------    Term    ----------
    term = groupby_factor(df,factor='term')
    print (title('term'), term,'\n')
    
    # -------    Income class   ----------

    df['income_class'] = pd.cut(df.annual_inc,[1,50000,100000,500000,10000000,max(df.annual_inc.max(),10000001)])
    income_class = groupby_factor(df,factor='income_class')
    print (title('income_class'),income_class ,'\n')

    # -------    Has default rate increased/decreased over time   ----------
    df['year'] = df.issue_d.dt.year
    overtime = groupby_factor(df,factor='year')
    print (title('Over years'), ,overtime,'\n')

    
# print (df_loan.head(2).T)
print ('   IN   %    ')
data_analysis(data=sample)

We can safely neglect the beginning and ending years because it takes 3-5 years to determine the loan status (loans either take 3 or 5 years to mature). From the figure below, there is a clearly an upward trend in the default loan rate over the years. 

In [None]:
def by_issue_time():
    df = default_rate_by(df_loan,by=['issue_d'])
    df = df.sort_values(by=['issue_d']).fillna(0)
    print ('data started :', df.index.min(),'    end :',df.index.max())

    sns.scatterplot(data=df, x="issue_d", y="Default Rate %").set_title('default rate')
    df = df.loc[(df.index>='2009-07-01')&(df.index<='2019-03-01')] #loan maturity 3-5 years
    sns.scatterplot(data=df, x="issue_d", y="Default Rate %").set_title('default rate')
    plt.legend(labels=["whole period","effective period"])

#     return df
by_issue_time()

#### 2.2 Factors 
    What factors may affect default rate ? 
    - 2). Default rate at different grade
    - 
Because we set charged off as 1, the positive correlation means the loan more likely to go default. From the correlation map below, we see that borrowers with delinq history, public records, bankcrupt history, high debt to income ratio (dti),higher bankcard utilization rate more likely to go default. Higher loan amount, multiple inqury attempts in 6 months will also increase the risk. Whereas borrowers with higher income, higher fico score, more credit cards in use tend to pay off the debt. 

In [None]:
plt.figure(figsize=(20, 8))
sns.heatmap(df_loan.corr(), annot=True, cmap='viridis')

In [None]:
#classification
def default_rate_by(data, by='grade'):
    df = data.groupby([by,'loan_status']).agg({'count':'count'})
    df = pd.pivot_table(df, values='count', index=[by],columns=['loan_status'])
    df['Default Rate %'] = df[1]/df.sum(axis=1)*100
    return df.round(2)
default_rate_by(df_loan,by='grade')

reference: https://cs229.stanford.edu/proj2018/report/69.pdf    