
# Capstone Project 

# Author : Hamidreza Salahi

# Notebook : 1


# Table of Contents
* [Introduction](#Introduction)
* [Importing Data](#Importing-Data)
* [Creating Sample Data](#Creating-Sample-Data)
* [Data Cleaning](#Data-Cleaning)
    * [Dropping NaN Columns](#Dropping-NaN-Columns)
    * [Picking Features](#Picking-Features)
    * [Dropping NaN Rows](#Dropping-NaN-Rows)
    * [Changing Non-numeric Columns to Numeric Columns](#Changing-Non-numeric-Columns-to-Numeric-Columns)
* [References](#References)

# Introduction

LendingClub is one of the largest peer-to-peer financial services company headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. The company reported that $15.98 billion in loans had been originated through its platform up to December 31, 2015 [1]. <br>

In this project, it is assumed that I am recruited in LendingClub company as Data Scientist to answer the following business question: <br>

**Using machine learning (ML), make a loan approval predictor that identifeis wheter an applicant is risky or not**. <br>

Here, a risky applicant is an applicant who will not be able to pay the instalments in due time for a long period of time. In the LendingClub dataset, `Charged off` term in the load_status column refers to those applicants who has not paid their loan and has defaulted on the loan  <br>

Having a ML indicator can be very helpful for the business for two reasons: <br>

(i) Approving a risky applicant may lead to financial loss <br>
(ii) Not approving an applicant who is likely to pay their loan also leads to loss of business profit. 


In this project, the target variable is `loan_status` column. Within this column, I will concentrate only on `Fully Paid` and `Charged off` applicants i.e., `Current`, `Late (16-30 days)`, `Late (31-120 days)`, `In Grace Period`, `Issued`, `Does not meet the credit policy. Status:Fully Paid`, `Does not meet the credit policy. Status:Charged Off` and  `Default` applicants will not be considered in the analysis. 

## Importing Data

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Creating dataframe from dataset zip file
loans = pd.read_csv('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\Loan_status_2007-2020Q3.gzip', compression='gzip', low_memory=False)
loans.head()

Unnamed: 0.1,Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag
0,0,1077501,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,,,,,,,,,,N
1,1,1077430,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,,,,,,,,,,N
2,2,1077175,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,,,,,,,,,,N
3,3,1076863,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,,,,,,,,,,N
4,4,1075358,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,,,,,,,,,,N


In [334]:
loans.tail()

Unnamed: 0.1,Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag
2925488,105446,102556443,24000.0,24000.0,24000.0,60 months,23.99%,690.3,E,E2,...,,,,,,,,,,N
2925489,105447,102653304,10000.0,10000.0,10000.0,36 months,7.99%,313.32,A,A5,...,,,,,,,,,,N
2925490,105448,102628603,10050.0,10050.0,10050.0,36 months,16.99%,358.26,D,D1,...,,,,,,,,,,N
2925491,105449,102196576,6000.0,6000.0,6000.0,36 months,11.44%,197.69,B,B4,...,,,,,,,,,,N
2925492,105450,99799684,30000.0,30000.0,30000.0,60 months,25.49%,889.18,E,E4,...,,,,,,,,,,N


In [7]:
loans['loan_status'].value_counts()/loans['loan_status'].count()

Fully Paid                                             0.511976
Current                                                0.352425
Charged Off                                            0.123927
Late (31-120 days)                                     0.005522
In Grace Period                                        0.003428
Late (16-30 days)                                      0.000929
Issued                                                 0.000705
Does not meet the credit policy. Status:Fully Paid     0.000680
Does not meet the credit policy. Status:Charged Off    0.000260
Default                                                0.000148
Name: loan_status, dtype: float64

Only the `Fully Paid` and `Charged Off` applicants will be retained. Except for the `Current` applicants, which contains almost 35% of the applicants, the other ignored fields combined contribut less than 1.3%  of the dataset. 

In [9]:
loans=loans[(loans['loan_status']=='Fully Paid') | (loans['loan_status']=='Charged Off')]\
.reset_index(drop = True)

In [1]:
loans.shape

NameError: name 'loans' is not defined

In [10]:
loans['loan_status'].value_counts()

Fully Paid     1497783
Charged Off     362548
Name: loan_status, dtype: int64

## Creating Sample Data

Since the original data set is too large, we create a sample data set to work with for now. The sample data set contains about 16% (300000 rows) of the original dataset which are randomly distributed. Duplicated rows are avoided in creating the sample data. 

In [12]:
# Creating 300000 random indices.
# replace=False ensures that non of the indices are the same --> avoid duplicate rows
rand_index=np.random.choice(loans.index.values, 300000, replace=False)
rand_index

array([ 631698, 1782325,  503168, ...,   73456, 1697974, 1508462],
      dtype=int64)

In [13]:
# No duplicated index
pd.DataFrame(rand_index).duplicated().sum()

0

In [14]:
# Using the indices generated above to create the random sample without duplicates
loan_sample = loans.iloc[rand_index].reset_index()

In [15]:
loan_sample.head()

Unnamed: 0.1,index,Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag
0,631698,106881,154614504,2800.0,2800.0,2800.0,36 months,8.81%,88.8,A,...,,,,,,,,,,N
1,1782325,6573,110051889,28000.0,28000.0,28000.0,36 months,10.42%,909.02,B,...,,,,,,,,,,N
2,503168,56041,138623865,15000.0,15000.0,15000.0,36 months,10.08%,484.58,B,...,,,,,,,,,,N
3,45013,5227,9767153,8400.0,8400.0,8400.0,36 months,14.98%,291.11,C,...,,,,,,,,,,N
4,1775415,94085,96430100,3200.0,3200.0,3200.0,36 months,13.99%,109.36,C,...,,,,,,,,,,N


In [16]:
#Dropping the generated columns which are useless
loan_sample.drop(["index","Unnamed: 0" , "id"] , axis = 1 , inplace=True)

## Data Cleaning

At a glance, it is seen that the data set has a lot of Nan values, specially in some certain columns. We will drop the columns which have more than 10% of their data missing

### Dropping NaN Columns

In [17]:
loan_sample.isna().sum()

loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
                                               ...  
hardship_loan_status                          297926
orig_projected_additional_accrued_interest    296847
hardship_payoff_balance_amount                296400
hardship_last_payment_amount                  296400
debt_settlement_flag                               0
Length: 140, dtype: int64

In [18]:
# Finding all columns with more than 10% values missing
nan_cols = [i for i in loan_sample.columns if loan_sample[i].isnull().sum() > 0.1*len(loan_sample)]

In [19]:
# Dropping nan columns 
loan_sample.drop(nan_cols , axis=1, inplace=True)
loan_sample.shape

(300000, 90)

At this point, I will save a copy of the sample dataset with all 90 columns for Principal Component Analysis (PCA) which is done in the Modeling notebook. 

In [20]:
loan_sample.to_csv('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\loan_sample_PCA.csv' , index=False)

### Picking Features

In [21]:
data_Dic = pd.read_excel('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\LoanDataDictionary.xlsx')

In [22]:
data_Dic.head()

Unnamed: 0,LoanStatNew,Description
0,acc_now_delinq,The number of accounts on which the borrower i...
1,acc_open_past_24mths,Number of trades opened in past 24 months.
2,addr_state,The state provided by the borrower in the loan...
3,all_util,Balance to credit limit on all trades
4,annual_inc,The self-reported annual income provided by th...


In [23]:
# Joining the dictionary and column names
desc_loans_colms = pd.merge(pd.DataFrame({"col_name": list(loan_sample.columns)}), data_Dic, \
               how='inner', right_on='LoanStatNew', left_on='col_name')
desc_loans_colms.drop(['col_name'], axis =1, inplace=True)

In [24]:
# Displaying the dictionary defining each column
from IPython.display import display
with pd.option_context('display.max_rows', 100, 'display.max_columns', 3 , 'display.max_colwidth' , -1):
    display(desc_loans_colms)

Unnamed: 0,LoanStatNew,Description
0,loan_amnt,"The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value."
1,funded_amnt,The total amount committed to that loan at that point in time.
2,funded_amnt_inv,The total amount committed by investors for that loan at that point in time.
3,term,The number of payments on the loan. Values are in months and can be either 36 or 60.
4,int_rate,Interest Rate on the loan
5,installment,The monthly payment owed by the borrower if the loan originates.
6,grade,LC assigned loan grade
7,sub_grade,LC assigned loan subgrade
8,emp_title,The job title supplied by the Borrower when applying for the loan.*
9,emp_length,Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.


In [25]:
loan_sample.columns.shape

(90,)

* There are 90 columns in the dataset. However, looking at the data dictionary, one sees only 89 columns. The description of the column `total_rev_hi_lim` is missing in the Data Dictionary

<font size="3.5"> There are still 91 columns left even after dropping NaN columns! At this point, it is important to keep only the relevant columns to avoid confussion in the models which are going to be built on the sample data. It is assumed that keeping the following columns suffices to answer the business question asked here.
</font>

* loan_amnt
* funded_amnt_inv
* term
* int_rate
* installment
* grade
* sub_grade
* emp_title
* emp_length
* home_ownership
* annual_inc
* verification_status
* <font size="4.5">loan_status</font>
* purpose
* addr_state
* dti
* delinq_2yrs
* fico_range_low
* fico_range_high ( The difference between fico_range_low and fico_range_high is 4 for 99% of the sample data (shown below). Later on in this notebook, a new column called `fico_avg` will be created as fico_avg=(fico_range_low+fico_range_high)/2
* open_acc
* pub_rec
* revol_bal
* revol_util
* total_acc
* pub_rec_bankruptcies
* application_type

**26 columns in total. The loan_status columns is what we are trying to predict (dependent variable or y) whereas the other columns are going to be the independent variables (X)** 

In [26]:
# Columns to retain as listed above
Cols = ['loan_amnt','funded_amnt_inv','term','int_rate','installment','grade','sub_grade','emp_title','emp_length',\
        'home_ownership','annual_inc','verification_status','purpose','addr_state','dti','delinq_2yrs','fico_range_low'\
        ,'fico_range_high','open_acc','pub_rec','revol_bal','revol_util','total_acc','pub_rec_bankruptcies',\
        'application_type','loan_status']
loan_sample=loan_sample[Cols]

In [27]:
# The difference between fico_range_low and fico_range_high 
(loan_sample['fico_range_high']-loan_sample['fico_range_low']).value_counts()/loan_sample.shape[0]

4.0    0.99984
5.0    0.00016
dtype: float64

In [28]:
# Replacing fico_range_low and fico_range_high by their average
loan_sample['fico_avg'] = (loan_sample['fico_range_high']+loan_sample['fico_range_low'])/2
loan_sample.drop(columns=['fico_range_high' , 'fico_range_low'] , inplace=True)

In [29]:
#Looking at possible outcomes for loan_status
loan_sample['loan_status'].value_counts()

Fully Paid     241938
Charged Off     58062
Name: loan_status, dtype: int64

### Dropping NaN Rows

In [30]:
loan_sample.dropna(axis=0, inplace=True)

### Changing Non-numeric Columns to Numeric Columns

In [31]:
loan_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 278307 entries, 1 to 299999
Data columns (total 25 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             278307 non-null  float64
 1   funded_amnt_inv       278307 non-null  float64
 2   term                  278307 non-null  object 
 3   int_rate              278307 non-null  object 
 4   installment           278307 non-null  float64
 5   grade                 278307 non-null  object 
 6   sub_grade             278307 non-null  object 
 7   emp_title             278307 non-null  object 
 8   emp_length            278307 non-null  object 
 9   home_ownership        278307 non-null  object 
 10  annual_inc            278307 non-null  float64
 11  verification_status   278307 non-null  object 
 12  purpose               278307 non-null  object 
 13  addr_state            278307 non-null  object 
 14  dti                   278307 non-null  float64
 15  

Columns to be changed to numerics are: `term`, `int_rate`, `emp_length`, `revol_util`

In [32]:
# term
loan_sample.replace(to_replace=['36 months' , '60 months'], value=[36 , 60], regex=True , inplace = True)

In [33]:
# The int_rate column in not numeric. 
# Removing the % sign and converting int_rate col to a numceric col
loan_sample['int_rate']=loan_sample['int_rate'].str.replace('%', '').astype(float)

In [34]:
# The int_rate column in not numeric. 
# Removing the % sign and converting int_rate col to a numceric col
loan_sample['revol_util']=loan_sample['revol_util'].str.replace('%', '').astype(float)

In [35]:
# Emp_length
loan_sample['emp_length'].replace(to_replace=['years' , 'year' , '<' , '\+'], value='', regex=True, inplace = True)
loan_sample['emp_length']=loan_sample['emp_length'].astype(int)

In the numeric emp_length column, the 10+ years is changed to 10 and <1 year is changed to 1

In [36]:
loan_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 278307 entries, 1 to 299999
Data columns (total 25 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             278307 non-null  float64
 1   funded_amnt_inv       278307 non-null  float64
 2   term                  278307 non-null  int64  
 3   int_rate              278307 non-null  float64
 4   installment           278307 non-null  float64
 5   grade                 278307 non-null  object 
 6   sub_grade             278307 non-null  object 
 7   emp_title             278307 non-null  object 
 8   emp_length            278307 non-null  int32  
 9   home_ownership        278307 non-null  object 
 10  annual_inc            278307 non-null  float64
 11  verification_status   278307 non-null  object 
 12  purpose               278307 non-null  object 
 13  addr_state            278307 non-null  object 
 14  dti                   278307 non-null  float64
 15  

In [37]:
#Saving the sample data
loan_sample.to_csv('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\loan_sample.csv' , index=False)

In [435]:
#loan_sample = pd.read_csv('C:\\Users\\hamid\\Desktop\\Capstone\\Data\\loan_sample.csv')

# References
[1] https://en.wikipedia.org/wiki/LendingClub