# LendUp Data Challenge - Loan Approvals

The goal of this data challenge is to utilize the given dataset, sample applicant data from the LendingClub portfolio, and utilize that data to create a model that can assess whether LendingClub should issue a loan to the applicant. The dataset includes a wide variety of information on the applicant, some of which is personal and other information that is more situational/operational in nature.

Let's go through a typical data science process (Clean -> Explore -> Analyse -> Model -> Evaluate) with this dataset!

## Import Data + First Look

Let's first import any modules we need + the actual data, and take a look at all the different features that are present in the dataset:

In [56]:
# Import any necessary modules
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [57]:
# Import data + Read
data = pd.read_csv("lending_club_data.csv", low_memory = False)
data.head(5)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Data derived from LendingClub Prospectus (https://www.lendingclub.com/info/prospectus.action)
id,member_id,loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,loan_status,desc,purpose,percent_bc_gt_75,bc_util,dti,inq_last_6mths,mths_since_recent_inq,revol_util,total_bc_limit,total_pymnt,mths_since_last_major_derog,tot_hi_cred_lim,tot_cur_bal
10129403,11981032.0,7550,36 months,16.24%,3 years,RENT,28000.0,Current,,debt_consolidation,100.0,96.0,8.4,0.0,17.0,72%,4000.0,1864.38,,3828.95380081,5759.0
10149342,12000897.0,27050,36 months,10.99%,10+ years,OWN,55000.0,Current,Borrower added on 12/31/13 > Combining high interest credit cards to lower interest rate.<br>,debt_consolidation,25.0,53.9,22.87,0.0,8.0,61.2%,35700.0,6198.22,,34359.9407269,114834.0
10129454,11981072.0,12000,36 months,10.99%,4 years,RENT,60000.0,Current,Borrower added on 12/31/13 > I would like to use this money to payoff existing credit card debt and use the remaining about to purchase a used car that is fuel efficient.<br>,debt_consolidation,0.0,15.9,4.62,1.0,3.0,24%,18100.0,2748.84,,16416.6177583,7137.0
10149577,12001118.0,28000,36 months,7.62%,5 years,MORTGAGE,325000.0,Fully Paid,,debt_consolidation,16.7,67.1,18.55,1.0,3.0,54.6%,42200.0,29150.98,,38014.1497567,799592.0


In [58]:
data.describe()

Unnamed: 0,Data derived from LendingClub Prospectus (https://www.lendingclub.com/info/prospectus.action)
count,160383.0
unique,118423.0
top,0.0
freq,32.0


Immediately, we can notice that the dataframe is actually stored completely stored under one column (as shown above). This discrepency needs to be resolved so we can properly read the data coming in:

In [59]:
# Fix dataframe issue - set header parameter
applicants = pd.read_csv("lending_club_data.csv", low_memory = True, header = 1)
applicants.columns

Index([u'id', u'member_id', u'loan_amnt', u'term', u'int_rate', u'emp_length',
       u'home_ownership', u'annual_inc', u'loan_status', u'desc', u'purpose',
       u'percent_bc_gt_75', u'bc_util', u'dti', u'inq_last_6mths',
       u'mths_since_recent_inq', u'revol_util', u'total_bc_limit',
       u'total_pymnt', u'mths_since_last_major_derog', u'tot_hi_cred_lim',
       u'tot_cur_bal'],
      dtype='object')

In [60]:
# Check out the values
applicants.head(5)

Unnamed: 0,id,member_id,loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,loan_status,desc,...,bc_util,dti,inq_last_6mths,mths_since_recent_inq,revol_util,total_bc_limit,total_pymnt,mths_since_last_major_derog,tot_hi_cred_lim,tot_cur_bal
0,10129403,11981032.0,7550,36 months,16.24%,3 years,RENT,28000.0,Current,,...,96.0,8.4,0.0,17.0,72%,4000.0,1864.38,,3828.953801,5759.0
1,10149342,12000897.0,27050,36 months,10.99%,10+ years,OWN,55000.0,Current,Borrower added on 12/31/13 > Combining high ...,...,53.9,22.87,0.0,8.0,61.2%,35700.0,6198.22,,34359.940727,114834.0
2,10129454,11981072.0,12000,36 months,10.99%,4 years,RENT,60000.0,Current,Borrower added on 12/31/13 > I would like to...,...,15.9,4.62,1.0,3.0,24%,18100.0,2748.84,,16416.617758,7137.0
3,10149577,12001118.0,28000,36 months,7.62%,5 years,MORTGAGE,325000.0,Fully Paid,,...,67.1,18.55,1.0,3.0,54.6%,42200.0,29150.98,,38014.149757,799592.0
4,10139658,11991209.0,12000,36 months,13.53%,10+ years,RENT,40000.0,Current,,...,79.6,16.94,0.0,17.0,68.8%,7000.0,2851.8,53.0,6471.462236,13605.0


In [61]:
# First look at some basic numerical statistics
applicants.describe()

Unnamed: 0,id,member_id,loan_amnt,annual_inc,percent_bc_gt_75,bc_util,dti,inq_last_6mths,mths_since_recent_inq,total_bc_limit,total_pymnt,mths_since_last_major_derog,tot_hi_cred_lim,tot_cur_bal
count,197787.0,188123.0,197787.0,188123.0,179096.0,179012.0,188123.0,188123.0,160263.0,180628.0,188123.0,32497.0,180628.0,160382.0
mean,5090397.0,5910758.0,14070.907213,72238.71,53.55703,66.829415,17.058663,0.803581,6.99177,20240.250448,8038.53961,41.792473,20239.458973,137330.5
std,2800545.0,3343605.0,8069.585694,51829.46,34.148464,26.110808,7.596977,1.032841,5.880568,18885.232505,6524.967826,20.997645,18947.098163,150758.7
min,58524.0,149512.0,1000.0,4800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2295346.0,2169516.0,8000.0,45000.0,25.0,49.5,11.34,0.0,2.0,7800.0,3624.63,25.0,7834.304802,27471.25
50%,5445986.0,6047542.0,12000.0,62000.0,50.0,72.2,16.78,0.0,6.0,14700.0,6166.05,41.0,14635.958701,80764.0
75%,7371872.0,8721086.0,19600.0,87000.0,80.0,89.0,22.58,1.0,11.0,26500.0,10323.805,58.0,26395.960437,208185.2
max,10234830.0,12096970.0,35000.0,7141778.0,100.0,339.6,34.99,8.0,24.0,522210.0,50914.591249,165.0,520643.298178,8000078.0


Having successfully read the applicant information into a DataFrame, we can take a look at some basic statistics above, and notice a few things immediately!

**Problem Type**

Noting that *Loan Status* is a given column in the dataset, and the goal is to predict/assess whether a loan should be issued or not (classification problem), thus the following study is on a **Supervised Learning Classification Problem**. The type of the problem informs the various models we can use, ranging from Logistic Regression all the way to Neural Networks! We will spend time on which model to use (and why!) later in the challenge! 

**Missing Values**

Based on the ids the total count of values/entries seems to be 197787, whereas almost all other features are missing values (the range of missing values varies quite a bit, some features are missing afew whereas others are missing a lot of values -> mths_since_last_major_derog only has 32497 values!). This will have to be dealt with during the cleaning process. The number of missing values can be seen below:

In [62]:
# Missing values
applicants.apply(lambda x: sum(x.isnull()), axis=0)

id                                  0
member_id                        9664
loan_amnt                           0
term                             9664
int_rate                         9664
emp_length                       9664
home_ownership                   9664
annual_inc                       9664
loan_status                      9664
desc                           116326
purpose                          9664
percent_bc_gt_75                18691
bc_util                         18775
dti                              9664
inq_last_6mths                   9664
mths_since_recent_inq           37524
revol_util                       9789
total_bc_limit                  17159
total_pymnt                      9664
mths_since_last_major_derog    165290
tot_hi_cred_lim                 17159
tot_cur_bal                     37405
dtype: int64

**Total Features**

There are a total of 22 columns in the dataset, each adding a different dimension to the applicant knowledge base. Note that this includes the ids, as well as the target variable (Loan Status). From the remaining columns, as we clean/process the data, we will learn which features to keep/modify and which ones to get rid of!

**Balanced Dataset**

Looking at the means/averages for each feature, and it's respective 50% value / median, we can gauge the amount of skew that's present in the dataset. Judging from the relatively small difference in values, it implies that the dataset is fairly balanced, which is great!

Having made a few initial observations, we're now ready to start cleaning the data!

## Data Cleaning/Munging

Often the most time-consuming part of the process, data cleaning is also likely the most important. Models can only be as good as their input, so it's vital that the data is cleaned in a way that relevant information retains statistical significant. Let's start with the target/output variable (*LoanStatus*), and move our way through the remaining feature vectors.

### Target Variable

Taking a look at the target variable, which is *Loan_Status*:

In [63]:
loanStatus = applicants["loan_status"]
loanStatus.value_counts()

Current               140116
Fully Paid             33309
Charged Off             9178
Late (31-120 days)      3077
In Grace Period         1570
Late (16-30 days)        780
Default                   93
Name: loan_status, dtype: int64

In [64]:
loanStatus.describe()

count      188123
unique          7
top       Current
freq       140116
Name: loan_status, dtype: object

As seen above, the clear issue with this category is that it's missing certain entries (9664 values to be precise). There are a variety of ways to deal with missing values, numerous different *imputations* that we can perform to either fill in, or remove the missing values altogether.

Keeping the distribution of values in mind (140116 entries correspond to the Current category - this is > 70% of the total values), and the number of missing values (9664 - approx. 4% missing), it's a safe assumption to make that these missing values fall under the Current category. Thus, we will impute/fill in the values assuming they belong to the mode.

Note that there's several ways this could have been done (looking at other features and observing the correlations between them and the target variable), however in cases where there is a clear majority and a relatively small number of missing values, this method of imputation is perfect.

In [65]:
# Fill in missing values
loanStatus.fillna("Current", inplace = True)
loanStatus.describe()

count      197787
unique          7
top       Current
freq       149780
Name: loan_status, dtype: object

In order to be able to use this as a target variable in our model, it's necessary to convert this categorical variable into a numerical variable. This can be done via a simple mapping:

0 - Default

1 - Late (16 - 30 days)

2 - In Grace Period 

3 - Late (31 - 120 days)

4 - Charged Off

5 - Fully Paid

6 - Current

Applying this mapping:

In [66]:
# Suppress SettingWithCopyWarning (occurs when you set a value on a slice of a DataFrame as done below)
pd.options.mode.chained_assignment = None

# Convert categorical to numerical feature
loanStatus[loanStatus == "Default"] = 0
loanStatus[loanStatus == "Late (16-30 days)"] = 1
loanStatus[loanStatus == "In Grace Period"] = 2
loanStatus[loanStatus == "Late (31-120 days)"] = 3
loanStatus[loanStatus == "Charged Off"] = 4
loanStatus[loanStatus == "Fully Paid"] = 5
loanStatus[loanStatus == "Current"] = 6

loanStatus.value_counts()

6    149780
5     33309
4      9178
3      3077
2      1570
1       780
0        93
Name: loan_status, dtype: int64

## Feature Variables

Having cleaned the target variable, we can now move on and clean the feature vectors! Let's start from the *id*, and move across the remaining features in order.

### ID

In [68]:
ids = applicants["id"]
ids.describe()

count    1.977870e+05
mean     5.090397e+06
std      2.800545e+06
min      5.852400e+04
25%      2.295346e+06
50%      5.445986e+06
75%      7.371872e+06
max      1.023483e+07
Name: id, dtype: float64

There are no missing values here, and the feature is already numerical! However, realizing that the goal is to classify whether we should approve the applicant's loan, it seems unlikely that the id (which based on the definition, is just a "*unique LendingClub assigned ID for the loan listing*) has *any* correlation to the final prediction. Thus, we can **remove this feature altogether**

It's a good idea to remove any unnecessary feature from the data, so that we can avoid **overfitting our model**, and having it memorize the training data (rather than being able to generalize well to real world conditions).

In [73]:
# Remove id
applicants.drop("id", axis = 1, inplace = True)
applicants.head(3)

Unnamed: 0,member_id,loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,loan_status,desc,purpose,...,bc_util,dti,inq_last_6mths,mths_since_recent_inq,revol_util,total_bc_limit,total_pymnt,mths_since_last_major_derog,tot_hi_cred_lim,tot_cur_bal
0,11981032.0,7550,36 months,16.24%,3 years,RENT,28000.0,6,,debt_consolidation,...,96.0,8.4,0.0,17.0,72%,4000.0,1864.38,,3828.953801,5759.0
1,12000897.0,27050,36 months,10.99%,10+ years,OWN,55000.0,6,Borrower added on 12/31/13 > Combining high ...,debt_consolidation,...,53.9,22.87,0.0,8.0,61.2%,35700.0,6198.22,,34359.940727,114834.0
2,11981072.0,12000,36 months,10.99%,4 years,RENT,60000.0,6,Borrower added on 12/31/13 > I would like to...,debt_consolidation,...,15.9,4.62,1.0,3.0,24%,18100.0,2748.84,,16416.617758,7137.0


### Member ID


In [80]:
memberID = applicants["member_id"]
len(memberID.value_counts())

188123

In [79]:
memberID.describe()

count    1.881230e+05
mean     5.910758e+06
std      3.343605e+06
min      1.495120e+05
25%      2.169516e+06
50%      6.047542e+06
75%      8.721086e+06
max      1.209697e+07
Name: member_id, dtype: float64

Similar to the initial id column, member_id is defined to be *A unique LendingClub assigned id for the borrower member*. Thus, the difference between this column and the previous id is that the previous column corresponded to the specific loan, whereas this id corresponds to the borrowing member (a borrower could potentially request multiple loans in the future).

Judging from the above statistics, it seems like every single entry is unique (implying no repeat members/customers). Since there are missing values in this feature, there would be some method necessary to fill/generate the missing ids. Imputing repeat indexes would be difficult since the ids are not sequential, and we don't have any information regarding repeat applicants.

Based on the above information, and the fact that even the member_id largely doesn't correlate with the target classification, we can also **remove this feature**!

In [81]:
# Remove member ID
applicants.drop("member_id", axis = 1, inplace = True)
applicants.head(3)

Unnamed: 0,loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,loan_status,desc,purpose,percent_bc_gt_75,bc_util,dti,inq_last_6mths,mths_since_recent_inq,revol_util,total_bc_limit,total_pymnt,mths_since_last_major_derog,tot_hi_cred_lim,tot_cur_bal
0,7550,36 months,16.24%,3 years,RENT,28000.0,6,,debt_consolidation,100.0,96.0,8.4,0.0,17.0,72%,4000.0,1864.38,,3828.953801,5759.0
1,27050,36 months,10.99%,10+ years,OWN,55000.0,6,Borrower added on 12/31/13 > Combining high ...,debt_consolidation,25.0,53.9,22.87,0.0,8.0,61.2%,35700.0,6198.22,,34359.940727,114834.0
2,12000,36 months,10.99%,4 years,RENT,60000.0,6,Borrower added on 12/31/13 > I would like to...,debt_consolidation,0.0,15.9,4.62,1.0,3.0,24%,18100.0,2748.84,,16416.617758,7137.0


### Loan Amount