# Final Project
## Introduction to Business Analytics
Spring 2017 
***

### Introduction

Lending Club is an online credit marketplace that facilitates peer to peer lending by matching up investors and borrowers. Potential borrowers sign up to Lending Club and go through a quick approval process that allows them to request a personal, business, or medical loan.  Investors then make offers to borrowers based on their credit ratings and Lending Club’s suggested interest rates. Borrowers then begin to accept offers and have 15 days to reach a cumulative amount of at least 75% of their requested loan. The risk assessment process is automated and borrowers can begin to receive loan offers within minutes. (Lendingclub.com)

### The Business Problem

Unlike most banks and traditional lenders, Lending Club has a very flexible prepayment policy. Borrowers can prepay their loan at any time and without incurring a fee. Furthermore, investor fees may be reduced depending on the timing of the prepayment. Investors pay a payment processing fee equal to 1% of their monthly payment. If the loan is prepaid within the first 12 months then the investor only pays 1% of the contractual monthly payment and no fee is paid on the prepaid portion of the loan. However, if the loan is prepaid after the first 12 months then the investor pays a fee of 1% of the entire payment. In other words, if the borrower prepays during the first 12 months, Lending Club will only receive up to 33% and 20% of the total fee on a three- and five-year loan, respectively. If a loan is prepaid after the first 12 months then Lending Club will receive the entire fee. (For simplicity, we assume that any prepayments done during the first 12 months occur at the end of the 12th month.)

Prepayments can represent a significant loss in revenue for Lending Club. In its last 10-K, Lending Club reported service fees of 11.5m, 32.8m, and 68.0m dollars for 2014, 2015, and 2016, respectively. Additionally, data analysis shows that approximately X% of loans initiated in 201X were prepaid. This translates into a loss of up to XXM dollars for Lending club in 201X.
    
As Lending Club grows, the number of loans increase as do the number of prepayments. One solution to the problem is to implement a dynamic prepayment policy that would charge the borrower a prepayment fee based on the probability that the borrower prepays the loan. Lending club could maintain the 0% fee for borrowers less likely to prepay and charge a 1% fee for borrowers more likely to prepay. That way, in addition to protecting itself from a loss, Lending Club would make up to 33% more in fees on loans that were prepaid within the first 12 months.

### Notes

The Data

•	Please provide some notes (bullets are fine) on the features we selected… I’ll add the discussion about the timing

The Target Variable

•	Please provide the details on how we derive the target feature from current features.

Notes for profit curve to be done later:
1.	Assume 100 dollar loan
2.	Assume a probability that a borrower will not take out a loan if they have to pay a prepayment fee – this will represent a potential loss
3.	Assume a fee of 0% or 1%

Short cut cheatsheet: https://gist.github.com/kidpixo/f4318f8c8143adee5b40

## Approved Applications

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt

# For file loading and memory monitoring
import os
import gc
import psutil
'''
    If you get error saying no module named psutil, run this in terminal :
    sudo su
    pip install psutil
'''

# collect garbage and check current memory
def collect_and_check_mem():
    proc = psutil.Process(os.getpid())
    gc.collect()
    mem = proc.memory_info().rss
    print ("Memory : %.2f MB" % (mem / (1000 * 1000)))
collect_and_check_mem()

Memory : 81.27 MB


In [3]:
# data loading util
def load_year_data(year, suffixs):
    loans_data = pd.DataFrame()
    for index in range(len(suffixs)):
        suffix = suffixs[index]
        path ="data/loans/%s/xa%s.csv.gz" % (year, suffix)
        if index == 0:
            loans_data = pd.read_csv(path, skiprows=1)
        else:
            frame = pd.read_csv(path, skiprows=0, names=loans_data.columns)
            loans_data = loans_data.append(frame, ignore_index=True)
            del frame
    return loans_data
# create suffix from start (e.g. 'a') to end (e.g. 'z')
def create_suffixs(start, end):
    return [chr(i) for i in range(ord(start), ord(end)+1)]

In [4]:

#df1 = pd.read_csv("data/loans/2007-2011/xaa.csv.gz",skiprows=1)
#df2 = pd.read_csv("data/loans/2007-2011/xab.csv.gz",skiprows=0, names=df1.columns)
#loans_2007_2011 = pd.concat([df1,df2])
#loans_2007_2011.columns

#Reason why memory overflow: 
#    When we do read and concat, concat actually create a copy with each data frame (data frame is immutable)
#    So after concat at least we need to delete the data chunks and do garbage collection:
#        del df1
#        gc.collect()
#    But never mind, we will use load_year_data helper which will load data for each year without keeping temporary data 

# 'a' to 'b' for 2007-2011
# suffixs = create_suffixs('a', 'b')
# loans_2007_2011 = load_year_data('2007-2011', suffixs)


In [5]:
# 'a' to 'g' for 2012-2013
# suffixs = create_suffixs('a', 'g')
# loans_2007_2011 = load_year_data('2012-2013', suffixs)

In [6]:
# 'a' to 'h' for 2014
# suffixs = create_suffixs('a', 'h')
# loans_2014 = load_year_data('2012-2013', suffixs)

In [7]:
# 'a' to 'o' for 2015
suffixs = create_suffixs('a', 'o')
loans_2015 = load_year_data('2015', suffixs)
collect_and_check_mem()
len(loans_2015.index)

  if self.run_code(code, result):
  if self.run_code(code, result):


Memory : 583.70 MB


421097

In [8]:
# Before scaling up the instance we can try small set of data
# years_info = {
#     "2016Q1": create_suffixs('a', 'e'), 
#     #"2016Q2": create_suffixs('a', 'd'), 
#     #"2016Q3": create_suffixs('a', 'd'), 
#     "2016Q4": create_suffixs('a', 'd')
# }
# loans_2016 = pd.DataFrame()
# for year in years_info:
#     frame = load_year_data(year, years_info[year])
#     loans_2016 = loans_2016.append(frame, ignore_index=True)
#     collect_and_check_mem()
# collect_and_check_mem()
# len(loans_2016.index)

In [9]:
# # Merge any years you want
# loans_data = pd.concat([loans_2015, loans_2016])
# collect_and_check_mem()

# # IMPORTANT: remove the useless temporary frames:
# del loans_2015
# del loans_2016
# collect_and_check_mem()
# len(loans_data.index)

# While the collected memory of deleted object will not be returned to OS but kept for python
# so the memory does not go down as expected, but actually they are available:
# reference: http://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe

In [10]:
# Rename it to loans_data
loans_data = loans_2015

In [11]:
# since too many features, maybe we take out those we wanna keep:
features_to_keep = set([
    # numerical
    'loan_amnt', 'funded_amnt', 'annual_inc', 'installment',
    'open_acc', 'total_acc',
        # Some of the following line features are duplicates? Do we need them all?
        # Say: total_pymnt = total_rec_prncp + total_rec_int
    'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
    'recoveries', 'collection_recovery_fee', 
    'last_pymnt_amnt',  
    
    # features used to extract target and drop once that done
    'loan_status',
    'last_pymnt_d', # used to extract target
    'issue_d', # categorical ? Dec-11
    
    # categorical
        #date
    'last_credit_pull_d', # 21 NULL values for 2016Q1Q4 same as above, maybe ignore it
        #other
    'verification_status', 'purpose', 'addr_state',
    'grade', 'sub_grade', 'home_ownership', 'term',
    
    # special
    #'title', # 10629 NULLs for 2016 Q1Q4 value maybe forget it
    
    'int_rate', 'revol_util' # trim out percentage mark: 10.65%
    'emp_length', # extracting number: 10+ years < 1 year
    
    # not sure:
    'inq_last_6mths', 'pub_rec', 'revol_bal', 'dti', 'delinq_2yrs', 
    'pymnt_plan', 'earliest_cr_line' 'initial_list_status',
    'out_prncp', 'out_prncp_inv',
    'collections_12_mths_ex_med',
    'policy_code', 'application_type',
    'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
    'pub_rec_bankruptcies', 'tax_liens'
])

In [12]:
features_to_keep

{'acc_now_delinq',
 'addr_state',
 'annual_inc',
 'application_type',
 'chargeoff_within_12_mths',
 'collection_recovery_fee',
 'collections_12_mths_ex_med',
 'delinq_2yrs',
 'delinq_amnt',
 'dti',
 'earliest_cr_lineinitial_list_status',
 'funded_amnt',
 'grade',
 'home_ownership',
 'inq_last_6mths',
 'installment',
 'int_rate',
 'issue_d',
 'last_credit_pull_d',
 'last_pymnt_amnt',
 'last_pymnt_d',
 'loan_amnt',
 'loan_status',
 'open_acc',
 'out_prncp',
 'out_prncp_inv',
 'policy_code',
 'pub_rec',
 'pub_rec_bankruptcies',
 'purpose',
 'pymnt_plan',
 'recoveries',
 'revol_bal',
 'revol_utilemp_length',
 'sub_grade',
 'tax_liens',
 'term',
 'total_acc',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_int',
 'total_rec_late_fee',
 'total_rec_prncp',
 'verification_status'}

In [13]:
for column in loans_data.columns:
    if column not in features_to_keep:
        loans_data = loans_data.drop(column, axis=1)
loans_data.columns

Index(['loan_amnt', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade',
       'sub_grade', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'addr_state', 'dti',
       'delinq_2yrs', 'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal',
       'total_acc', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'policy_code', 'application_type',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 'tax_liens'],
      dtype='object')

In [14]:
# Check NULL
loans_data.isnull().sum()

loan_amnt                       2
funded_amnt                     2
term                            2
int_rate                        2
installment                     2
grade                           2
sub_grade                       2
home_ownership                  2
annual_inc                      2
verification_status             2
issue_d                         2
loan_status                     2
pymnt_plan                      2
purpose                         2
addr_state                      2
dti                             2
delinq_2yrs                     2
inq_last_6mths                  2
open_acc                        2
pub_rec                         2
revol_bal                       2
total_acc                       2
out_prncp                       2
out_prncp_inv                   2
total_pymnt                     2
total_pymnt_inv                 2
total_rec_prncp                 2
total_rec_int                   2
total_rec_late_fee              2
recoveries    

In [15]:
# Drop rows with NULL values:
loans_data = loans_data.dropna()
total_num = len(loans_data.index)
#loans_data.isnull().sum()
total_num

420793

In [16]:
loans_data["term"].unique()

array([' 60 months', ' 36 months'], dtype=object)

In [17]:
loans_data["last_pymnt_d"].unique()

array(['Feb-2017', 'Jun-2016', 'Oct-2016', 'Jul-2016', 'Nov-2016',
       'Jan-2017', 'Dec-2016', 'Apr-2016', 'Aug-2016', 'Mar-2016',
       'Sep-2016', 'May-2016', 'Feb-2016', 'Jan-2016', 'Dec-2015',
       'Nov-2015', 'Oct-2015', 'Sep-2015', 'Aug-2015', 'Jul-2015',
       'Jun-2015', 'May-2015', 'Apr-2015', 'Mar-2015', 'Feb-2015',
       'Jan-2015'], dtype=object)

In [18]:
loans_data["issue_d"].unique()

array(['Dec-2015', 'Nov-2015', 'Oct-2015', 'Sep-2015', 'Aug-2015',
       'Jul-2015', 'Jun-2015', 'May-2015', 'Apr-2015', 'Mar-2015',
       'Feb-2015', 'Jan-2015'], dtype=object)

In [19]:
loans_data["loan_status"].unique()

array(['Current', 'Fully Paid', 'Default', 'Charged Off',
       'Late (16-30 days)', 'Late (31-120 days)', 'In Grace Period'], dtype=object)

In [20]:
# Target creation
from datetime import datetime
from random import randint

Y = [] # pre_paid : 1   not pre_paid : 0
for index in range(total_num):
    df = loans_data.iloc[index]
    if df["loan_status"] == "Fully Paid":
        time1 = datetime.strptime(df["issue_d"], '%b-%Y')
        time2 = datetime.strptime(df["last_pymnt_d"], '%b-%Y')
        diff = abs(time1 - time2).days
        if diff < 366:
            Y.insert(len(Y),1)    
        else:
            Y.insert(len(Y),0)
    else:
        Y.insert(len(Y),0)

In [21]:
#Target
len(Y)

420793

In [22]:
#convert 'grade','subgrade' and 'verification status' into numerical values
loans_data['grade'].replace(['A', 'B', 'C', 'D', 'E', 'F', 'G'], [1,2,3,4,5,6,7], inplace = True)
loans_data['sub_grade'].replace(['A1','A2','A3','A4','A5','B1','B2','B3','B4','B5','C1','C2','C3','C4','C5','D1','D2','D3','D4','D5','E1','E2','E3','E4','E5','F1','F2','F3','F4','F5','G1','G2','G3','G4','G5'
], [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35], inplace = True)
loans_data['verification_status'].replace(['Not Verified', 'Source Verified', 'Verified'], [0,1,2], inplace = True)

In [23]:
#convert 'pymnt_plan' to numeric values
loans_data['pymnt_plan']=pd.Series(loans_data['pymnt_plan'] =='y',dtype=int)

In [24]:
# Rename it to loans_data2 before dropping variables for dummy variable creation
loans_data2 = loans_data

In [25]:
#creating dummy variables for 'term', 'home_ownership', 'purpose' and 'application type'
for field in ['term','home_ownership', 'purpose', 'application_type']:
    for value in loans_data2[field].unique():
        loans_data2[field + " _ " + value] = pd.Series(loans_data2[field]==value, dtype=int)
    loans_data2 = loans_data2.drop([field], axis=1)

In [28]:
loans_data2.shape

(420793, 60)

In [29]:
loans_data2.columns

Index(['loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'grade',
       'sub_grade', 'annual_inc', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'addr_state', 'dti', 'delinq_2yrs',
       'inq_last_6mths', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc',
       'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt',
       'last_credit_pull_d', 'collections_12_mths_ex_med', 'policy_code',
       'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt',
       'pub_rec_bankruptcies', 'tax_liens', 'term _  60 months',
       'term _  36 months', 'home_ownership _ MORTGAGE',
       'home_ownership _ RENT', 'home_ownership _ OWN', 'home_ownership _ ANY',
       'purpose _ credit_card', 'purpose _ debt_consolidation',
       'purpose _ small_business', 'purpose _ car', 'purpose _ other',
       'purpose _

In [30]:
#add target to the data
loans_data2['Target']=Y

In [31]:
loans_data2.shape

(420793, 61)

In [32]:
#remove date and state fields
for field in ['issue_d', 'loan_status', 'addr_state', 'last_pymnt_d', 'last_credit_pull_d']:
    loans_data2 = loans_data2.drop([field], axis=1)

In [33]:
loans_data2.shape

(420793, 56)

In [34]:
loans_data2['int_rate'].dtypes

dtype('O')

In [35]:
loans_data2.head()

Unnamed: 0,loan_amnt,funded_amnt,int_rate,installment,grade,sub_grade,annual_inc,verification_status,pymnt_plan,dti,...,purpose _ major_purchase,purpose _ house,purpose _ vacation,purpose _ moving,purpose _ renewable_energy,purpose _ wedding,purpose _ educational,application_type _ INDIVIDUAL,application_type _ JOINT,Target
0,16000.0,16000.0,8.49%,328.19,2,6,62000.0,1,0,28.92,...,0,0,0,0,0,0,0,1,0,0
1,8000.0,8000.0,10.78%,261.08,2,9,45000.0,0,0,21.23,...,0,0,0,0,0,0,0,1,0,0
2,10000.0,10000.0,10.78%,326.35,2,9,41600.0,0,0,15.78,...,0,0,0,0,0,0,0,1,0,0
3,24700.0,24700.0,11.99%,820.28,3,11,65000.0,0,0,16.06,...,0,0,0,0,0,0,0,1,0,1
4,10000.0,10000.0,11.99%,222.4,3,11,42500.0,0,0,31.04,...,0,0,0,0,0,0,0,1,0,0


In [36]:
#convert 'int_rate' field to float
loans_data2['int_rate'] = loans_data2['int_rate'].replace('\%','', regex=True).astype(float)

In [37]:
loans_data2['int_rate'].dtypes

dtype('float64')

In [38]:
#Separating the labels
Y=loans_data2['Target']

In [39]:
loans_data2.shape

(420793, 56)

In [40]:
loans_data2.drop('Target', axis=1, inplace = True)

In [41]:
loans_data2.shape

(420793, 55)

In [42]:
#initiate model development
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.preprocessing import scale

In [43]:
#Scaling the data
X_Scaled = pd.DataFrame(scale(loans_data2, axis=0, with_mean=True, with_std=True, copy=True), columns = loans_data2.columns.values)

In [44]:
X_Scaled.shape

(420793, 55)

In [45]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [48]:
X_train, X_test, Y_train, Y_test = train_test_split(X_Scaled, Y, train_size=0.10)

In [49]:
#logistic regression model
LR_model = LogisticRegression()

In [50]:
LR_model.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [51]:
logistic_regression_accuracy = metrics.accuracy_score(LR_model.predict(X_test), Y_test)
print(logistic_regression_accuracy)

0.971118046864


In [54]:
LR_model.coef_

array([[ -5.01953620e+00,  -5.01953620e+00,   1.79584098e+00,
          3.33844672e+00,  -1.91258223e-01,   1.38833059e-01,
         -6.62822455e-03,   1.99230991e-02,   3.56006129e-02,
         -2.16948672e-01,   7.08385398e-03,  -4.76689974e-02,
         -6.89079008e-02,  -1.35345177e-01,  -9.68349253e-02,
          1.50291736e-01,  -1.51693797e+00,  -1.50858052e+00,
          1.26964552e+00,   1.26113503e+00,   4.31552549e+00,
         -9.75889572e+00,  -5.24881771e-03,  -1.12890595e+00,
         -1.18997287e+00,   1.76184503e+00,  -5.64635122e-03,
          0.00000000e+00,  -2.52790248e-02,  -2.96174833e-02,
         -3.29179326e-01,   1.19696535e-01,   1.04352499e-01,
          7.50192402e-01,  -7.50192402e-01,   1.08369663e-02,
          1.90419801e-02,  -4.76131575e-02,   3.56006129e-02,
          4.47142128e-03,   1.50635979e-02,  -1.97427871e-04,
          1.55702335e-03,  -5.49910370e-02,   3.62769098e-02,
         -1.19874370e-02,  -4.02693001e-03,  -5.52093438e-03,
        

## Rejected applications (deprecated)

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt

reject_2007_2012 = pd.read_csv("data/rejected/2007-2012/RejectStatsA.csv.gz",skiprows=1)

#df1 = pd.read_csv("data/rejected/2013-2014/xaa.csv.gz",skiprows=1)
#df2 = pd.read_csv("data/rejected/2013-2014/xab.csv.gz",skiprows=0,names=df1.columns)
#reject_2013_2014 = pd.concat([df1,df2])

#df2 = pd.read_csv("data/rejected/2015/xaa.csv.gz",skiprows=1)
#df2 = pd.read_csv("data/rejected/2015/xab.csv.gz",skiprows=0, names=df1.columns)
#reject_2015 = pd.concat([df1,df2])

#reject_2016Q1 = pd.read_csv("data/rejected/2016Q1/RejectStats_2016Q1.csv.gz",skiprows=1)

#reject_2016Q2 = pd.read_csv("data/rejected/2016Q2/RejectStats_2016Q2.csv.gz",skiprows=1)

#reject_2016Q3 = pd.read_csv("data/rejected/2016Q3/RejectStats_2016Q3.csv.gz",skiprows=1)

#reject_2016Q4 = pd.read_csv("data/rejected/2016Q4/RejectStats_2016Q4.csv.gz",skiprows=1)




In [None]:
reject_2007_2012.columns

In [None]:
reject_2007_2012.describe()