# Predicting Approval of a Loan (In Progress)

In this excersize we are using a machine learning to model a borrower's credit risk. Data was obtained from Lending Club's website. Lending club is marketplace for personal loans; borrowers seeking a loan are matched with investors looking to lend money and make a return. The riskier the lend, the higher the interest rate. 

In [60]:
import pandas as pd
loans_2007 = pd.read_csv('C:\Users\pminh\Documents\Python Scripts\LoanStats3a.csv',skiprows=1,dtype=unicode) #skip the first row (extraneous text) 
half_count=len(loans_2007)/2
loans_2007=loans_2007.dropna(thresh=half_count, axis=1) #remove columns where 50% or more of the rows are missing  
loans_2007=loans_2007.drop(['desc'], axis =1) #remove the desc column(contains explanation of each loan) and the url column (contains a link to each loan)
loans_2007.to_csv('loans_2007.csv', index=False) #read into new csv file called Loans_2007
    

In [61]:
loans_2007=pd.read_csv('loans_2007.csv') 
print(loans_2007.iloc[1]) #display first row 
print(loans_2007.shape[1]) #display number of columns 

loan_amnt                                2500
funded_amnt                              2500
funded_amnt_inv                          2500
term                                60 months
int_rate                               15.27%
installment                             59.83
grade                                       C
sub_grade                                  C4
emp_title                               Ryder
emp_length                           < 1 year
home_ownership                           RENT
annual_inc                              30000
verification_status           Source Verified
issue_d                              Dec-2011
loan_status                       Charged Off
pymnt_plan                                  n
purpose                                   car
title                                    bike
zip_code                                309xx
addr_state                                 GA
dti                                         1
delinq_2yrs                       

In order to determine what features (columns) to use to build a good predictive model the following was considered (as suggested by DATAQUEST):
1. Does the feature leak information from the future?
2. What role does the feature play in determining the borrower's ability to pay back a lender 
3. How is formatted and does it need to be cleaned up 
4. Does it need a lot of processing to turn into a useful feature?
5. Does it contain redundant information?

The rows were analyzed in groups of 18 

Looking through the first 18 rows, I thought it it would make sense to drop the following features:
1. sub_grade - contains redundant information
2. grade - contains redundant information 
3. verification status - leaks data from future 
4. issue_id - randomly generated feild 
5. loan_status - leak data from future 
6. payment_plan - leak data from future 
7. emp_title - requires other data and a lot of processing  

After looking at what the DATAQUEST team suggested, I realized that I did not include:
1. funded_amnt - leaks data from future (determined after the loan has already started to get processed) 
2. funded_amnt_inv - leaks data from future 

Features that I did include to be removed that should not have been removed include:
1. verification status - indicates wheather income was verified by LC. I thought this information would leak information from the future, but it actually gives insight into the future, because it is not something that is determined after a loan is approved but rather helps decide whether to improve one or not. 
2. issue_id - is not a randomly generated feild. It is issued the month the loan is funded and leaks information from the future 
3. funded_amnt and funded_amnt_inv were not included in my original list but they too leak information from the future. These features refer to the total amount committed to the loan at a given time and the total amount committed by the invester at a given time, respectively. 

In [62]:
loans_2007=loans_2007.drop(['funded_amnt','funded_amnt_inv','grade','sub_grade','emp_title','issue_d'], axis=1)

In [63]:
#the process above was repeated for the remaining groups of rows
loans_2007=loans_2007.drop(['zip_code','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp'],axis=1)

In [64]:
loans_2007=loans_2007.drop(['total_rec_int', 'total_rec_late_fee','recoveries', 'collection_recovery_fee', 'last_pymnt_d','last_pymnt_amnt'], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[0])

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                 Jan-1985
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                          83.7%
total_acc                               9
initial_list_status                     f
last_credit_pull_d               M

Next we needed to decide what appropriate Target column to use. The loan_status column was chosen because it is the only column that mentions if a loan payment was made on time, had delayed payments, or was defaulted. 

In [65]:
unique_status=loans_2007['loan_status'].value_counts()
print(unique_status) 

Fully Paid                                             34115
Charged Off                                             5670
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                         1
Name: loan_status, dtype: int64


An investor is only concerned with accurately predicting whether a loan will be payed off on time or not. We can treat this as binary classification problem and select all rows with 'Fully Paid' and 'Charged Off' loan status. 

In [66]:
loans_2007=loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]
#create a dictionary in which if a loan is 'Fully Paid' it is assigned a value of 1, 
#and if it is 'Charged Off' it is assigned a value of '0. 
mapping_dict={'loan_status':{'Fully Paid':1,'Charged Off':0}} 
loans_2007=loans_2007.replace(mapping_dict)

In [67]:
# remove all columns that have only one unique value as they will not  add any information 
#to each loan application 
drop_columns=[]
column_names=loans_2007.columns
for name in column_names:
    non_null=loans_2007[name].dropna()
    unique_non_null=non_null.unique() 
    num_true_unique=len(unique_non_null) 
    if num_true_unique <= 1:
        drop_columns.append(name) 
        
loans_2007=loans_2007.drop(drop_columns, axis=1) 
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


Next, we should remove columns with a large number of null values and any rows that contain missing values. 

In [68]:
null_counts=loans_2007.isnull().sum()
print(null_counts)

loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
title                    10
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
last_credit_pull_d        2
pub_rec_bankruptcies    697
dtype: int64


In [69]:
loans=loans_2007.drop('pub_rec_bankruptcies', axis=1) 
loans=loans.dropna(axis = 0, subset=['title', 'revol_util', 'last_credit_pull_d']) 
print(loans.dtypes.value_counts())

object     11
float64    10
int64       1
dtype: int64


In order to use scikit learn, object columns that contain text need to be converted to numerical data types. 

In [70]:
object_columns_df=loans.select_dtypes(include=['object']) 
print(object_columns_df.iloc[0]) 

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Mar-2017
Name: 0, dtype: object


In [71]:
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts()) 
    

RENT        18881
MORTGAGE    17687
OWN          3056
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16890
Verified           12832
Source Verified    10001
Name: verification_status, dtype: int64
10+ years    8897
< 1 year     4576
2 years      4389
3 years      4094
4 years      3435
5 years      3279
1 year       3240
6 years      2227
7 years      1771
8 years      1482
9 years      1259
n/a          1074
Name: emp_length, dtype: int64
 36 months    29041
 60 months    10682
Name: term, dtype: int64
CA    7095
NY    3815
FL    2869
TX    2729
NJ    1850
IL    1524
PA    1515
VA    1407
GA    1399
MA    1343
OH    1221
MD    1053
AZ     878
WA     841
CO     791
NC     788
CT     754
MI     722
MO     685
MN     613
NV     497
SC     472
WI     459
AL     450
OR     450
LA     436
KY     327
OK     299
KS     271
UT     259
AR     245
DC     212
RI     199
NM     190
WV     177
HI     173
NH     172
DE     113
MT      85
WY      83
AK      

In [72]:
cols=['purpose','title']
for c in cols:
    print(loans[c].value_counts()) 

debt_consolidation    18660
credit_card            5134
other                  3985
home_improvement       2980
major_purchase         2182
small_business         1827
car                    1549
wedding                 947
medical                 693
moving                  581
house                   382
vacation                380
educational             320
renewable_energy        103
Name: purpose, dtype: int64
Debt Consolidation                        2188
Debt Consolidation Loan                   1732
Personal Loan                              661
Consolidation                              516
debt consolidation                         508
Credit Card Consolidation                  357
Home Improvement                           357
Debt consolidation                         334
Small Business Loan                        329
Credit Card Loan                           319
Personal                                   309
Consolidation Loan                         256
Home Improvement

In [73]:
mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans=loans.drop(['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line'],axis=1)
loans['int_rate']=loans['int_rate'].str.rstrip('%').astype(float) 
loans['revol_util']=loans['revol_util'].str.rstrip('%').astype(float)
loans = loans.replace(mapping_dict) 

In [74]:
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
for c in cat_columns:
    loans[c]=loans[c].astype('category')
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)

Once the data is all clean, it is important to decide and error metric for the  model. It is important to note that the loans dataset presents a class imbalance. There are 6 times as many loans that were payed off on time (1's) as there were loans that weren't paid off (0's). This implies that a classifier can predict 1 for a row and have high accuracy. Therefore, accuracy is not a proper metric to use to assess the strength of our model. Rather, our model should optimize towards a high true positive rate (the number of predictions that were correct) and a low false positive rate (number of predictions that were 1's but should have been 0's), so that an investor has less chances of losing money. 

In [75]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
target=loans['loan_status'] 
features=loans.drop(['loan_status'],axis=1)
lr = LogisticRegression()
kf = KFold(features.shape[0], random_state=1)
predictions=cross_val_predict(lr, features, y=target, cv=kf) 

predictions = pd.Series(predictions)
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

fpr=float(fp)/float(fp+tn) 
tpr=float(tp)/float(fn+tp)

print(fpr) 
print(tpr)

0.999291408326
0.999147434955


The FPR and TPR are both very high. This is most likely due to the class imbalance. One way to correct for the imbalance is to tell the classifier to penalize misclassifications of the less prevalent class more than the other class. By setting the class_weight parameter to balanced, the penalty is set to be inversely proportional to the class frequencies. 

In [76]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
from sklearn.cross_validation import cross_val_predict, KFold
#By setting the class_weight parameter to balanced, the penalty is set to be inversely 
#proportional to the class frequencies
lr = LogisticRegression(class_weight='balanced')
kf = KFold(features.shape[0], random_state=1)
predictions=cross_val_predict(lr, features, y=target, cv=kf) 

predictions = pd.Series(predictions)
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

fpr=float(fp)/float(fp+tn) 
tpr=float(tp)/float(fn+tp)

print(fpr) 
print(tpr)

0.615234720992
0.629310598265


Unfortunately, by adjusting the penalty for errors within both classes did not help much. THe FPR came down but so did the TPR, so we are missing out on oppurtunities that might have payed the loans on time. We can try manually adjusting the penalties or use a different type of classifier. 