# Predicting lending outcomes using logistic regression and random forest
The goal of this project is to predict if a loan will be repaid (before it is granted!). The data used can be found [here](https://www.lendingclub.com/auth/login?login_url=%2Finfo%2Fdownload-data.action). Lending Club is a website to facilitate lending money between private individuals. To make a profit, a lender should only lend to someone with a high probability of repaying. How can that be determined? By learning from past outcomes using machine learning models of course!

First, let's begin by reading and cleaning the data.
## Data reading

In [36]:
import pandas as pd
loans_2007 = pd.read_csv('loans_2007.csv')
loans_2007.info()
loans_copy = loans_2007.copy()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 52 columns):
id                            42538 non-null object
member_id                     42535 non-null float64
loan_amnt                     42535 non-null float64
funded_amnt                   42535 non-null float64
funded_amnt_inv               42535 non-null float64
term                          42535 non-null object
int_rate                      42535 non-null object
installment                   42535 non-null float64
grade                         42535 non-null object
sub_grade                     42535 non-null object
emp_title                     39909 non-null object
emp_length                    41423 non-null object
home_ownership                42535 non-null object
annual_inc                    42531 non-null float64
verification_status           42535 non-null object
issue_d                       42535 non-null object
loan_status                   42535 non-null object
p

In [37]:
loans_2007.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Data cleaning
### Droping columns with irrelevant, redundant, or leaky information
Some columns are irrelevant to the prediction of loan outcome: 
* id and member id columns are just for identification purposes
* the emp title (job title) is too complex to include

Other columns are redundant with others: 
* grade and sub grade are redundant with the interest column
* zip code (3 digits only) is redunant with the address state column

Other columns leak information about the prediction, or provide knowledge unavailable before the lending decision: 
* funded amnt and funded amnt inv leak information about the timelyness of the payment
* issued leaks info about the month the loan was funded - a prospective investor wouldnt know this in advance
* out prncp and inv leak also provide irrelevant data after the loan is funded
* similarly for total payment and inv, total rec, total rec int,total rec late fee, recoveries,collection recovery fee, last payment day,last payment amnt

In [38]:
loans_2007 = loans_2007.drop(['id', 
                              'member_id', 
                              'funded_amnt', 
                              'funded_amnt_inv', 
                              'grade', 
                              'sub_grade', 
                              'emp_title', 
                              'issue_d', 
                              'zip_code', 
                              'out_prncp', 
                              'out_prncp_inv', 
                              'total_pymnt', 
                              'total_pymnt_inv', 
                              'total_rec_prncp', 
                              'total_rec_int', 
                              'total_rec_late_fee', 
                              'recoveries', 
                              'collection_recovery_fee', 
                              'last_pymnt_amnt', 
                              'last_pymnt_d'], axis=1)

loans_2007.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,pymnt_plan,...,initial_list_status,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,36 months,10.65%,162.87,10+ years,RENT,24000.0,Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,2500.0,60 months,15.27%,59.83,< 1 year,RENT,30000.0,Source Verified,Charged Off,n,...,f,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,36 months,15.96%,84.33,10+ years,RENT,12252.0,Not Verified,Fully Paid,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,36 months,13.49%,339.31,10+ years,RENT,49200.0,Source Verified,Fully Paid,n,...,f,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,3000.0,60 months,12.69%,67.79,1 year,RENT,80000.0,Source Verified,Current,n,...,f,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


### Dropping columns with little information and standard deviation
Let's find which columns say the same thing for every row, or almost every row.

In [39]:
list_low_info_cols = []

for col in loans_2007.columns:
    if len(loans_2007[col].dropna().unique()) <= 1:
        list_low_info_cols.append(col)
        print(col, loans_copy[col].unique())
loans_2007 = loans_2007.drop(list_low_info_cols, axis=1)    

initial_list_status ['f' nan]
collections_12_mths_ex_med [ 0. nan]
policy_code [ 1. nan]
application_type ['INDIVIDUAL' nan]
chargeoff_within_12_mths [ 0. nan]


We removed columns with only a single unique value, what about columns where a vast majority of the values are the same?

In [40]:
std_devs = loans_2007.std().sort_values(ascending=False)
std_devs

annual_inc              64096.349719
revol_bal               22018.441010
loan_amnt                7410.938391
installment               208.927216
delinq_amnt                29.359579
total_acc                  11.592811
dti                         6.726315
open_acc                    4.496274
inq_last_6mths              1.527455
delinq_2yrs                 0.512406
pub_rec                     0.245713
pub_rec_bankruptcies        0.208737
acc_now_delinq              0.009700
tax_liens                   0.004855
dtype: float64

Clearly, we can drop these last two.

In [41]:
loans_2007 = loans_2007.drop(['tax_liens', 'acc_now_delinq'], axis=1)

Let's explore further, in particular non numeric columns


In [42]:
col_list = []
for col in loans_2007.columns:
    if len(loans_2007[col].dropna().unique()) <=5:
        col_list.append(col)
col_list

['term',
 'home_ownership',
 'verification_status',
 'pymnt_plan',
 'delinq_amnt',
 'pub_rec_bankruptcies']

In [43]:
distribs = []
for col in col_list:
    print(loans_2007[col].value_counts(dropna=False, normalize=True),'\n')

 36 months    0.741314
 60 months    0.258616
NaN           0.000071
Name: term, dtype: float64 

RENT        0.474423
MORTGAGE    0.445696
OWN         0.076426
OTHER       0.003197
NONE        0.000188
NaN         0.000071
Name: home_ownership, dtype: float64 

Not Verified       0.440970
Verified           0.316682
Source Verified    0.242277
NaN                0.000071
Name: verification_status, dtype: float64 

n      0.999906
NaN    0.000071
y      0.000024
Name: pymnt_plan, dtype: float64 

0.0       0.999201
NaN       0.000752
6053.0    0.000024
27.0      0.000024
Name: delinq_amnt, dtype: float64 

0.0    0.924256
1.0    0.043396
NaN    0.032159
2.0    0.000188
Name: pub_rec_bankruptcies, dtype: float64 



Payment plan, delinq_amnt and pub_rec_bankruptcies seems clearly droppable, as a vast majority of the rows have the same value.

In [44]:
loans_2007 = loans_2007.drop(['pymnt_plan', 'delinq_amnt', 'pub_rec_bankruptcies'], axis=1)
loans_2007.shape

(42538, 22)

We are down to 22 columns!

### Preparing target column
We want to predict the loan outcome, which is contained in the loan_status column.

In [12]:
loans_2007['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

Only Fully Paid or Charged Off indicate a completed loan, we can therefore safely remove all rows that dont match these two, as they indicate loans in progress, for which no prediction or model training could be performed yet.

We can convert these two outcomes into 1 and 0 and apply binary classification models.

In [45]:
loans_2007 = loans_2007[(loans_2007['loan_status'] == 'Fully Paid') | (loans_2007['loan_status'] == 'Charged Off')]
loans_2007 = loans_2007.replace({'loan_status' : {'Fully Paid' : 1, 'Charged Off' : 0}})

In [47]:
loans_2007.shape

(38770, 22)

### Handling missing values
We need to deal with missing values before we are able to fit models

In [48]:
loans_2007.isnull().sum()

loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length             1036
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
title                    11
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
last_credit_pull_d        2
dtype: int64

There are four problem columns, how to deal with them?
* employment length is too important to drop the entire column, as it is probably highly relevant to the capacity to repay a loan. We will have to remove those rows, sadly losing a lot of data but still, we have nearly 40k rows
* title has very few missing rows, so can drop without much loss
* Similarly for revol_util and last_credit pulld

In [49]:
loans_2007 =loans_2007.dropna()
loans_2007.shape

(37675, 22)

### Preparing text columns
Text columns can also be used in our model, but first we need to convert what can be to numerical, and encode the rest.

In [50]:
print(loans_2007.dtypes.value_counts())

object     11
float64    10
int64       1
dtype: int64


In [52]:
loans_text = loans_2007.select_dtypes(include=['object'])
loans_text.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
5,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


In [53]:
for col in loans_text.columns:
    print(loans_2007[col].value_counts(), '\n')

 36 months    28234
 60 months     9441
Name: term, dtype: int64 

 10.99%    906
 11.49%    770
  7.51%    756
 13.49%    747
  7.88%    701
  7.49%    629
  9.99%    573
  7.90%    552
 11.71%    546
  5.42%    524
 11.99%    478
 10.37%    453
 12.69%    441
  8.49%    425
  6.03%    413
 12.99%    404
 12.42%    397
 10.65%    393
  5.79%    390
 11.86%    383
  7.29%    379
  6.62%    376
  8.90%    371
  9.63%    368
 10.59%    351
 14.27%    336
  9.91%    331
  5.99%    329
 12.53%    327
  7.14%    326
          ... 
 14.70%      2
 15.83%      2
 14.88%      2
 17.03%      2
 14.07%      2
 15.38%      2
 15.01%      2
 22.94%      2
 15.07%      2
 17.90%      2
 16.20%      1
 16.01%      1
 17.54%      1
 14.67%      1
 16.71%      1
 18.36%      1
 10.64%      1
 18.72%      1
 16.15%      1
 24.59%      1
 16.96%      1
 21.48%      1
 16.33%      1
 17.34%      1
 17.44%      1
 24.40%      1
 22.64%      1
 20.52%      1
 17.46%      1
 13.84%      1
Name: int_rate, Le

#### Observations:
* Should convert int rate, and revol util to numerical
* Employment length can be converted as well with some logic
* Can make dummy collumns out of term, home ownership, verification status, and purpose
* Should drop title, as there are too many different values, and similar data in purpose
* should drop state, same problem as above
* earliest cr line and last_Credit pull would require too much work to be useful

In [54]:
loans_2007 = loans_2007.drop(['last_credit_pull_d', 'addr_state', 'title', 'earliest_cr_line'], axis=1) 

In [55]:
loans_2007.shape

(37675, 18)

In [56]:
loans_2007['int_rate'] = loans_2007['int_rate'].str.rstrip('%').astype('float')
loans_2007['revol_util'] = loans_2007['revol_util'].str.rstrip('%').astype('float')

emp_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans_2007 = loans_2007.replace(emp_dict)

In [57]:
loans_2007.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc
0,5000.0,36 months,10.65,162.87,10,RENT,24000.0,Verified,1,credit_card,27.65,0.0,1.0,3.0,0.0,13648.0,83.7,9.0
1,2500.0,60 months,15.27,59.83,0,RENT,30000.0,Source Verified,0,car,1.0,0.0,5.0,3.0,0.0,1687.0,9.4,4.0
2,2400.0,36 months,15.96,84.33,10,RENT,12252.0,Not Verified,1,small_business,8.72,0.0,2.0,2.0,0.0,2956.0,98.5,10.0
3,10000.0,36 months,13.49,339.31,10,RENT,49200.0,Source Verified,1,other,20.0,0.0,1.0,10.0,0.0,5598.0,21.0,37.0
5,5000.0,36 months,7.9,156.46,3,RENT,36000.0,Source Verified,1,wedding,11.2,0.0,3.0,9.0,0.0,7963.0,28.3,12.0


In [59]:
to_dummy = ["home_ownership", "verification_status", "purpose", "term"]
dummies = pd.get_dummies(loans_2007[to_dummy])
loans_2007 = pd.concat([loans_2007, dummies], axis=1)
loans_2007 = loans_2007.drop(to_dummy, axis=1)
loans_2007.columns

Index(['loan_amnt', 'int_rate', 'installment', 'emp_length', 'annual_inc',
       'loan_status', 'dti', 'delinq_2yrs', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'home_ownership_MORTGAGE', 'home_ownership_NONE',
       'home_ownership_OTHER', 'home_ownership_OWN', 'home_ownership_RENT',
       'verification_status_Not Verified',
       'verification_status_Source Verified', 'verification_status_Verified',
       'purpose_car', 'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement', 'purpose_house',
       'purpose_major_purchase', 'purpose_medical', 'purpose_moving',
       'purpose_other', 'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'term_ 36 months',
       'term_ 60 months'],
      dtype='object')

In [60]:
loans_2007.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
5,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0


Data is now ready for modelling!

## Data modeling
### Choice of metric and class imbalance
Most loans are repaid, which means a model that predicts 1 for very row would be pretty accurate. However, it would probably lose money if you followed it investing, because the loss from a single false positive creates a much larger financial penalty than multiple true positives can offset. When better, it's not enough to be right most of the time, because one loss is very painful!

Therefore, the metric used cannot be accuracy, but instead:
* The true positive rate, aka. "sensitivity" or "recall", how much the model guessed TRUE out of those it should have guessed TRUE
    * Formula: true positives / (true positives + false negatives)
* the false positive rate = "Fall-out", how much the model guessed TRUE out of those it should have guessed FALSE
    * Formula: false positives / (false positives + true negatives), 

These two metrics actually track when money is lost (either by making bad loans or avoiding good ones).

Let's start with a basic logistic regression model and see what happens.
### Basic logistic model

In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import recall_score
lr = LogisticRegression()
features = loans_2007.drop('loan_status', axis=1).copy()
target = loans_2007['loan_status']
lr.fit(features,target)
predictions = lr.predict(features)
tpr = recall_score(target, predictions)
tpr



0.9986062070247166

Looks like the model is pretty good at catching the good loans, but now let's find out if it catches the bad ones!

In [62]:
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])

tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
fpr = fp  / (fp + tn)
fpr

0.9964742994989794

Ouch, the model gave almost every single defaulted loan the incorrect answer. To fix, we have to set the parameter class_weight to balanced in the regression model instance.

In [63]:
lr = LogisticRegression(class_weight='balanced')
lr.fit(features,target)
predictions = lr.predict(features)

tpr = recall_score(target, predictions)
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
fpr = fp  / (fp + tn)
print(tpr, fpr)



0.6552375642693428 0.36927073668584154


The loss of sensitivity was compensated by a great reduction in the false positive rate. This was achieved with an automatic penatly that depends on the balance in the data between trues and falses. We can try to improve our fpr by manually setting a penalty.

In [64]:
penalty = {
    0: 10,
    1: 1
}
lr = LogisticRegression(class_weight=penalty)
lr.fit(features,target)
predictions = lr.predict(features)

tpr = recall_score(target, predictions)
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
fpr = fp  / (fp + tn)
print(tpr, fpr)



0.19175494022176795 0.06290591946557803


Now our model is extremely good at avoiding false positives, however, it is also very conservative, considering only 20% of good loans to be good.

To improve our predictions further, let's try a different model: decision trees aggregated using the random forest algorithm.

### Random Forest Algorithm

In [65]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)
tpr = recall_score(target, predictions)
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
fpr = fp  / (fp + tn)
print(tpr, fpr)



0.9700799107972495 0.9636722606120435


This isn't better than logistic model - let's try manual penalties again.

In [66]:
penalty = {
    0: 20,
    1: 1
}
rf = RandomForestClassifier(class_weight=penalty, random_state=1)
predictions = cross_val_predict(rf, features, target, cv=3)
predictions = pd.Series(predictions)
tpr = recall_score(target, predictions)
fp_filter = (predictions == 1) & (loans_2007["loan_status"] == 0)
fp = len(predictions[fp_filter])
tn_filter = (predictions == 0) & (loans_2007["loan_status"] == 0)
tn = len(predictions[tn_filter])
fpr = fp  / (fp + tn)
print(tpr, fpr)



0.9733011212290157 0.9693978282329714


It looks like random forest just doesn't help for this type of problem.

## Conclusion
Using logistic regression, we were able to build a model that accepts 20% of the good loans and only accepts 6% of the bad ones. What is the actual default rate?

In [67]:
loans_2007['loan_status'].value_counts(normalize=True)

1    0.856961
0    0.143039
Name: loan_status, dtype: float64

14% of the Loans default, which means our model is better than picking randomly, while being quite conservative on what to grant. As long as the interest rate from our good loans is enough to offset the bad ones, we are good!