## Data Understanding and Preparation
<img src="https://github.com/CatherineCao2016/lendingclub/raw/master/cleaning.png" width="800" height="500" align="middle"/>

There are a number of fields that are populated with a large number of NA values and also a number of fields of data that were filled after the loan was given - since we are modeling whether a loan should be given or not we are only concerned with the information provided to Lending Club when the loan was requested. So we are going to do a first pass through the data and select out only the variables we are interested in keeping.

* Outcome: Loan Status
* Loan application info: 
    * Issue date, 
    * loan amount
    * employment title
    * employment length
    * verification status
    * home ownership
    * annual income
    * purpose, 
    * loan description
    * address
    * term

* Background check: 
    * financial inquiries in last 6 months
    * open credit lines, 
    * derogatory public records
    * Revolving line utilization rate
    * debt-to-income ratio
    * total credit lines
    * delinquency instances in past 2 years
    * earliest reported credit line open time
    * Months since last delinquency

* Computed additional features: 
    * EMP_LISTED
    * EMPTY_DESC
    * EMP_NA,
    * DELING_EVER
    * TIME_HISTORY

## Import Libraries

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import time
from datetime import datetime
import math

import pandas as pd
#pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import glob

# custom library for some helper functions 
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib
import matplotlib.pyplot as plt
import numpy as np


## Load Sample Data 
### The data was downloaded from lendingclub.com

In [2]:
DDIR='/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/'
loanstats_csv_files = glob.glob(DDIR + 'LoanStats*csv')


print("Loan Stats files = {}".format(loanstats_csv_files))


Loan Stats files = ['/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats3b_securev1.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats3c_securev1.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats_securev1_2018Q1.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats_securev1_2017Q1.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats_securev1_2017Q2.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats_securev1_2017Q3.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats_securev1_2017Q4.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats_securev1_2016Q2.csv', '/data/work/osa/2018-04-lendingclub/lending-club-loan-data/lendingclub.com/LoanStats3a_securev1.csv', '/data/work/osa/2018-04-lendingc

In [3]:
loan_list = []
for i in range(1) : #len(loanstats_csv_files)
    loan_list.append( pd.read_csv(loanstats_csv_files[i], index_col=None, header=1))



In [4]:
loan_df = pd.concat(loan_list,axis=0)

## Quick Overview

In [5]:
print("There are " + str(len(loan_df)) + " observations in the dataset.")
print("There are " + str(len(loan_df.columns)) + " variables in the dataset.")

print("\n******************Dataset Quick View*****************************\n")
loan_df.head(5)

There are 188183 observations in the dataset.
There are 151 variables in the dataset.

******************Dataset Quick View*****************************



Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,10159548,,15000.0,15000.0,15000.0,36 months,8.90%,476.3,A,A5,aircraft maintenance engineer,2 years,MORTGAGE,63000.0,Not Verified,Dec-2013,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,Borrower added on 12/31/13 > To pay Home Dep...,debt_consolidation,Pay off,334xx,FL,16.51,0.0,Mar-1998,670.0,674.0,0.0,34.0,,8.0,0.0,11431.0,74.2%,29.0,w,0.0,0.0,17146.725104,17146.73,15000.0,2146.73,0.0,0.0,0.0,Jan-2017,476.23,,Dec-2016,669.0,665.0,0.0,34.0,1.0,Individual,,,,0.0,1514.0,272492.0,,,,,,,,,,,,15400.0,,,,3.0,38927.0,2969.0,79.1,0.0,0.0,147.0,189.0,24.0,13.0,4.0,24.0,75.0,12.0,75.0,3.0,3.0,4.0,3.0,10.0,8.0,6.0,17.0,4.0,8.0,0.0,0.0,0.0,0.0,89.3,66.7,0.0,0.0,288195.0,39448.0,14200.0,33895.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
1,10127816,,24000.0,24000.0,24000.0,36 months,13.53%,814.8,B,B5,driver,10+ years,MORTGAGE,100000.0,Verified,Dec-2013,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,Borrower added on 12/31/13 > pay off my othe...,credit_card,credit card,493xx,MI,22.18,0.0,Jan-1989,660.0,664.0,0.0,,,14.0,0.0,21617.0,76.7%,39.0,w,0.0,0.0,28652.21,28652.21,24000.0,4652.21,0.0,0.0,0.0,Dec-2015,10726.61,,Nov-2017,699.0,695.0,0.0,,1.0,Individual,,,,0.0,539.0,199834.0,,,,,,,,,,,,28200.0,,,,7.0,15372.0,4822.0,77.6,0.0,0.0,179.0,299.0,18.0,7.0,3.0,18.0,,7.0,,0.0,3.0,5.0,5.0,10.0,17.0,8.0,19.0,5.0,14.0,0.0,0.0,0.0,2.0,100.0,75.0,0.0,0.0,229072.0,61397.0,21500.0,58847.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2,10139658,,12000.0,12000.0,12000.0,36 months,13.53%,407.4,B,B5,On road manager,10+ years,RENT,40000.0,Source Verified,Dec-2013,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,871xx,NM,16.94,0.0,Oct-1998,660.0,664.0,0.0,53.0,33.0,7.0,2.0,5572.0,68.8%,32.0,w,0.0,0.0,13359.776858,13359.78,12000.0,1359.78,0.0,0.0,0.0,Sep-2015,119.17,,Jul-2017,739.0,735.0,0.0,53.0,1.0,Individual,,,,0.0,15386.0,13605.0,,,,,,,,,,,,8100.0,,,,4.0,2268.0,1428.0,79.6,0.0,0.0,124.0,182.0,1.0,1.0,0.0,11.0,53.0,17.0,53.0,6.0,2.0,2.0,3.0,14.0,8.0,6.0,24.0,2.0,7.0,0.0,0.0,0.0,2.0,81.2,33.3,0.0,0.0,18130.0,13605.0,7000.0,10030.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
3,10129477,,14000.0,14000.0,14000.0,36 months,12.85%,470.71,B,B4,Assistant Director - Human Resources,4 years,RENT,88000.0,Not Verified,Dec-2013,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,282xx,NC,10.02,1.0,Jun-1988,670.0,674.0,0.0,16.0,115.0,6.0,1.0,3686.0,81.9%,14.0,f,0.0,0.0,16945.318783,16945.32,14000.0,2945.32,0.0,0.0,0.0,Jan-2017,470.47,,Nov-2017,694.0,690.0,0.0,,1.0,Individual,,,,0.0,0.0,17672.0,,,,,,,,,,,,4500.0,,,,3.0,2945.0,480.0,87.7,0.0,0.0,111.0,103.0,24.0,13.0,0.0,38.0,16.0,,16.0,0.0,3.0,4.0,3.0,9.0,3.0,4.0,10.0,4.0,6.0,0.0,0.0,0.0,0.0,78.6,100.0,1.0,0.0,31840.0,17672.0,3900.0,27340.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
4,10119623,,12000.0,12000.0,12000.0,36 months,11.99%,398.52,B,B3,LTC,10+ years,MORTGAGE,130000.0,Source Verified,Dec-2013,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,809xx,CO,13.03,0.0,Nov-1997,715.0,719.0,1.0,,,9.0,0.0,10805.0,67%,19.0,f,0.0,0.0,14346.47905,14346.48,12000.0,2346.48,0.0,0.0,0.0,Jan-2017,398.28,,Nov-2017,734.0,730.0,0.0,,1.0,Individual,,,,0.0,0.0,327264.0,,,,,,,,,,,,16200.0,,,,4.0,36362.0,3567.0,93.0,0.0,0.0,173.0,193.0,4.0,4.0,3.0,85.0,,4.0,,0.0,3.0,5.0,4.0,4.0,8.0,5.0,8.0,5.0,9.0,,0.0,0.0,3.0,100.0,1.0,0.0,0.0,365874.0,44327.0,10700.0,57674.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,


### Descriptive Statistics

In [6]:
print("\n******************Descriptive statistics*****************************\n")
# Note this table only shows variables that are non categorical .. ie numerical
loan_df.describe()


******************Descriptive statistics*****************************



Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,installment,annual_inc,dti,delinq_2yrs,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,80608.0,17474.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,32516.0,188181.0,0.0,0.0,0.0,188181.0,160440.0,160440.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,160440.0,0.0,0.0,0.0,180686.0,160434.0,179156.0,179069.0,188181.0,188181.0,154309.0,160439.0,160439.0,160440.0,180686.0,179353.0,36751.0,160313.0,54447.0,160440.0,160440.0,160440.0,172126.0,160440.0,160440.0,160440.0,160440.0,160440.0,172126.0,160184.0,160440.0,160440.0,160440.0,160287.0,179153.0,188181.0,188181.0,160440.0,180686.0,180686.0,160440.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,66.0,66.0,66.0,66.0,58.0,66.0,66.0,1719.0,1719.0,1719.0
mean,,14354.139366,14351.61985,14339.617827,443.734012,72233.28,17.060171,0.239626,696.810119,700.810225,0.803678,34.977806,86.012762,11.000808,0.106212,16318.4,24.543456,190.190564,190.103135,16111.869584,16098.932265,12623.140466,3309.895078,0.957005,177.877027,19.660953,4143.202453,681.973807,669.38971,0.003172,41.796838,1.0,,,,0.002715,76.736188,137324.1,,,,,,,,,,,,29891.6,,,,3.92953,13796.105776,8262.724893,66.830442,0.005266,8.366961,125.076787,178.488491,14.108459,8.938581,1.81102,25.6599,40.818399,6.99123,36.612118,0.329506,3.755192,5.675374,4.66582,9.018979,7.726427,8.094459,14.950405,5.693393,11.085368,0.000474,0.002013,0.063737,1.788014,95.399571,53.558127,0.084785,0.01404,165545.3,42883.77,20238.462072,34389.56,,,,,,,,,,,,,,3.0,103.628788,3.0,12.333333,309.497069,6695.522273,178.809091,4413.133839,45.854148,3.858057
std,,8114.766207,8112.60861,8107.009785,242.648651,51824.59,7.597634,0.70373,29.958302,29.958829,1.032934,21.612651,26.433704,4.607592,0.406336,19287.61,11.070113,995.52633,995.144027,10464.410455,10457.218577,8090.530507,3498.98409,7.044596,790.222132,116.366908,5855.798972,78.105496,115.82765,0.059182,20.997098,0.0,,,,0.058392,859.656048,150754.0,,,,,,,,,,,,37769.69,,,,2.65703,16382.711908,13175.612008,26.111872,0.08169,524.9348,50.850541,88.085042,16.149764,9.696242,2.193459,29.48927,21.688092,5.88041,21.741702,0.952602,2.079794,2.918935,2.486201,4.808095,6.578654,3.887859,7.384224,2.923143,4.622629,0.022895,0.049326,0.367567,1.516775,7.391559,34.148169,0.290168,0.241542,167253.3,40177.89,18883.558196,36879.63,,,,,,,,,,,,,,0.0,47.820792,0.0,9.785913,127.966898,2893.707395,188.466433,3239.816137,9.675271,7.235815
min,,1000.0,1000.0,950.0,4.93,4800.0,0.0,0.0,660.0,664.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,0.0,0.0,0.0,,,,,,,,,,,,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,3.0,30.22,3.0,0.0,90.66,2221.64,0.13,181.0,0.2,0.0
25%,,8000.0,8000.0,8000.0,269.98,45000.0,11.34,0.0,675.0,679.0,0.0,17.0,69.0,8.0,0.0,7132.0,16.0,0.0,0.0,8261.250341,8253.2,6275.0,1149.4,0.0,0.0,0.0,371.06,639.0,635.0,0.0,25.0,1.0,,,,0.0,0.0,27470.0,,,,,,,,,,,,13900.0,,,,2.0,3013.0,1062.0,49.5,0.0,0.0,95.0,116.0,4.0,3.0,0.0,7.0,23.0,2.0,18.0,0.0,2.0,4.0,3.0,6.0,3.0,5.0,10.0,4.0,8.0,0.0,0.0,0.0,1.0,93.0,25.0,0.0,0.0,44795.0,18904.25,7800.0,10049.5,,,,,,,,,,,,,,3.0,67.66,3.0,0.0,210.3075,4480.26,74.91,2058.5,40.14,0.0
50%,,12175.0,12125.0,12100.0,398.21,62000.0,16.78,0.0,690.0,694.0,0.0,32.0,95.0,10.0,0.0,12436.0,23.0,0.0,0.0,13520.332276,13517.36,10800.0,2123.76,0.0,0.0,0.0,940.86,694.0,690.0,0.0,41.0,1.0,,,,0.0,0.0,80760.5,,,,,,,,,,,,23000.0,,,,4.0,7775.0,3523.0,72.2,0.0,0.0,128.0,161.0,9.0,6.0,1.0,15.0,40.0,6.0,34.0,0.0,3.0,5.0,4.0,8.0,6.0,7.0,14.0,5.0,10.0,0.0,0.0,0.0,2.0,100.0,50.0,0.0,0.0,108500.0,32963.0,14700.0,25751.5,,,,,,,,,,,,,,3.0,104.795,3.0,12.0,314.385,6225.74,150.18,3690.8,45.0,0.0
75%,,20000.0,20000.0,19975.0,578.31,87000.0,22.58,0.0,710.0,714.0,1.0,50.0,106.0,14.0,0.0,20669.0,31.0,0.0,0.0,22087.658512,22063.7,18000.0,3999.1,0.0,0.0,0.0,6079.37,734.0,730.0,0.0,58.0,1.0,,,,0.0,0.0,208170.2,,,,,,,,,,,,37200.0,,,,5.0,19668.0,9703.25,89.0,0.0,0.0,151.0,222.0,17.0,11.0,3.0,33.0,58.0,11.0,52.0,0.0,5.0,7.0,6.0,12.0,10.0,10.0,19.0,7.0,14.0,0.0,0.0,0.0,3.0,100.0,80.0,0.0,0.0,243732.8,53859.75,26500.0,46935.25,,,,,,,,,,,,,,3.0,142.935,3.0,21.0,428.805,8410.3975,222.5325,6000.0,50.0,3.0
max,,35000.0,35000.0,35000.0,1408.13,7141778.0,34.99,29.0,845.0,850.0,8.0,156.0,121.0,62.0,54.0,2568995.0,105.0,13373.34,13354.23,61557.694036,61513.72,35000.0,27074.58,367.6,39444.37,6124.938,35760.2,850.0,845.0,4.0,165.0,1.0,,,,5.0,88303.0,8000078.0,,,,,,,,,,,,9999999.0,,,,40.0,958084.0,497445.0,339.6,5.0,65000.0,649.0,760.0,264.0,211.0,31.0,554.0,152.0,24.0,165.0,29.0,30.0,37.0,35.0,65.0,66.0,58.0,94.0,37.0,62.0,2.0,4.0,24.0,25.0,100.0,100.0,8.0,53.0,9999999.0,2644442.0,522210.0,1214546.0,,,,,,,,,,,,,,3.0,247.4,3.0,29.0,531.72,15761.17,991.79,33601.0,184.36,36.0


### 1. Keep Variables of Interest

In [7]:
#[print('"'+x+'",') for x in loan_df.columns]

In [8]:
# sort externally ...
keep_list = ["id",
  "addr_state",
  "annual_inc",
  "collections_12_mths_ex_med",
  "delinq_2yrs",
  "desc",
  "dti",
  "emp_length",
  "emp_title",
  "fico_range_high",
  "fico_range_low",
  "funded_amnt",
  "funded_amnt_inv",
  "grade",
  "home_ownership",
  "inq_last_6mths",
  "installment",
  "int_rate",
  "issue_d",
  "loan_amnt",
  "loan_status",
  "open_acc",
  "purpose",
  "pymnt_plan",
  "revol_bal",
  "revol_util",
  "sub_grade",
  "term",
  "total_acc",
  "total_pymnt",
  "verification_status",
  "zip_code",
   "acc_now_delinq",
   "acc_open_past_24mths",
   "all_util",
   "annual_inc_joint",
   "application_type",
   "avg_cur_bal",
   "bc_open_to_buy",
   "bc_util",
   "chargeoff_within_12_mths",
   "collection_recovery_fee",
   "debt_settlement_flag",
   "debt_settlement_flag_date",
   "deferral_term",
   "delinq_amnt",
   "disbursement_method",
   "dti_joint",
   "earliest_cr_line",
   "hardship_amount",
   "hardship_dpd",
   "hardship_end_date",
   "hardship_flag",
   "hardship_last_payment_amount",
   "hardship_length",
   "hardship_loan_status",
   "hardship_payoff_balance_amount",
   "hardship_reason",
   "hardship_start_date",
   "hardship_status",
   "hardship_type",
   "il_util",
   "initial_list_status",
   "inq_fi",
   "inq_last_12m",
   "last_credit_pull_d",
   "last_fico_range_high",
   "last_fico_range_low",
   "last_pymnt_amnt",
   "last_pymnt_d",
   "max_bal_bc",
   "mo_sin_old_il_acct",
   "mo_sin_old_rev_tl_op",
   "mo_sin_rcnt_rev_tl_op",
   "mo_sin_rcnt_tl",
   "mort_acc",
   "mths_since_last_delinq",
   "mths_since_last_major_derog",
   "mths_since_last_record",
   "mths_since_rcnt_il",
   "mths_since_recent_bc",
   "mths_since_recent_bc_dlq",
   "mths_since_recent_inq",
   "mths_since_recent_revol_delinq",
   "next_pymnt_d",
   "num_accts_ever_120_pd",
   "num_actv_bc_tl",
   "num_actv_rev_tl",
   "num_bc_sats",
   "num_bc_tl",
   "num_il_tl",
   "num_op_rev_tl",
   "num_rev_accts",
   "num_rev_tl_bal_gt_0",
   "num_sats",
   "num_tl_120dpd_2m",
   "num_tl_30dpd",
   "num_tl_90g_dpd_24m",
   "num_tl_op_past_12m",
   "open_acc_6m",
   "open_act_il",
   "open_il_12m",
   "open_il_24m",
   "open_rv_12m",
   "open_rv_24m",
   "orig_projected_additional_accrued_interest",
   "out_prncp",
   "out_prncp_inv",
   "payment_plan_start_date",
   "pct_tl_nvr_dlq",
   "percent_bc_gt_75",
   "policy_code",
   "pub_rec",
   "pub_rec_bankruptcies",
   "recoveries",
   "revol_bal_joint",
   "sec_app_chargeoff_within_12_mths",
   "sec_app_collections_12_mths_ex_med",
   "sec_app_earliest_cr_line",
   "sec_app_fico_range_high",
   "sec_app_fico_range_low",
   "sec_app_inq_last_6mths",
   "sec_app_mort_acc",
   "sec_app_mths_since_last_major_derog",
   "sec_app_num_rev_accts",
   "sec_app_open_acc",
   "sec_app_open_act_il",
   "sec_app_revol_util",
   "settlement_amount",
   "settlement_date",
   "settlement_percentage",
   "settlement_status",
   "settlement_term",
   "tax_liens",
   "title",
   "tot_coll_amt",
   "tot_cur_bal",
   "tot_hi_cred_lim",
   "total_bal_ex_mort",
   "total_bal_il",
   "total_bc_limit",
   "total_cu_tl",
   "total_il_high_credit_limit",
   "total_pymnt_inv",
   "total_rec_int",
   "total_rec_late_fee",
   "total_rec_prncp",
   "total_rev_hi_lim",
   "url",
   "verification_status_joint",
   "member_id",
]


loan_short_df = loan_df[keep_list]

### 2. Encode dependent variable from loan_status 

In [9]:
# use a lamba function to encode multiple loan_status entries into a single 1/0 default variable
loan_short_df['default'] = loan_short_df['loan_status'].isin([
    'Default',
    'Charged Off',
    'Late (31-120 days)',
    'Late (16-30 days)',
    'Does not meet the credit policy. Status:Charged Off'
]).map(lambda x: int(x))

In [10]:
print(loan_short_df.shape)
loan_short_df.describe()



(188183, 152)


Unnamed: 0,annual_inc,collections_12_mths_ex_med,delinq_2yrs,dti,fico_range_high,fico_range_low,funded_amnt,funded_amnt_inv,inq_last_6mths,installment,loan_amnt,open_acc,revol_bal,total_acc,total_pymnt,acc_now_delinq,acc_open_past_24mths,all_util,annual_inc_joint,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,collection_recovery_fee,deferral_term,delinq_amnt,dti_joint,hardship_amount,hardship_dpd,hardship_last_payment_amount,hardship_length,hardship_payoff_balance_amount,il_util,inq_fi,inq_last_12m,last_fico_range_high,last_fico_range_low,last_pymnt_amnt,max_bal_bc,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_last_delinq,mths_since_last_major_derog,mths_since_last_record,mths_since_rcnt_il,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,open_acc_6m,open_act_il,open_il_12m,open_il_24m,open_rv_12m,open_rv_24m,orig_projected_additional_accrued_interest,out_prncp,out_prncp_inv,pct_tl_nvr_dlq,percent_bc_gt_75,policy_code,pub_rec,pub_rec_bankruptcies,recoveries,revol_bal_joint,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_earliest_cr_line,sec_app_fico_range_high,sec_app_fico_range_low,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_mths_since_last_major_derog,sec_app_num_rev_accts,sec_app_open_acc,sec_app_open_act_il,sec_app_revol_util,settlement_amount,settlement_percentage,settlement_term,tax_liens,tot_coll_amt,tot_cur_bal,tot_hi_cred_lim,total_bal_ex_mort,total_bal_il,total_bc_limit,total_cu_tl,total_il_high_credit_limit,total_pymnt_inv,total_rec_int,total_rec_late_fee,total_rec_prncp,total_rev_hi_lim,verification_status_joint,member_id,default
count,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,180686.0,0.0,0.0,160434.0,179156.0,179069.0,188181.0,188181.0,66.0,188181.0,0.0,66.0,66.0,66.0,66.0,66.0,0.0,0.0,0.0,188181.0,188181.0,188181.0,0.0,154309.0,160439.0,160439.0,160440.0,180686.0,80608.0,32516.0,17474.0,0.0,179353.0,36751.0,160313.0,54447.0,160440.0,160440.0,160440.0,172126.0,160440.0,160440.0,160440.0,160440.0,160440.0,172126.0,160184.0,160440.0,160440.0,160440.0,0.0,0.0,0.0,0.0,0.0,0.0,58.0,188181.0,188181.0,160287.0,179153.0,188181.0,188181.0,188181.0,188181.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1719.0,1719.0,1719.0,188181.0,160440.0,160440.0,160440.0,180686.0,0.0,180686.0,0.0,160440.0,188181.0,188181.0,188181.0,188181.0,160440.0,0.0,0.0,188183.0
mean,72233.28,0.003172,0.239626,17.060171,700.810225,696.810119,14351.61985,14339.617827,0.803678,443.734012,14354.139366,11.000808,16318.4,24.543456,16111.869584,0.002715,3.92953,,,13796.105776,8262.724893,66.830442,0.005266,19.660953,3.0,8.366961,,103.628788,12.333333,178.809091,3.0,6695.522273,,,,681.973807,669.38971,4143.202453,,125.076787,178.488491,14.108459,8.938581,1.81102,34.977806,41.796838,86.012762,,25.6599,40.818399,6.99123,36.612118,0.329506,3.755192,5.675374,4.66582,9.018979,7.726427,8.094459,14.950405,5.693393,11.085368,0.000474,0.002013,0.063737,1.788014,,,,,,,309.497069,190.190564,190.103135,95.399571,53.558127,1.0,0.106212,0.084785,177.877027,,,,,,,,,,,,,,4413.133839,45.854148,3.858057,0.01404,76.736188,137324.1,165545.3,42883.77,,20238.462072,,34389.56,16098.932265,3309.895078,0.957005,12623.140466,29891.6,,,0.156295
std,51824.59,0.059182,0.70373,7.597634,29.958829,29.958302,8112.60861,8107.009785,1.032934,242.648651,8114.766207,4.607592,19287.61,11.070113,10464.410455,0.058392,2.65703,,,16382.711908,13175.612008,26.111872,0.08169,116.366908,0.0,524.9348,,47.820792,9.785913,188.466433,0.0,2893.707395,,,,78.105496,115.82765,5855.798972,,50.850541,88.085042,16.149764,9.696242,2.193459,21.612651,20.997098,26.433704,,29.48927,21.688092,5.88041,21.741702,0.952602,2.079794,2.918935,2.486201,4.808095,6.578654,3.887859,7.384224,2.923143,4.622629,0.022895,0.049326,0.367567,1.516775,,,,,,,127.966898,995.52633,995.144027,7.391559,34.148169,0.0,0.406336,0.290168,790.222132,,,,,,,,,,,,,,3239.816137,9.675271,7.235815,0.241542,859.656048,150754.0,167253.3,40177.89,,18883.558196,,36879.63,10457.218577,3498.98409,7.044596,8090.530507,37769.69,,,0.363135
min,4800.0,0.0,0.0,0.0,664.0,660.0,1000.0,950.0,0.0,4.93,1000.0,0.0,0.0,2.0,0.0,0.0,0.0,,,0.0,0.0,0.0,0.0,0.0,3.0,0.0,,30.22,0.0,0.13,3.0,2221.64,,,,0.0,0.0,0.0,,0.0,5.0,0.0,0.0,0.0,0.0,0.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,90.66,0.0,0.0,15.0,0.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,181.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0
25%,45000.0,0.0,0.0,11.34,679.0,675.0,8000.0,8000.0,0.0,269.98,8000.0,8.0,7132.0,16.0,8261.250341,0.0,2.0,,,3013.0,1062.0,49.5,0.0,0.0,3.0,0.0,,67.66,0.0,74.91,3.0,4480.26,,,,639.0,635.0,371.06,,95.0,116.0,4.0,3.0,0.0,17.0,25.0,69.0,,7.0,23.0,2.0,18.0,0.0,2.0,4.0,3.0,6.0,3.0,5.0,10.0,4.0,8.0,0.0,0.0,0.0,1.0,,,,,,,210.3075,0.0,0.0,93.0,25.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,2058.5,40.14,0.0,0.0,0.0,27470.0,44795.0,18904.25,,7800.0,,10049.5,8253.2,1149.4,0.0,6275.0,13900.0,,,0.0
50%,62000.0,0.0,0.0,16.78,694.0,690.0,12125.0,12100.0,0.0,398.21,12175.0,10.0,12436.0,23.0,13520.332276,0.0,4.0,,,7775.0,3523.0,72.2,0.0,0.0,3.0,0.0,,104.795,12.0,150.18,3.0,6225.74,,,,694.0,690.0,940.86,,128.0,161.0,9.0,6.0,1.0,32.0,41.0,95.0,,15.0,40.0,6.0,34.0,0.0,3.0,5.0,4.0,8.0,6.0,7.0,14.0,5.0,10.0,0.0,0.0,0.0,2.0,,,,,,,314.385,0.0,0.0,100.0,50.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,3690.8,45.0,0.0,0.0,0.0,80760.5,108500.0,32963.0,,14700.0,,25751.5,13517.36,2123.76,0.0,10800.0,23000.0,,,0.0
75%,87000.0,0.0,0.0,22.58,714.0,710.0,20000.0,19975.0,1.0,578.31,20000.0,14.0,20669.0,31.0,22087.658512,0.0,5.0,,,19668.0,9703.25,89.0,0.0,0.0,3.0,0.0,,142.935,21.0,222.5325,3.0,8410.3975,,,,734.0,730.0,6079.37,,151.0,222.0,17.0,11.0,3.0,50.0,58.0,106.0,,33.0,58.0,11.0,52.0,0.0,5.0,7.0,6.0,12.0,10.0,10.0,19.0,7.0,14.0,0.0,0.0,0.0,3.0,,,,,,,428.805,0.0,0.0,100.0,80.0,1.0,0.0,0.0,0.0,,,,,,,,,,,,,,6000.0,50.0,3.0,0.0,0.0,208170.2,243732.8,53859.75,,26500.0,,46935.25,22063.7,3999.1,0.0,18000.0,37200.0,,,0.0
max,7141778.0,4.0,29.0,34.99,850.0,845.0,35000.0,35000.0,8.0,1408.13,35000.0,62.0,2568995.0,105.0,61557.694036,5.0,40.0,,,958084.0,497445.0,339.6,5.0,6124.938,3.0,65000.0,,247.4,29.0,991.79,3.0,15761.17,,,,850.0,845.0,35760.2,,649.0,760.0,264.0,211.0,31.0,156.0,165.0,121.0,,554.0,152.0,24.0,165.0,29.0,30.0,37.0,35.0,65.0,66.0,58.0,94.0,37.0,62.0,2.0,4.0,24.0,25.0,,,,,,,531.72,13373.34,13354.23,100.0,100.0,1.0,54.0,8.0,39444.37,,,,,,,,,,,,,,33601.0,184.36,36.0,53.0,88303.0,8000078.0,9999999.0,2644442.0,,522210.0,,1214546.0,61513.72,27074.58,367.6,35000.0,9999999.0,,,1.0


### Remove columns with no data

In [11]:
class useless_columns(BaseEstimator, TransformerMixin):
    def __init__(self) :
        a=0
    def fit(self,X,y=None) :
        return self # do nothing, no implementation
    def transform(self,X,y=None) :
        assert isinstance(X, pd.DataFrame)
        remove_cols = []
        # Use describe to filter out columns with a lot junk ...
        X_desc = X.describe()
        for c in X_desc.columns :
            #count is the 0th index
            if(X_desc[c][0] == 0) : 
                #print("appending {}".format(c))
                remove_cols.append(c)
                #print(remove_cols)
        X = X.drop(columns=remove_cols, axis=1)
        return X



uc = useless_columns()
loan_short_df = uc.transform(loan_short_df)

In [12]:
#loan_short_df[loan_short_df['addr_state'].isnull()]

### 3. Handle Missing Values

In [13]:
# Replacing missing values with 0's
#loan_short_df['EMP_LISTED'] = [1 if x != None else 0 for x in loan_sub_kp['EMP_TITLE']]
#loan_short_df['EMPTY_DESC'] = [1 if x == None else 0 for x in loan_sub_kp['DESC']]
#loan_short_df['EMP_NA'] = [1 if x == "n/a" else 0 for x in loan_sub_kp['EMP_LENGTH']]
#loan_short_df['EMP_LENGTH'] = ['Other' if x == "n/a" else x for x in loan_sub_kp['EMP_LENGTH']]
#loan_short_df['DELING_EVER'] = [0 if math.isnan(x) else 1 for x in loan_sub_kp['MTHS_SINCE_LAST_DELINQ']]
#loan_short_df['HOME_OWNERSHIP'] = ["OTHER" if x == None else x for x in loan_sub_kp['HOME_OWNERSHIP'] ]


class missing_features(BaseEstimator, TransformerMixin):
    def __init__(self) :
        a=0
    def fit(self,X,y=None) :
        return self # do nothing, no implementation
    def transform(self,X,y=None) :
        assert isinstance(X, pd.DataFrame)
        # Remove lines at bottom of csv files taht are jusnk
        X = X[X['addr_state'].notnull()]
        X.loc[:,'total_il_high_credit_limit'] = X['total_il_high_credit_limit'].fillna(0)
        return X

mf = missing_features()
loan_short_df = mf.transform(loan_short_df)

In [14]:
# Impute Median for now ..
#from sklearn.preprocessing import Imputer
#imputer = Imputer(strategy=median)
# page 60 ...

### 4. Data Preparation - Handle Time Objects

In [15]:
class clean_time_columns(BaseEstimator, TransformerMixin):
    def __init__(self) :
        a=0
    def fit(self,X,y=None) :
        return self # do nothing, no implementation
    def transform(self,X,y=None) :
        assert isinstance(X, pd.DataFrame)
        # turn MM-YYYY into YYYY-MM-DD
        X['issue_d'] = X['issue_d'].map(lambda x: datetime.strptime(x, "%b-%Y"))
        X['earliest_cr_line'] = X['earliest_cr_line'].map(lambda x: datetime.strptime(x, '%b-%Y'))
        X['time_history'] = X['issue_d'] - X['earliest_cr_line']
        X['time_history'] = X['time_history'].astype('timedelta64[D]').astype(int)

        return X

cln = clean_time_columns()
loan_short_df = cln.transform(loan_short_df)


### 5.  Handle Null Values ... Impute later for key columns ....

In [16]:
loan_short_df.describe()

Unnamed: 0,annual_inc,collections_12_mths_ex_med,delinq_2yrs,dti,fico_range_high,fico_range_low,funded_amnt,funded_amnt_inv,inq_last_6mths,installment,loan_amnt,open_acc,revol_bal,total_acc,total_pymnt,acc_now_delinq,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,collection_recovery_fee,deferral_term,delinq_amnt,hardship_amount,hardship_dpd,hardship_last_payment_amount,hardship_length,hardship_payoff_balance_amount,last_fico_range_high,last_fico_range_low,last_pymnt_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_last_delinq,mths_since_last_major_derog,mths_since_last_record,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,orig_projected_additional_accrued_interest,out_prncp,out_prncp_inv,pct_tl_nvr_dlq,percent_bc_gt_75,policy_code,pub_rec,pub_rec_bankruptcies,recoveries,settlement_amount,settlement_percentage,settlement_term,tax_liens,tot_coll_amt,tot_cur_bal,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,total_pymnt_inv,total_rec_int,total_rec_late_fee,total_rec_prncp,total_rev_hi_lim,default,time_history
count,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,180686.0,160434.0,179156.0,179069.0,188181.0,188181.0,66.0,188181.0,66.0,66.0,66.0,66.0,66.0,188181.0,188181.0,188181.0,154309.0,160439.0,160439.0,160440.0,180686.0,80608.0,32516.0,17474.0,179353.0,36751.0,160313.0,54447.0,160440.0,160440.0,160440.0,172126.0,160440.0,160440.0,160440.0,160440.0,160440.0,172126.0,160184.0,160440.0,160440.0,160440.0,58.0,188181.0,188181.0,160287.0,179153.0,188181.0,188181.0,188181.0,188181.0,1719.0,1719.0,1719.0,188181.0,160440.0,160440.0,160440.0,180686.0,180686.0,188181.0,188181.0,188181.0,188181.0,188181.0,160440.0,188181.0,188181.0
mean,72233.28,0.003172,0.239626,17.060171,700.810225,696.810119,14351.61985,14339.617827,0.803678,443.734012,14354.139366,11.000808,16318.4,24.543456,16111.869584,0.002715,3.92953,13796.105776,8262.724893,66.830442,0.005266,19.660953,3.0,8.366961,103.628788,12.333333,178.809091,3.0,6695.522273,681.973807,669.38971,4143.202453,125.076787,178.488491,14.108459,8.938581,1.81102,34.977806,41.796838,86.012762,25.6599,40.818399,6.99123,36.612118,0.329506,3.755192,5.675374,4.66582,9.018979,7.726427,8.094459,14.950405,5.693393,11.085368,0.000474,0.002013,0.063737,1.788014,309.497069,190.190564,190.103135,95.399571,53.558127,1.0,0.106212,0.084785,177.877027,4413.133839,45.854148,3.858057,0.01404,76.736188,137324.1,165545.3,42883.77,20238.462072,29319.97,16098.932265,3309.895078,0.957005,12623.140466,29891.6,0.156296,5679.545783
std,51824.59,0.059182,0.70373,7.597634,29.958829,29.958302,8112.60861,8107.009785,1.032934,242.648651,8114.766207,4.607592,19287.61,11.070113,10464.410455,0.058392,2.65703,16382.711908,13175.612008,26.111872,0.08169,116.366908,0.0,524.9348,47.820792,9.785913,188.466433,0.0,2893.707395,78.105496,115.82765,5855.798972,50.850541,88.085042,16.149764,9.696242,2.193459,21.612651,20.997098,26.433704,29.48927,21.688092,5.88041,21.741702,0.952602,2.079794,2.918935,2.486201,4.808095,6.578654,3.887859,7.384224,2.923143,4.622629,0.022895,0.049326,0.367567,1.516775,127.966898,995.52633,995.144027,7.391559,34.148169,0.0,0.406336,0.290168,790.222132,3239.816137,9.675271,7.235815,0.241542,859.656048,150754.0,167253.3,40177.89,18883.558196,36169.66,10457.218577,3498.98409,7.044596,8090.530507,37769.69,0.363137,2578.888982
min,4800.0,0.0,0.0,0.0,664.0,660.0,1000.0,950.0,0.0,4.93,1000.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,30.22,0.0,0.13,3.0,2221.64,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,90.66,0.0,0.0,15.0,0.0,1.0,0.0,0.0,0.0,181.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1095.0
25%,45000.0,0.0,0.0,11.34,679.0,675.0,8000.0,8000.0,0.0,269.98,8000.0,8.0,7132.0,16.0,8261.250341,0.0,2.0,3013.0,1062.0,49.5,0.0,0.0,3.0,0.0,67.66,0.0,74.91,3.0,4480.26,639.0,635.0,371.06,95.0,116.0,4.0,3.0,0.0,17.0,25.0,69.0,7.0,23.0,2.0,18.0,0.0,2.0,4.0,3.0,6.0,3.0,5.0,10.0,4.0,8.0,0.0,0.0,0.0,1.0,210.3075,0.0,0.0,93.0,25.0,1.0,0.0,0.0,0.0,2058.5,40.14,0.0,0.0,0.0,27470.0,44795.0,18904.25,7800.0,0.0,8253.2,1149.4,0.0,6275.0,13900.0,0.0,3957.0
50%,62000.0,0.0,0.0,16.78,694.0,690.0,12125.0,12100.0,0.0,398.21,12175.0,10.0,12436.0,23.0,13520.332276,0.0,4.0,7775.0,3523.0,72.2,0.0,0.0,3.0,0.0,104.795,12.0,150.18,3.0,6225.74,694.0,690.0,940.86,128.0,161.0,9.0,6.0,1.0,32.0,41.0,95.0,15.0,40.0,6.0,34.0,0.0,3.0,5.0,4.0,8.0,6.0,7.0,14.0,5.0,10.0,0.0,0.0,0.0,2.0,314.385,0.0,0.0,100.0,50.0,1.0,0.0,0.0,0.0,3690.8,45.0,0.0,0.0,0.0,80760.5,108500.0,32963.0,14700.0,20562.0,13517.36,2123.76,0.0,10800.0,23000.0,0.0,5175.0
75%,87000.0,0.0,0.0,22.58,714.0,710.0,20000.0,19975.0,1.0,578.31,20000.0,14.0,20669.0,31.0,22087.658512,0.0,5.0,19668.0,9703.25,89.0,0.0,0.0,3.0,0.0,142.935,21.0,222.5325,3.0,8410.3975,734.0,730.0,6079.37,151.0,222.0,17.0,11.0,3.0,50.0,58.0,106.0,33.0,58.0,11.0,52.0,0.0,5.0,7.0,6.0,12.0,10.0,10.0,19.0,7.0,14.0,0.0,0.0,0.0,3.0,428.805,0.0,0.0,100.0,80.0,1.0,0.0,0.0,0.0,6000.0,50.0,3.0,0.0,0.0,208170.2,243732.8,53859.75,26500.0,42064.0,22063.7,3999.1,0.0,18000.0,37200.0,0.0,6940.0
max,7141778.0,4.0,29.0,34.99,850.0,845.0,35000.0,35000.0,8.0,1408.13,35000.0,62.0,2568995.0,105.0,61557.694036,5.0,40.0,958084.0,497445.0,339.6,5.0,6124.938,3.0,65000.0,247.4,29.0,991.79,3.0,15761.17,850.0,845.0,35760.2,649.0,760.0,264.0,211.0,31.0,156.0,165.0,121.0,554.0,152.0,24.0,165.0,29.0,30.0,37.0,35.0,65.0,66.0,58.0,94.0,37.0,62.0,2.0,4.0,24.0,25.0,531.72,13373.34,13354.23,100.0,100.0,1.0,54.0,8.0,39444.37,33601.0,184.36,36.0,53.0,88303.0,8000078.0,9999999.0,2644442.0,522210.0,1214546.0,61513.72,27074.58,367.6,35000.0,9999999.0,1.0,22827.0


In [17]:
# Note that in the output of describe, I have some columns with less than my 39999 rows.. this is due to NaN 
# loan_short_df = loan_short_df.fillna(0)
# loan_short_df[loan_short_df.isnull().any(axis=1)].shape
# Print out rows with NaNs --> loan_short_df[loan_short_df.isnull().any(axis=1)].head()

drop_list = ['url','debt_settlement_flag_date','deferral_term','next_pymnt_d','orig_projected_additional_accrued_interest']
drop_dates = ['payment_plan_start_date','last_pymnt_d','last_credit_pull_d']

drop_nlp_cand = ['title','emp_title','desc']
# create hardship indicator -> drop hardship ....
drop_hardship =['hardship_amount','hardship_dpd','hardship_end_date','hardship_flag','hardship_last_payment_amount','hardship_length','hardship_loan_status','hardship_payoff_balance_amount','hardship_reason','hardship_start_date','hardship_status','hardship_type']

#create settlement indictor -> 
drop_settles = [ 'settlement_amount','settlement_date','settlement_percentage','settlement_status','settlement_term']

# Handle NaN for months since ..... these numbers should be high ..
#<- transform logic
drop_msince = ['mths_since_last_delinq','mths_since_last_major_derog','mths_since_last_record','mths_since_recent_bc','mths_since_recent_bc_dlq','mths_since_recent_inq','mths_since_recent_revol_delinq','mo_sin_old_il_acct','mo_sin_old_rev_tl_op','mo_sin_rcnt_rev_tl_op','mo_sin_rcnt_tl']

# These could be imputed, but drop for now ...
drop_total = ['total_rev_hi_lim','tot_coll_amt','tot_cur_bal','tot_hi_cred_lim']

# there is information here .. imputer later (maybe based on GRADE ?)
drop_nums = ['num_accts_ever_120_pd','num_actv_bc_tl','num_actv_rev_tl','num_bc_sats','num_bc_tl','num_il_tl','num_op_rev_tl','num_rev_accts','num_rev_tl_bal_gt_0','num_sats','num_tl_120dpd_2m','num_tl_30dpd','num_tl_90g_dpd_24m','num_tl_op_past_12m']

loan_short_df = loan_short_df.drop(columns=drop_list,axis=1).\
                               drop(columns=drop_dates,axis=1).\
                               drop(columns=drop_nlp_cand,axis=1).\
                               drop(columns=drop_hardship,axis=1).\
                               drop(columns=drop_settles,axis=1).\
                               drop(columns=drop_msince,axis=1).\
                               drop(columns=drop_total,axis=1).\
                               drop(columns=drop_nums,axis=1)



In [18]:
#loan_short_df.drop(columns='title',axis=1,inplace=True)

In [19]:
# Fill 0 candidates for now - Add justification later ...
loan_short_df['percent_bc_gt_75'].fillna(0,inplace=True)
loan_short_df['bc_open_to_buy'].fillna(0,inplace=True)
loan_short_df['bc_util'].fillna(0,inplace=True)
loan_short_df['pct_tl_nvr_dlq'].fillna(0,inplace=True)  # Percent trades never delinquent
loan_short_df['avg_cur_bal'].fillna(3000,inplace=True)  # set to around lower 25% percentile
loan_short_df['acc_open_past_24mths'].fillna(4,inplace=True)  # set to around lower 25% percentile
loan_short_df['mort_acc'].fillna(1,inplace=True)  # set to around lower 25% percentile

loan_short_df['total_bal_ex_mort'].fillna(18000,inplace=True)# set to around lower 50% percentile
loan_short_df['total_bc_limit'].fillna(7800,inplace=True)# set to around lower 50% percentile


In [20]:
# create emp_length indictor variable 
# emp_length <- impute that with simple formula based on diff mdl ...
def emp_func(row):
    if(isinstance(row['emp_length'], str)) :
        if row['emp_length'] == '1 years' or row['emp_length'] == '2 years' or row['emp_length'] == '3 years':
            return '0_3yrs'
        elif row['emp_length'] == '4 years' or row['emp_length'] == '5 years' or row['emp_length'] == '6 years':
            return '4_6yrs' 
        else:
            return 'gt_6yrs'
    else :
        return '0_3yrs'

loan_short_df['emp_bin'] = loan_short_df.apply(emp_func, axis=1)
loan_short_df.drop(columns='emp_length',axis=1,inplace=True)

In [21]:
def revol_util_func(row) :
    if(isinstance(row['revol_util'], int)) :
        return row['revol_util']
    else :
        return float(row['revol_bal']/(row['revol_bal']+row['loan_amnt']))

loan_short_df['revol_util_1'] = loan_short_df.apply(revol_util_func, axis=1)
loan_short_df.drop(columns='revol_util',axis=1,inplace=True)

In [22]:
loan_short_df.describe()

Unnamed: 0,annual_inc,collections_12_mths_ex_med,delinq_2yrs,dti,fico_range_high,fico_range_low,funded_amnt,funded_amnt_inv,inq_last_6mths,installment,loan_amnt,open_acc,revol_bal,total_acc,total_pymnt,acc_now_delinq,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,collection_recovery_fee,delinq_amnt,last_fico_range_high,last_fico_range_low,last_pymnt_amnt,mort_acc,out_prncp,out_prncp_inv,pct_tl_nvr_dlq,percent_bc_gt_75,policy_code,pub_rec,pub_rec_bankruptcies,recoveries,tax_liens,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,total_pymnt_inv,total_rec_int,total_rec_late_fee,total_rec_prncp,default,time_history,revol_util_1
count,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0,188181.0
mean,72233.28,0.003172,0.239626,17.060171,700.810225,696.810119,14351.61985,14339.617827,0.803678,443.734012,14354.139366,11.000808,16318.4,24.543456,16111.869584,0.002715,3.932336,12204.236528,7866.451666,63.594414,0.005266,19.660953,8.366961,681.973807,669.38971,4143.202453,1.778718,190.190564,190.103135,81.258528,50.988671,1.0,0.106212,0.084785,177.877027,0.01404,41892.69,19743.054602,29319.97,16098.932265,3309.895078,0.957005,12623.140466,0.156296,5679.545783,0.493742
std,51824.59,0.059182,0.70373,7.597634,29.958829,29.958302,8112.60861,8107.009785,1.032934,242.648651,8114.766207,4.607592,19287.61,11.070113,10464.410455,0.058392,2.603616,15603.553598,12976.457004,29.233695,0.08169,116.366908,524.9348,78.105496,115.82765,5855.798972,2.155178,995.52633,995.144027,34.577782,35.230205,0.0,0.406336,0.290168,790.222132,0.241542,39669.24,18662.876148,36169.66,10457.218577,3498.98409,7.044596,8090.530507,0.363137,2578.888982,0.171163
min,4800.0,0.0,0.0,0.0,664.0,660.0,1000.0,950.0,0.0,4.93,1000.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1095.0,0.0
25%,45000.0,0.0,0.0,11.34,679.0,675.0,8000.0,8000.0,0.0,269.98,8000.0,8.0,7132.0,16.0,8261.250341,0.0,2.0,3000.0,831.0,45.0,0.0,0.0,0.0,639.0,635.0,371.06,0.0,0.0,0.0,87.0,20.0,1.0,0.0,0.0,0.0,0.0,18000.0,7800.0,0.0,8253.2,1149.4,0.0,6275.0,0.0,3957.0,0.395039
50%,62000.0,0.0,0.0,16.78,694.0,690.0,12125.0,12100.0,0.0,398.21,12175.0,10.0,12436.0,23.0,13520.332276,0.0,4.0,5329.0,3192.0,70.3,0.0,0.0,0.0,694.0,690.0,940.86,1.0,0.0,0.0,97.0,50.0,1.0,0.0,0.0,0.0,0.0,31711.0,14000.0,20562.0,13517.36,2123.76,0.0,10800.0,0.0,5175.0,0.496465
75%,87000.0,0.0,0.0,22.58,714.0,710.0,20000.0,19975.0,1.0,578.31,20000.0,14.0,20669.0,31.0,22087.658512,0.0,5.0,17265.0,9157.0,88.3,0.0,0.0,0.0,734.0,730.0,6079.37,3.0,0.0,0.0,100.0,80.0,1.0,0.0,0.0,0.0,0.0,52574.0,25700.0,42064.0,22063.7,3999.1,0.0,18000.0,0.0,6940.0,0.594978
max,7141778.0,4.0,29.0,34.99,850.0,845.0,35000.0,35000.0,8.0,1408.13,35000.0,62.0,2568995.0,105.0,61557.694036,5.0,40.0,958084.0,497445.0,339.6,5.0,6124.938,65000.0,850.0,845.0,35760.2,31.0,13373.34,13354.23,100.0,100.0,1.0,54.0,8.0,39444.37,53.0,2644442.0,522210.0,1214546.0,61513.72,27074.58,367.6,35000.0,1.0,22827.0,0.992216


In [23]:
# This line should return no values!!!  Validates your missing data handler ...
loan_short_df[loan_short_df.isnull().any(axis=1)].head(50)



Unnamed: 0,id,addr_state,annual_inc,collections_12_mths_ex_med,delinq_2yrs,dti,fico_range_high,fico_range_low,funded_amnt,funded_amnt_inv,grade,home_ownership,inq_last_6mths,installment,int_rate,issue_d,loan_amnt,loan_status,open_acc,purpose,pymnt_plan,revol_bal,sub_grade,term,total_acc,total_pymnt,verification_status,zip_code,acc_now_delinq,acc_open_past_24mths,application_type,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,collection_recovery_fee,debt_settlement_flag,delinq_amnt,disbursement_method,earliest_cr_line,initial_list_status,last_fico_range_high,last_fico_range_low,last_pymnt_amnt,mort_acc,out_prncp,out_prncp_inv,pct_tl_nvr_dlq,percent_bc_gt_75,policy_code,pub_rec,pub_rec_bankruptcies,recoveries,tax_liens,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,total_pymnt_inv,total_rec_int,total_rec_late_fee,total_rec_prncp,default,time_history,emp_bin,revol_util_1


In [24]:
# Just Fill Nan values with mean from above ...
#loan_sub_kp["REVOL_UTIL"].fillna(48.8, inplace=True)
#loan_sub_kp["MTHS_SINCE_LAST_DELINQ"].fillna(35.9, inplace=True)
#
#loan_sub_kp.describe()

### 6.  Cast miscategorized data

In [25]:
# a = 'int_rate', 'revol_util'
#loan_short_df['int_rate']
#loan_short_df['revol_util']
class cast_data(BaseEstimator, TransformerMixin):
    def __init__(self) :
        a=0
    def fit(self,X,y=None) :
        return self # do nothing, no implementation
    def transform(self,X,y=None) :
        assert isinstance(X, pd.DataFrame)
        # turn MM-YYYY into YYYY-MM-DD
        X['int_rate'] = X['int_rate'].map(lambda x : float(x.strip('%'))/100 if(isinstance(x,str)) else x  )
        #X['revol_util'] = X['revol_util'].map(lambda x : float(x.strip('%'))/100 if(isinstance(x,str)) else x  )

        return X

cd = cast_data()
loan_short_df = cd.transform(loan_short_df)


## Save Cleaned Dataframe for Followon Modelling Phase

In [26]:
loan_short_df.to_pickle("01-dataprep-loan_short_df.pkl")


In [27]:
loan_short_df.to_csv("01-dataprep-loan_short_df.csv",header=True,sep=',',na_rep=None)


In [None]:
!head -n 1 01-dataprep-loan_short_df.csv
!ls -lart
!cat 01-dataprep-loan_short_df.csv |wc
!pwd