List of variables that have a reliable relationship with the default rate:
annual_inc, dti, delinq_2yrs, inq_last_6mths, total_acc, open_acc, collections_12_mths_ex_med,
home_ownership (MORTGAGE, RENT),
purpose (car, credit_card, debt_consolidation (very small effect), home_improvement, major_purchase, medical,
moving, other, small_business, wedding), avg_inc (small effect), avg_SS (very small), avg_itemized (small effect), avg_unemp (very small effect), avg_mort_intr (small effect), prop_SS (small effect), prop_itemized (small effect), prop_unemp (small effect), prop_mort_intr (small effect), prop_edu (small), prop_st_loans (small),  erlst_cred (small effect), last_cred_pl(large effect, strange)

In [1]:
#Import some modules
%matplotlib inline
import numpy as np
import pandas as pd
import utils as ut
import datetime as dt



#Load in the data
#loans = pd.read_csv('LoanStats3a.csv', skiprows = 1, 
#                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
#                 skipfooter = 2)

df1 = pd.read_csv('LoanStats3a.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2)
df2 = pd.read_csv('LoanStats3b.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2, infer_datetime_format=True)
df3 = pd.read_csv('LoanStats3c.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2, infer_datetime_format=True)
df4 = pd.read_csv('LoanStats3d.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2, infer_datetime_format=True)

loans = pd.concat([df1, df2, df3, df4], ignore_index = True)

del df1
del df2
del df3
del df4



In [2]:
#Begin by cleaning up the data a bit
#Percentages of the different loan statuses
loans.loan_status.value_counts(normalize = True)


Current                                                0.652293
Fully Paid                                             0.261470
Charged Off                                            0.057966
Late (31-120 days)                                     0.014763
In Grace Period                                        0.006518
Late (16-30 days)                                      0.002723
Does not meet the credit policy. Status:Fully Paid     0.002621
Does not meet the credit policy. Status:Charged Off    0.001005
Default                                                0.000634
Does not meet the credit policy. Status:Current        0.000005
dtype: float64

In [3]:
loans.replace(to_replace = 'Does not meet the credit policy. Status:Fully Paid', value = 'Fully Paid', inplace = True)
loans.replace(to_replace = 'Does not meet the credit policy. Status:Charged Off', value = 'Charged Off', inplace = True)
#Assign default to charged off? They seem equivalent from the lenders perspective
loans.replace(to_replace = 'Default', value = 'Charged Off', inplace = True)

loans.loan_status.value_counts(normalize = True)

Current                                            0.652293
Fully Paid                                         0.264091
Charged Off                                        0.059605
Late (31-120 days)                                 0.014763
In Grace Period                                    0.006518
Late (16-30 days)                                  0.002723
Does not meet the credit policy. Status:Current    0.000005
dtype: float64

In [4]:
#Filter out everything but 'fully paid' and 'charged off' and plot
#We're just going to consider those two categories to simplify the analysis
loans = loans[(loans.loan_status == 'Fully Paid') | (loans.loan_status == 'Charged Off')]

In [16]:
loans.columns

Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'verification_status', u'issue_d', u'loan_status', u'pymnt_plan',
       u'url', u'desc', u'purpose', u'title', u'zip_code', u'addr_state',
       u'dti', u'delinq_2yrs', u'earliest_cr_line', u'inq_last_6mths',
       u'mths_since_last_delinq', u'mths_since_last_record', u'open_acc',
       u'pub_rec', u'revol_bal', u'revol_util', u'total_acc',
       u'initial_list_status', u'out_prncp', u'out_prncp_inv', u'total_pymnt',
       u'total_pymnt_inv', u'total_rec_prncp', u'total_rec_int',
       u'total_rec_late_fee', u'recoveries', u'collection_recovery_fee',
       u'last_pymnt_d', u'last_pymnt_amnt', u'next_pymnt_d',
       u'last_credit_pull_d', u'collections_12_mths_ex_med',
       u'mths_since_last_major_derog', u'policy_code', u'application_type',
       u'annu

Inferential statistics on the various features of this data set. Begin with the numeric fields that have little to no missing data and do not require any processing.

In [17]:
#Check out the effect of annual income
#Get the annual income for the charged off and fully paid categories
#Get the median incomes
inc_paid_med = loans[loans.loan_status == 'Fully Paid'].annual_inc.median()
inc_chrg_med = loans[loans.loan_status == 'Charged Off'].annual_inc.median()

loans.groupby('loan_status').annual_inc.median()

loan_status
Charged Off    56000
Fully Paid     64000
Name: annual_inc, dtype: float64

In [19]:
#Look at the effect size for the incomes
inc_paid_med / inc_chrg_med
#Modest effect size

1.1428571428571428

In [20]:
#Bootstrap the median incomes for the two categories and see if that effect is significant
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].annual_inc,
                             loans[loans.loan_status == 'Charged Off'].annual_inc, 99.5, 0.5)
prct
#Reliable

[7024.9849999999997, 8773.0099999999984]

In [21]:
#Check out the effect of dti
#Get the dti for the charged off and fully paid categories
#Get the median dtis
dti_paid_med = loans[loans.loan_status == 'Fully Paid'].dti.median()
dti_chrg_med = loans[loans.loan_status == 'Charged Off'].dti.median()

loans.groupby('loan_status').dti.median() 

loan_status
Charged Off    18.27
Fully Paid     15.69
Name: dti, dtype: float64

In [22]:
#Look at the effect size for the dtis
dti_chrg_med / dti_paid_med
#Modest effect size

1.1644359464627152

In [23]:
#Bootstrap the dti for the two categories and see if that effect is significant
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].dti,
                             loans[loans.loan_status == 'Fully Paid'].dti, 99.5, 0.5)
prct
#Reliable

[2.4400000000000013, 2.7000000000000011]

In [24]:
#Check out the effect of delinquencies in the last two years
#Get the number of delinquencies for the charged off and fully paid categories
#Get the mean delinquencies
dlq_paid_mn = loans[loans.loan_status == 'Fully Paid'].delinq_2yrs.mean()
dlq_chrg_mn = loans[loans.loan_status == 'Charged Off'].delinq_2yrs.mean()

loans.groupby('loan_status').delinq_2yrs.mean() 

loan_status
Charged Off    0.27490
Fully Paid     0.24146
Name: delinq_2yrs, dtype: float64

In [26]:
#Look at the effect size for the delinquencies
dlq_chrg_mn / dlq_paid_mn
#Modest effect size

1.1384884893215346

In [27]:
#Bootstrap the delinquencies for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].delinq_2yrs,
                              loans[loans.loan_status == 'Fully Paid'].delinq_2yrs, 99.5, 0.5)
prct
#Reliable

[0.022954457309100756, 0.043943470247482859]

In [28]:
#Check out the effect of inquiries in the last six months
#Get the inquiries for the charged off and fully paid categories
#Get the mean inquiries
inq_paid_mn = loans[loans.loan_status == 'Fully Paid'].inq_last_6mths.mean()
inq_chrg_mn = loans[loans.loan_status == 'Charged Off'].inq_last_6mths.mean()

loans.groupby('loan_status').inq_last_6mths.mean() 

loan_status
Charged Off    1.047217
Fully Paid     0.858365
Name: inq_last_6mths, dtype: float64

In [30]:
#Look at the effect size for the inquiries
inq_chrg_mn / inq_paid_mn
#Moderate effect

1.2200132638223962

In [31]:
#Bootstrap the inquiries for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].inq_last_6mths,
                              loans[loans.loan_status == 'Fully Paid'].inq_last_6mths, 99.5, 0.5)
prct
#Reliable

[0.17240098985823846, 0.2063343733412828]

In [32]:
#Check out the effect of number of open accounts
#Get the number of open accounts for the charged off and fully paid categories
#Get the mean number of accounts
opn_paid_mn = loans[loans.loan_status == 'Fully Paid'].open_acc.mean()
opn_chrg_mn = loans[loans.loan_status == 'Charged Off'].open_acc.mean()

loans.groupby('loan_status').open_acc.mean()

loan_status
Charged Off    11.008180
Fully Paid     10.875742
Name: open_acc, dtype: float64

In [33]:
#Look at the effect size for the number of open accounts
opn_paid_mn / opn_chrg_mn
#Small effect

0.98796918047848403

In [34]:
#Bootstrap the open accounts for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].open_acc,
                              loans[loans.loan_status == 'Charged Off'].open_acc, 99.5, 0.5)
prct
#Looks like the effect is reliable, but it is small enough to be suspect

[-0.19623503540697443, -0.066461859835128423]

In [48]:
#Check out the effect of number of derogatory public records
#Get the number of public records for the charged off and fully paid categories
#Get the mean number of derogatory public records
pub_paid_mn = loans[loans.loan_status == 'Fully Paid'].pub_rec.mean()
pub_chrg_mn = loans[loans.loan_status == 'Charged Off'].pub_rec.mean()

loans.groupby('loan_status').pub_rec.mean()


loan_status
Charged Off    0.143180
Fully Paid     0.139283
Name: pub_rec, dtype: float64

In [49]:
#Look at the effect size for the number of derogatory public records
pub_chrg_mn / pub_paid_mn
#Small effect

1.0279786874048025

In [37]:
#Bootstrap the public records for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].pub_rec,
                              loans[loans.loan_status == 'Fully Paid'].pub_rec, 99.5, 0.5)
prct
#Not reliable

[-0.0018740016053750276, 0.009818523648975357]

In [38]:
#Check out the effect of revolving balance
#Get the revolving balance for the charged off and fully paid categories
#Get the median revolving balance 
rev_bal_paid_med = loans[loans.loan_status == 'Fully Paid'].revol_bal.median()
rev_bal_chrg_med = loans[loans.loan_status == 'Charged Off'].revol_bal.median()

loans.groupby('loan_status').revol_bal.median()

loan_status
Charged Off    11301.5
Fully Paid     10805.0
Name: revol_bal, dtype: float64

In [39]:
#Effect size
rev_bal_chrg_med / rev_bal_paid_med
#Small effect

1.0459509486348912

In [40]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].revol_bal, 
                             loans[loans.loan_status == 'Fully Paid'].revol_bal, 99.5, 0.5)
prct
#reliable

[352.0, 647.0]

In [42]:
#Total number of accounts
tot_acc_paid_mn = loans[loans.loan_status == 'Fully Paid'].total_acc.mean()
tot_acc_chrg_mn = loans[loans.loan_status == 'Charged Off'].total_acc.mean()

loans.groupby('loan_status').total_acc.mean() 

loan_status
Charged Off    24.141296
Fully Paid     25.138142
Name: total_acc, dtype: float64

In [43]:
#Effect size
tot_acc_paid_mn  / tot_acc_chrg_mn 
#Small effect

1.041292169210503

In [44]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].total_acc,
                              loans[loans.loan_status == 'Charged Off'].total_acc, 99.5, 0.5)
prct
#Seems to be reliable, but not a very big effect

[0.8410510951780743, 1.148117810622356]

In [45]:
#Collections in the last 12 months excluding medical collections
col_paid_mn = loans[loans.loan_status == 'Fully Paid'].collections_12_mths_ex_med.mean()
col_chrg_mn = loans[loans.loan_status == 'Charged Off'].collections_12_mths_ex_med.mean()

loans.groupby('loan_status').collections_12_mths_ex_med.mean()

#Field seems to be missing some data

loan_status
Charged Off    0.007741
Fully Paid     0.006107
Name: collections_12_mths_ex_med, dtype: float64

In [46]:
#Effect size
col_chrg_mn  / col_paid_mn
#Moderate effect, but missing data?

1.2675053795187776

In [47]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].collections_12_mths_ex_med,
                              loans[loans.loan_status == 'Fully Paid'].collections_12_mths_ex_med, 99.5, 0.5)
prct
#Effect is reliable

[0.00044366518691515099, 0.002854148549792756]

In [50]:
#Number of accounts now delinquent
dlq_acc_paid_mn = loans[loans.loan_status == 'Fully Paid'].acc_now_delinq.mean()
dlq_acc_chrg_mn = loans[loans.loan_status == 'Charged Off'].acc_now_delinq.mean()

loans.groupby('loan_status').acc_now_delinq.mean()

loan_status
Charged Off    0.003879
Fully Paid     0.002827
Name: acc_now_delinq, dtype: float64

In [51]:
#Effect size
dlq_acc_chrg_mn / dlq_acc_paid_mn
#Large effect

1.3722424705392382

In [52]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].acc_now_delinq,
                              loans[loans.loan_status == 'Fully Paid'].acc_now_delinq, 99.5, 0.5)
prct
#Also reliable

[0.00017702472440258958, 0.0019903687465753829]

Categorical variables:
relatively few categories:
home_ownership, purpose 

lots of categories:
addr_state, zip_code

In [4]:
chrg_off_rate = len(loans[loans.loan_status == 'Charged Off'].loan_status) / float(len(loans.loan_status))
chrg_off_rate

0.1841395271001106

In [5]:
#home ownership
#There is one loan with home_ownwership == 'ANY', add that to 'OTHER'
loans.home_ownership.replace(to_replace = 'ANY', value = 'OTHER', inplace = True)
#Print the fully paid and charged off rates for each ownership category
loans.groupby('home_ownership').loan_status.value_counts(normalize = True).unstack()

Unnamed: 0_level_0,Fully Paid,Charged Off
home_ownership,Unnamed: 1_level_1,Unnamed: 2_level_1
MORTGAGE,0.835505,0.164495
NONE,0.829787,0.170213
OTHER,0.787709,0.212291
OWN,0.811723,0.188277
RENT,0.793776,0.206224


In [10]:
#is_mort
eff_size = (chrg_off_rate / 
    loans[loans.home_ownership == 'MORTGAGE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_mort

1.1194209091762977

In [11]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'MORTGAGE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is reliable

[0.161742103211789, 0.16729834061714779]

In [12]:
#is_none
eff_size = (chrg_off_rate / 
    loans[loans.home_ownership == 'NONE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_none

1.0818197217131498

In [13]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'NONE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is not reliable

[0.042553191489361701, 0.31914893617021278]

In [16]:
#is_other
eff_size = (
    loans[loans.home_ownership == 'OTHER'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_other

1.1528785054274671

In [17]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'OTHER'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is not reliable

[0.13407821229050279, 0.29608938547486036]

In [8]:
#is_own
eff_size = (
    loans[loans.home_ownership == 'OWN'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Small effect of is_own

1.0224701578129374

In [6]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'OWN'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is not reliable

[0.18144257237778832, 0.19501661129568107]

In [18]:
#is_rent
eff_size = (
    loans[loans.home_ownership == 'RENT'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_rent

1.1199336806700069

In [19]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'RENT'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is reliable

[0.20299468559680359, 0.20954075992086582]

In [22]:
#loan purpose
#Print the fully paid and charged off rates for each purpose category
loans.groupby('purpose').loan_status.value_counts(normalize = True).unstack()

Unnamed: 0_level_0,Fully Paid,Charged Off
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1
car,0.873259,0.126741
credit_card,0.839346,0.160654
debt_consolidation,0.808992,0.191008
educational,0.791469,0.208531
home_improvement,0.841063,0.158937
house,0.820905,0.179095
major_purchase,0.85668,0.14332
medical,0.793521,0.206479
moving,0.785607,0.214393
other,0.788047,0.211953


In [23]:
#p_is_car
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'car'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Large effect of is_car

1.4528811039327409

In [24]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'car'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is reliable

[0.11253481894150417, 0.14150417827298051]

In [25]:
#p_is_credit
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'credit_card'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_credit

1.1461892165678129

In [26]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'credit_card'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct

[0.15646924117830749, 0.16508795669824086]

In [27]:
#p_is_debt
eff_size = (
    loans[loans.purpose == 'debt_consolidation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Small effect of is_debt

1.0372996019357483

In [28]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'debt_consolidation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable, but so small it probably doesn't matter much

[0.18837205318940686, 0.19374143572247546]

In [29]:
#p_is_edu
eff_size = (
    loans[loans.purpose == 'educational'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_edu

1.1324608516770678

In [30]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'educational'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.15876777251184834, 0.26067535545023507]

In [31]:
#p_is_home_imp
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'home_improvement'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_home_imp

1.1585678452803312

In [32]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'home_improvement'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.15113871635610765, 0.16687370600414078]

In [33]:
#p_is_house
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'house'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Small effect of is_house

1.0281647315214366

In [34]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'house'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.15586491442542788, 0.20415647921760391]

In [35]:
#p_is_major
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'major_purchase'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Large effect of is_major

1.2848106370514505

In [36]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'major_purchase'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct

[0.13197146562905318, 0.15466926070038911]

In [37]:
#p_is_med
eff_size = (
    loans[loans.purpose == 'medical'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_med

1.1213191286792417

In [38]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'medical'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.18689925240299038, 0.22641509433962265]

In [39]:
#p_is_moving
eff_size = (
    loans[loans.purpose == 'moving'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_moving

1.1642953958583948

In [40]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'moving'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.19140429785107446, 0.23888055972013994]

In [41]:
#p_is_other
eff_size = (
    loans[loans.purpose == 'other'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_other

1.1510439902927163

In [42]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'other'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.2032943826552161, 0.22089258059974659]

In [43]:
#p_is_renew
eff_size = (
    loans[loans.purpose == 'renewable_energy'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_renew

1.127119055206893

In [44]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'renewable_energy'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.14339622641509434, 0.27547169811320754]

In [45]:
#p_is_small_bus
eff_size = (
    loans[loans.purpose == 'small_business'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Large effect of small business

1.6138188872884882

In [46]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'small_business'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.28050812161599331, 0.31424406497292795]

In [47]:
#p_is_vac
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'vacation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#small effect of vacation

1.0369609135601119

In [48]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'vacation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.15165262475696695, 0.20220349967595594]

In [49]:
#p_is_wed
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'wedding'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#large effect of wedding

1.3268682169379811

In [50]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'wedding'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.1192384769539078, 0.15881763527054107]

Effects of categorical variables requiring minimal processing:
Reliable:
home_ownership: MORTGAGE, RENT
purpose: car, credit_card, debt_consolidation (very small effect), home_improvement, major_purchase, medical,
moving, other, small_business, wedding

Not Reliable:
home_ownership: OWN, NONE, OTHER
purpose: educational, house, renewable_energy, vacation

In [6]:
#Read in IRS zipcode data
zip_data = pd.read_pickle('anon_avg_irs_data.pkl')

In [6]:
zip_data.columns

Index([u'adj_gross_inc', u'amt_SS', u'amt_edu', u'amt_itemized',
       u'amt_mort_intr', u'amt_st_loans', u'amt_unemp', u'n_SS', u'n_edu',
       u'n_farm', u'n_itemized', u'n_mort_intr', u'n_returns', u'n_st_loans',
       u'n_unemp', u'zipcd', u'avg_inc', u'avg_SS', u'avg_itemized',
       u'avg_unemp', u'avg_mort_intr', u'avg_edu', u'avg_st_loans', u'prop_SS',
       u'prop_itemized', u'prop_unemp', u'prop_farm', u'prop_mort_intr',
       u'prop_edu', u'prop_st_loans'],
      dtype='object')

In [7]:
#Create a dictionary to convert anonomized string zipcodes into integers
code_dict = {code: int(code[0:3]) for code in loans.zip_code.unique()}
#Replace string zipcodes with integer zipcodes
loans.replace(to_replace = {'zip_code': code_dict}, inplace = True)

In [8]:
#Drop unnecessary columns from zipcode data
zip_data.drop(['adj_gross_inc', 'amt_SS', 'amt_edu', 'amt_itemized',
       'amt_mort_intr', 'amt_st_loans', 'amt_unemp', 'n_SS', 'n_edu',
       'n_farm', 'n_itemized', 'n_mort_intr', 'n_returns', 'n_st_loans',
       'n_unemp'], axis = 1, inplace=True)
#Rename the zipcd column so that we can do a join
zip_data.rename(columns = {'zipcd':'zip_code'}, inplace = True)
zip_data.head()

Unnamed: 0,zip_code,avg_inc,avg_SS,avg_itemized,avg_unemp,avg_mort_intr,avg_edu,avg_st_loans,prop_SS,prop_itemized,prop_unemp,prop_farm,prop_mort_intr,prop_edu,prop_st_loans
0,0,60.891937,12.351555,26.158858,7.114837,9.953748,2.228984,1.013651,0.118443,0.326262,0.089928,0.01202177,0.249079,0.011627,0.078936
10,10,55.922672,11.008976,20.113773,8.41699,7.691459,2.277419,1.017362,0.130922,0.366475,0.107337,0.002385129,0.295665,0.011685,0.098242
11,11,45.777654,10.762002,21.024044,7.605732,7.149565,2.382857,0.990185,0.094586,0.276972,0.127751,3.473076e-09,0.222265,0.01013,0.078144
12,12,53.726985,11.908767,21.85106,7.568678,7.620091,2.17,1.026488,0.156558,0.323024,0.106462,0.001026603,0.250585,0.008252,0.08599
13,13,48.752319,9.950238,18.28319,7.977017,7.077144,2.146667,0.967655,0.130129,0.327657,0.109854,0.008545488,0.264874,0.007666,0.094808


In [9]:
#Join zipcode data to loan data on zip_code
loans = loans.join(zip_data, on = 'zip_code', rsuffix = 'r')
loans.drop('zip_coder', axis = 1, inplace = True)

Look at the effects of the features extracted from the zipcode data (note that dollar amounts are in thousands):
avg_inc, avg_SS, avg_itemized, avg_unemp, avg_mort_intr, avg_edu, avg_st_loans, prop_SS,
prop_itemized, prop_unemp, prop_farm, prop_mort_intr, prop_edu, prop_st_loans

In [25]:
#Average income of the region
reg_inc_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_inc.median()
reg_inc_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_inc.median()

loans.groupby('loan_status').avg_inc.median()

loan_status
Charged Off    56.675219
Fully Paid     58.896681
Name: avg_inc, dtype: float64

In [26]:
#Effect size
reg_inc_paid_med / reg_inc_chrg_med
#Small effect

1.0391963486496827

In [27]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_inc,
                              loans[loans.loan_status == 'Charged Off'].avg_inc, 99.5, 0.5)
prct
#Reliable

[1.8863152713123483, 2.3915383968174089]

In [28]:
#Average SS income of the region
SS_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_SS.median()
SS_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_SS.median()

loans.groupby('loan_status').avg_SS.median()

loan_status
Charged Off    12.280081
Fully Paid     12.424060
Name: avg_SS, dtype: float64

In [29]:
#Effect size
SS_paid_med / SS_chrg_med
#very small effect

1.011724574159853

In [30]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_SS,
                              loans[loans.loan_status == 'Charged Off'].avg_SS, 99.5, 0.5)
prct
#Reliable

[0.11135745256081719, 0.18167845421086959]

In [31]:
#Average amount itemized
item_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_itemized.median()
item_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_itemized.median()

loans.groupby('loan_status').avg_itemized.median()


loan_status
Charged Off    23.752441
Fully Paid     24.310692
Name: avg_itemized, dtype: float64

In [32]:
#Effect size
item_paid_med / item_chrg_med
#Small effect

1.0235028790934877

In [33]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_itemized,
                              loans[loans.loan_status == 'Charged Off'].avg_itemized, 99.5, 0.5)
prct
#Reliable

[0.44778617172055846, 0.65321314892618076]

In [34]:
#Average amount of taxable unemployment compensation
unemp_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_unemp.median()
unemp_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_unemp.median()

loans.groupby('loan_status').avg_unemp.median()

loan_status
Charged Off    7.325424
Fully Paid     7.379383
Name: avg_unemp, dtype: float64

In [35]:
#Effect size
unemp_paid_med / unemp_chrg_med
#Very small effect

1.0073659840336695

In [36]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_unemp,
                              loans[loans.loan_status == 'Charged Off'].avg_unemp, 99.5, 0.5)
prct
#Reliable

[0.037183010065774624, 0.089849978490889271]

In [37]:
#Average amount of mortgage interest deductions
mort_intr_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_mort_intr.median()
mort_intr_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_mort_intr.median()

loans.groupby('loan_status').avg_mort_intr.median()

loan_status
Charged Off    9.765733
Fully Paid     9.910703
Name: avg_mort_intr, dtype: float64

In [38]:
#Effect size
mort_intr_paid_med / mort_intr_chrg_med
#Small effect

1.0148447361292101

In [39]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_mort_intr,
                              loans[loans.loan_status == 'Charged Off'].avg_mort_intr, 99.5, 0.5)
prct
#Reliable

[0.13791472791523773, 0.21836008908643834]

In [40]:
#Average amount of educational credit
edu_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_edu.median()
edu_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_edu.median()

loans.groupby('loan_status').avg_edu.median()

loan_status
Charged Off    2.220000
Fully Paid     2.217544
Name: avg_edu, dtype: float64

In [41]:
#Effect size
edu_chrg_med / edu_paid_med
#Extremely small, no way this should be reliable

1.001107594936709

In [42]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].avg_edu,
                              loans[loans.loan_status == 'Fully Paid'].avg_edu, 99.5, 0.5)
prct
#Not Reliable

[-0.0020639834881324148, 0.010758027143330295]

In [43]:
#Average amount of student loan deduction
st_loans_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_st_loans.median()
st_loans_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_st_loans.median()

loans.groupby('loan_status').avg_st_loans.median()

loan_status
Charged Off    1.006260
Fully Paid     1.007403
Name: avg_st_loans, dtype: float64

In [44]:
#Effect size
st_loans_paid_med / st_loans_chrg_med
#Very small, shouldn't be reliable

1.0011355277469616

In [45]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].avg_st_loans,
                              loans[loans.loan_status == 'Fully Paid'].avg_st_loans, 99.5, 0.5)
prct
#So close, I'm ignoring it

[-0.0017925058038980642, -0.00069192448465127399]

In [46]:
#Proportion of SS returns
p_SS_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_SS.median()
p_SS_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_SS.median()

loans.groupby('loan_status').prop_SS.median()

loan_status
Charged Off    0.118104
Fully Paid     0.116293
Name: prop_SS, dtype: float64

In [47]:
#Effect size
p_SS_chrg_med / p_SS_paid_med
#Small effect

1.0155693170677838

In [48]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].prop_SS,
                              loans[loans.loan_status == 'Fully Paid'].prop_SS, 99.5, 0.5)
prct
#Reliable

[0.00035456217134922918, 0.0024302380466321161]

In [49]:
#Proportion of itemized returns
p_item_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_itemized.median()
p_item_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_itemized.median()

loans.groupby('loan_status').prop_itemized.median()

loan_status
Charged Off    0.33108
Fully Paid     0.34438
Name: prop_itemized, dtype: float64

In [50]:
#Effect Size
p_item_paid_med / p_item_chrg_med
#Small effect

1.0401716803704948

In [51]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_itemized,
                              loans[loans.loan_status == 'Charged Off'].prop_itemized, 99.5, 0.5)
prct
#Reliable

[0.010438340664018275, 0.014080862420544051]

In [52]:
#Proportion of unemployment compensation returns
p_item_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_unemp.median()
p_item_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_unemp.median()

loans.groupby('loan_status').prop_unemp.median()

loan_status
Charged Off    0.089548
Fully Paid     0.087252
Name: prop_unemp, dtype: float64

In [53]:
#Effect size
p_item_chrg_med / p_item_paid_med
#Small effect

1.0263089682710889

In [54]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].prop_unemp,
                              loans[loans.loan_status == 'Fully Paid'].prop_unemp, 99.5, 0.5)
prct
#Reliable

[0.0009472731122621425, 0.0035332816514737631]

In [55]:
#Proportion of farm returns
p_farm_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_farm.median()
p_farm_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_farm.median()

loans.groupby('loan_status').prop_farm.median()

loan_status
Charged Off    0.002074
Fully Paid     0.002046
Name: prop_farm, dtype: float64

In [56]:
#Effect size
p_farm_chrg_med / p_farm_paid_med
#Small effect

1.0138264923382068

In [57]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].prop_farm,
                              loans[loans.loan_status == 'Fully Paid'].prop_farm, 99.5, 0.5)
prct
#Not Reliable

[0.0, 0.00014550664520866438]

In [58]:
#Proportion of mortgage interest deduction returns
p_mort_intr_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_mort_intr.median()
p_mort_intr_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_mort_intr.median()

loans.groupby('loan_status').prop_mort_intr.median()

loan_status
Charged Off    0.258684
Fully Paid     0.262168
Name: prop_mort_intr, dtype: float64

In [59]:
#Effect size
p_mort_intr_paid_med / p_mort_intr_chrg_med
#Small effect

1.0134684740890052

In [60]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_mort_intr,
                              loans[loans.loan_status == 'Charged Off'].prop_mort_intr, 99.5, 0.5)
prct
#Reliable

[0.0022454538489951603, 0.0056254303365271818]

In [61]:
#Proportion of returns with an educational credit
p_edu_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_edu.median()
p_edu_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_edu.median()

loans.groupby('loan_status').prop_edu.median()

loan_status
Charged Off    0.012222
Fully Paid     0.012526
Name: prop_edu, dtype: float64

In [62]:
#Effect size
p_edu_paid_med / p_edu_chrg_med
#Small effect

1.024879199906008

In [63]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_edu,
                              loans[loans.loan_status == 'Charged Off'].prop_edu, 99.5, 0.5)
prct
#Reliable

[0.00026288811239375334, 0.00033951059520473392]

In [64]:
#Proportion of returns with student loans
p_st_loans_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_st_loans.median()
p_st_loans_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_st_loans.median()

loans.groupby('loan_status').prop_st_loans.median()

loan_status
Charged Off    0.074606
Fully Paid     0.076177
Name: prop_st_loans, dtype: float64

In [65]:
#Effect size
p_st_loans_paid_med / p_st_loans_chrg_med
#Small effect

1.0210597826086956

In [66]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_st_loans,
                              loans[loans.loan_status == 'Charged Off'].prop_st_loans, 99.5, 0.5)
prct
#Reliable

[0.0004610607198060751, 0.001899697461255942]

Zipcode Features:
Reliable: avg_inc (small effect), avg_SS (very small), avg_itemized (small effect), avg_unemp (very small effect),
avg_mort_intr (small effect), prop_SS (small effect), prop_itemized (small effect), prop_unemp (small effect), prop_mort_intr (small effect), prop_edu (small), prop_st_loans (small)


Not Reliable: avg_edu, avg_st_loans, prop_farm, 

Work on text fields:
desc, emp_title

Half of desc are NaN, stopped being recorded 5/19/2014, makes it irrelevant
Half of emp_titles are unique, also changes from company name to title, most common title accounts for less than 1% of the data

NLP on desc and emp_titles doesn't seem like a good investment of time

Work on dates:
last_credit_pull_d, earliest_cr_line

In [5]:
#Very few missing pull dates
sum(loans.last_credit_pull_d.isnull())

21

In [6]:
#Very few missing earliest credit lines
sum(loans.earliest_cr_line.isnull())

29

In [9]:
import datetime as dt
base_time = dt.datetime.now()

cr_line = ut.parse_date_series(loans.earliest_cr_line)
pl_date = ut.parse_date_series(loans.last_credit_pull_d)

In [10]:
#Fill null times with the current time
cr_line.fillna(value = base_time, inplace = True)
pl_date.fillna(value = base_time, inplace = True)

In [11]:
#Get the months since first credit and last credit pull
cr_line_months = ut.months_since(base_time, cr_line)
pl_date_months = ut.months_since(base_time, pl_date)

In [39]:
#Check to see if there were zeros other than the original null values
sum(cr_line_months == 0)

29

In [40]:
sum(pl_date_months == 0)

21

In [12]:
#No zeros other than the values filled in
#Change null values into the means
cr_line_months.replace(to_replace=0, value=cr_line_months.mean(), inplace=True)
pl_date_months.replace(to_replace=0, value=pl_date_months.mean(), inplace=True)

In [13]:
#Add the new columns to the data
loans['erlst_cred'] = cr_line_months
loans['last_cred_pl'] = pl_date_months

In [43]:
#Number of months of credit
earlst_cred_paid_med = loans[loans.loan_status == 'Fully Paid'].erlst_cred.median()
earlst_cred_chrg_med = loans[loans.loan_status == 'Charged Off'].erlst_cred.median()

loans.groupby('loan_status').erlst_cred.median()

loan_status
Charged Off    197.148815
Fully Paid     202.178946
Name: erlst_cred, dtype: float64

In [44]:
#Effect Size
earlst_cred_paid_med / earlst_cred_chrg_med
#Small effect

1.0255143885829572

In [46]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].erlst_cred,
                              loans[loans.loan_status == 'Charged Off'].erlst_cred, 99.5, 0.5)
prct
#Reliable

[4.0438311848425315, 7.0356087281000157]

In [55]:
#Time since last credit pull
last_cred_paid_med = loans[loans.loan_status == 'Fully Paid'].last_cred_pl.median()
last_cred_cred_chrg_med = loans[loans.loan_status == 'Charged Off'].last_cred_pl.median()

loans.groupby('loan_status').last_cred_pl.median()

ERROR! Session/line number was not unique in database. History logging moved to new session 157


loan_status
Charged Off    3.505192
Fully Paid     5.017519
Name: last_cred_pl, dtype: float64

In [56]:
#Effect size
last_cred_paid_med / last_cred_cred_chrg_med
#Large Effect

1.431453482973929

In [61]:
loans[loans.loan_status == 'Fully Paid'].last_cred_pl.describe()

count    199885.000000
mean         10.278077
std          11.787821
min           3.012041
25%           3.012041
50%           5.017519
75%          12.546278
max         106.606448
Name: last_cred_pl, dtype: float64

In [62]:
loans[loans.loan_status == 'Charged Off'].last_cred_pl.describe()

count    45114.000000
mean        11.385819
std         12.632813
min          3.012041
25%          3.012041
50%          3.505192
75%         15.998329
max        102.562617
Name: last_cred_pl, dtype: float64

In [63]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].last_cred_pl,
                              loans[loans.loan_status == 'Charged Off'].last_cred_pl, 99.5, 0.5, reps = 100)
prct
#Reliable
#Strange behavior on bootstrapping, keep an eye on this variable

[1.5123271097785103, 1.5123271097785103]

Date fields
Reliable: erlst_cred (small effect), last_cred_pl(large effect, strange)

Not reliable:

Numeric fields that require processing:
emp_length, mths_since_last_delinq, mths_since_last_record, revol_util, mths_since_last_major_derog, total_credit_rv

In [14]:
#emp_length
#Create a dictionary to change the year strings into numbers
year_dict = {'emp_length':{'10+ years':10, '< 1 year':0, '3 years':3, '9 years':9, '4 years':4, '5 years':5,
       '1 year':1, '6 years':6, '2 years':2, '7 years':7, '8 years':8, 'n/a':-1}}
#Replace the year strings with numbers
loans.replace(year_dict, inplace = True)

In [25]:
#employment length
emp_paid_mn = loans[loans.loan_status == 'Fully Paid'].emp_length.mean()
emp_chrg_mn = loans[loans.loan_status == 'Charged Off'].emp_length.mean()

loans.groupby('loan_status').emp_length.mean()

loan_status
Charged Off    5.773641
Fully Paid     5.806348
Name: emp_length, dtype: float64

In [26]:
#Effect size
emp_paid_mn / emp_chrg_mn
#Very small effect

1.0056649621395004

In [27]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].emp_length,
                              loans[loans.loan_status == 'Fully Paid'].emp_length, 99.5, 0.5)
prct
#Not reliable

[-0.081428922288216909, 0.016252776422998692]

In [15]:
#nans are rare, there are no zeros, pretty sure nan is either a zero or decline to report
#replace nans with 0%
loans.revol_util.fillna('0%', inplace = True)

#Create a dictionary to convert revol_util strings to floats
revol_ut_dict = {revol_ut: float(revol_ut[0:-1]) for revol_ut in loans.revol_util.unique()}

#Replace string revol_util with floats
loans.replace(to_replace = {'revol_util': revol_ut_dict}, inplace = True)

In [39]:
#revolving credit utilization
rv_util_paid_med = loans[loans.loan_status == 'Fully Paid'].revol_util.median()
rv_util_chrg_med = loans[loans.loan_status == 'Charged Off'].revol_util.median()

loans.groupby('loan_status').revol_util.median()

loan_status
Charged Off    61.6
Fully Paid     54.5
Name: revol_util, dtype: float64

In [40]:
#effect size
rv_util_chrg_med / rv_util_paid_med
#Modest effect

1.1302752293577982

In [41]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].revol_util,
                              loans[loans.loan_status == 'Fully Paid'].revol_util, 99.5, 0.5)
prct
#reliable

[6.6000000000000014, 7.6000000000000014]

In [64]:
#Features requiring imputation
#mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, total_credit_rv, acc_now_delinq,


In [17]:
#Begin by selecting out the relevant features
loans.columns

Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'verification_status', u'issue_d', u'loan_status', u'pymnt_plan',
       u'url', u'desc', u'purpose', u'title', u'zip_code', u'addr_state',
       u'dti', u'delinq_2yrs', u'earliest_cr_line', u'inq_last_6mths',
       u'mths_since_last_delinq', u'mths_since_last_record', u'open_acc',
       u'pub_rec', u'revol_bal', u'revol_util', u'total_acc',
       u'initial_list_status', u'out_prncp', u'out_prncp_inv', u'total_pymnt',
       u'total_pymnt_inv', u'total_rec_prncp', u'total_rec_int',
       u'total_rec_late_fee', u'recoveries', u'collection_recovery_fee',
       u'last_pymnt_d', u'last_pymnt_amnt', u'next_pymnt_d',
       u'last_credit_pull_d', u'collections_12_mths_ex_med',
       u'mths_since_last_major_derog', u'policy_code', u'application_type',
       u'annu

In [23]:
#Drop all columns not being used
drop_list = ['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
            'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'verification_status', 'issue_d',
            'pymnt_plan', 'url', 'desc', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'initial_list_status',
            'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
            'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt',
            'next_pymnt_d', 'last_credit_pull_d', 'policy_code', 'application_type', 'annual_inc_joint',
            'dti_joint', 'verification_status_joint', 'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m',
            'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m',
            'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m','max_bal_bc', 'all_util',
            'inq_fi', 'total_fi_tl', 'inq_last_12m']
loans.drop(drop_list, axis = 1, inplace = True)

In [24]:
loans.head()

Unnamed: 0,emp_length,home_ownership,annual_inc,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,...,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,mths_since_last_major_derog,acc_now_delinq,tot_coll_amt,total_credit_rv,erlst_cred,last_cred_pl
0,10,RENT,24000,Fully Paid,credit_card,27.65,0,1,,,...,13648,83.7,9,0,,0,,,374.744446,3.533894
1,0,RENT,30000,Charged Off,car,1.0,0,5,,,...,1687,9.4,4,0,,0,,,203.687099,30.525645
2,10,RENT,12252,Fully Paid,small_business,8.72,0,2,,,...,2956,98.5,10,0,,0,,,172.61864,3.533894
3,10,RENT,49200,Fully Paid,other,20.0,0,1,35.0,,...,5598,21.0,37,0,,0,,,241.659661,14.514704
5,3,RENT,36000,Fully Paid,wedding,11.2,0,3,,,...,7963,28.3,12,0,,0,,,136.585803,6.525672


In [25]:
#Change categorical variables into indicator variables
#Change loan status to a number "charged off" is signal, so 1, "fully paid" is 0
loans.replace({'loan_status': {'Fully Paid':0, 'Charged Off':1}}, inplace = True)
#Change home_ownership variables (leaving NONE out)
loans['has_mortgage'] = pd.Series(data=loans.home_ownership == 'MORTGAGE', index=loans.index)
loans['has_rent'] = pd.Series(data=loans.home_ownership == 'RENT', index=loans.index)
loans['has_own'] = pd.Series(data=loans.home_ownership == 'OWN', index=loans.index)
loans['has_other'] = pd.Series(data=loans.home_ownership == 'OTHER', index=loans.index)
#Change purpose varaiables (leaving out 'educational)
loans['for_car'] = pd.Series(data=loans.purpose == 'car', index=loans.index)
loans['for_cc'] = pd.Series(data=loans.purpose == 'credit_card', index=loans.index)
loans['for_debt'] = pd.Series(data=loans.purpose == 'debt_consolidation', index=loans.index)
loans['for_home_imp'] = pd.Series(data=loans.purpose == 'home_improvement', index=loans.index)
loans['for_house'] = pd.Series(data=loans.purpose == 'house', index=loans.index)
loans['for_purchase'] = pd.Series(data=loans.purpose == 'major_purchase', index=loans.index)
loans['for_med'] = pd.Series(data=loans.purpose == 'medical', index=loans.index)
loans['for_move'] = pd.Series(data=loans.purpose == 'moving', index=loans.index)
loans['for_other'] = pd.Series(data=loans.purpose == 'other', index=loans.index)
loans['for_energy'] = pd.Series(data=loans.purpose == 'renewable_energy', index=loans.index)
loans['for_business'] = pd.Series(data=loans.purpose == 'small_business', index=loans.index)
loans['for_vacation'] = pd.Series(data=loans.purpose == 'vacation', index=loans.index)
loans['for_wedding'] = pd.Series(data=loans.purpose == 'wedding', index=loans.index)
#Drop purpose and home_ownership columns
loans.drop(['purpose', 'home_ownership'], axis = 1, inplace = True)

In [26]:
#Pull out the labels and features
labels = loans['loan_status']
features = loans.drop('loan_status', axis = 1)

In [27]:
#Create a data frame for imputing the features
drop_list = ['mths_since_last_delinq', 'mths_since_last_record', 'mths_since_last_major_derog',
             'total_credit_rv']
features_impute = features.drop(drop_list, axis = 1)
#Get the features for imputation
since_delinq = features['mths_since_last_delinq']
since_record = features['mths_since_last_record']
since_derog = features['mths_since_last_major_derog']
total_cred = features['total_credit_rv']


In [36]:
#Get the known and unknown components for the values and features
since_delinq_unknown = since_delinq[since_delinq.isnull()]
since_delinq_known = since_delinq[since_delinq.isnull() == False]
delinq_feats_unknown = features_impute[since_delinq.isnull()]
delinq_feats_known = features_impute[since_delinq.isnull() == False]

since_record_unknown = since_record[since_record.isnull()]
since_record_known = since_record[since_record.isnull() == False]
record_feats_unknown = features_impute[since_record.isnull()]
record_feats_known = features_impute[since_record.isnull() == False]

since_derog_unknown = since_derog[since_derog.isnull()]
since_derog_known = since_derog[since_derog.isnull() == False]
derog_feats_unknown = features_impute[since_derog.isnull()]
derog_feats_known = features_impute[since_derog.isnull() == False]

total_cred_unknown = total_cred[total_cred.isnull()]
total_cred_known = total_cred[total_cred.isnull() == False]
cred_feats_unknown = features_impute[total_cred.isnull()]
cred_feats_known = features_impute[total_cred.isnull() == False]

3         35
16        61
18         8
27        20
28        18
33        45
41        38
45        45
71        48
72        41
75        40
76        74
79        25
80        35
84        53
91        39
93        10
94        26
98        35
114       77
116       28
117       74
120       28
121       56
123       74
126       52
133       24
136       16
141       60
146       54
          ..
756500     5
756512    44
756513    58
756534    34
756538    58
756544    14
756546    26
756565    35
756566    64
756608    29
756615    32
756623    31
756624    71
756629    49
756636    62
756653    15
756672     5
756677    60
756704    62
756705     5
756752    20
756765    10
756791    29
756804    11
756808    11
756831    73
756845    45
756851    38
756865    16
756870    65
Name: mths_since_last_delinq, dtype: float64

Numeric fields requiring processing.

Reliable: revol_util

Not reliable: emp_length

Imputation protocol: variables to use for imputation: emp_length, home_ownership, annual_inc, purpose, dti, delinq_2yrs, earliest_cr_line, inq_last_6mths, open_acc, pub_rec, revol_bal, total_acc, last_credit_pull_d, revol_util, add any zip_code based features
Variables with missing values to impute: mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, total_credit_rv
