List of variables that have a reliable relationship with the default rate:
annual_inc, dti, delinq_2yrs, inq_last_6mths, total_acc, open_acc, collections_12_mths_ex_med,
home_ownership (MORTGAGE, RENT),
purpose (car, credit_card, debt_consolidation (very small effect), home_improvement, major_purchase, medical,
moving, other, small_business, wedding)

In [1]:
#Import some modules
%matplotlib inline
import numpy as np
import pandas as pd
import utils as ut


#Load in the data
#loans = pd.read_csv('LoanStats3a.csv', skiprows = 1, 
#                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
#                 skipfooter = 2)

df1 = pd.read_csv('LoanStats3a.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2)
df2 = pd.read_csv('LoanStats3b.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2, infer_datetime_format=True)
df3 = pd.read_csv('LoanStats3c.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2, infer_datetime_format=True)
df4 = pd.read_csv('LoanStats3d.csv', skiprows = 1, 
                  parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'],
                 skipfooter = 2, infer_datetime_format=True)

loans = pd.concat([df1, df2, df3, df4], ignore_index = True)

del df1
del df2
del df3
del df4



In [2]:
#Begin by cleaning up the data a bit
#Percentages of the different loan statuses
loans.loan_status.value_counts(normalize = True)


Current                                                0.652293
Fully Paid                                             0.261470
Charged Off                                            0.057966
Late (31-120 days)                                     0.014763
In Grace Period                                        0.006518
Late (16-30 days)                                      0.002723
Does not meet the credit policy. Status:Fully Paid     0.002621
Does not meet the credit policy. Status:Charged Off    0.001005
Default                                                0.000634
Does not meet the credit policy. Status:Current        0.000005
dtype: float64

In [3]:
loans.replace(to_replace = 'Does not meet the credit policy. Status:Fully Paid', value = 'Fully Paid', inplace = True)
loans.replace(to_replace = 'Does not meet the credit policy. Status:Charged Off', value = 'Charged Off', inplace = True)
#Assign default to charged off? They seem equivalent from the lenders perspective
loans.replace(to_replace = 'Default', value = 'Charged Off', inplace = True)

loans.loan_status.value_counts(normalize = True)

Current                                            0.652293
Fully Paid                                         0.264091
Charged Off                                        0.059605
Late (31-120 days)                                 0.014763
In Grace Period                                    0.006518
Late (16-30 days)                                  0.002723
Does not meet the credit policy. Status:Current    0.000005
dtype: float64

In [4]:
#Filter out everything but 'fully paid' and 'charged off' and plot
#We're just going to consider those two categories to simplify the analysis
loans = loans[(loans.loan_status == 'Fully Paid') | (loans.loan_status == 'Charged Off')]

In [16]:
loans.columns

Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'verification_status', u'issue_d', u'loan_status', u'pymnt_plan',
       u'url', u'desc', u'purpose', u'title', u'zip_code', u'addr_state',
       u'dti', u'delinq_2yrs', u'earliest_cr_line', u'inq_last_6mths',
       u'mths_since_last_delinq', u'mths_since_last_record', u'open_acc',
       u'pub_rec', u'revol_bal', u'revol_util', u'total_acc',
       u'initial_list_status', u'out_prncp', u'out_prncp_inv', u'total_pymnt',
       u'total_pymnt_inv', u'total_rec_prncp', u'total_rec_int',
       u'total_rec_late_fee', u'recoveries', u'collection_recovery_fee',
       u'last_pymnt_d', u'last_pymnt_amnt', u'next_pymnt_d',
       u'last_credit_pull_d', u'collections_12_mths_ex_med',
       u'mths_since_last_major_derog', u'policy_code', u'application_type',
       u'annu

Inferential statistics on the various features of this data set. Begin with the numeric fields that have little to no missing data and do not require any processing.

In [17]:
#Check out the effect of annual income
#Get the annual income for the charged off and fully paid categories
#Get the median incomes
inc_paid_med = loans[loans.loan_status == 'Fully Paid'].annual_inc.median()
inc_chrg_med = loans[loans.loan_status == 'Charged Off'].annual_inc.median()

loans.groupby('loan_status').annual_inc.median()

loan_status
Charged Off    56000
Fully Paid     64000
Name: annual_inc, dtype: float64

In [19]:
#Look at the effect size for the incomes
inc_paid_med / inc_chrg_med
#Modest effect size

1.1428571428571428

In [20]:
#Bootstrap the median incomes for the two categories and see if that effect is significant
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].annual_inc,
                             loans[loans.loan_status == 'Charged Off'].annual_inc, 99.5, 0.5)
prct
#Reliable

[7024.9849999999997, 8773.0099999999984]

In [21]:
#Check out the effect of dti
#Get the dti for the charged off and fully paid categories
#Get the median dtis
dti_paid_med = loans[loans.loan_status == 'Fully Paid'].dti.median()
dti_chrg_med = loans[loans.loan_status == 'Charged Off'].dti.median()

loans.groupby('loan_status').dti.median() 

loan_status
Charged Off    18.27
Fully Paid     15.69
Name: dti, dtype: float64

In [22]:
#Look at the effect size for the dtis
dti_chrg_med / dti_paid_med
#Modest effect size

1.1644359464627152

In [23]:
#Bootstrap the dti for the two categories and see if that effect is significant
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].dti,
                             loans[loans.loan_status == 'Fully Paid'].dti, 99.5, 0.5)
prct
#Reliable

[2.4400000000000013, 2.7000000000000011]

In [24]:
#Check out the effect of delinquencies in the last two years
#Get the number of delinquencies for the charged off and fully paid categories
#Get the mean delinquencies
dlq_paid_mn = loans[loans.loan_status == 'Fully Paid'].delinq_2yrs.mean()
dlq_chrg_mn = loans[loans.loan_status == 'Charged Off'].delinq_2yrs.mean()

loans.groupby('loan_status').delinq_2yrs.mean() 

loan_status
Charged Off    0.27490
Fully Paid     0.24146
Name: delinq_2yrs, dtype: float64

In [26]:
#Look at the effect size for the delinquencies
dlq_chrg_mn / dlq_paid_mn
#Modest effect size

1.1384884893215346

In [27]:
#Bootstrap the delinquencies for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].delinq_2yrs,
                              loans[loans.loan_status == 'Fully Paid'].delinq_2yrs, 99.5, 0.5)
prct
#Reliable

[0.022954457309100756, 0.043943470247482859]

In [28]:
#Check out the effect of inquiries in the last six months
#Get the inquiries for the charged off and fully paid categories
#Get the mean inquiries
inq_paid_mn = loans[loans.loan_status == 'Fully Paid'].inq_last_6mths.mean()
inq_chrg_mn = loans[loans.loan_status == 'Charged Off'].inq_last_6mths.mean()

loans.groupby('loan_status').inq_last_6mths.mean() 

loan_status
Charged Off    1.047217
Fully Paid     0.858365
Name: inq_last_6mths, dtype: float64

In [30]:
#Look at the effect size for the inquiries
inq_chrg_mn / inq_paid_mn
#Moderate effect

1.2200132638223962

In [31]:
#Bootstrap the inquiries for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].inq_last_6mths,
                              loans[loans.loan_status == 'Fully Paid'].inq_last_6mths, 99.5, 0.5)
prct
#Reliable

[0.17240098985823846, 0.2063343733412828]

In [32]:
#Check out the effect of number of open accounts
#Get the number of open accounts for the charged off and fully paid categories
#Get the mean number of accounts
opn_paid_mn = loans[loans.loan_status == 'Fully Paid'].open_acc.mean()
opn_chrg_mn = loans[loans.loan_status == 'Charged Off'].open_acc.mean()

loans.groupby('loan_status').open_acc.mean()

loan_status
Charged Off    11.008180
Fully Paid     10.875742
Name: open_acc, dtype: float64

In [33]:
#Look at the effect size for the number of open accounts
opn_paid_mn / opn_chrg_mn
#Small effect

0.98796918047848403

In [34]:
#Bootstrap the open accounts for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].open_acc,
                              loans[loans.loan_status == 'Charged Off'].open_acc, 99.5, 0.5)
prct
#Looks like the effect is reliable, but it is small enough to be suspect

[-0.19623503540697443, -0.066461859835128423]

In [48]:
#Check out the effect of number of derogatory public records
#Get the number of public records for the charged off and fully paid categories
#Get the mean number of derogatory public records
pub_paid_mn = loans[loans.loan_status == 'Fully Paid'].pub_rec.mean()
pub_chrg_mn = loans[loans.loan_status == 'Charged Off'].pub_rec.mean()

loans.groupby('loan_status').pub_rec.mean()


loan_status
Charged Off    0.143180
Fully Paid     0.139283
Name: pub_rec, dtype: float64

In [49]:
#Look at the effect size for the number of derogatory public records
pub_chrg_mn / pub_paid_mn
#Small effect

1.0279786874048025

In [37]:
#Bootstrap the public records for the two categories and see if that effect is significant
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].pub_rec,
                              loans[loans.loan_status == 'Fully Paid'].pub_rec, 99.5, 0.5)
prct
#Not reliable

[-0.0018740016053750276, 0.009818523648975357]

In [38]:
#Check out the effect of revolving balance
#Get the revolving balance for the charged off and fully paid categories
#Get the median revolving balance 
rev_bal_paid_med = loans[loans.loan_status == 'Fully Paid'].revol_bal.median()
rev_bal_chrg_med = loans[loans.loan_status == 'Charged Off'].revol_bal.median()

loans.groupby('loan_status').revol_bal.median()

loan_status
Charged Off    11301.5
Fully Paid     10805.0
Name: revol_bal, dtype: float64

In [39]:
#Effect size
rev_bal_chrg_med / rev_bal_paid_med
#Small effect

1.0459509486348912

In [40]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].revol_bal, 
                             loans[loans.loan_status == 'Fully Paid'].revol_bal, 99.5, 0.5)
prct
#reliable

[352.0, 647.0]

In [42]:
#Total number of accounts
tot_acc_paid_mn = loans[loans.loan_status == 'Fully Paid'].total_acc.mean()
tot_acc_chrg_mn = loans[loans.loan_status == 'Charged Off'].total_acc.mean()

loans.groupby('loan_status').total_acc.mean() 

loan_status
Charged Off    24.141296
Fully Paid     25.138142
Name: total_acc, dtype: float64

In [43]:
#Effect size
tot_acc_paid_mn  / tot_acc_chrg_mn 
#Small effect

1.041292169210503

In [44]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].total_acc,
                              loans[loans.loan_status == 'Charged Off'].total_acc, 99.5, 0.5)
prct
#Seems to be reliable, but not a very big effect

[0.8410510951780743, 1.148117810622356]

In [45]:
#Collections in the last 12 months excluding medical collections
col_paid_mn = loans[loans.loan_status == 'Fully Paid'].collections_12_mths_ex_med.mean()
col_chrg_mn = loans[loans.loan_status == 'Charged Off'].collections_12_mths_ex_med.mean()

loans.groupby('loan_status').collections_12_mths_ex_med.mean()

#Field seems to be missing some data

loan_status
Charged Off    0.007741
Fully Paid     0.006107
Name: collections_12_mths_ex_med, dtype: float64

In [46]:
#Effect size
col_chrg_mn  / col_paid_mn
#Moderate effect, but missing data?

1.2675053795187776

In [47]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].collections_12_mths_ex_med,
                              loans[loans.loan_status == 'Fully Paid'].collections_12_mths_ex_med, 99.5, 0.5)
prct
#Effect is reliable

[0.00044366518691515099, 0.002854148549792756]

In [50]:
#Number of accounts now delinquent
dlq_acc_paid_mn = loans[loans.loan_status == 'Fully Paid'].acc_now_delinq.mean()
dlq_acc_chrg_mn = loans[loans.loan_status == 'Charged Off'].acc_now_delinq.mean()

loans.groupby('loan_status').acc_now_delinq.mean()

loan_status
Charged Off    0.003879
Fully Paid     0.002827
Name: acc_now_delinq, dtype: float64

In [51]:
#Effect size
dlq_acc_chrg_mn / dlq_acc_paid_mn
#Large effect

1.3722424705392382

In [52]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].acc_now_delinq,
                              loans[loans.loan_status == 'Fully Paid'].acc_now_delinq, 99.5, 0.5)
prct
#Also reliable

[0.00017702472440258958, 0.0019903687465753829]

Categorical variables:
relatively few categories:
home_ownership, purpose 

lots of categories:
addr_state, zip_code

In [4]:
chrg_off_rate = len(loans[loans.loan_status == 'Charged Off'].loan_status) / float(len(loans.loan_status))
chrg_off_rate

0.1841395271001106

In [5]:
#home ownership
#There is one loan with home_ownwership == 'ANY', add that to 'OTHER'
loans.home_ownership.replace(to_replace = 'ANY', value = 'OTHER', inplace = True)
#Print the fully paid and charged off rates for each ownership category
loans.groupby('home_ownership').loan_status.value_counts(normalize = True).unstack()

Unnamed: 0_level_0,Fully Paid,Charged Off
home_ownership,Unnamed: 1_level_1,Unnamed: 2_level_1
MORTGAGE,0.835505,0.164495
NONE,0.829787,0.170213
OTHER,0.787709,0.212291
OWN,0.811723,0.188277
RENT,0.793776,0.206224


In [10]:
#is_mort
eff_size = (chrg_off_rate / 
    loans[loans.home_ownership == 'MORTGAGE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_mort

1.1194209091762977

In [11]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'MORTGAGE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is reliable

[0.161742103211789, 0.16729834061714779]

In [12]:
#is_none
eff_size = (chrg_off_rate / 
    loans[loans.home_ownership == 'NONE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_none

1.0818197217131498

In [13]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'NONE'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is not reliable

[0.042553191489361701, 0.31914893617021278]

In [16]:
#is_other
eff_size = (
    loans[loans.home_ownership == 'OTHER'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_other

1.1528785054274671

In [17]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'OTHER'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is not reliable

[0.13407821229050279, 0.29608938547486036]

In [8]:
#is_own
eff_size = (
    loans[loans.home_ownership == 'OWN'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Small effect of is_own

1.0224701578129374

In [6]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'OWN'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is not reliable

[0.18144257237778832, 0.19501661129568107]

In [18]:
#is_rent
eff_size = (
    loans[loans.home_ownership == 'RENT'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_rent

1.1199336806700069

In [19]:
prct = ut.mean_bootstrap(
    loans[loans.home_ownership == 'RENT'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is reliable

[0.20299468559680359, 0.20954075992086582]

In [22]:
#loan purpose
#Print the fully paid and charged off rates for each purpose category
loans.groupby('purpose').loan_status.value_counts(normalize = True).unstack()

Unnamed: 0_level_0,Fully Paid,Charged Off
purpose,Unnamed: 1_level_1,Unnamed: 2_level_1
car,0.873259,0.126741
credit_card,0.839346,0.160654
debt_consolidation,0.808992,0.191008
educational,0.791469,0.208531
home_improvement,0.841063,0.158937
house,0.820905,0.179095
major_purchase,0.85668,0.14332
medical,0.793521,0.206479
moving,0.785607,0.214393
other,0.788047,0.211953


In [23]:
#p_is_car
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'car'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Large effect of is_car

1.4528811039327409

In [24]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'car'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#Effect is reliable

[0.11253481894150417, 0.14150417827298051]

In [25]:
#p_is_credit
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'credit_card'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_credit

1.1461892165678129

In [26]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'credit_card'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct

[0.15646924117830749, 0.16508795669824086]

In [27]:
#p_is_debt
eff_size = (
    loans[loans.purpose == 'debt_consolidation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Small effect of is_debt

1.0372996019357483

In [28]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'debt_consolidation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable, but so small it probably doesn't matter much

[0.18837205318940686, 0.19374143572247546]

In [29]:
#p_is_edu
eff_size = (
    loans[loans.purpose == 'educational'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_edu

1.1324608516770678

In [30]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'educational'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.15876777251184834, 0.26067535545023507]

In [31]:
#p_is_home_imp
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'home_improvement'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Modest effect of is_home_imp

1.1585678452803312

In [32]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'home_improvement'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.15113871635610765, 0.16687370600414078]

In [33]:
#p_is_house
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'house'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Small effect of is_house

1.0281647315214366

In [34]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'house'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.15586491442542788, 0.20415647921760391]

In [35]:
#p_is_major
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'major_purchase'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#Large effect of is_major

1.2848106370514505

In [36]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'major_purchase'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct

[0.13197146562905318, 0.15466926070038911]

In [37]:
#p_is_med
eff_size = (
    loans[loans.purpose == 'medical'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_med

1.1213191286792417

In [38]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'medical'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.18689925240299038, 0.22641509433962265]

In [39]:
#p_is_moving
eff_size = (
    loans[loans.purpose == 'moving'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_moving

1.1642953958583948

In [40]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'moving'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.19140429785107446, 0.23888055972013994]

In [41]:
#p_is_other
eff_size = (
    loans[loans.purpose == 'other'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_other

1.1510439902927163

In [42]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'other'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.2032943826552161, 0.22089258059974659]

In [43]:
#p_is_renew
eff_size = (
    loans[loans.purpose == 'renewable_energy'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Modest effect of is_renew

1.127119055206893

In [44]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'renewable_energy'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.14339622641509434, 0.27547169811320754]

In [45]:
#p_is_small_bus
eff_size = (
    loans[loans.purpose == 'small_business'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean() /
    chrg_off_rate)
eff_size
#Large effect of small business

1.6138188872884882

In [46]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'small_business'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#reliable

[0.28050812161599331, 0.31424406497292795]

In [47]:
#p_is_vac
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'vacation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#small effect of vacation

1.0369609135601119

In [48]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'vacation'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.15165262475696695, 0.20220349967595594]

In [49]:
#p_is_wed
eff_size = (chrg_off_rate /
    loans[loans.purpose == 'wedding'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]).mean())
eff_size
#large effect of wedding

1.3268682169379811

In [50]:
prct = ut.mean_bootstrap(
    loans[loans.purpose == 'wedding'].loan_status.replace(['Fully Paid', 'Charged Off'], value = [0, 1]),
    99.5, 0.5)
prct
#not reliable

[0.1192384769539078, 0.15881763527054107]

Effects of categorical variables requiring minimal processing:
Reliable:
home_ownership: MORTGAGE, RENT
purpose: car, credit_card, debt_consolidation (very small effect), home_improvement, major_purchase, medical,
moving, other, small_business, wedding

Not Reliable:
home_ownership: OWN, NONE, OTHER
purpose: educational, house, renewable_energy, vacation

In [6]:
#Read in IRS zipcode data
zip_data = pd.read_pickle('anon_avg_irs_data.pkl')

In [8]:
zip_data.columns

Index([u'adj_gross_inc', u'amt_SS', u'amt_edu', u'amt_itemized',
       u'amt_mort_intr', u'amt_st_loans', u'amt_unemp', u'n_SS', u'n_edu',
       u'n_farm', u'n_itemized', u'n_mort_intr', u'n_returns', u'n_st_loans',
       u'n_unemp', u'zipcd', u'avg_inc', u'avg_SS', u'avg_itemized',
       u'avg_unemp', u'avg_mort_intr', u'avg_edu', u'avg_st_loans', u'prop_SS',
       u'prop_itemized', u'prop_unemp', u'prop_farm', u'prop_mort_intr',
       u'prop_edu', u'prop_st_loans'],
      dtype='object')

In [None]:
#Create a dictionary to convert anonomized string zipcodes into integers
code_dict = {code: int(code[0:3]) for code in loans.zip_code.unique()}
#Replace string zipcodes with integer zipcodes
loans.replace(to_replace = {'zip_code': code_dict}, inplace = True)

In [None]:
#Drop unnecessary columns from zipcode data
zip_data.drop(['adj_gross_inc', 'amt_SS', 'amt_edu', 'amt_itemized',
       'amt_mort_intr', 'amt_st_loans', 'amt_unemp', 'n_SS', 'n_edu',
       'n_farm', 'n_itemized', 'n_mort_intr', 'n_returns', 'n_st_loans',
       'n_unemp'], axis = 1, inplace=True)
#Rename the zipcd column so that we can do a join
zip_data.rename(columns = {'zipcd':'zip_code'}, inplace = True)
zip_data.head()

In [34]:
#Join zipcode data to loan data on zip_code
loans = loans.join(zip_data, on = 'zip_code', rsuffix = 'r')
loans.drop('zip_coder', axis = 1, inplace = True)

Look at the effects of the features extracted from the zipcode data (note that dollar amounts are in thousands):
avg_inc, avg_SS, avg_itemized, avg_unemp, avg_mort_intr, avg_edu, avg_st_loans, prop_SS,
prop_itemized, prop_unemp, prop_farm, prop_mort_intr, prop_edu, prop_st_loans

In [38]:
#Average income of the region
reg_inc_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_inc.median()
reg_inc_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_inc.median()

loans.groupby('loan_status').avg_inc.median()

loan_status
Charged Off    56.675219
Fully Paid     58.883092
Name: avg_inc, dtype: float64

In [39]:
#Effect size
reg_inc_paid_med / reg_inc_chrg_med
#Small effect

1.0389565776690952

In [41]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_inc,
                              loans[loans.loan_status == 'Charged Off'].avg_inc, 99.5, 0.5)
prct
#Reliable

[1.8863152713123483, 2.3753684291614121]

In [42]:
#Average SS income of the region
SS_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_SS.median()
SS_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_SS.median()

loans.groupby('loan_status').avg_SS.median()

loan_status
Charged Off    12.283921
Fully Paid     12.424060
Name: avg_SS, dtype: float64

In [43]:
#Effect size
SS_paid_med / SS_chrg_med
#very small effect

1.0114082830826128

In [44]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_SS,
                              loans[loans.loan_status == 'Charged Off'].avg_SS, 99.5, 0.5)
prct
#Reliable

[0.11135745256081542, 0.18167845421086959]

In [46]:
#Average amount itemized
item_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_itemized.median()
item_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_itemized.median()

loans.groupby('loan_status').avg_itemized.median()


loan_status
Charged Off    23.752402
Fully Paid     24.310692
Name: avg_itemized, dtype: float64

In [47]:
#Effect size
item_paid_med / item_chrg_med
#Small effect

1.023504556169143

In [48]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_itemized,
                              loans[loans.loan_status == 'Charged Off'].avg_itemized, 99.5, 0.5)
prct
#Reliable

[0.44778617172056201, 0.65132502713042939]

In [49]:
#Average amount of taxable unemployment compensation
unemp_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_unemp.median()
unemp_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_unemp.median()

loans.groupby('loan_status').avg_unemp.median()

loan_status
Charged Off    7.325424
Fully Paid     7.379383
Name: avg_unemp, dtype: float64

In [51]:
#Effect size
unemp_paid_med / unemp_chrg_med
#Very small effect

1.0073659840336693

In [52]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_unemp,
                              loans[loans.loan_status == 'Charged Off'].avg_unemp, 99.5, 0.5)
prct
#Reliable

[0.037183010065772848, 0.089849978490887494]

In [53]:
#Average amount of mortgage interest deductions
mort_intr_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_mort_intr.median()
mort_intr_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_mort_intr.median()

loans.groupby('loan_status').avg_mort_intr.median()

loan_status
Charged Off    9.765733
Fully Paid     9.909097
Name: avg_mort_intr, dtype: float64

In [54]:
#Effect size
mort_intr_paid_med / mort_intr_chrg_med
#Small effect

1.0146802665860144

In [55]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].avg_mort_intr,
                              loans[loans.loan_status == 'Charged Off'].avg_mort_intr, 99.5, 0.5)
prct
#Reliable

[0.13791472791523596, 0.21836008908644366]

In [56]:
#Average amount of educational credit
edu_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_edu.median()
edu_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_edu.median()

loans.groupby('loan_status').avg_edu.median()

loan_status
Charged Off    2.220000
Fully Paid     2.217544
Name: avg_edu, dtype: float64

In [57]:
#Effect size
edu_chrg_med / edu_paid_med
#Extremely small, no way this should be reliable

1.0011075949367092

In [58]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].avg_edu,
                              loans[loans.loan_status == 'Fully Paid'].avg_edu, 99.5, 0.5)
prct
#Reliable

[-0.0020639834881324148, 0.0097833682739341121]

In [59]:
#Average amount of student loan deduction
st_loans_paid_med = loans[loans.loan_status == 'Fully Paid'].avg_st_loans.median()
st_loans_chrg_med = loans[loans.loan_status == 'Charged Off'].avg_st_loans.median()

loans.groupby('loan_status').avg_st_loans.median()
#So small I'm not even bothering with an effect size or bootstrapping

loan_status
Charged Off    1.006260
Fully Paid     1.007403
Name: avg_st_loans, dtype: float64

In [60]:
#Proportion of SS returns
p_SS_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_SS.median()
p_SS_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_SS.median()

loans.groupby('loan_status').prop_SS.median()

loan_status
Charged Off    0.118132
Fully Paid     0.116293
Name: prop_SS, dtype: float64

In [61]:
#Effect size
p_SS_chrg_med / p_SS_paid_med
#Small effect

1.015815436690911

In [62]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].prop_SS,
                              loans[loans.loan_status == 'Fully Paid'].prop_SS, 99.5, 0.5)
prct
#Reliable

[0.00039461880905591884, 0.0024026557863610909]

In [63]:
#Proportion of itemized returns
p_item_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_itemized.median()
p_item_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_itemized.median()

loans.groupby('loan_status').prop_itemized.median()

loan_status
Charged Off    0.331080
Fully Paid     0.344376
Name: prop_itemized, dtype: float64

In [65]:
#Effect Size
p_item_paid_med / p_item_chrg_med
#Small effect

1.0401585808729137

In [66]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_itemized,
                              loans[loans.loan_status == 'Charged Off'].prop_itemized, 99.5, 0.5)
prct
#Reliable

[0.010566025880013874, 0.01408084073561694]

In [67]:
#Proportion of unemployment compensation returns
p_item_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_unemp.median()
p_item_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_unemp.median()

loans.groupby('loan_status').prop_unemp.median()

loan_status
Charged Off    0.089548
Fully Paid     0.087252
Name: prop_unemp, dtype: float64

In [68]:
#Effect size
p_item_chrg_med / p_item_paid_med
#Small effect

1.0263089682710886

In [69]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].prop_unemp,
                              loans[loans.loan_status == 'Fully Paid'].prop_unemp, 99.5, 0.5)
prct
#Reliable

[0.00092591159335435325, 0.0035332816514737631]

In [70]:
#Proportion of farm returns
p_farm_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_farm.median()
p_farm_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_farm.median()

loans.groupby('loan_status').prop_farm.median()

loan_status
Charged Off    0.002074
Fully Paid     0.002046
Name: prop_farm, dtype: float64

In [71]:
#Effect size
p_farm_chrg_med / p_farm_paid_med
#Small effect

1.0138264923382065

In [72]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].prop_farm,
                              loans[loans.loan_status == 'Fully Paid'].prop_farm, 99.5, 0.5)
prct
#Not Reliable

[0.0, 0.00014550664520866395]

In [73]:
#Proportion of mortgage interest deduction returns
p_mort_intr_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_mort_intr.median()
p_mort_intr_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_mort_intr.median()

loans.groupby('loan_status').prop_mort_intr.median()

loan_status
Charged Off    0.258684
Fully Paid     0.262168
Name: prop_mort_intr, dtype: float64

In [74]:
#Effect size
p_mort_intr_paid_med / p_mort_intr_chrg_med
#Small effect

1.013468474089005

In [75]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_mort_intr,
                              loans[loans.loan_status == 'Charged Off'].prop_mort_intr, 99.5, 0.5)
prct
#Reliable

[0.0022454538489952158, 0.0056254303365271818]

In [76]:
#Proportion of returns with an educational credit
p_edu_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_edu.median()
p_edu_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_edu.median()

loans.groupby('loan_status').prop_edu.median()

loan_status
Charged Off    0.012223
Fully Paid     0.012526
Name: prop_edu, dtype: float64

In [77]:
#Effect size
p_edu_paid_med / p_edu_chrg_med
#Small effect

1.024805230548222

In [78]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_edu,
                              loans[loans.loan_status == 'Charged Off'].prop_edu, 99.5, 0.5)
prct
#Reliable

[0.00026288811239375508, 0.0003143476129658547]

In [79]:
#Proportion of returns with student loans
p_st_loans_paid_med = loans[loans.loan_status == 'Fully Paid'].prop_st_loans.median()
p_st_loans_chrg_med = loans[loans.loan_status == 'Charged Off'].prop_st_loans.median()

loans.groupby('loan_status').prop_st_loans.median()

loan_status
Charged Off    0.075028
Fully Paid     0.076177
Name: prop_st_loans, dtype: float64

In [80]:
#Effect size
p_st_loans_paid_med / p_st_loans_chrg_med
#Small effect

1.0153165678871037

In [81]:
#Bootstrap
prct = ut.med_diff_bootstrap(loans[loans.loan_status == 'Fully Paid'].prop_st_loans,
                              loans[loans.loan_status == 'Charged Off'].prop_st_loans, 99.5, 0.5)
prct
#Reliable

[0.00052856175342524958, 0.0018996974612559697]

Zipcode Features:
Reliable: avg_inc (small effect), avg_SS (very small), avg_itemized (small effect), avg_unemp (very small effect),
avg_mort_intr (small effect), prop_SS (small effect), prop_itemized (small effect), prop_unemp (small effect), prop_mort_intr (small effect), prop_edu (small), prop_st_loans (small)


Not Reliable: avg_edu, avg_st_loans, prop_farm, 

Numeric fields that require processing:
emp_length, mths_since_last_delinq, mths_since_last_record, revol_util, mths_since_last_major_derog, total_credit_rv

In [53]:
#emp_length
#Create a dictionary to change the year strings into numbers
year_dict = {'emp_length':{'10+ years':10, '< 1 year':0, '3 years':3, '9 years':9, '4 years':4, '5 years':5,
       '1 year':1, '6 years':6, '2 years':2, '7 years':7, '8 years':8, 'n/a':np.nan}}
#Replace the year strings with numbers
loans.replace(year_dict, inplace = True)

In [56]:
#employment length
emp_paid_mn = loans[loans.loan_status == 'Fully Paid'].emp_length.mean()
emp_chrg_mn = loans[loans.loan_status == 'Charged Off'].emp_length.mean()

loans.groupby('loan_status').emp_length.mean()

loan_status
Charged Off    5.773641
Fully Paid     5.806348
Name: emp_length, dtype: float64

In [57]:
#Effect size
emp_paid_mn / emp_chrg_mn
#Very small effect

1.0056649621395004

In [58]:
#Bootstrap
prct = ut.mean_diff_bootstrap(loans[loans.loan_status == 'Charged Off'].emp_length,
                              loans[loans.loan_status == 'Fully Paid'].emp_length, 99.5, 0.5)
prct
#Not reliable

[-0.082100414260775412, 0.018788161397101412]

Numeric fields requiring processing.


Not reliable: emp_length

Imputation protocol: variables to use for imputation: emp_length, home_ownership, annual_inc, purpose, zip_code, addr_state, dti, delinq_2yrs, earliest_cr_line, inq_last_6mths, open_acc, pub_rec, revol_bal, total_acc, last_credit_pull_d add any zip_code based features, features from NLP on desc variables to impute: mths_since_last_delinq, mths_since_last_record, revol_util, mths_since_last_major_derog, total_credit_rv, acc_now_delinq (before 12/10/15)
