# Are religious Americans more likely to repay loans?
The recent rise of peer-to-peer lending has been stunning. Starting from zero a few years ago, platforms like Lending Club are already placing $10 billion of unsecured personal loans per year. 

The concept is simple: a borrower fills in an application, investors choose to fund part or all of the loan, and the platform takes care of fine details, like collecting payments and annoying late borrowers. In theory, this disintermediation of banks from consumer credit lowers borrowing rates, improves investor returns, and leaves a healthy margin for the platform's profits.

The platforms publish anonymized default history from previous loans. This data helps investors make smarter decisions about future loans - attracting more investment dollars. Investors use the data to try to pick loans with a high interest rate, but a lower likelihood of default.

## Data set
For now, I will use only 36 month loans originating before 2013 (maturing before 2016). For now, I will restrict my data set to 2012 (as we go back further, the underwriting standards change). If I fill rows to handle current loans (see below), I will bring in more recent data.

## Predictive factors
Many factors look to be of interest. Obvious ones include:
- FICO Score
- Debt to Income ratio
- Size of loan

Additional factors that people have provided some evidence for:
- Loan description length
  - Long descriptions are bad: http://drjasondavis.com/blog/2012/04/08/lending-club-loan-analysis-making-money-with-logistic-regression
  - Long descriptions are good: https://lendingclubmodeling.wordpress.com/2011/04/25/why-loan-descriptions-and-qa-matter/
- Loan purpose: http://blog.lendingrobot.com/research/predicting-the-number-of-payments-in-peer-lending/

Other useful articles:
- http://blog.lendingrobot.com/research/predicting-returns-for-ongoing-loans/
- http://kldavenport.com/gradient-boosting-analysis-of-lendingclubs-data/

## Analysis

### Predicted variable
We're ultimately concerned with risk-adjusted return. However, analysis is simpler with certain assumptions. These may be revisited in a future analysis:
- Risk-adjustment can be handled separately
- Return = interest rate - probability of default * loss given default
- PD follows the same hazard curve, regardless of risk grade or other factors
- LGD is the same for different risk grades and only depends on lateness severity

Therefore, we can focus on PD only.

Explore:
- shape of default curve at different risk grades
- chance of late payments developing into default at different grades
- recovery amounts at different grades
- do late fees help out?

## Conclusion

### Caveats
This study has focused on a methodology for answering sociological questions and predicting defaults. Other issues to resolve for successful investing in P2P include:
- API access: smart investors, such as hedge funds are rumored to quickly cream off the best loans. Any superior statistical view must also compete on execution speed.
- Returns should be risk-adjusted for proper comparison to alternative investments, such as the S&P 500. An examination of similar grades of unsecuritized personal loans (credit cards) across the entire credit cycle would help capture both beta and true risk of default.
- Lending Club uses NAR (Net Annualized Return, a non-standard method that assumes all current loans will pay 100%) http://danielodio.com/one-year-in-lending-money-to-complete-strangers-via-an-api-and-why-you-should-try-it
- Investment size: what is the proper amount to invest in each loan to achieve the Kelly Criterion?
- Loans do not receive capital gains treatment and are taxed as ordinary income. 

## Load data from Lending Club
Lending Club provides information on all past applications (and current loan status)

## Input files
- **LoanStats3a.csv.zip:** 2007-2011 Lending Club default history from https://www.lendingclub.com/info/download-data.action
- **LoanStats3b.csv.zip:** 2012-2013 Lending Club default history
- **LoanStats3c.csv.zip:** 2013-2014 Lending Club default history (currently reading this one only)
- **LoanStats3d.csv.zip:** 2015-2015 Lending Club default history
  - According to http://www.lendingmemo.com/lendingrobot-investing-review/, lending standards changed drastically starting in 2013
- **ZIP_COUNTY_122015.xlsx:** list of zip codes and county codes
- **national_county.txt:** list of states and counties mapped to code from US Census
- **2010.csv:** list of US counties, population, congregations, and adherents from http://www.rcms2010.org/ 
  - This website also has info breaking down into specific religion
- **US_elect_county:** 2012 presidential elections by county from http://www.theguardian.com/news/datablog/2012/nov/07/us-2012-election-county-results-download

In [78]:
import pandas as pd
pd.set_option('display.max_columns', 100)
%matplotlib inline
# religious adherents 
zip_counties = pd.read_excel('ZIP_COUNTY_122015.xlsx', skiprows = 0, header = 0)
zip_counties['ZIP'] = zip_counties['ZIP'].str.zfill(5)
zip_counties.head()

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

In [76]:
import pandas as pd
pd.set_option('display.max_columns', 100)
%matplotlib inline
# religious adherents 
national_county = pd.read_csv('national_county.txt', 
    names=['State', 'State_Code', 'County_Code', 'County', 'H1'], dtype={1: str, 2: str})
# combine into needed fields
national_county['County_Code_Full'] = national_county['State_Code'] + national_county['County_Code']
national_county['County_Full'] = national_county['County'].str[:-7] + ', ' + national_county['State']
national_county.head()

Unnamed: 0,State,State_Code,County_Code,County,H1,County_Code_Full,County_Full
0,AL,1,1,Autauga County,H1,1001,"Autauga, AL"
1,AL,1,3,Baldwin County,H1,1003,"Baldwin, AL"
2,AL,1,5,Barbour County,H1,1005,"Barbour, AL"
3,AL,1,7,Bibb County,H1,1007,"Bibb, AL"
4,AL,1,9,Blount County,H1,1009,"Blount, AL"


In [58]:
import pandas as pd
pd.set_option('display.max_columns', 100)
%matplotlib inline
# religious adherents 
religion = pd.read_csv('2010.csv', skiprows = 2, header = 1)
religion.head()

Unnamed: 0,County or Equivalent,Population,PopRank,Adherents,AdhRank,Congregations,ConRank,Adherents %,Adh% Rank,Congregations Per 10K People,Con Per 10K Rank
0,"Autauga, AL",54571,917,36938,691,106,831,67.7,546,19,1799
1,"Baldwin, AL",182265,338,96918,303,271,248,53.2,1329,15,2268
2,"Barbour, AL",27457,1516,15101,1392,89,1024,55.0,1206,32,701
3,"Bibb, AL",22915,1693,11430,1671,81,1152,49.9,1564,35,539
4,"Blount, AL",57322,884,37352,685,156,510,65.2,662,27,1072


In [35]:
import pandas as pd
pd.set_option('display.max_columns', 100)
%matplotlib inline
# 2012-2013 loan data - files are available from deeper history
# had to unzip file
past_loans = pd.read_csv('LoanStats3b.csv', header = 1, index_col=0, skiprows=0, skip_footer=2, engine='python')
past_loans['term'] = past_loans['term'].astype('category')
past_loans['grade'] = past_loans['grade'].astype('category')
past_loans['sub_grade'] = past_loans['sub_grade'].astype('category')
past_loans['home_ownership'] = past_loans['home_ownership'].astype('category')
past_loans['verification_status'] = past_loans['verification_status'].astype('category')

# Making assumption that any currently late loans are bad; may need to revisit
# Assumptions from Lending Club https://www.lendingclub.com/account/investorReturnsAdjustments.action
# Current 0% loss
# In Grace Period 28% loss
# Late 16-30 days 58% loss
# Late 31-120 days 74% loss
# Default 89% loss
di = {'Current': 'Good', 'Fully Paid': 'Good', 'Charged Off': 'Bad', 'Late (31-120 days)': 'Bad',
       'In Grace Period': 'Bad', 'Late (16-30 days)': 'Bad', 'Default': 'Bad'}
# past_loans.replace({'loan_status': di}, inplace=True)
past_loans['loan_status'] = past_loans['loan_status'].astype('category')
past_loans['pymnt_plan'] = past_loans['pymnt_plan'].astype('category')
past_loans['purpose'] = past_loans['purpose'].astype('category')
past_loans['application_type'] = past_loans['application_type'].astype('category')

# convert dates
past_loans['issue_d'] = pd.to_datetime(past_loans['issue_d'])
past_loans['earliest_cr_line'] = pd.to_datetime(past_loans['earliest_cr_line'])
past_loans['last_pymnt_d'] = pd.to_datetime(past_loans['last_pymnt_d'])
past_loans['next_pymnt_d'] = pd.to_datetime(past_loans['next_pymnt_d'])
past_loans['last_credit_pull_d'] = pd.to_datetime(past_loans['last_credit_pull_d'])
past_loans['issue_d'] = pd.to_datetime(past_loans['issue_d'])

# convert floats
past_loans['int_rate'] = pd.to_numeric(past_loans['int_rate'].str.extract('(.*)%'))/100

# drop pointless data
past_loans=past_loans.dropna(axis=1,how='all')
past_loans.drop('application_type',axis = 1) # only contains "individual"

# drop data that's not 2012 origination and 36 months term
past_loans = past_loans.drop(past_loans[(past_loans.issue_d >= '2012-07-01') | (past_loans.term == ' 60 months')].index)

past_loans.head()

Unnamed: 0_level_0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1
1377559,1622372,2500,2500,2500,36 months,0.1465,86.24,C,C2,value concepts inc,10+ years,MORTGAGE,53000,Source Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > home repairs th...,home_improvement,loan for life,306xx,GA,17.0,0,1987-03-01,0,30.0,,8,0,3222,54.6%,15,f,0,0,3104.435553,3104.44,2500,604.44,0,0,0,2015-07-01,89.38,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,2,,1080.0,74.3,0,0,,,,,3,103.0,,11.0,30.0,,,,3,,,,,,8,,,,,,33.3,0,0,,12315,4200,
1377635,1621835,3000,3000,2750,36 months,0.079,93.88,A,A4,Washoe County School District,10+ years,MORTGAGE,45600,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,car,My Car,894xx,NV,8.21,0,1998-05-01,0,,,4,0,1046,26.1%,21,f,0,0,3379.306095,3097.7,3000,379.31,0,0,0,2015-07-01,96.03,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,0,,,,0,0,,,,,8,,,,,,,,0,,,,,,4,,,,,,,0,0,,1046,0,
1377314,1621503,5600,5600,5600,36 months,0.1922,205.9,D,D5,OpenText,6 years,RENT,76320,Source Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > Consolidation<b...,debt_consolidation,Consloan,323xx,FL,17.3,0,2008-09-01,0,,,9,0,3445,91.5%,13,f,0,0,7093.693707,7093.69,5600,1493.69,0,0,0,2014-05-01,2783.95,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,5,,210.0,90.0,0,0,,,,,0,12.0,,12.0,,,,,3,,,,,,9,,,,,,100.0,0,0,,29364,2100,
1327173,1572525,2800,2800,2800,36 months,0.0762,87.26,A,A3,The Dawson Academy,7 years,MORTGAGE,65000,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > This loan is go...,major_purchase,Spa Loan,140xx,NY,5.32,0,1992-02-01,0,,,4,0,818,4.1%,6,f,0,0,3141.029828,3141.03,2800,341.03,0,0,0,2015-07-01,89.64,NaT,2015-06-01,0,,1,INDIVIDUAL,0,,,,1,,17782.0,4.4,0,0,,,,,0,139.0,,9.0,,,,,2,,,,,,4,,,,,,0.0,0,0,,12529,18600,
1363816,1607830,1925,1925,1925,36 months,0.1074,62.79,B,B2,PPT,< 1 year,OWN,20700,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > Help pay for ki...,home_improvement,Materials for home addition.,857xx,AZ,4.23,0,2002-09-01,2,,,7,0,2387,21.3%,10,f,0,0,1975.565136,1975.57,1925,50.57,0,0,0,2012-10-01,1850.67,NaT,2014-05-01,0,,1,INDIVIDUAL,0,,,,6,,4702.0,22.9,0,0,,,,,0,11.0,,0.0,,,,,2,,,,,,7,,,,,,50.0,0,0,,12834,6100,


In [54]:
past_loans.head(100)

Unnamed: 0_level_0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1
1377559,1622372,2500,2500,2500.000000,36 months,0.1465,86.24,C,C2,value concepts inc,10+ years,MORTGAGE,53000,Source Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > home repairs th...,home_improvement,loan for life,306xx,GA,17.00,0,1987-03-01,0,30,,8,0,3222,54.6%,15,f,0,0,3104.435553,3104.44,2500.00,604.44,0.00,0.00,0.000,2015-07-01,89.38,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,2,,1080,74.3,0,0,,,,,3,103,,11,30,,,,3,,,,,,8,,,,,,33.3,0,0,,12315,4200,
1377635,1621835,3000,3000,2750.000000,36 months,0.0790,93.88,A,A4,Washoe County School District,10+ years,MORTGAGE,45600,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,car,My Car,894xx,NV,8.21,0,1998-05-01,0,,,4,0,1046,26.1%,21,f,0,0,3379.306095,3097.70,3000.00,379.31,0.00,0.00,0.000,2015-07-01,96.03,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,0,,,,0,0,,,,,8,,,,,,,,0,,,,,,4,,,,,,,0,0,,1046,0,
1377314,1621503,5600,5600,5600.000000,36 months,0.1922,205.90,D,D5,OpenText,6 years,RENT,76320,Source Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > Consolidation<b...,debt_consolidation,Consloan,323xx,FL,17.30,0,2008-09-01,0,,,9,0,3445,91.5%,13,f,0,0,7093.693707,7093.69,5600.00,1493.69,0.00,0.00,0.000,2014-05-01,2783.95,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,5,,210,90.0,0,0,,,,,0,12,,12,,,,,3,,,,,,9,,,,,,100.0,0,0,,29364,2100,
1327173,1572525,2800,2800,2800.000000,36 months,0.0762,87.26,A,A3,The Dawson Academy,7 years,MORTGAGE,65000,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > This loan is go...,major_purchase,Spa Loan,140xx,NY,5.32,0,1992-02-01,0,,,4,0,818,4.1%,6,f,0,0,3141.029828,3141.03,2800.00,341.03,0.00,0.00,0.000,2015-07-01,89.64,NaT,2015-06-01,0,,1,INDIVIDUAL,0,,,,1,,17782,4.4,0,0,,,,,0,139,,9,,,,,2,,,,,,4,,,,,,0.0,0,0,,12529,18600,
1363816,1607830,1925,1925,1925.000000,36 months,0.1074,62.79,B,B2,PPT,< 1 year,OWN,20700,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > Help pay for ki...,home_improvement,Materials for home addition.,857xx,AZ,4.23,0,2002-09-01,2,,,7,0,2387,21.3%,10,f,0,0,1975.565136,1975.57,1925.00,50.57,0.00,0.00,0.000,2012-10-01,1850.67,NaT,2014-05-01,0,,1,INDIVIDUAL,0,,,,6,,4702,22.9,0,0,,,,,0,11,,0,,,,,2,,,,,,7,,,,,,50.0,0,0,,12834,6100,
1376522,1620885,2000,2000,2000.000000,36 months,0.1074,65.24,B,B2,Fiserv,8 years,RENT,75000,Not Verified,2012-06-01,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/21/12 > Thank you!<br>,moving,Moving Expense,064xx,CT,13.38,0,1999-03-01,0,,,9,0,10603,56.4%,37,f,0,0,301.780000,301.78,142.92,52.29,0.00,106.57,1.000,2012-10-01,65.24,NaT,2013-03-01,0,,1,INDIVIDUAL,0,,,,3,,267,96.6,0,0,,,,,1,8,,8,,,,,4,,,,,,9,,,,,,100.0,0,0,,44297,7800,
1375343,1619895,15000,15000,15000.000000,36 months,0.1922,551.52,D,D5,CHKD,< 1 year,RENT,135000,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/20/12 > I have excellen...,credit_card,To pay off high interest credit cards,279xx,NC,11.21,0,1994-11-01,0,47,,9,0,26321,87.2%,38,f,0,0,17168.480544,17168.48,15000.00,2168.48,0.00,0.00,0.000,2013-05-01,12207.55,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,0,,2750,88.8,0,0,,,,,2,73,,,47,,,,5,,,,,,9,,,,,,80.0,0,0,,115339,24600,
1375440,1619974,1400,1400,1400.000000,36 months,0.0976,45.02,B,B1,bank of america,2 years,MORTGAGE,56200,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/19/12 > Loan will be us...,vacation,Consolidate/Vacaation,891xx,NV,27.14,0,1996-11-01,0,,,6,0,6466,95.1%,22,f,0,0,1620.573312,1620.57,1400.00,220.57,0.00,0.00,0.000,2015-07-01,46.97,NaT,2015-06-01,0,,1,INDIVIDUAL,0,,,,2,,12,99.7,0,0,,,,,1,26,,14,,,,,2,,,,,,6,,,,,,100.0,0,0,,19731,3500,
1375177,1619715,2400,2400,2400.000000,36 months,0.1074,78.28,B,B2,"James Slaman, DDS",6 years,RENT,36500,Not Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,credit_card,Credit Cards,871xx,NM,21.11,0,1998-11-01,0,,,15,0,29689,69.4%,28,f,0,0,2817.984806,2817.98,2400.00,417.98,0.00,0.00,0.000,2015-07-01,81.35,NaT,2016-02-01,0,,1,INDIVIDUAL,0,,,,2,,6806,80.8,0,0,,,,,0,51,,7,,,,,9,,,,,,15,,,,,,55.6,0,0,,29689,35493,
1375009,1619515,2000,2000,2000.000000,36 months,0.1367,68.04,B,B5,Dekalb Farmers market,5 years,RENT,28000,Source Verified,2012-06-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,Borrower added on 06/19/12 > This loan is to...,small_business,Business Start up,300xx,GA,20.40,0,2007-06-01,3,,,18,0,2078,40%,25,f,0,0,2326.413835,2326.41,2000.00,326.41,0.00,0.00,0.000,2014-01-01,1170.39,NaT,2014-01-01,0,,1,INDIVIDUAL,0,,,,11,,26,97.4,0,0,,,,,0,61,,0,,,,,1,,,,,,18,,,,,,100.0,0,0,,70571,1000,


In [36]:
past_loans[(past_loans['loan_status'] != "Charged Off") & (past_loans['loan_status'] != "Fully Paid")].head(100)

Unnamed: 0_level_0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1


# Explore different categorical data
What values are inside the categorical fields?

In [37]:
pd.unique(past_loans['term'])

array([' 36 months'], dtype=object)

In [38]:
pd.unique(past_loans['grade'])

array(['C', 'A', 'D', 'B', 'E', 'G', 'F'], dtype=object)

In [39]:
pd.unique(past_loans['sub_grade'])

array(['C2', 'A4', 'D5', 'A3', 'B2', 'B1', 'B5', 'A1', 'C5', 'B3', 'C1',
       'E1', 'B4', 'D1', 'C4', 'D3', 'A2', 'E2', 'E4', 'A5', 'D2', 'C3',
       'D4', 'G1', 'G2', 'E5', 'F1', 'G3', 'F5', 'F2', 'E3', 'F3', 'G4',
       'G5', 'F4'], dtype=object)

In [40]:
pd.unique(past_loans['home_ownership'])

array(['MORTGAGE', 'RENT', 'OWN'], dtype=object)

In [41]:
pd.unique(past_loans['verification_status'])

array(['Source Verified', 'Not Verified', 'Verified'], dtype=object)

In [42]:
pd.unique(past_loans['loan_status'])

array(['Fully Paid', 'Charged Off'], dtype=object)

In [43]:
pd.unique(past_loans['pymnt_plan'])

array(['n'], dtype=object)

In [44]:
pd.unique(past_loans['purpose'])

array(['home_improvement', 'car', 'debt_consolidation', 'major_purchase',
       'moving', 'credit_card', 'vacation', 'small_business', 'other',
       'wedding', 'medical', 'house', 'renewable_energy'], dtype=object)

# Examine field types and amount of data
Is there any missing data?

In [45]:
past_loans.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14967 entries, 1377559 to 1058722
Data columns (total 93 columns):
member_id                         14967 non-null int64
loan_amnt                         14967 non-null int64
funded_amnt                       14967 non-null int64
funded_amnt_inv                   14967 non-null float64
term                              14967 non-null category
int_rate                          14967 non-null float64
installment                       14967 non-null float64
grade                             14967 non-null category
sub_grade                         14967 non-null category
emp_title                         13938 non-null object
emp_length                        14967 non-null object
home_ownership                    14967 non-null category
annual_inc                        14967 non-null float64
verification_status               14967 non-null category
issue_d                           14967 non-null datetime64[ns]
loan_status             

# Check out summary stats on the numeric fields

In [46]:
past_loans.describe().transpose() #.to_string()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
member_id,14967,1455102.984499,95085.641457,382664.00,1383155.500000,1456996.0000,1531193.000000,1622372.000000
loan_amnt,14967,11601.224360,7188.093122,1000.00,6325.000000,10000.0000,15000.000000,35000.000000
funded_amnt,14967,11601.224360,7188.093122,1000.00,6325.000000,10000.0000,15000.000000,35000.000000
funded_amnt_inv,14967,11577.779687,7173.743526,950.00,6300.000000,10000.0000,15000.000000,35000.000000
int_rate,14967,0.118843,0.037323,0.06,0.089000,0.1212,0.142700,0.248900
installment,14967,387.315191,246.678443,30.44,208.750000,337.3800,496.800000,1379.230000
annual_inc,14967,66678.824698,43712.992175,5000.00,40000.000000,57000.0000,80000.000000,1233000.000000
dti,14967,14.849731,6.776538,0.00,9.750000,14.9000,20.020000,34.920000
delinq_2yrs,14967,0.131422,0.505779,0.00,0.000000,0.0000,0.000000,10.000000
inq_last_6mths,14967,0.766553,0.983709,0.00,0.000000,0.0000,1.000000,8.000000


## Explore date fields

In [47]:
past_loans.groupby(past_loans['last_pymnt_d'].map(lambda x: x.year))

<pandas.core.groupby.DataFrameGroupBy object at 0x13f3bc910>

In [48]:
pd.unique(past_loans.sort_values('last_pymnt_d')['last_pymnt_d'])

array(['2012-01-31T19:00:00.000000000-0500',
       '2012-02-29T19:00:00.000000000-0500',
       '2012-03-31T20:00:00.000000000-0400',
       '2012-04-30T20:00:00.000000000-0400',
       '2012-05-31T20:00:00.000000000-0400',
       '2012-06-30T20:00:00.000000000-0400',
       '2012-07-31T20:00:00.000000000-0400',
       '2012-08-31T20:00:00.000000000-0400',
       '2012-09-30T20:00:00.000000000-0400',
       '2012-10-31T20:00:00.000000000-0400',
       '2012-11-30T19:00:00.000000000-0500',
       '2012-12-31T19:00:00.000000000-0500',
       '2013-01-31T19:00:00.000000000-0500',
       '2013-02-28T19:00:00.000000000-0500',
       '2013-03-31T20:00:00.000000000-0400',
       '2013-04-30T20:00:00.000000000-0400',
       '2013-05-31T20:00:00.000000000-0400',
       '2013-06-30T20:00:00.000000000-0400',
       '2013-07-31T20:00:00.000000000-0400',
       '2013-08-31T20:00:00.000000000-0400',
       '2013-09-30T20:00:00.000000000-0400',
       '2013-10-31T20:00:00.000000000-0400',
       '20

In [49]:
pd.unique(past_loans['issue_d'])

array(['2012-05-31T20:00:00.000000000-0400',
       '2012-04-30T20:00:00.000000000-0400',
       '2012-03-31T20:00:00.000000000-0400',
       '2012-02-29T19:00:00.000000000-0500',
       '2012-01-31T19:00:00.000000000-0500',
       '2011-12-31T19:00:00.000000000-0500'], dtype='datetime64[ns]')

# Fill in rows for each date to last payment
A simple look at the data will be skewed because active loans may still default in the future. Only using 2012 36 month loans.

The data is supplied with status for each loan. Best to fill data with records for each month from loan start (and status = current) until today or last payment). This will allow me to determine the shape of the conditional default curve (what percentage of current loans default vs age?) and then use the current data set. I will also gain a better understanding of Loss Given Default for this data set.

Not sure if I have time to make this happen. When I do, I will go back and import data from 2013+.

## Create list of possible dates to match with

In [16]:
# What are all the dates in the table?
# for each date, what are the earlier dates
# create new table with "current" until paid or written off
# table should stop before defaults and late payments start registering


In [20]:
all_dates = pd.DataFrame(index=pd.unique(past_loans.sort_values('last_pymnt_d')['last_pymnt_d']))#, columns='date_d')
all_dates['report_d'] = pd.unique(past_loans.sort_values('last_pymnt_d')['last_pymnt_d'])
all_dates['common'] = 1
all_dates.dropna(how = 'any')
all_dates[(pd.notnull(all_dates['report_d']))]

Unnamed: 0,report_d,common
2014-02-01,2014-02-01,1
2014-03-01,2014-03-01,1
2014-04-01,2014-04-01,1
2014-05-01,2014-05-01,1
2014-06-01,2014-06-01,1
2014-07-01,2014-07-01,1
2014-08-01,2014-08-01,1
2014-09-01,2014-09-01,1
2014-10-01,2014-10-01,1
2014-11-01,2014-11-01,1


In [50]:
# ro's idea - just loop through the data to create the data

# past_loans['common'] = 1
# partly_merged = pd.merge(past_loans, all_dates, on = 'common')
# full_past_loans = partly_merged[(partly_merged['last_pymnt_d'] >= partly_merged['report_d'] & 
#                                 pd.notnull(partly_merged['report_d']))]
# full_past_loans.head(100)

# Create dependent and independent data


In [51]:
X = past_loans.drop('loan_status', axis = 1)
y = past_loans['loan_status']

# Split into training and test data

In [52]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# 10 cross validation iterations with 20% test / 80% train
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)

# Standardize the data

In [53]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

ValueError: could not convert string to float: INDIVIDUAL