# Are religious Americans more likely to repay loans?
The recent rise of peer-to-peer lending has been stunning. Starting from zero a few years ago, platforms like Lending Club are already placing $10 billion of unsecured personal loans per year. 

The concept is simple: a borrower fills in an application, investors choose to fund part or all of the loan, and the platform takes care of fine details, like collecting payments and annoying late borrowers. In theory, this disintermediation of banks from consumer credit lowers borrowing rates, improves investor returns, and leaves a healthy margin for the platform's profits.

The platforms publish anonymized default history from previous loans. This data helps investors make smarter decisions about future loans - attracting more investment dollars. Investors use the data to try to pick loans with a high interest rate, but a lower likelihood of default.

## Predictive factors
Many factors look to be of interest. Obvious ones include:
- FICO Score
- Debt to Income ratio
- Size of loan

Additional factors that people have provided some evidence for:
- Loan description length
  - Long descriptions are bad: http://drjasondavis.com/blog/2012/04/08/lending-club-loan-analysis-making-money-with-logistic-regression
  - Long descriptions are good: https://lendingclubmodeling.wordpress.com/2011/04/25/why-loan-descriptions-and-qa-matter/
- Loan purpose: http://blog.lendingrobot.com/research/predicting-the-number-of-payments-in-peer-lending/

Other useful articles:
- http://blog.lendingrobot.com/research/predicting-returns-for-ongoing-loans/
- http://kldavenport.com/gradient-boosting-analysis-of-lendingclubs-data/

## Analysis

### Predicted variable
We're ultimately concerned with risk-adjusted return. However, analysis is simpler with certain assumptions. These may be revisited in a future analysis:
- Risk-adjustment can be handled separately
- Return = interest rate - probability of default * loss given default
- PD follows the same hazard curve, regardless of risk grade or other factors
- LGD is the same for different risk grades and only depends on lateness severity

Therefore, we can focus on PD only.

Explore:
- shape of default curve at different risk grades
- chance of late payments developing into default at different grades
- recovery amounts at different grades
- do late fees help out?

## Conclusion

### Caveats
This study has focused on a methodology for answering sociological questions and predicting defaults. Other issues to resolve for successful investing in P2P include:
- API access: smart investors, such as hedge funds are rumored to quickly cream off the best loans. Any superior statistical view must also compete on execution speed.
- Returns should be risk-adjusted for proper comparison to alternative investments, such as the S&P 500. An examination of similar grades of unsecuritized personal loans (credit cards) across the entire credit cycle would help capture both beta and true risk of default.
- Lending Club uses NAR (Net Annualized Return, a non-standard method that assumes all current loans will pay 100%) http://danielodio.com/one-year-in-lending-money-to-complete-strangers-via-an-api-and-why-you-should-try-it
- Investment size: what is the proper amount to invest in each loan to achieve the Kelly Criterion?
- Loans do not receive capital gains treatment and are taxed as ordinary income. 

## Load data from Lending Club
Lending Club provides information on all past applications (and current loan status)

## Input files
- **LoanStats3a_securev1.csv.gz:** 2007-2011 Lending Club default history
- **LoanStats3b_securev1.csv.gz:** 2012-2013 Lending Club default history
- **LoanStats3c_securev1.csv.gz:** 2013-2014 Lending Club default history (currently reading this one only)
- **LoanStats3d_securev1.csv.gz:** 2015-2015 Lending Club default history
  - According to http://www.lendingmemo.com/lendingrobot-investing-review/, lending standards changed drastically starting in 2013
- **ZIP_COUNTY_122015.xlsx:** list of zip codes and county codes
- **national_county.txt:** list of states and counties mapped to code from US Census
- **2010.csv:** list of US counties, population, congregations, and adherents from http://www.rcms2010.org/ 
  - This website also has info breaking down into specific religion
- **US_elect_county:** 2012 presidential elections by county from http://www.theguardian.com/news/datablog/2012/nov/07/us-2012-election-county-results-download

In [19]:
import pandas as pd
pd.set_option('display.max_columns', 100)
%matplotlib inline
# 2013-2014 loan data - files are available from deeper history
past_loans = pd.read_csv('LoanStats3c_securev1.csv', header = 1, index_col=0)
past_loans['term'] = past_loans['term'].astype('category')
past_loans['grade'] = past_loans['grade'].astype('category')
past_loans['sub_grade'] = past_loans['sub_grade'].astype('category')
past_loans['home_ownership'] = past_loans['home_ownership'].astype('category')
past_loans['verification_status'] = past_loans['verification_status'].astype('category')

# Making assumption that any currently late loans are bad; may need to revisit
# Assumptions from Lending Club https://www.lendingclub.com/account/investorReturnsAdjustments.action
# Current 0% loss
# In Grace Period 28% loss
# Late 16-30 days 58% loss
# Late 31-120 days 74% loss
# Default 89% loss
di = {'Current': 'Good', 'Fully Paid': 'Good', 'Charged Off': 'Bad', 'Late (31-120 days)': 'Bad',
       'In Grace Period': 'Bad', 'Late (16-30 days)': 'Bad', 'Default': 'Bad'}
# past_loans.replace({'loan_status': di}, inplace=True)
past_loans['loan_status'] = past_loans['loan_status'].astype('category')
past_loans['pymnt_plan'] = past_loans['pymnt_plan'].astype('category')
past_loans['purpose'] = past_loans['purpose'].astype('category')
past_loans['application_type'] = past_loans['application_type'].astype('category')

# convert dates
past_loans['issue_d'] = pd.to_datetime(past_loans['issue_d'])
past_loans['earliest_cr_line'] = pd.to_datetime(past_loans['earliest_cr_line'])
past_loans['last_pymnt_d'] = pd.to_datetime(past_loans['last_pymnt_d'])
past_loans['next_pymnt_d'] = pd.to_datetime(past_loans['next_pymnt_d'])
past_loans['last_credit_pull_d'] = pd.to_datetime(past_loans['last_credit_pull_d'])
past_loans['issue_d'] = pd.to_datetime(past_loans['issue_d'])

# convert floats
past_loans['int_rate'] = pd.to_numeric(past_loans['int_rate'].str.extract('(.*)%'))/100


# drop pointless data
past_loans=past_loans.dropna(axis=1,how='all')
past_loans.drop('application_type',axis = 1) # only contains "individual"
past_loans.head(100)

Unnamed: 0_level_0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_credit_rv
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1
36805548,39558264,10400,10400,10400,36 months,0.0699,321.08,A,A3,Truck Driver Delivery Personel,8 years,MORTGAGE,58000,Not Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,credit_card,Credit card refinancing,937xx,CA,14.92,0,1989-09-01,710,714,2,42,,17,0,6133,31.6%,36,w,7449.57,7449.57,3521.78,3521.78,2950.43,571.35,0,0,0,2015-12-01,321.08,2016-01-01,2015-12-01,709,705,0,59,1,INDIVIDUAL,0,0,162110,19400
38098114,40860827,15000,15000,15000,60 months,0.1239,336.64,C,C1,MANAGEMENT,10+ years,RENT,78000,Source Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,235xx,VA,12.03,0,1994-08-01,750,754,0,,,6,0,138008,29%,17,w,12894.10,12894.10,3691.36,3691.36,2105.90,1585.46,0,0,0,2015-12-01,336.64,2016-02-01,2015-12-01,684,680,0,,1,INDIVIDUAL,0,0,149140,184500
37612354,40375473,12800,12800,12800,60 months,0.1714,319.08,D,D4,Senior Sales Professional,10+ years,MORTGAGE,125000,Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,car,Car financing,953xx,CA,8.31,1,2000-10-01,665,669,0,17,,8,0,5753,100.9%,13,w,11189.45,11189.45,3479.41,3479.41,1610.55,1868.86,0,0,0,2015-12-01,319.08,2016-01-01,2015-12-01,734,730,0,36,1,INDIVIDUAL,0,0,261815,5700
37822187,40585251,9600,9600,9600,36 months,0.1366,326.53,C,C3,Admin Specialist,10+ years,RENT,69000,Source Verified,2014-12-01,Fully Paid,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,077xx,NJ,25.81,0,1992-11-01,680,684,0,,,12,0,16388,59.4%,44,f,0.00,0.00,9973.43,9973.43,9600.00,373.43,0,0,0,2015-04-01,9338.58,NaT,2015-12-01,684,680,0,,1,INDIVIDUAL,0,0,38566,27600
37842129,40605224,21425,21425,21425,60 months,0.1559,516.36,D,D1,Programming Analysis Supervisor,6 years,RENT,63800,Source Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,credit_card,Credit card refinancing,658xx,MO,18.49,0,2003-08-01,685,689,0,60,,10,0,16374,76.2%,35,w,18629.98,18629.98,5633.57,5633.57,2795.02,2838.55,0,0,0,2015-12-01,516.36,2016-01-01,2015-12-01,684,680,0,74,1,INDIVIDUAL,0,0,42315,21500
37662224,40425321,7650,7650,7650,36 months,0.1366,260.20,C,C3,Technical Specialist,< 1 year,RENT,50000,Source Verified,2014-12-01,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,850xx,AZ,34.81,0,2002-08-01,685,689,1,,,11,0,16822,91.9%,20,f,0.00,0.00,1043.99,1043.99,704.38,339.61,0,0,0,2015-08-01,17.70,NaT,2015-12-01,539,535,0,,1,INDIVIDUAL,0,0,64426,18300
37800722,40563521,12975,12975,12975,36 months,0.1786,468.17,D,D5,Sales,10+ years,RENT,60000,Source Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,house,Home buying,331xx,FL,22.42,0,1999-01-01,680,684,0,48,,11,0,5200,33.1%,19,f,10346.86,10346.86,4181.34,4181.34,2628.14,1553.20,0,0,0,2015-10-01,468.17,2016-01-01,2015-12-01,634,630,0,,1,INDIVIDUAL,0,900,17281,15700
37701596,40474581,10000,10000,10000,36 months,0.1199,332.10,B,B5,Investment Consultant,8 years,RENT,90000,Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,483xx,MI,8.44,0,2003-07-01,675,679,1,,,5,0,23723,98%,10,f,7314.49,7314.49,3636.45,3636.45,2685.51,950.94,0,0,0,2015-12-01,332.10,2016-01-01,2015-12-01,669,665,0,,1,INDIVIDUAL,0,0,23723,24200
37682226,40455307,17000,17000,17000,36 months,0.1366,578.22,C,C3,Deputy sheriff,10+ years,MORTGAGE,75000,Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,144xx,NY,23.63,0,2001-01-01,675,679,0,46,,7,0,5063,46.4%,31,f,12518.99,12518.99,6328.17,6328.17,4481.01,1847.16,0,0,0,2015-12-01,578.22,2016-01-01,2015-12-01,719,715,0,51,1,INDIVIDUAL,0,0,122193,10900
37854444,40617199,16000,16000,16000,60 months,0.1144,351.40,B,B4,Foreign Service Officer,6 years,OWN,109777,Verified,2014-12-01,Current,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,201xx,VA,11.63,1,2003-11-01,700,704,0,12,,7,0,7253,60.4%,14,w,13705.15,13705.15,3839.98,3839.98,2294.85,1545.13,0,0,0,2015-12-01,351.40,2016-01-01,2015-12-01,669,665,0,,1,INDIVIDUAL,0,0,373743,12000


In [49]:
past_loans[(past_loans['loan_status'] != "Current") & (past_loans['loan_status'] != "Fully Paid")].head(100)

Unnamed: 0_level_0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_credit_rv
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1
37662224,40425321,7650,7650,7650,36 months,0.1366,260.20,C,C3,Technical Specialist,< 1 year,RENT,50000.00,Source Verified,2014-12-01,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,850xx,AZ,34.81,0,2002-08-01,685,689,1,,,11,0,16822,91.9%,20,f,0.00,0.00,1043.99,1043.99,704.38,339.61,0.00,0.00,0.0000,2015-08-01,17.70,NaT,2015-12-01,539,535,0,,1,INDIVIDUAL,0,0,64426,18300
37800722,40563521,12975,12975,12975,36 months,0.1786,468.17,D,D5,Sales,10+ years,RENT,60000.00,Source Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,house,Home buying,331xx,FL,22.42,0,1999-01-01,680,684,0,48,,11,0,5200,33.1%,19,f,10346.86,10346.86,4181.34,4181.34,2628.14,1553.20,0.00,0.00,0.0000,2015-10-01,468.17,2016-01-01,2015-12-01,634,630,0,,1,INDIVIDUAL,0,900,17281,15700
37822030,40585070,18450,18450,18450,36 months,0.1431,633.36,C,C4,construction foreman,10+ years,MORTGAGE,108000.00,Not Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,home_improvement,Home improvement,810xx,CO,23.37,0,2003-03-01,680,684,3,,,11,0,5925,87%,20,f,12981.68,12981.68,6932.56,6932.56,5468.32,1464.24,0.00,0.00,0.0000,2015-09-01,633.36,2016-01-01,2015-12-01,614,610,0,100,1,INDIVIDUAL,0,0,237795,6800
37791995,40555038,4500,4500,4500,36 months,0.1499,155.98,C,C5,Nanny,4 years,OWN,34000.00,Not Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,major_purchase,Major purchase,210xx,MD,28.77,0,1997-02-01,660,664,1,25,54,18,1,12506,45%,51,f,3444.33,3444.33,1554.18,1554.18,1055.67,498.51,0.00,0.00,0.0000,2015-11-01,155.98,2016-01-01,2015-12-01,654,650,0,25,1,INDIVIDUAL,0,5098,24481,27800
37711991,40485052,14000,14000,14000,36 months,0.1649,495.60,D,D3,Customerservice,9 years,RENT,45000.00,Source Verified,2014-12-01,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,credit_card,Credit card refinancing,331xx,FL,13.47,0,2008-08-01,660,664,1,,,20,0,10585,39.9%,22,f,0.00,0.00,2955.08,2955.08,303.22,173.14,0.00,2478.72,446.1696,2015-02-01,495.60,NaT,2015-06-01,559,555,0,,1,INDIVIDUAL,0,0,15659,26500
37671924,40434950,16000,16000,16000,36 months,0.1299,539.03,C,C2,Sr manager,< 1 year,MORTGAGE,80000.00,Source Verified,2014-12-01,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,800xx,CO,22.32,6,2001-12-01,660,664,1,4,,17,0,10768,53.8%,46,f,0.00,0.00,2677.83,2677.83,1869.18,808.65,0.00,0.00,0.0000,2015-06-01,539.03,NaT,2015-12-01,574,570,0,,1,INDIVIDUAL,0,0,318905,20000
37752007,40515020,22200,22200,22200,60 months,0.1714,553.40,D,D4,Clinical Data Lead,10+ years,RENT,74500.00,Source Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,015xx,MA,8.05,0,2001-09-01,695,699,0,36,,7,0,7014,47.4%,14,w,19679.08,19679.08,5502.29,5502.29,2520.92,2981.37,0.00,0.00,0.0000,2015-11-01,553.40,2016-01-01,2015-12-01,574,570,0,,1,INDIVIDUAL,0,0,11528,14800
37108024,39870843,13825,13825,13825,60 months,0.1366,319.26,C,C3,parks dept.,< 1 year,RENT,27800.00,Source Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,981xx,WA,24.22,0,1991-04-01,695,699,2,,,6,0,11048,47.4%,12,w,11962.36,11962.36,3496.12,3496.12,1862.64,1633.48,0.00,0.00,0.0000,2015-12-01,319.26,2016-02-01,2015-12-01,709,705,0,,1,INDIVIDUAL,0,0,19394,23300
37711880,40484927,20000,20000,20000,36 months,0.1559,699.10,D,D1,Graphic Designer,2 years,MORTGAGE,50000.00,Source Verified,2014-12-01,Late (31-120 days),n,https://www.lendingclub.com/browse/loanDetail....,,credit_card,Credit card refinancing,189xx,PA,16.47,0,2010-03-01,660,664,1,,,9,0,7186,74.1%,12,f,16334.19,16334.19,5636.74,5636.74,3665.81,1901.01,69.92,0.00,0.0000,2015-09-01,1433.16,2016-01-01,2015-12-01,499,0,0,,1,INDIVIDUAL,0,0,21850,9700
37781998,40545047,14400,14400,14400,60 months,0.1924,375.45,E,E2,Administrative Assistant,10+ years,MORTGAGE,70000.00,Verified,2014-12-01,Charged Off,n,https://www.lendingclub.com/browse/loanDetail....,,debt_consolidation,Debt consolidation,105xx,NY,26.81,0,1976-04-01,665,669,1,29,68,14,1,11080,73.9%,29,f,0.00,0.00,727.81,727.81,291.46,436.35,0.00,0.00,0.0000,2015-03-01,375.45,NaT,2015-12-01,584,580,0,67,1,INDIVIDUAL,0,0,40195,15000


# Explore different categorical data
What values are inside the categorical fields?

In [20]:
pd.unique(past_loans['term'])

array([' 36 months', ' 60 months', nan], dtype=object)

In [21]:
pd.unique(past_loans['grade'])

array(['A', 'C', 'D', 'B', 'E', 'F', 'G', nan], dtype=object)

In [22]:
pd.unique(past_loans['sub_grade'])

array(['A3', 'C1', 'D4', 'C3', 'D1', 'D5', 'B5', 'B4', 'C4', 'E5', 'D2',
       'B3', 'C5', 'E4', 'E3', 'C2', 'B2', 'A5', 'F1', 'B1', 'D3', 'A4',
       'E1', 'E2', 'G2', 'A1', 'G1', 'A2', 'F3', 'F2', 'G3', 'F4', 'G4',
       'F5', 'G5', nan], dtype=object)

In [23]:
pd.unique(past_loans['home_ownership'])

array(['MORTGAGE', 'RENT', 'OWN', 'ANY', nan], dtype=object)

In [24]:
pd.unique(past_loans['verification_status'])

array(['Not Verified', 'Source Verified', 'Verified', nan], dtype=object)

In [25]:
pd.unique(past_loans['loan_status'])

array(['Current', 'Fully Paid', 'Charged Off', 'Late (31-120 days)',
       'In Grace Period', 'Late (16-30 days)', 'Default', nan], dtype=object)

In [26]:
pd.unique(past_loans['pymnt_plan'])

array(['n', 'y', nan], dtype=object)

In [27]:
pd.unique(past_loans['purpose'])

array(['credit_card', 'debt_consolidation', 'car', 'house',
       'home_improvement', 'other', 'medical', 'moving', 'major_purchase',
       'vacation', 'small_business', 'renewable_energy', 'wedding', nan], dtype=object)

# Examine field types and amount of data
Is there any missing data?

In [28]:
past_loans.info()

<class 'pandas.core.frame.DataFrame'>
Index: 235631 entries, 36805548 to Total amount funded in policy code 2: 873652739
Data columns (total 60 columns):
member_id                      235629 non-null float64
loan_amnt                      235629 non-null float64
funded_amnt                    235629 non-null float64
funded_amnt_inv                235629 non-null float64
term                           235629 non-null category
int_rate                       235629 non-null float64
installment                    235629 non-null float64
grade                          235629 non-null category
sub_grade                      235629 non-null category
emp_title                      222393 non-null object
emp_length                     235629 non-null object
home_ownership                 235629 non-null category
annual_inc                     235629 non-null float64
verification_status            235629 non-null category
issue_d                        235629 non-null datetime64[ns]
loan_status

# Check out summary stats on the numeric fields

In [29]:
past_loans.describe().transpose() #.to_string()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
member_id,235629,24019299.796719,8825162.024128,137225.0,15579724.0,22953173.0,31767065.0,40860827.0
loan_amnt,235629,14870.156793,8438.318193,1000.0,8325.0,13000.0,20000.0,35000.0
funded_amnt,235629,14870.156793,8438.318193,1000.0,8325.0,13000.0,20000.0,35000.0
funded_amnt_inv,235629,14865.334169,8435.524995,950.0,8325.0,13000.0,20000.0,35000.0
int_rate,235629,0.137713,0.043255,0.06,0.1099,0.1365,0.1629,0.2606
installment,235629,442.482374,245.050238,23.36,265.68,384.12,578.71,1409.99
annual_inc,235629,74854.148281,55547.533374,3000.0,45377.0,65000.0,90000.0,7500000.0
dti,235629,18.04077,8.023002,0.0,12.02,17.63,23.76,39.99
delinq_2yrs,235629,0.344512,0.898319,0.0,0.0,0.0,0.0,22.0
fico_range_low,235629,692.497358,29.246641,660.0,670.0,685.0,705.0,845.0


## Explore date fields

In [14]:
past_loans.groupby(past_loans['last_pymnt_d'].map(lambda x: x.year))

<pandas.core.groupby.DataFrameGroupBy object at 0x10fb62190>

In [38]:
pd.unique(past_loans.sort('last_pymnt_d')['last_pymnt_d'])

  if __name__ == '__main__':


array(['2014-01-31T19:00:00.000000000-0500',
       '2014-02-28T19:00:00.000000000-0500',
       '2014-03-31T20:00:00.000000000-0400',
       '2014-04-30T20:00:00.000000000-0400',
       '2014-05-31T20:00:00.000000000-0400',
       '2014-06-30T20:00:00.000000000-0400',
       '2014-07-31T20:00:00.000000000-0400',
       '2014-08-31T20:00:00.000000000-0400',
       '2014-09-30T20:00:00.000000000-0400',
       '2014-10-31T20:00:00.000000000-0400',
       '2014-11-30T19:00:00.000000000-0500',
       '2014-12-31T19:00:00.000000000-0500',
       '2015-01-31T19:00:00.000000000-0500',
       '2015-02-28T19:00:00.000000000-0500',
       '2015-03-31T20:00:00.000000000-0400',
       '2015-04-30T20:00:00.000000000-0400',
       '2015-05-31T20:00:00.000000000-0400',
       '2015-06-30T20:00:00.000000000-0400',
       '2015-07-31T20:00:00.000000000-0400',
       '2015-08-31T20:00:00.000000000-0400',
       '2015-09-30T20:00:00.000000000-0400',
       '2015-10-31T20:00:00.000000000-0400',
       '20

In [30]:
pd.unique(past_loans['issue_d'])

array(['2014-11-30T19:00:00.000000000-0500',
       '2014-10-31T20:00:00.000000000-0400',
       '2014-09-30T20:00:00.000000000-0400',
       '2014-08-31T20:00:00.000000000-0400',
       '2014-07-31T20:00:00.000000000-0400',
       '2014-06-30T20:00:00.000000000-0400',
       '2014-05-31T20:00:00.000000000-0400',
       '2014-04-30T20:00:00.000000000-0400',
       '2014-03-31T20:00:00.000000000-0400',
       '2014-02-28T19:00:00.000000000-0500',
       '2014-01-31T19:00:00.000000000-0500',
       '2013-12-31T19:00:00.000000000-0500', 'NaT'], dtype='datetime64[ns]')

# Fill in rows for each date to last payment

## Create list of possible dates to match with

In [None]:
# What are all the dates in the table?
# for each date, what are the earlier dates
# create new table with "current" until paid or written off
# table should stop before defaults and late payments start registering


In [None]:
all_dates = pd.unique(past_loans.sort('last_pymnt_d')['last_pymnt_d'])
all_dates

# Create dependent and independent data


In [80]:
X = past_loans.drop('loan_status', axis = 1)
y = past_loans['loan_status']

# Split into training and test data

In [81]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# 10 cross validation iterations with 20% test / 80% train
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)

# Standardize the data

In [82]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

ValueError: could not convert string to float: INDIVIDUAL