# Loan Cancellation #

## Summary
I focused on building a parsimonious model by first reducing the number of variables to a workable representative set, and then understanding the effect of each variable on the probability of loan cancellation, and discarding variables with limited predictivity. My final model uses whether the loanee has taken out a Prosper loan before, the size of loan requested and the loan's interest rate, as well as loanee's monthly debt, monthly income, employment status, credit card use and number of real estate trades, and the number of loan inquiries initialized as predictors. My final model achieves an AUC score of xxx.

Insert Figure

## Methodology

### Dimensional Reduction

To handle the large number of columns in the data (86 to start), my initial focus was dimensional reduction: Finding a smaller set of variables that still encompasses most of the original information. One way to do this would be Principal Component Analysis (PCA) followed by rotation of eigenvectors, but since I was dealing with an unfamiliar data set, I preferred to use a different method to allow me to pare down data more deliberately. Instead, I used Pearson's r to find variables that were highly correlated, which I then turned into a graph with networkx to find connected families. After doing this, I chose one representive variable from each family and dropped the rest. 

My first application of this graph technique with r = .55 found twelve families and reduced the number of variables from 80 to 56. I chose an r such that the max family size would be no larger than 10. I then reapplied the same method with a lower cutoff of r = .45. This method proved to be imperfect - in the original run-through, I ended up discarding NumPriorProsperLoans, which I later realized to be highly important.

### Logistic Regression with Individual Variables

I then performed a logistic regression with each individual variable to see whether a significant relation exists, and then I started to build my final logistic model one variable at a time. In the case of clearly related families of variables, such as the combination of DolMonthlyDebt-DolMonthlyIncome-FracDebtToIncomeRatio, I investigated the set of variables as a group to see what combination of variables provided better predictivity (in this case, DolMonthlyDebt and DolMonthlyIncome performed best). Plugging in each variable at a time proved painstaking, but given my unfamiliarity with loan data I didn't feel comfortable with making assumptions about which variables might be significant.

In the end, I created the variable BoolPriorProsperLoanee, used the numerical variables DolLoanAmountRequested, PctBankcardUtil, DolMonthlyDebt, DolMonthlyIncome, BorrowerRate, NumRealEstateTrades, StrEmploymentStatus, NumTotalInquiries' and the categorical variable StrEmploymentStatus.

## Findings

* The single best predictor of loan cancellation is simply familiarity with the Prosper platform. Users who have taken out a Prosper loan before have only a 13% chance of cancellation, versus 36% if they have not.

* Potential loanees who list their employment status as "Other" are much more likely to cancel than ...

* One might expect the interest rate of the loan to strongly influence cancellations, but in fact interest rate has only a subtle influence on cancellation, subordinate to other factors.

In [2]:
from datetime import datetime, timedelta
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import scipy.stats as stats
import seaborn as sns
from sklearn import linear_model
from sklearn import metrics
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import make_pipeline

import networkx as nx
from networkx.algorithms.components.connected import connected_components

%matplotlib notebook
%timeit

#show up to 100 columns and rows.
pd.set_option('display.max_columns', 100, 'display.max_rows', 100)

Load in the data from the pickle file

In [3]:
data_file = "new_theorem_data.p"

## data appears to have been savied in Python 2 - changing encoding allows us to properly load data
with open(data_file, 'rb') as pickle_file:
    data = pickle.load(pickle_file, encoding='latin1') 

Sample listing below

In [4]:
data.iloc[0]

ListingID                                                       973605
DateCreditPulled                                   2013-10-13 01:50:58
DateListingStart                            2014-01-26 19:00:08.887000
DateListingCreation                         2013-10-13 01:50:56.287000
EnumListingStatus                                                    7
DolLoanAmountRequested                                           15000
BoolPartialFundingApproved                                        True
CreditGrade                                                          B
LenderYield                                                      0.152
BorrowerRate                                                     0.162
NumMonthsTerm                                                       60
DolMonthlyLoanPayment                                           366.37
FICOScore                                                          689
ProsperScore                                                         6
EnumLi

In [5]:
print('Number of rows:',len(data))
cols = data.columns.values
print('Number of columns:',len(cols))

Number of rows: 252469
Number of columns: 86


Let's visualize how the different columns are inter-related by showing a heat map of Pearson's r between pairs of variables.

In [6]:
corr = data.corr()
corr.head()

## 86 different categories - start by taking correlation matrix to figure out which categories are redundant
f, ax = plt.subplots(figsize=(9,7))

# Draw the heatmap using seaborn
sns.heatmap(corr, square=True, cbar=True, xticklabels=False, yticklabels=False)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x12d4072b0>

There are many visible families of variables encoding very similar information. Let's gather these sets of variables by turning them into a graph network. I've provided a solution using networkx below.

In [7]:
## thank you to stack overflow for the elegant solution: 
## http://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
def to_graph(l):
    G = nx.Graph()
    for part in l:
        G.add_nodes_from(part)
        G.add_edges_from(to_edges(part))
    return G

def to_edges(l):
    """ 
        treat `l` as a Graph and returns it's edges 
        to_edges(['a','b','c','d']) -> [(a,b), (b,c),(c,d)]
    """
    it = iter(l)
    last = next(it)

    for current in it:
        yield last, current
        last = current    

The following function outputs a sorted list of all families of highly-correlated variables in a matrix. Takes in a dataframe of correlations and outputs a list of lists.

In [8]:
def find_families(corr_df, thresh):
    
    tentative_families = []

    ## for each variable, find each other variable that is highly correlated with, and put into list form.
    for name, col in corr_df.iteritems():

        highly_correlated = (abs(col) > thresh) & (col.index != name)
        high_corr_list = col[highly_correlated].index.tolist()

        #need to add the variable itself if the family is non-empty
        if high_corr_list != []:
            tentative_families.append(sorted(high_corr_list + [ name ]))

    tentative_families.sort()
    tentative_families = list(tentative_families for tentative_families,_ in itertools.groupby(tentative_families))
        
    ## use networkx class to convert list of connected nodes into a graph
    G = to_graph(tentative_families)
    cc = nx.connected_components(G)  ## returns a list of connected nodel elements
    families = []

    for nodes in cc:
        families.append(list(nodes))

    families = sorted([ sorted(fam) for fam in families ])

    ## let's print a list of all the families we've found
    for fam in families:
        print(fam, len(fam))
   
    ll = [ len(fam) for fam in families ]
    print("In total,", sum(ll), "variables found in", len(families), "families.")
    
    return families

For each variable in our data, we find a list of variables that it is highly correlated with, transform these linkages into a graph, and output families of connected nodes.

In [9]:
## cutoff of .6 recovers 47 inter-connected variables, .5 recovers 52, .55 recovers 50
## cutoff of .55 features largest group at length 10 - probably don't want to exceed that
cutoff = .55
families_1 = find_families(corr,cutoff)

['BoolEverWholeLoan', 'BoolIsFractionalLoan', 'EnumLoanFractionalType'] 3
['BoolOwnsHome', 'DolMonthlyDebt', 'DolRealEstateBalance', 'DolRealEstatePayment', 'DolTotalBalanceAllOpenTrades6', 'DolTotalBalanceInstallTradesReptd6', 'DolTotalPaymentAllOpenTrades6'] 7
['BorrowerRate', 'LenderYield', 'ProsperScore'] 3
['DolLoanAmountRequested', 'DolMonthlyLoanPayment'] 2
['DolMaxPriorProsperLoan', 'DolMinPriorProsperLoan', 'DolPriorProsperLoansBalanceOutstanding', 'DolPriorProsperLoansPrincipalBorrowed', 'DolPriorProsperLoansPrincipalOutstanding', 'NumPriorProsperLoans', 'NumPriorProsperLoansActive', 'NumPriorProsperLoansCyclesBilled', 'NumPriorProsperLoansEarliestPayOff', 'NumPriorProsperLoansOnTimePayments'] 10
['DolRevolvingBalance', 'DolTotalBalanceOpenRevolving6'] 2
['NumBankcardTradesOpened12', 'NumCreditLines84', 'NumCurrentCreditLines', 'NumOpenCreditLines', 'NumOpenRevolvingAccounts', 'NumSatisfactoryAccounts', 'NumTrades'] 7
['NumDelinquencies84', 'NumDelinquenciesOver30Days', 'NumD

I go through each family individually, give each one a name, and then choose a single variable to represent that family. Then, drop all of the variables in each family except for the chosen representative variable.

In [10]:
## give custom descriptive names to each family
# two items: first, each unit in the family; second, the variable(s) you're going to keep around.
family_dict = {}
family_dict['fractional_loan'] = [families_1[0],'BoolEverWholeLoan']
family_dict['debt'] = [families_1[1],'DolMonthlyDebt']
family_dict['borrower_rate'] = [families_1[2],'BorrowerRate']
family_dict['loan_amount'] = [families_1[3],'DolLoanAmountRequested']
family_dict['prosper_history'] = [families_1[4],'NumPriorProsperLoans']
family_dict['revolving_balance'] = [families_1[5],'DolTotalBalanceOpenRevolving6']
family_dict['credit'] = [families_1[6],'NumCurrentCreditLines']
family_dict['delinquencies'] = [families_1[7],'PctTradesNeverDelinquent']
family_dict['inquiries'] = [families_1[8],'NumTotalInquiries']
family_dict['prior_prosper_loans'] = [families_1[9],'NumPriorProsperLoans61dpd']
family_dict['real_estate'] = [families_1[10],'NumRealEstateTrades']
family_dict['current_delinquency'] = [families_1[11],'NumTradesCurr30DPDOrDerog6']

## leave original data unaltered, work with dataframe reduced_data instead...
reduced_data = pd.DataFrame.copy(data, deep = True)

In [11]:
## drop similar columns as determined by families
for fam, items in family_dict.items():
    drop_cols = list(items[0])
    drop_cols.remove(items[1]) #keep our chosen representative variable for that family
    reduced_data.drop(drop_cols, axis = 1, inplace = True)
    
print('New number of columns after first filtering:',len(reduced_data.columns.values))

## save this preliminary set of variables as a pickle file.
reduced_data.to_pickle('theorem_reduced_firstfilter.pkl')

New number of columns after first filtering: 48


Can take a quick look at remaining columns.

In [12]:
print(reduced_data.columns.values)

['ListingID' 'DateCreditPulled' 'DateListingStart' 'DateListingCreation'
 'EnumListingStatus' 'DolLoanAmountRequested' 'BoolPartialFundingApproved'
 'CreditGrade' 'BorrowerRate' 'NumMonthsTerm' 'FICOScore'
 'EnumListingCategory' 'DolMonthlyIncome' 'BoolIncomeVerifiable'
 'FracDebtToIncomeRatio' 'StrEmploymentStatus' 'StrOccupation'
 'NumMonthsEmployed' 'StrState' 'StrBorrowerCity' 'NumPriorProsperLoans'
 'NumPriorProsperLoansLateCycles' 'NumPriorProsperLoansLatePayments'
 'NumPriorProsperLoans61dpd' 'BoolIsLender' 'BoolInGroup' 'EnumChannelCode'
 'NumTradesOpened6' 'NumOpenTradesDelinqOrPastDue6'
 'NumTradesCurr30DPDOrDerog6' 'DolTotalBalanceOnPublicRecords'
 'AgeOldestTrade' 'PctTradesNeverDelinquent' 'DolTotalAvailBankcardCredit6'
 'NumRealEstateTrades' 'DolTotalBalanceOpenRevolving6' 'DolMonthlyDebt'
 'NumCurrentDelinquencies' 'NumPublicRecordsLast10Years'
 'NumPublicRecords12' 'DateFirstCredit' 'DolAmountDelinquent'
 'NumCurrentCreditLines' 'PctBankcardUtil' 'NumTotalInquiries'
 'D

Start looking at individual columns and remove extraneous columns such as dates and ID numbers. For instance, variables such as DateCreditPulled, DateListingCreation, DateListingStart and DateWholeLoanStart are unlikely to reflect consumer decision-making. Similarly, while there may be interesting things to be said about the influence of state/city on loan cancellation, they again shouldn't really strongly affect decision-making, and any effects are probably subordinate to direct economic effects.

In [13]:
## immediately drop columns that are unlikely/unable to cause loan cancellation (ID numbers/dates)
drop_cols = ['ListingID', 'DateCreditPulled', 'DateListingStart', 'DateListingCreation', 'DateWholeLoanStart', 'DateWholeLoanEnd']
reduced_data.drop(drop_cols, axis = 1, inplace = True)

## manually drop a few other columns that won't help this analysis.
reduced_data.drop(['StrState','StrBorrowerCity'], axis=1, inplace=True)

Although it's still too early to work with each column individual, some of these columns can be translated into more useful form - in particular, DateFirstCredit.

In [14]:
## replace DateFirstCredit column with an integer
days_since_firstcredit = datetime.now()-reduced_data['DateFirstCredit']
reduced_data['DaysSinceFirstCredit'] = [ x.days for x in days_since_firstcredit ]
reduced_data.drop('DateFirstCredit', axis=1, inplace=True)

Similarly, it turns out that CreditGrade encodes the same information as BorrowerRate, except with greater precision.

In [15]:
## quick plot to show that BorrowerRate and CreditGrade are pretty much equivalent - drop CreditGrade
reduced_data['BorrowerRate'].hist(by=reduced_data['CreditGrade'])
reduced_data.drop('CreditGrade',axis =1, inplace=True)

<IPython.core.display.Javascript object>

We're down to 39 columns.

In [16]:
print('New number of columns after removing variables:',len(reduced_data.columns.values))

New number of columns after removing variables: 39


Let's use the graph-family-correlation method again to further reduce the number of variables we're working with, this time with a lower threshold.

In [17]:
## show another correlation matrix, this time with reduced set of variables.
reduced_corr = reduced_data.corr()

f, ax = plt.subplots(figsize=(9,7))

# Draw the heatmap using seaborn
sns.heatmap(reduced_corr, square=True, cbar=True, xticklabels=False, yticklabels=False)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x13b425eb8>

Still pockets of highly-correlated variables present, so let's rerun the method.

In [18]:
cutoff = .45
families_2 = find_families(reduced_corr,cutoff)

['AgeOldestTrade', 'DaysSinceFirstCredit'] 2
['BorrowerRate', 'DolTotalAvailBankcardCredit6', 'FICOScore', 'PctTradesNeverDelinquent'] 4
['DolMonthlyDebt', 'DolTotalBalanceOpenRevolving6', 'NumCurrentCreditLines'] 3
['NumCurrentDelinquencies', 'NumTradesCurr30DPDOrDerog6'] 2
['NumPriorProsperLoansLateCycles', 'NumPriorProsperLoansLatePayments'] 2
In total, 13 variables found in 5 families.


In [19]:
## give custom descriptive names to each family
# two items: first, each unit in the family; second, the variable(s) you're going to keep around.
family_dict_2 = {}
family_dict_2['oldest_trade'] = [families_2[0],'DaysSinceFirstCredit']
family_dict_2['borrower_rate'] = [families_2[1],'BorrowerRate']
family_dict_2['monthly_debt'] = [families_2[2],'DolMonthlyDebt']
family_dict_2['current_delinquencies'] = [families_2[3],'NumCurrentDelinquencies']
family_dict_2['prosper_delinquencies'] = [families_2[4],'NumPriorProsperLoansLatePayments']

## again, drop similar columns as determined by families
for fam, items in family_dict_2.items():
    drop_cols = list(items[0])
    drop_cols.remove(items[1]) #keep our chosen representative variable for that family
    reduced_data.drop(drop_cols, axis = 1, inplace = True)

In [20]:
print("Number of columns:",len(reduced_data.columns.values))
reduced_data.head()

Number of columns: 31


Unnamed: 0_level_0,EnumListingStatus,DolLoanAmountRequested,BoolPartialFundingApproved,BorrowerRate,NumMonthsTerm,EnumListingCategory,DolMonthlyIncome,BoolIncomeVerifiable,FracDebtToIncomeRatio,StrEmploymentStatus,StrOccupation,NumMonthsEmployed,NumPriorProsperLoans,NumPriorProsperLoansLatePayments,NumPriorProsperLoans61dpd,BoolIsLender,BoolInGroup,EnumChannelCode,NumTradesOpened6,NumOpenTradesDelinqOrPastDue6,DolTotalBalanceOnPublicRecords,NumRealEstateTrades,DolMonthlyDebt,NumCurrentDelinquencies,NumPublicRecordsLast10Years,NumPublicRecords12,DolAmountDelinquent,PctBankcardUtil,NumTotalInquiries,BoolEverWholeLoan,DaysSinceFirstCredit
ListingNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1
973605,7,15000.0,True,0.162,60,1,6000.0,True,0.27,Employed,Tradesman - Mechanic,445.0,0,,,0,False,70000,1,0,0,2,1242,0,0,0,0,0.97,5,True,13914
981099,7,15000.0,True,0.1585,60,1,7916.6667,True,0.35,Other,,32.0,0,,,0,False,70000,1,0,0,2,2289,0,0,0,0,0.48,3,True,14251
1025766,6,4000.0,True,0.2085,36,1,2083.3333,True,0.53,Employed,Professional,4.0,3,0.0,,0,False,80000,0,0,0,0,911,0,0,0,0,0.93,5,False,4159
1003835,7,10000.0,True,0.1299,36,13,3750.0,True,0.14,Employed,Medical Technician,2.0,0,,,0,False,90000,1,0,0,0,223,0,0,0,0,0.26,1,True,2955
1011335,6,20000.0,True,0.144,60,1,9000.0,True,0.16,Employed,Executive,90.0,1,0.0,,0,False,80000,1,0,1249,1,1264,1,2,0,0,0.81,17,False,8342


A closer look at BoolIncomeVerifiable shows that it is completely equivalent to StrEmploymentStatus - can safely drop.

In [21]:
print('BoolIncomeVerifiable is false:\n', \
      reduced_data[reduced_data['BoolIncomeVerifiable'] == False]['StrEmploymentStatus'].value_counts(normalize = True))
print('BoolIncomeVerifiable is true:\n', \
      reduced_data[reduced_data['BoolIncomeVerifiable'] == True]['StrEmploymentStatus'].value_counts(normalize = True))

reduced_data.drop('BoolIncomeVerifiable', axis = 1, inplace = True)

BoolIncomeVerifiable is false:
 Self-employed    0.999616
Not employed     0.000256
Employed         0.000128
Name: StrEmploymentStatus, dtype: float64
BoolIncomeVerifiable is true:
 Employed         0.914816
Other            0.082516
Full-time        0.002592
Self-employed    0.000055
Part-time        0.000021
Name: StrEmploymentStatus, dtype: float64


Last change: Logistic regression requires that the target variable be either 0 or 1, so let's create a new variable called "Cancelled" which is just EnumListingStatus minus 6.

In [22]:
## replace EnumListingStatus with Cancelled
reduced_data['Cancelled'] = reduced_data['EnumListingStatus']-6
reduced_data.drop('EnumListingStatus', axis=1, inplace = True)

In [23]:
## save to pickle file again.
reduced_data.to_pickle('theorem_reduced_secondfilter.pkl')

In [24]:
## What's left:
print(reduced_data.columns.values)

['DolLoanAmountRequested' 'BoolPartialFundingApproved' 'BorrowerRate'
 'NumMonthsTerm' 'EnumListingCategory' 'DolMonthlyIncome'
 'FracDebtToIncomeRatio' 'StrEmploymentStatus' 'StrOccupation'
 'NumMonthsEmployed' 'NumPriorProsperLoans'
 'NumPriorProsperLoansLatePayments' 'NumPriorProsperLoans61dpd'
 'BoolIsLender' 'BoolInGroup' 'EnumChannelCode' 'NumTradesOpened6'
 'NumOpenTradesDelinqOrPastDue6' 'DolTotalBalanceOnPublicRecords'
 'NumRealEstateTrades' 'DolMonthlyDebt' 'NumCurrentDelinquencies'
 'NumPublicRecordsLast10Years' 'NumPublicRecords12' 'DolAmountDelinquent'
 'PctBankcardUtil' 'NumTotalInquiries' 'BoolEverWholeLoan'
 'DaysSinceFirstCredit' 'Cancelled']


This is the final set of variables that we're going to start investigating. We've gone from 86 columns down to 31 columns. Here's a breakdown of what remains:

* Target: Cancelled
* Numerical variables: DolLoanAmountRequested, BorrowerRate, NumMonthsTerm, DolMonthlyIncome, FracDebtToIncomeRatio, NumMonthsEmployed, NumPriorProsperLoans, NumPriorProsperLoansLatePayments, NumPriorProsperLoans61dpd, NumTradesOpened6, NumOpenTradesDelinqOrPastDue6, DolTotalBalanceOnPublicRecords, NumRealEstateTrades, DolMonthlyDebt, NumCurrentDelinquencies, NumPublicRecordsLast10Years, NumPublicRecords12, DolAmountDelinquent, PctBankcardUtil, NumTotalInquiries, DaysSinceFirstCredit (21)
* Categorical variables: EnumListingCategory, StrEmploymentStatus, StrOccupation, EnumChannelCode (4)
* Boolean variables: BoolPartialFundingApproved, BoolIsLender, BoolInGroup, BoolEverWholeLoan (5)

Many of these variables have strange distributions, and will have to be treated individually. In particular, some of these variables are likely better suited to be bools, as demonstrated shortly, and others have large numbers of NaNs that we'll have to figure out what to do with.

Time to visualize the data! NumPriorProsperLoans61dpd is left out because, on closer inspection, all but 50 of the rows have NaNs in that variable. We also use NumPublicRecordsLast10Years instead of NumPublicRecords12 to explore link between public records and cancellation.

In [25]:
## only 50 people have a non-zero value for this criterion!
sum(np.isnan(reduced_data['NumPriorProsperLoans61dpd']))
reduced_data['NumPriorProsperLoans61dpd'].value_counts()

## similarly, very few people have a public record from the last 12 months - let's use 10-year version instead for now
sum(np.isnan(reduced_data['NumPublicRecords12']))
reduced_data['NumPublicRecords12'].value_counts()

0    251457
1       894
2        92
3        18
4         7
6         1
Name: NumPublicRecords12, dtype: int64

In [26]:
## 19 numerical variables only - show plot for each one. Also include column for cancellation.
numerics = reduced_data[['DolLoanAmountRequested','BorrowerRate','NumMonthsTerm','DolMonthlyIncome','FracDebtToIncomeRatio',\
'NumMonthsEmployed','NumPriorProsperLoans','NumPriorProsperLoansLatePayments',\
'NumTradesOpened6','NumOpenTradesDelinqOrPastDue6','DolTotalBalanceOnPublicRecords','NumRealEstateTrades',\
'DolMonthlyDebt','NumCurrentDelinquencies','NumPublicRecordsLast10Years','DolAmountDelinquent','PctBankcardUtil',\
'NumTotalInquiries','DaysSinceFirstCredit']]

In [27]:
## create one big figure behind subpanels
fig, axes = plt.subplots(6,3,figsize=(12,16), facecolor='w')

## cycle through each numeric variable and logistic regress with cancellation (excluding cancellation itself)
for ax, var in zip(axes.reshape(-1),numerics.columns.values):
    sns.regplot(ax=ax, x=var, y="Cancelled", data=reduced_data, logistic=True, ci = None, x_bins = 200)

<IPython.core.display.Javascript object>

* Variables that immediately appear to show a strong correlation with loan cancellation: DolLoanAmountRequested, DolMonthlyIncome,FracDebtToIncomeRatio,NumMonthsEmployed,NumTradesOpened6, NumOpenTradesDelinqOrPastDue6, NumRealEstateTrades, DolMonthlyDebt, NumCurrentDelinquencies, PctBankcardUtil, NumTotalInquiries.

* A few variables immediately seem to demonstrate nonlinear behavior, most notably DaysSinceFirstCredit, DolMonthlyDebt, NumMonthsEmployed and maybe even BorrowerRate. That said, I'm reluctant to add non-linear terms to my eventual model for risk of overfitting.

* NumPriorProsperLoans is a great predictor of whether you cancel or not. This makes some intuitive sense - if you've been through the process before, much easier to follow through. In fact, a boolean variable seems more appropriate, which we implement immediately below.

In [28]:
fig, ax = plt.subplots()
sns.regplot(x="NumPriorProsperLoans", y="Cancelled", data=reduced_data, logistic=True, ci = None, x_bins = 500)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x12343aa20>

There's no evidence that taking out more than one loans actually makes you less likely to cancel, so a logistic regression of the number of loans is a poor model. Let's change this variable to a boolean parameter instead.

In [29]:
#Create new Boolean variable that shows whether someone has ever taken out a Prosper loan before or not.
reduced_data['BoolPriorProsperLoanee'] = reduced_data['NumPriorProsperLoans'] > 0
reduced_data.drop('NumPriorProsperLoans', inplace = True, axis =1)

As we did above for the numerical variables, let's check out the influence of the boolean variables in our set.

In [30]:
## 5 boolean variables to consider.
booleans = reduced_data[['BoolPriorProsperLoanee','BoolPartialFundingApproved','BoolIsLender','BoolInGroup','BoolEverWholeLoan']]

In [31]:
## create one big figure behind subpanels
fig, axes = plt.subplots(3,2,figsize=(8,10), facecolor='w')

## cycle through each numeric variable and logistic regress with cancellation (excluding cancellation itself)
for ax, var in zip(axes.reshape(-1), booleans.columns.values):
    sns.regplot(ax=ax, x=var, y="Cancelled", data=reduced_data, logistic=True, ci = None, x_bins = 500)

<IPython.core.display.Javascript object>

BoolPriorProsperLoanee is highly significant, BoolIsLender and BoolInGroup also look significant, 
BoolPartialFundingApproved and BoolEverWholeLoan do not. We drop the latter two variables right away.

In [32]:
reduced_data.drop(['BoolPartialFundingApproved','BoolEverWholeLoan'], inplace = True, axis =1)

In [33]:
print(reduced_data['BoolPriorProsperLoanee'].value_counts(normalize = True))
print(reduced_data['BoolIsLender'].value_counts(normalize = True))
print(reduced_data['BoolInGroup'].value_counts(normalize = True))

False    0.935481
True     0.064519
Name: BoolPriorProsperLoanee, dtype: float64
0    0.987567
1    0.012433
Name: BoolIsLender, dtype: float64
False    0.997366
True     0.002634
Name: BoolInGroup, dtype: float64


Even though these variables all looked significant, BoolIsLender and BoolInGroup are only true for a very small subset of the total loan population - therefore including them into our model probably won't improve skill very much. We'll still keep them in as candidate variables for now.

In [34]:
reduced_data.to_pickle('theorem_model_variables_test.pkl')

We added a couple of variables and dropped others based on our findings above - let's save our final variable pool.

# Building a logistic model

### First parameters

The plot above gives us a good idea of a few variables that make a big difference, including DolLoanAmountRequested, BoolPriorProsperLoanee and PctBankcardUtil. Loanees are more likely to cancel if the amount of the loan is higher, less likely to cancel if they have received a Prosper loan before, and less likely to cancel with higher bank card utilization. Let's plot some ROC curves for these variables, and also test out another variable, NumTotalInquiries.

In [90]:
variables = [['DolLoanAmountRequested'],['BoolPriorProsperLoanee','DolLoanAmountRequested'],\
       ['BoolPriorProsperLoanee','DolLoanAmountRequested','PctBankcardUtil'],\
            ['BoolPriorProsperLoanee','DolLoanAmountRequested','PctBankcardUtil','NumTotalInquiries']]

y = reduced_data['Cancelled']
X = []
X_train = []
X_test = []
y_train = []
y_test = []

fig, ax = plt.subplots()

for i,v in enumerate(variables):
    standard_scaler = preprocessing.StandardScaler()
    X.append(standard_scaler.fit_transform(reduced_data[v]))
    X_tr, X_te, y_tr, y_te = train_test_split(X[i], y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])

    ## make plot of ROC curve
    l, = ax.plot(fpr,tpr, label = 'a')
    print(logm.coef_)

plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['DolLoanAmountRequested','BoolPriorProsperLoanee, DolLoanAmountRequested',\
       'BoolPriorProsperLoanee, DolLoanAmountRequested,\nPctBankcardUtil',\
            'BoolPriorProsperLoanee, DolLoanAmountRequested,\nNumTotalInquiries, PctBankcardUtil']
legend = ax.legend(handles[::-1], labels = mylabels, loc = 2)

#show the "no-skill" line
ax.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[ 0.16317715]]
[[-0.30253921  0.15834587]]
[[-0.30678644  0.16329316 -0.1515659 ]]
[[-0.30796493  0.16270451 -0.14653551  0.04711068]]




[<matplotlib.lines.Line2D at 0x13b2b0550>]

This series of ROC curves shows that starting with DolLoanAmountRequested, then adding BoolPriorProsperLoanee, then PctBankcardUtil, all make successive contributions to quality of fit. On the other hand, adding NumTotalInquiries makes a relatively minor impact on the resulting ROC curve, so we should consider leaving it out.

### Employment Status

Next, let's figure out how to treat the categorical variable StrEmploymentStatus.

In [97]:
print(reduced_data['StrEmploymentStatus'].value_counts())
print(reduced_data.groupby('StrEmploymentStatus')['Cancelled'].mean())

Employed         216678
Other             19553
Self-employed     15624
Full-time           614
Name: StrEmploymentStatus, dtype: int64
StrEmploymentStatus
Employed         0.329807
Full-time        0.190554
Other            0.494707
Self-employed    0.202573
Name: Cancelled, dtype: float64


Clearly, self-reported employment status makes a huge difference on the probability of cancellation, especially if potential loanees list themselves as "Other." Let's file the last two categories as Other to reduce the number of categories and convert StrEmploymentStatus to dummy variables.

In [101]:
## important step...let's test the treatment of the categorical variable.
reduced_data['StrEmploymentStatus'].replace(['Part-time','Not employed'],'Other',inplace = True)
X_job = reduced_data['StrEmploymentStatus'].to_frame()
X_dummies = pd.get_dummies(X_job)

## these are the core variables we've already identified
variables = ['BoolPriorProsperLoanee','DolLoanAmountRequested','PctBankcardUtil']
X = reduced_data[variables]

#y = reduced_data['Cancelled']
X_without_jobs = reduced_data[variables]
X_with_jobs = pd.concat([X,X_dummies], axis=1)

In [107]:
## let's plot 3 curves - accounting for job status only
fig, ax = plt.subplots()

standard_scaler1 = preprocessing.StandardScaler()
X_without_jobs_sc = standard_scaler1.fit_transform(X_without_jobs)

standard_scaler2 = preprocessing.StandardScaler()
X_with_jobs_sc = standard_scaler2.fit_transform(X_with_jobs)

for X in [X_dummies, X_without_jobs_sc, X_with_jobs_sc]:
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['StrEmploymentStatus only','3 vars without job status','3 vars with job status']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[ 0.00223078 -0.75684759  0.67767825 -0.63696438]]
[[-0.30678644  0.16329316 -0.1515659 ]]
[[-0.30072478  0.1711396  -0.15424153 -0.02124451 -0.02022657  0.17427729
  -0.15844917]]




[<matplotlib.lines.Line2D at 0x123601da0>]

Adding job type notably increases area under curve, especially at the fringes - again, small sub-populations (like
those who respond 'other' as job status) where we're able to make much stronger predictions. StrEmploymentStatus is added as the fourth core variable of the model.

### Monthly Debt/Monthly Income/Debt-To-Income Ratio

Rather than trying to introduce these variables separately, they are clearly inter-related to one another (although as we'll see below, FracDebtToIncomeratio is not exactly equal to DolMonthlyDebt/DolMonthlyIncome). Therefore, I tested the possible combinations of these variables to see which had the best predictivity for loan cancellation (rather than just plugging each in individually).

In [109]:
print('Does variable contain any NaNs?')
reduced_data.isnull().any()

Does variable contain any NaNs?


DolLoanAmountRequested              False
BorrowerRate                        False
NumMonthsTerm                       False
EnumListingCategory                 False
DolMonthlyIncome                    False
FracDebtToIncomeRatio                True
StrEmploymentStatus                 False
StrOccupation                        True
NumMonthsEmployed                    True
NumPriorProsperLoansLatePayments     True
NumPriorProsperLoans61dpd            True
BoolIsLender                        False
BoolInGroup                         False
EnumChannelCode                     False
NumTradesOpened6                    False
NumOpenTradesDelinqOrPastDue6       False
DolTotalBalanceOnPublicRecords      False
NumRealEstateTrades                 False
DolMonthlyDebt                      False
NumCurrentDelinquencies             False
NumPublicRecordsLast10Years         False
NumPublicRecords12                  False
DolAmountDelinquent                 False
PctBankcardUtil                   

One immediate problem - some values of DolMonthlyIncome are null. Let's figure out why.

In [122]:
DI_null = reduced_data[reduced_data['FracDebtToIncomeRatio'].isnull()]
print(DI_null['StrEmploymentStatus'].value_counts())

Self-employed    15605
Employed             2
Name: StrEmploymentStatus, dtype: int64


15605 of 15624 loanees listed as Self-employed don't have a value for FracDebtToIncomeRatio. Let's fix that by creating a pseudo-value based on the observed relationships for other kinds of jobs.

In [129]:
## make a smaller data-frame just with DolMonthlyDebt, DolMonthlyIncome and FracDebtToIncomeRatio.
test_df = reduced_data[['DolMonthlyDebt','DolMonthlyIncome','FracDebtToIncomeRatio']]
test_df['PreliminaryDTI'] = test_df['DolMonthlyDebt']/test_df['DolMonthlyIncome']

## first try - let's compare the variable FracDebtToIncomeRatio to what we get by dividing DolMonthlyDebt to DolMonthlyIncome.
test_df['prelim_vs_actual_frac'] = test_df['PreliminaryDTI']/test_df['FracDebtToIncomeRatio']

## key line below - a very small amount of loanees have a reported DolMonthlyIncome of 0, in which case the fraction is infinite
test_df.replace([np.inf, -np.inf], np.nan, inplace = True)
print("Mean ratio between our estimated Debt-to-Income ratio and actual:", test_df['prelim_vs_actual_frac'].mean())
print("Standard deviation:", test_df['prelim_vs_actual_frac'].std())

fig,ax = plt.subplots()
test_df['prelim_vs_actual_frac'].hist(bins = 100)
plt.title('Ratio of estimated Debt-to-Income Ratio versus Actual', size = 14)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Mean ratio between our estimated Debt-to-Income ratio and actual: 0.6932558700618643
Standard deviation: 0.15203597840598992


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x1340affd0>

Ad hoc formula for creating a pseudo-Debt-to-Income ratio: divide DolMonthlyDebt by DolMonthlyIncome, then scale by factor of 1/.6933 (about 1.442)

In [137]:
## let's make the pseudo-variable.
## Two steps: 1) Use FracDebtToIncomeRatio where available. 2) Otherwise, estimate it using method described just above.
test_df['EstimatedDebtToIncomeRatio'] = test_df['FracDebtToIncomeRatio']
test_df.ix[np.isnan(test_df.EstimatedDebtToIncomeRatio), 'EstimatedDebtToIncomeRatio'] = test_df['PreliminaryDTI']/.6933
test_df[:20]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0_level_0,DolMonthlyDebt,DolMonthlyIncome,FracDebtToIncomeRatio,PreliminaryDTI,prelim_vs_actual_frac,EstimatedDebtToIncomeRatio
ListingNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
973605,1242,6000.0,0.27,0.207,0.766667,0.27
981099,2289,7916.6667,0.35,0.289137,0.826105,0.35
1025766,911,2083.3333,0.53,0.43728,0.825057,0.53
1003835,223,3750.0,0.14,0.059467,0.424762,0.14
1011335,1264,9000.0,0.16,0.140444,0.877778,0.16
1010105,3455,8416.6667,0.45,0.410495,0.912211,0.45
1029573,1488,6250.0,,0.23808,,0.343401
1014296,308,3333.3333,0.16,0.0924,0.5775,0.16
1009580,846,6450.9167,0.15,0.131144,0.874294,0.15
743482,1559,12500.0,0.28,0.12472,0.445429,0.28


Now, let's try out some models with different combinations of variables, starting with individual variables.

In [143]:
cols_debt_income = [['DolMonthlyDebt'],['DolMonthlyIncome'],['EstimatedDebtToIncomeRatio']]

fig, ax = plt.subplots()

for cols in cols_debt_income:
    X_debt = test_df[cols]
    standard_scaler2 = preprocessing.StandardScaler()
    X_debt_sc = standard_scaler2.fit_transform(X_debt)
    X_tr, X_te, y_tr, y_te = train_test_split(X_debt_sc, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
plt.title('ROC curves for single variables', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['DolMonthlyDebt','DolMonthlyIncome','EstimatedDebtToIncomeRatio']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[-0.1089228]]
[[ 0.23961597]]
[[-0.33228433]]




[<matplotlib.lines.Line2D at 0x1349d4320>]

If we were going to pick a single variable, DolMonthlyDebt would perform best  and EstimatedDebtToIncomeRatio not too much worse, while DolMonthlyIncome would have very little predictivity.

In [146]:
cols_debt_income = [['DolMonthlyDebt','DolMonthlyIncome'],\
      ['DolMonthlyDebt','EstimatedDebtToIncomeRatio'],\
      ['DolMonthlyIncome','EstimatedDebtToIncomeRatio']]

fig, ax = plt.subplots()

for cols in cols_debt_income:
    X = test_df[cols]
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
plt.title('ROC curves for single variables', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['DolMonthlyDebt + DolMonthlyIncome','DolMonthlyDebt + EstimatedDebtToIncomeRatio',\
           'DolMonthlyIncome + EstimatedDebtToIncomeRatio']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[ -2.03352998e-04   1.80366695e-05]]
[[-0.0001316  -0.09274625]]
[[  3.28953809e-06  -1.73600887e-01]]




[<matplotlib.lines.Line2D at 0x134cbe208>]

Perhaps surprisingly, even though DolMonthlyIncome wasn't useful individually, works better in tandem with DolMonthlyDebt than EstimatedDebtToIncomeRatio.

In [147]:
cols_debt_income = [['DolMonthlyDebt'],['DolMonthlyDebt','DolMonthlyIncome'],\
                   ['DolMonthlyDebt','DolMonthlyIncome','EstimatedDebtToIncomeRatio']]

fig, ax = plt.subplots()
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

for cols in cols_debt_income:
    X = test_df[cols]
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
handles, labels = ax.get_legend_handles_labels()
mylabels = ['DolMonthlyDebt','DolMonthlyDebt + DolMonthlyIncome',\
           'DolMonthlyDebt + DolMonthlyIncome + EstimatedDebtToIncomeRatio']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[-0.00013843]]
[[ -2.03352998e-04   1.80366695e-05]]
[[ -1.93526489e-04   1.68784012e-05  -9.05197942e-02]]




[<matplotlib.lines.Line2D at 0x12d668940>]

The last plot above shows that adding EstimatedDebtToIncomeRatio doesn't add skill to the combo of DolMonthlyDebt and DolMonthlyIncome - therefore, we get to leave it out entirely. It makes sense that one of the three variables should be omitted, since they are clearly not independent from one another. Let's take a look at the improvement in skill by including these two variables:

In [156]:
X_nodebt = X_with_jobs 
X_withdebt = pd.concat([X_nodebt,reduced_data[['DolMonthlyDebt','DolMonthlyIncome']]], axis=1)
X_withdebt.head()

Unnamed: 0_level_0,BoolPriorProsperLoanee,DolLoanAmountRequested,PctBankcardUtil,StrEmploymentStatus_Employed,StrEmploymentStatus_Full-time,StrEmploymentStatus_Other,StrEmploymentStatus_Self-employed,DolMonthlyDebt,DolMonthlyIncome
ListingNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
973605,False,15000.0,0.97,1,0,0,0,1242,6000.0
981099,False,15000.0,0.48,0,0,1,0,2289,7916.6667
1025766,True,4000.0,0.93,1,0,0,0,911,2083.3333
1003835,False,10000.0,0.26,1,0,0,0,223,3750.0
1011335,True,20000.0,0.81,1,0,0,0,1264,9000.0


In [159]:
fig, ax = plt.subplots()

## scale the variables first
X_nodebt_sc = X_with_jobs_sc
standard_scaler = preprocessing.StandardScaler()
X_withdebt_sc = standard_scaler.fit_transform(X_withdebt)

for X in [X_nodebt_sc, X_withdebt_sc]:
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['Without Monthly Debt/Income','With Monthly Debt/Income']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[-0.30072478  0.1711396  -0.15424153 -0.02124451 -0.02022657  0.17427729
  -0.15844917]]
[[-0.28830317  0.1968249  -0.13185078 -0.01997185 -0.01937201  0.17228927
  -0.15826079 -0.14862107  0.34991002]]




[<matplotlib.lines.Line2D at 0x1344eff98>]

### Borrower Rate and Real Estate Trades
From our variable-by-variable logistic regressions above, it seemed that BorrowerRate and NumRealEstateTrades both had a weak influence on cancellation probability. Let's test it out here.

In [171]:
X_borrower_rate = pd.concat([X_withdebt,reduced_data[['BorrowerRate']]], axis=1)
X_rate_plus_realestate = pd.concat([X_withdebt,reduced_data[['BorrowerRate',\
                                                                'NumRealEstateTrades']]], axis=1)

fig, ax = plt.subplots()

## scale the variables first
standard_scaler = preprocessing.StandardScaler()
X_borrower_rate_sc = standard_scaler.fit_transform(X_borrower_rate)

standard_scaler_2 = preprocessing.StandardScaler()
X_rate_plus_realestate_sc = standard_scaler.fit_transform(X_rate_plus_realestate)

for X in [X_withdebt_sc, X_borrower_rate_sc, X_rate_plus_realestate_sc]:
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['BoolPriorProsperLoanee, DolLoanAmountRequested, PctBankcardUtil,\n\
StrEmploymentStatus, DolMonthlyDebt and DolMonthlyIncome', '+ BorrowerRate','+ BorrowerRate + NumRealEstateTrades']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[-0.28830317  0.1968249  -0.13185078 -0.01997185 -0.01937201  0.17228927
  -0.15826079 -0.14862107  0.34991002]]
[[-0.28481846  0.2140983  -0.15345215 -0.01874342 -0.01981258  0.17388755
  -0.16172216 -0.15796579  0.37362016  0.08109587]]
[[-0.28483716  0.2253451  -0.15113019 -0.02038505 -0.01987256  0.17816365
  -0.16407718 -0.14316928  0.41307171  0.07629983 -0.09721096]]




[<matplotlib.lines.Line2D at 0x15b137f60>]

The coefficient for BorrowerRate is smaller in magnitude than for other variables, but it does make a noticeable difference in the center of our ROC curve, and so I include it in the final model. For other variables such as BoolPriorProsperLoanee which only affect a small percentage of loanees (less than 10%), the effect is primarily on the fringes of the ROC curve. Even though the BorrowerRate effect is small (higher borrower rate leads to slightly higher probability of cancellation), it still provides added information for every loan.

## Are we missing any significant variables?
As it stands, the model includes BoolPriorProsperLoanee (derived from NumPriorProsperLoans), DolLoanAmountRequested, PctBankcardUtil, StrEmploymentStatus, DolMonthlyDebt, DolMonthlyIncome, BorrowerRate and NumRealEstateTrades, for 8 variables in total (11 if we account for treating StrEmploymentStatus with dummy variables). Let's test out a few other possibly important variables to see if we're missing anything.

## Categorical variables
We still have EnumChannelCode, EnumListingCategory and StrOccupation to evaluate, since they require prior treatment before determining their influence

In [178]:
channel_codes = reduced_data.groupby('EnumChannelCode')
print(channel_codes['Cancelled'].mean())


EnumChannelCode
40000    0.347317
50000    0.350827
70000    0.369273
80000    0.133648
90000    0.341167
Name: Cancelled, dtype: float64


EnumChannelCode looks promising, but there's a catch:

In [184]:
channel_codes.get_group('80000')['BoolPriorProsperLoanee'].mean()

1.0

An EnumChannelCode of 80000 just refers to a new potential Prosper loanee, so no new information - can go ahead and discard.

In [187]:
listing_category = reduced_data.groupby('EnumListingCategory')
print(listing_category['Cancelled'].sum())
print(listing_category['Cancelled'].mean())

EnumListingCategory
0        81
1     62989
2      4453
3      2659
6       846
7      6687
8       134
9        55
11      112
12       57
13      933
14     1627
15     2111
16      104
17       60
18      400
19      581
20      497
21       31
Name: Cancelled, dtype: int64
EnumListingCategory
0     0.355263
1     0.322420
2     0.352350
3     0.421997
6     0.374668
7     0.376605
8     0.330864
9     0.343750
11    0.336336
12    0.518182
13    0.343267
14    0.402126
15    0.394137
16    0.371429
17    0.410959
18    0.269542
19    0.431329
20    0.351734
21    0.244094
Name: Cancelled, dtype: float64


### Remaining boolean variables

We have two remaining variables that seem to show strong predictivity of loan cancellation: BoolIsLender and BoolInGroup. The problem as shown below is that there are very few loanees for whom either variable are true, and so including them will only add predictive skill to about 1% of loans.

In [196]:
print(reduced_data['BoolIsLender'].value_counts())
print(reduced_data['BoolInGroup'].value_counts())

0    249330
1      3139
Name: BoolIsLender, dtype: int64
False    251804
True        665
Name: BoolInGroup, dtype: int64


In [202]:
X_bool_test= pd.concat([X_rate_plus_realestate,reduced_data[['BoolIsLender','BoolInGroup']]], axis=1)

fig, ax = plt.subplots()

## scale the variables first
standard_scaler = preprocessing.StandardScaler()
X_bool_test_sc = standard_scaler.fit_transform(X_bool_test)

for X in [X_rate_plus_realestate_sc, X_bool_test_sc]:
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
    logm = linear_model.LogisticRegression()
    logm.fit(X_tr,y_tr)
    probs = logm.predict_proba(X_te)
    fpr, tpr, thresholds = metrics.roc_curve(y_te,probs[:,1])
    plt.plot(fpr,tpr)
    print(logm.coef_)
    
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize = 16)
plt.xlabel('FPR (False Positive Rate)', fontsize=14)
plt.ylabel('TPR (True Positive Rate)', fontsize=14)

handles, labels = ax.get_legend_handles_labels()
mylabels = ['Core model without booleans','Core model with booleans']
legend = ax.legend(handles, labels = mylabels, loc = 2)

## draw the "no-skill" line
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

<IPython.core.display.Javascript object>

[[-0.28483716  0.2253451  -0.15113019 -0.02038505 -0.01987256  0.17816365
  -0.16407718 -0.14316928  0.41307171  0.07629983 -0.09721096]]
[[-0.28782379  0.22521576 -0.1510228  -0.02029528 -0.02108214  0.17832402
  -0.16413778 -0.14298997  0.4124762   0.07623037 -0.09719635  0.00577262
   0.01147506]]




[<matplotlib.lines.Line2D at 0x152aba5c0>]

No perceptible difference exists between including or excluding BoolIsLender and BoolInGroup, so we leave them out of the model.

### Remaining numerical variables

A list of numerical variables that we haven't put into the model yet:
DolTotalBalanceOnPublicRecords, NumMonthsTerm, NumMonthsEmployed, NumPriorProsperLoansLatePayments, NumPriorProsperLoans61dpd, NumTradesOpened6, NumOpenTradesDelinqOrPastDue6, NumCurrentDelinquencies, NumPublicRecordsLast10Years, NumPublicRecords12, DolAmountDelinquent, DaysSinceFirstCredit

In [190]:
print(reduced_data[reduced_data['DolTotalBalanceOnPublicRecords']!=0]['Cancelled'].mean())
print(reduced_data[reduced_data['DolTotalBalanceOnPublicRecords']==0]['Cancelled'].mean())

0.34695719443
0.333966728372


In [192]:
print(reduced_data[reduced_data['DolAmountDelinquent']!=0]['Cancelled'].mean())
print(reduced_data[reduced_data['DolAmountDelinquent']==0]['Cancelled'].mean())

0.342281440841
0.333273232123


As also indicated by the regression, DolTotalBalanceOnPublicRecords and DolAmountDelinquent do not show a strong difference between behavior of zero and non-zero loanees, and are omitted from model.

# Final thoughts and caveats

Some of the variables in this study are treated as linear in this model, but clearly merit further investigation. For instance, the distribution of cancellation rates depending on BorrowerRate clearly looks parabolic, with lower cancellation probability at both low and high rates. Similarly, younger and older borrowers appear much more likely to cancel than middle-aged borrowers. However, for a first pass at a model I have decided not to include non-linear term to avoid over-complexification.