## Predicting Probability of Loan Delinquency

#### In this assignment, we will explore the historical loan data from [lending club](https://www.kaggle.com/wendykan/lending-club-loan-data/data), and build a very short machine learning pipeline to predict on likelihood of loan delinquency. 

#### We will start out with inspecting and describing the data, then prepare the dataset for a K-Nearest Neighbor algorithm, and evaluate the performance of the algorithm. This exercise walks you through _most_ steps of a machine learning project with much simplicity and focus on data manipulation rather than algorithm training. You are encouraged to check out the [sklearn library](http://scikit-learn.org/stable/index.html) and the recommended reading materials this week to learn more about ML!

#### There are 17 questions in total. Each question worths 5 points. Proper documentation and readability of your code will be 15 points.

In [108]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 999
pd.set_option('display.float_format', lambda x: '%.3f' % x)

### Inspect the data 

In [109]:
loan = pd.read_csv('loan.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [110]:
print(loan.shape)

(887379, 74)


In [111]:
print(list(loan))

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il

In [112]:
print(loan['member_id'][0:5])
print(loan['id'][0:5])

s = set()
for i in loan['member_id']:
    s.add(i)

print(len(s))

0    1296599
1    1314167
2    1313524
3    1277178
4    1311748
Name: member_id, dtype: int64
0    1077501
1    1077430
2    1077175
3    1076863
4    1075358
Name: id, dtype: int64
887379


In [113]:
u_mid = loan.member_id.unique()
print(len(u_mid))
print(max(u_mid), min(u_mid))

887379
73544841 70473


In [114]:
uid = loan.id.unique()
print(len(uid))
print(max(uid), min(uid))

887379
68617057 54734


In [115]:
issue_dates = loan['issue_d']
print(max(issue_dates), min(issue_dates))

Sep-2015 Apr-2008


#### Question 1: Browse the data dictionary and use pandas functions to browse the data and answer the following questions with a paragraph. You don't need to show your work :
    - How many columns are there? What general categories of information do they provide? 
    - What’s the time frame of the dataset based on the issue date of the loans? (You will have to parse the date) 
    - What’s the difference between “Id” and “Member Id”? 
    - How many unique borrowers and loans are in the dataset?

#### write a summary paragraph here: 

The "Loan.csv" file has 74 columns. The general categories are loan amount, location of person, past loan performance, where they got the loan at.  
The timeframe of the data set is between April 2008 to September 2015.
The number of unique Id and Member Id are the same. The maximum integer value of member id is higher than id and the minimun value is lower than id. 
There are 887379 borrowers and 887379 loans. 

### Descriptive statistics - About the clients

In [116]:
print(loan['annual_inc'].mean())

75027.5877608


####  What's the average annual individual income of the borrowers?

In [117]:
75027.5877608

75027.5877608

#### What are the top ten purposes people borrow money for? 

In [118]:
mapping = {}
for purpose in loan['purpose']:
    if purpose in mapping:
        mapping[purpose] += 1
    else:
        mapping[purpose] = 1

import operator
sorted_x = sorted(mapping.items(), key=operator.itemgetter(1))
for idx, e in enumerate(sorted_x):
    if idx == 10:
        break
    print(idx+1, e)

1 ('educational', 423)
2 ('renewable_energy', 575)
3 ('wedding', 2347)
4 ('house', 3707)
5 ('vacation', 4736)
6 ('moving', 5414)
7 ('medical', 8540)
8 ('car', 8863)
9 ('small_business', 10377)
10 ('major_purchase', 17277)


#### Which 10 states has the most number of loans? Sort by descending order

In [119]:
mapping = {}
for s in loan['addr_state']:
    if s in mapping:
        mapping[s] += 1
    else:
        mapping[s] = 1

sorted_x = sorted(mapping.items(), key=operator.itemgetter(1), reverse=True)
for idx, e in enumerate(sorted_x):
    if idx == 10:
        break
    print(idx+1, e) 

1 ('CA', 129517)
2 ('NY', 74086)
3 ('TX', 71138)
4 ('FL', 60935)
5 ('IL', 35476)
6 ('NJ', 33256)
7 ('PA', 31393)
8 ('OH', 29631)
9 ('GA', 29085)
10 ('VA', 26255)


### Descriptive statistics - About the loans

#### How many loans were issued per year?

In [120]:
mapping = {}

issue_d = loan['issue_d']
for d in issue_d:
    year = d.split('-')[1]
    if year in mapping:
        mapping[year] += 1
    else:
        mapping[year] = 1

for k in mapping:
    print(k, mapping[k])

2011 21721
2010 12537
2009 5281
2008 2393
2007 603
2013 134755
2012 53367
2014 235628
2015 421094


#### What’s the average interest rate for loans in each grade and within that for each sub_grade? 

In [121]:
grades = loan.grade.unique()
print(grades)
subgrades = loan.sub_grade.unique()
print(subgrades)

mapping = {}
for grade in grades:
    mapping[grade] = {}
    
for subgrade in subgrades:
    grade = subgrade[0]
    mapping[grade][subgrade] = []

print(mapping)

['B' 'C' 'A' 'E' 'F' 'D' 'G']
['B2' 'C4' 'C5' 'C1' 'B5' 'A4' 'E1' 'F2' 'C3' 'B1' 'D1' 'A1' 'B3' 'B4' 'C2'
 'D2' 'A3' 'A5' 'D5' 'A2' 'E4' 'D3' 'D4' 'F3' 'E3' 'F4' 'F1' 'E5' 'G4' 'E2'
 'G3' 'G2' 'G1' 'F5' 'G5']
{'B': {'B2': [], 'B5': [], 'B1': [], 'B3': [], 'B4': []}, 'C': {'C4': [], 'C5': [], 'C1': [], 'C3': [], 'C2': []}, 'A': {'A4': [], 'A1': [], 'A3': [], 'A5': [], 'A2': []}, 'E': {'E1': [], 'E4': [], 'E3': [], 'E5': [], 'E2': []}, 'F': {'F2': [], 'F3': [], 'F4': [], 'F1': [], 'F5': []}, 'D': {'D1': [], 'D2': [], 'D5': [], 'D3': [], 'D4': []}, 'G': {'G4': [], 'G3': [], 'G2': [], 'G1': [], 'G5': []}}


In [122]:
int_rates = loan['int_rate']

for idx, subgrade in enumerate(loan['sub_grade']):
    int_rate = loan['int_rate'][idx]
    grade = subgrade[0]
    mapping[grade][subgrade].append(int_rate)

for grade in mapping:
    grade_avg_int_list = []
    for sg in mapping[grade]:
        grade_avg_int_list += mapping[grade][sg]
        subgrade_avg_int = sum(mapping[grade][sg]) / float(len(mapping[grade][sg]))
        print(grade, sg, subgrade_avg_int)
        
    grade_avg_int = sum(grade_avg_int_list) / float(len(grade_avg_int_list))
    print(grade, grade_avg_int)
    print("\n")
    print("\n")

B B2 9.99664336524
B B5 12.2691690046
B B1 8.94879124789
B B3 10.9007918612
B B4 11.7448662496
B 10.8296181665




C C4 14.598234644
C C5 15.3181527936
C C1 12.8921316051
C C3 14.0159937003
C C2 13.4235697603
C 13.980098308




A A4 7.52918131534
A A1 5.71201850478
A A3 7.14102016456
A A5 8.26961955552
A A2 6.42585901712
A 7.24331156125




E E1 18.8276696956
E E4 20.780979188
E E3 19.9682418282
E E5 21.6807968668
E E2 19.3739884733
E 19.8973216887




F F2 23.3167433234
F F3 23.9705594406
F F4 24.4894925198
F F1 22.6062177889
F F5 24.9989090208
F 23.5827866007




D D1 16.079182626
D D2 16.8473925444
D D5 18.4172051054
D D3 17.3409072079
D D4 17.9032471242
D 17.1758144501




G G4 25.622428356
G G3 25.8429765545
G G2 25.6756723891
G G1 25.4576910743
G G5 25.6934548611
G 25.6267061396






#### What's the percentage breakdown of loan status? In particular we want to see the percentage of loans that were defaulted and fully paid.

In [123]:
mapping = {}
status = loan['loan_status']
total = float(len(status))
for s in status:
    if s in mapping:
        mapping[s] += 1
    else:
        mapping[s] = 1

for s in mapping:
    print(s, str(round((mapping[s]/total * 100),2)) + "%")

Fully Paid 23.41%
Charged Off 5.1%
Current 67.82%
Default 0.14%
Late (31-120 days) 1.31%
In Grace Period 0.7%
Late (16-30 days) 0.27%
Does not meet the credit policy. Status:Fully Paid 0.22%
Does not meet the credit policy. Status:Charged Off 0.09%
Issued 0.95%


Fully Paid: 23.41%
Defaulted: 0.14%

#### How does default rate varies from year to year? Create one dataframe showing year vs default rate

In [124]:
mapping = {}
# status is array of status
for idx, date in enumerate(loan['issue_d']):
    year = date.split("-")[0]
    is_default = 0
    if year in mapping:
        if status[idx] == 'Default':
            is_default = 1
        mapping[year].append(is_default)
    else:
        mapping[year] = [is_default]

In [125]:
tmp = []
for year in mapping:
    default_rate = sum(mapping[year]) / len(mapping[year])
    tmp.append([year, default_rate])
    print(year, str(round(default_rate * 100.0, 2)) + "%")

year_default=pd.DataFrame(tmp, columns = ["Year", "Default_Rate"])

Dec 0.06%
Nov 0.11%
Oct 0.09%
Sep 0.08%
Aug 0.12%
Jul 0.14%
Jun 0.13%
May 0.2%
Apr 0.17%
Mar 0.21%
Feb 0.2%
Jan 0.21%


In [126]:
print(year_default)

   Year  Default_Rate
0   Dec         0.001
1   Nov         0.001
2   Oct         0.001
3   Sep         0.001
4   Aug         0.001
5   Jul         0.001
6   Jun         0.001
7   May         0.002
8   Apr         0.002
9   Mar         0.002
10  Feb         0.002
11  Jan         0.002


### Case Study

#### You are considering approval of a loan of 100,000 dollars to a couple whose joint income is 200,000 dollars. You've verfied their information and now you are considering whether you want to approve this loan offer and what interest/term you would offer based on historical data. Write a short paragraph about how you would decide if you want to approve the loan or not based on available data on borrowers such as income, employment histories, and so on. You are encouraged to but not required to include numbers you obtained from analyzing historical data. You may also discuss other uncertainties you would like to consider to justify your decision. 

I want to see, given their income, age, and employment bracket, what the average default rate is and corresponding interest rate payment is. In addition, I want to look at how different kinds of loan offered to that couple performed and perhaps consider offer them the terms with the most probability of successful repayments. 

### Machine Learning

#### Machine learning algorithms allow us to take in large quantity of information, identify patterns, and make data-driven decisions. Now we want to use it to aid our decision-making by predicting a client's likelihood of deliquency. 

#### Some lingos before we start:
__features__: also your regressor, left-hand variables, variables that you use to predict your outcome 

__target variable__: also outcome variable, variable whose value you are trying to predict. 



### A machine learning pipeline generally consists of :

__Data preprocessing__: clean the data by getting rid of outliers and idiosyncracies, standarizeding the data on the same scale, and creating target variable if it's not clearly defined by the problem.
    

#### Then an re-iterative process of:

- __Feature generation and selection__: create predictors, ie features, to provide "hints" to the model and select the most informative features.


- __Cross-validation__: divide the dataset into training and testing to obtain various performance metrics for the algorithms. Common partition scheme for this are 10-/5-fold cross validation. Read more [here](https://genome.tugraz.at/proclassify/help/pages/XV.html)


- __Algorithm selection/Grid search__: find the best performing algorithm and the best hyper-parameters for the algorithm. this process can be time consuming, which is the so-called "no free lunch thereom", but generally the process is quite automated.



- __Evaluation__: choose your loss function and metrics you want to maximize based on your real concerns and tradeoffs between false positive/false negative rate, training time/latency, and so on.


- __Implementation__: In a real project, what algorithm you choose is also informed by accessiblity to the data in real time, the required data literacy from the users/complexity of your algorithm, transparency/ethics, costs of maintaince, and so on. You will also build in check-ins for abnormal prediction results.

#### In the following section you will get the data ready to feed it through a machine learning pipeline.

In [299]:
# read in your data frame here
# warning: this will update the df global variable 
# and erase the columns you created previously for descriptive analysis! we want to start fresh here.

df = pd.read_csv('loan.csv')
loan = df

  interactivity=interactivity, compiler=compiler, result=result)


#### First of all, we will sub-set our dataframe. To keep the training time short we will only use data from 2015. Then let's divide it into two parts: historical records with known payment outcomes vs loans that are currently active or just issued. We will use the historical records (with known outcomes) to train, test, and evaluate our model. In reality, we will apply the trained model to predict on loans that are still active (outcomes unkown) . 

__Task__: Create one df that contains all loans in 2015 then split it into two dataframes: one with status that are either 'Current' or 'Issued' - call it _application_, and one that contains the rest -- call it _development_

In [300]:
issue_d = df['issue_d']
tmp = []
for d in issue_d:
    tmp.append(int(d.split("-")[1]))

print(loan.shape)
print(df.shape)
df['issue_y'] = tmp

# # Create variable with TRUE if issue_d, year is > 2015
recent = df['issue_y'] >= 2015

# # Select all casess where issue_d's year is > 2015
df = df[recent]
print(df.shape)

(887379, 74)
(887379, 74)
(421094, 75)


In [301]:
issue_d = df['loan_status']
tmp = []
for d in issue_d:
    if d == 'Current' or d == 'Issued':
        tmp.append(1)
    else:
        tmp.append(0)  

df['is_historical'] = tmp

#  historical records with known payment outcomes vs loans that are currently active or just issued
historical = df['is_historical'] == 1
not_historical = df['is_historical'] == 0

application = df[historical]
development = df[not_historical]

# df = df.drop('is_historical', 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [302]:
print(application.shape)
print(development.shape)

(386013, 76)
(35081, 76)


#### Now we'll work with our _development set_. 

#### We will adjust our outcome variable slightly -- instead of predicting on default rate which is quite low and can make accurate prediction more challenging, we will predict on "delinquency" which include loans that are charged off, late, and defaulted. In this way, we increase the number of outcomes and also simplifies the problem into a binary classification -- meaning we have two outcomes essentially: delinquent (1) / not delinquent (0)

__Task:__ Create a new column called "delinquency" that consolidate the values from "loan_status" and is a binary variable that is 1 for delinquent loan and 0 otherwise. A delinquent loan is one whose status is any of the following:  charged off, late, in grace period, does not meet the credit policy - charged off, and default. Drop "loan_status" variable once you are done creating "delinquency" since it's no longer needed.

In [303]:
# delinquency
# Charged Off 5.1%
# Late (31-120 days) 1.31%
# In Grace Period 0.7%
# Late (16-30 days) 0.27%
# Does not meet the credit policy. Status:Charged Off 0.09%
# Default
print(loan.shape)
loan = development
df = development

issue_d = loan['loan_status']
tmp = []
for d in issue_d:
    if d in ["Charged Off", "Late (31-120 days)", "In Grace Period", "Late (16-30 days)", "Does not meet the credit policy. Status:Charged Off", "Default" ]:
        tmp.append(1)
    else:
        tmp.append(0)  

df['delinquency'] = tmp

print(df.shape)
# drop the column
# df = df.drop('loan_status', 1)

(887379, 75)
(35081, 77)


#### Feature generation: Once we have our target variable ready, we can move on to generate some features. The model does not learn if you do not provide it with informative hints! 

#### First, we will drop the id columns since they have no inherent value but for book keeping in this case. To simplify the task, we will also drop 'url', 'desc', 'title', 'zip_code', 'emp_title' columns because they contain text information that require more tedious cleaning and natural language processing. Finally, we will drop 'grade' because the information is contained in 'sub_grade'.

__Task__: drop 'id', 'member_id', 'url', 'desc', 'title', 'zip_code', 'emp_title', 'grade' columns

In [304]:
for column_name in ['id', 'member_id', 'url', 'desc', 'title', 'zip_code', 'emp_title', 'grade']:
    if column_name in df:
        df = df.drop(column_name, 1)
    else:
        print(column_name + ' is not a header')

#### Browse through a few rows, you'll notice we have a mix of categorical, date, and numerical variables. Turn all categorical variables into dummy variables using pd.get_dummies().

__Task__: Turn all categorical variables into dummy variables and join them back to the dataframe. Drop the original variable that we used to create the dummy variables. *Hint: There are 11 categorical variables*

In [305]:
print(df.shape)

df = pd.get_dummies(df)
print(df.shape)

(35081, 69)
(35081, 792)


#### Now let's clean the date variables. Date variables genernally don't provide much information unless we extract something useful from them. For example from entry/exit dates of a project, we can infer the project duration. We can also extract the month/year from the dates especially when there's seasonality involved in the data. For instance, most people may fail to pay off their loands after 2008 financial crisis. In this data, we will extract the year from the dates.

__Task__: 
        - 1) Make sure date columns contain datetime types 
        - 2) Create 'issued_yr' by extracting year from 'issued_d'
        - 3) Create 'earliest_cr_line_yr' by extracting year from 'earliest_cr_line' 
        - 4) Create 'payment_period' by subtracting 'next_pymnt_d' - 'last_pymnt_d' (make sure you convert timestamp to integers: 54days => 54)
        - 5) Create 'last_credit_pull_yr' by extracting year from 'last_credit_pull_d'
        - 6) Drop all original date variables

In [306]:
# issued_yr
# earliest_cr_line_yr
# payment_period
# last_credit_pull_yr

issued_yr = []
# print(df.head())
# for d in df['issued_d']:
#     year = d.split("-")[1]
#     issued_yr.append(year)
df['issued_yr'] = df['issue_y']
print(df['issued_yr'].head())

467234    2015
467594    2015
467825    2015
468190    2015
468331    2015
Name: issued_yr, dtype: int64


In [307]:
# df['earliest_cr_line_yr'] = earliest_cr_line_yr
earliest_cr_line = []
print(loan['earliest_cr_line'].head())

for idx, e in enumerate(loan['earliest_cr_line']):
    #     print(e)
    if type(e) == type(float('nan')):
        earliest_cr_line.append(e)
    else:
        earliest_cr_line.append(e.split("-")[1])
    
df['earliest_cr_line_yr'] = earliest_cr_line

if 'earliest_cr_line' in df:
    df = df.drop('earliest_cr_line', 1)

print(df['earliest_cr_line_yr'].head())

467234    Oct-2005
467594    Sep-2000
467825    May-2000
468190    Apr-1991
468331    Nov-1999
Name: earliest_cr_line, dtype: object
467234    2005
467594    2000
467825    2000
468190    1991
468331    1999
Name: earliest_cr_line_yr, dtype: object


In [308]:
from dateutil.parser import parser
p = parser()

payment_period = []
last_pymnt_d = loan['last_pymnt_d']
next_pymnt_d = loan['next_pymnt_d']
print(len(last_pymnt_d) == len(next_pymnt_d))
print(len(last_pymnt_d))
print(df.shape)

True
35081
(35081, 794)


In [309]:
def foo(var):
    if type(var) == type(0.0):
        return 0
    return p.parse(var)

def foo1(a, b):
    if type(a) == type(0.0):
        return 0
    if type(b) == type(0.0):
        return 0
#     print(a, b)
#     print(type(a) == NaTType)
#     print(type(b) == NaTType)
#     print(type(a))
#     print(type(b))
    try:
        result = (p.parse(a) - p.parse(a)).days
    except:
        return 0
    return result

print(loan['next_pymnt_d'].head())
print(loan['last_pymnt_d'].head())
tmp = loan
# tmp['next_pymnt_d'] = tmp.apply(lambda x: foo(x['next_pymnt_d']), axis=1)
# tmp['last_pymnt_d'] = tmp.apply(lambda x: foo(x['last_pymnt_d']), axis=1)

tmp['payment_period'] = tmp.apply(lambda x: foo1(x['next_pymnt_d'], x['last_pymnt_d']), axis=1)
# df.apply(lambda x: foo(x['a'], x['b']), axis=1)

# print(tmp['next_pymnt_d'].head())
# print(tmp['last_pymnt_d'].head())
# print(tmp['next_pymnt_d'].sub(tmp['last_pymnt_d'], axis=0).head())
print(tmp['payment_period'].head())
df['payment_period'] = tmp['payment_period']

467234    NaN
467594    NaN
467825    NaN
468190    NaN
468331    NaN
Name: next_pymnt_d, dtype: object
467234    Jan-2016
467594    Jan-2016
467825    Jan-2016
468190    Jan-2016
468331    Jan-2016
Name: last_pymnt_d, dtype: object
467234    0
467594    0
467825    0
468190    0
468331    0
Name: payment_period, dtype: int64


In [310]:
def last_credit_pull_d_fun(var):
    try:
        result = var.split("-")[1]
    except:
        return 0
    return result

df['last_credit_pull_yr'] = loan.apply(lambda x: last_credit_pull_d_fun(x['last_credit_pull_d']), axis=1)
print(loan['last_credit_pull_d'].head())
print(df['last_credit_pull_yr'].head())

467234    Jan-2016
467594    Jan-2016
467825    Jan-2016
468190    Jan-2016
468331    Jan-2016
Name: last_credit_pull_d, dtype: object
467234    2016
467594    2016
467825    2016
468190    2016
468331    2016
Name: last_credit_pull_yr, dtype: object


#### As a last step, we will fill in the missing values for each column. For numerical columns such as annual income, it's reasonable to assume the missing value will be around the median or mean income. For others such as payment, accounts opened, a missing value may simply indicate no activities happened and the value can be 0. You can also use interpolation (see more) to fill in the missing values. 

__Task__: Fill in missing values at your discretion. include justification of your methods in the comments. 

In [311]:
df.head()

# for column in list(df.columns.values):
#     df = df[np.isfinite(df[column])]

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,annual_inc_joint,dti_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_il_6m,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,issue_y,is_historical,delinquency,term_ 36 months,term_ 60 months,sub_grade_A1,sub_grade_A2,sub_grade_A3,sub_grade_A4,sub_grade_A5,sub_grade_B1,sub_grade_B2,sub_grade_B3,sub_grade_B4,sub_grade_B5,sub_grade_C1,sub_grade_C2,sub_grade_C3,sub_grade_C4,sub_grade_C5,sub_grade_D1,sub_grade_D2,sub_grade_D3,sub_grade_D4,sub_grade_D5,sub_grade_E1,sub_grade_E2,sub_grade_E3,sub_grade_E4,sub_grade_E5,sub_grade_F1,sub_grade_F2,sub_grade_F3,sub_grade_F4,sub_grade_F5,sub_grade_G1,sub_grade_G2,sub_grade_G3,sub_grade_G4,sub_grade_G5,emp_length_1 year,emp_length_10+ years,emp_length_2 years,emp_length_3 years,emp_length_4 years,emp_length_5 years,emp_length_6 years,emp_length_7 years,emp_length_8 years,emp_length_9 years,emp_length_< 1 year,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,verification_status_Not Verified,verification_status_Source Verified,verification_status_Verified,issue_d_Apr-2015,issue_d_Aug-2015,issue_d_Dec-2015,issue_d_Feb-2015,issue_d_Jan-2015,issue_d_Jul-2015,issue_d_Jun-2015,issue_d_Mar-2015,issue_d_May-2015,issue_d_Nov-2015,issue_d_Oct-2015,issue_d_Sep-2015,loan_status_Charged Off,loan_status_Default,loan_status_Fully Paid,loan_status_In Grace Period,loan_status_Late (16-30 days),loan_status_Late (31-120 days),pymnt_plan_n,purpose_car,purpose_credit_card,purpose_debt_consolidation,purpose_home_improvement,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,addr_state_AK,addr_state_AL,addr_state_AR,addr_state_AZ,addr_state_CA,addr_state_CO,addr_state_CT,addr_state_DC,addr_state_DE,addr_state_FL,addr_state_GA,addr_state_HI,addr_state_IL,addr_state_IN,addr_state_KS,addr_state_KY,addr_state_LA,addr_state_MA,addr_state_MD,addr_state_ME,addr_state_MI,addr_state_MN,addr_state_MO,addr_state_MS,addr_state_MT,addr_state_NC,addr_state_ND,addr_state_NE,addr_state_NH,addr_state_NJ,addr_state_NM,addr_state_NV,addr_state_NY,addr_state_OH,addr_state_OK,addr_state_OR,addr_state_PA,addr_state_RI,addr_state_SC,addr_state_SD,addr_state_TN,addr_state_TX,addr_state_UT,addr_state_VA,addr_state_VT,addr_state_WA,addr_state_WI,addr_state_WV,addr_state_WY,earliest_cr_line_Apr-1964,earliest_cr_line_Apr-1966,earliest_cr_line_Apr-1967,earliest_cr_line_Apr-1970,earliest_cr_line_Apr-1971,earliest_cr_line_Apr-1972,earliest_cr_line_Apr-1973,earliest_cr_line_Apr-1974,earliest_cr_line_Apr-1975,earliest_cr_line_Apr-1976,earliest_cr_line_Apr-1977,earliest_cr_line_Apr-1978,earliest_cr_line_Apr-1979,earliest_cr_line_Apr-1980,earliest_cr_line_Apr-1981,earliest_cr_line_Apr-1982,earliest_cr_line_Apr-1983,earliest_cr_line_Apr-1984,earliest_cr_line_Apr-1985,earliest_cr_line_Apr-1986,earliest_cr_line_Apr-1987,earliest_cr_line_Apr-1988,earliest_cr_line_Apr-1989,earliest_cr_line_Apr-1990,earliest_cr_line_Apr-1991,earliest_cr_line_Apr-1992,earliest_cr_line_Apr-1993,earliest_cr_line_Apr-1994,earliest_cr_line_Apr-1995,earliest_cr_line_Apr-1996,earliest_cr_line_Apr-1997,earliest_cr_line_Apr-1998,earliest_cr_line_Apr-1999,earliest_cr_line_Apr-2000,earliest_cr_line_Apr-2001,earliest_cr_line_Apr-2002,earliest_cr_line_Apr-2003,earliest_cr_line_Apr-2004,earliest_cr_line_Apr-2005,earliest_cr_line_Apr-2006,earliest_cr_line_Apr-2007,earliest_cr_line_Apr-2008,earliest_cr_line_Apr-2009,earliest_cr_line_Apr-2010,earliest_cr_line_Apr-2011,earliest_cr_line_Apr-2012,earliest_cr_line_Aug-1950,earliest_cr_line_Aug-1959,earliest_cr_line_Aug-1961,earliest_cr_line_Aug-1966,earliest_cr_line_Aug-1967,earliest_cr_line_Aug-1970,earliest_cr_line_Aug-1971,earliest_cr_line_Aug-1972,earliest_cr_line_Aug-1973,earliest_cr_line_Aug-1974,earliest_cr_line_Aug-1975,earliest_cr_line_Aug-1976,earliest_cr_line_Aug-1977,earliest_cr_line_Aug-1978,earliest_cr_line_Aug-1979,earliest_cr_line_Aug-1980,earliest_cr_line_Aug-1981,earliest_cr_line_Aug-1982,earliest_cr_line_Aug-1983,earliest_cr_line_Aug-1984,earliest_cr_line_Aug-1985,earliest_cr_line_Aug-1986,earliest_cr_line_Aug-1987,earliest_cr_line_Aug-1988,earliest_cr_line_Aug-1989,earliest_cr_line_Aug-1990,earliest_cr_line_Aug-1991,earliest_cr_line_Aug-1992,earliest_cr_line_Aug-1993,earliest_cr_line_Aug-1994,earliest_cr_line_Aug-1995,earliest_cr_line_Aug-1996,earliest_cr_line_Aug-1997,earliest_cr_line_Aug-1998,earliest_cr_line_Aug-1999,earliest_cr_line_Aug-2000,earliest_cr_line_Aug-2001,earliest_cr_line_Aug-2002,earliest_cr_line_Aug-2003,earliest_cr_line_Aug-2004,earliest_cr_line_Aug-2005,earliest_cr_line_Aug-2006,earliest_cr_line_Aug-2007,earliest_cr_line_Aug-2008,earliest_cr_line_Aug-2009,earliest_cr_line_Aug-2010,earliest_cr_line_Aug-2011,earliest_cr_line_Aug-2012,earliest_cr_line_Dec-1961,earliest_cr_line_Dec-1963,earliest_cr_line_Dec-1964,earliest_cr_line_Dec-1965,earliest_cr_line_Dec-1966,earliest_cr_line_Dec-1967,earliest_cr_line_Dec-1968,earliest_cr_line_Dec-1971,earliest_cr_line_Dec-1972,earliest_cr_line_Dec-1973,earliest_cr_line_Dec-1974,earliest_cr_line_Dec-1975,earliest_cr_line_Dec-1976,earliest_cr_line_Dec-1977,earliest_cr_line_Dec-1978,earliest_cr_line_Dec-1979,earliest_cr_line_Dec-1980,earliest_cr_line_Dec-1981,earliest_cr_line_Dec-1982,earliest_cr_line_Dec-1983,earliest_cr_line_Dec-1984,earliest_cr_line_Dec-1985,earliest_cr_line_Dec-1986,earliest_cr_line_Dec-1987,earliest_cr_line_Dec-1988,earliest_cr_line_Dec-1989,earliest_cr_line_Dec-1990,earliest_cr_line_Dec-1991,earliest_cr_line_Dec-1992,earliest_cr_line_Dec-1993,earliest_cr_line_Dec-1994,earliest_cr_line_Dec-1995,earliest_cr_line_Dec-1996,earliest_cr_line_Dec-1997,earliest_cr_line_Dec-1998,earliest_cr_line_Dec-1999,earliest_cr_line_Dec-2000,earliest_cr_line_Dec-2001,earliest_cr_line_Dec-2002,earliest_cr_line_Dec-2003,earliest_cr_line_Dec-2004,earliest_cr_line_Dec-2005,earliest_cr_line_Dec-2006,earliest_cr_line_Dec-2007,earliest_cr_line_Dec-2008,earliest_cr_line_Dec-2009,earliest_cr_line_Dec-2010,earliest_cr_line_Dec-2011,earliest_cr_line_Feb-1965,earliest_cr_line_Feb-1967,earliest_cr_line_Feb-1969,earliest_cr_line_Feb-1970,earliest_cr_line_Feb-1971,earliest_cr_line_Feb-1972,earliest_cr_line_Feb-1973,earliest_cr_line_Feb-1974,earliest_cr_line_Feb-1975,earliest_cr_line_Feb-1976,earliest_cr_line_Feb-1977,earliest_cr_line_Feb-1978,earliest_cr_line_Feb-1979,earliest_cr_line_Feb-1980,earliest_cr_line_Feb-1981,earliest_cr_line_Feb-1982,earliest_cr_line_Feb-1983,earliest_cr_line_Feb-1984,earliest_cr_line_Feb-1985,earliest_cr_line_Feb-1986,earliest_cr_line_Feb-1987,earliest_cr_line_Feb-1988,earliest_cr_line_Feb-1989,earliest_cr_line_Feb-1990,earliest_cr_line_Feb-1991,earliest_cr_line_Feb-1992,earliest_cr_line_Feb-1993,earliest_cr_line_Feb-1994,earliest_cr_line_Feb-1995,earliest_cr_line_Feb-1996,earliest_cr_line_Feb-1997,earliest_cr_line_Feb-1998,earliest_cr_line_Feb-1999,earliest_cr_line_Feb-2000,earliest_cr_line_Feb-2001,earliest_cr_line_Feb-2002,earliest_cr_line_Feb-2003,earliest_cr_line_Feb-2004,earliest_cr_line_Feb-2005,earliest_cr_line_Feb-2006,earliest_cr_line_Feb-2007,earliest_cr_line_Feb-2008,earliest_cr_line_Feb-2009,earliest_cr_line_Feb-2010,earliest_cr_line_Feb-2011,earliest_cr_line_Feb-2012,earliest_cr_line_Jan-1954,earliest_cr_line_Jan-1956,earliest_cr_line_Jan-1959,earliest_cr_line_Jan-1960,earliest_cr_line_Jan-1961,earliest_cr_line_Jan-1962,earliest_cr_line_Jan-1963,earliest_cr_line_Jan-1964,earliest_cr_line_Jan-1965,earliest_cr_line_Jan-1967,earliest_cr_line_Jan-1968,earliest_cr_line_Jan-1969,earliest_cr_line_Jan-1970,earliest_cr_line_Jan-1971,earliest_cr_line_Jan-1972,earliest_cr_line_Jan-1973,earliest_cr_line_Jan-1974,earliest_cr_line_Jan-1975,earliest_cr_line_Jan-1976,earliest_cr_line_Jan-1977,earliest_cr_line_Jan-1978,earliest_cr_line_Jan-1979,earliest_cr_line_Jan-1980,earliest_cr_line_Jan-1981,earliest_cr_line_Jan-1982,earliest_cr_line_Jan-1983,earliest_cr_line_Jan-1984,earliest_cr_line_Jan-1985,earliest_cr_line_Jan-1986,earliest_cr_line_Jan-1987,earliest_cr_line_Jan-1988,earliest_cr_line_Jan-1989,earliest_cr_line_Jan-1990,earliest_cr_line_Jan-1991,earliest_cr_line_Jan-1992,earliest_cr_line_Jan-1993,earliest_cr_line_Jan-1994,earliest_cr_line_Jan-1995,earliest_cr_line_Jan-1996,earliest_cr_line_Jan-1997,earliest_cr_line_Jan-1998,earliest_cr_line_Jan-1999,earliest_cr_line_Jan-2000,earliest_cr_line_Jan-2001,earliest_cr_line_Jan-2002,earliest_cr_line_Jan-2003,earliest_cr_line_Jan-2004,earliest_cr_line_Jan-2005,earliest_cr_line_Jan-2006,earliest_cr_line_Jan-2007,earliest_cr_line_Jan-2008,earliest_cr_line_Jan-2009,earliest_cr_line_Jan-2010,earliest_cr_line_Jan-2011,earliest_cr_line_Jan-2012,earliest_cr_line_Jul-1965,earliest_cr_line_Jul-1966,earliest_cr_line_Jul-1967,earliest_cr_line_Jul-1968,earliest_cr_line_Jul-1969,earliest_cr_line_Jul-1970,earliest_cr_line_Jul-1971,earliest_cr_line_Jul-1972,earliest_cr_line_Jul-1973,earliest_cr_line_Jul-1974,earliest_cr_line_Jul-1975,earliest_cr_line_Jul-1976,earliest_cr_line_Jul-1977,earliest_cr_line_Jul-1978,earliest_cr_line_Jul-1979,earliest_cr_line_Jul-1980,earliest_cr_line_Jul-1981,earliest_cr_line_Jul-1982,earliest_cr_line_Jul-1983,earliest_cr_line_Jul-1984,earliest_cr_line_Jul-1985,earliest_cr_line_Jul-1986,earliest_cr_line_Jul-1987,earliest_cr_line_Jul-1988,earliest_cr_line_Jul-1989,earliest_cr_line_Jul-1990,earliest_cr_line_Jul-1991,earliest_cr_line_Jul-1992,earliest_cr_line_Jul-1993,earliest_cr_line_Jul-1994,earliest_cr_line_Jul-1995,earliest_cr_line_Jul-1996,earliest_cr_line_Jul-1997,earliest_cr_line_Jul-1998,earliest_cr_line_Jul-1999,earliest_cr_line_Jul-2000,earliest_cr_line_Jul-2001,earliest_cr_line_Jul-2002,earliest_cr_line_Jul-2003,earliest_cr_line_Jul-2004,earliest_cr_line_Jul-2005,earliest_cr_line_Jul-2006,earliest_cr_line_Jul-2007,earliest_cr_line_Jul-2008,earliest_cr_line_Jul-2009,earliest_cr_line_Jul-2010,earliest_cr_line_Jul-2011,earliest_cr_line_Jul-2012,earliest_cr_line_Jun-1957,earliest_cr_line_Jun-1963,earliest_cr_line_Jun-1965,earliest_cr_line_Jun-1966,earliest_cr_line_Jun-1969,earliest_cr_line_Jun-1970,earliest_cr_line_Jun-1971,earliest_cr_line_Jun-1973,earliest_cr_line_Jun-1974,earliest_cr_line_Jun-1975,earliest_cr_line_Jun-1976,earliest_cr_line_Jun-1977,earliest_cr_line_Jun-1978,earliest_cr_line_Jun-1979,earliest_cr_line_Jun-1980,earliest_cr_line_Jun-1981,earliest_cr_line_Jun-1982,earliest_cr_line_Jun-1983,earliest_cr_line_Jun-1984,earliest_cr_line_Jun-1985,earliest_cr_line_Jun-1986,earliest_cr_line_Jun-1987,earliest_cr_line_Jun-1988,earliest_cr_line_Jun-1989,earliest_cr_line_Jun-1990,earliest_cr_line_Jun-1991,earliest_cr_line_Jun-1992,earliest_cr_line_Jun-1993,earliest_cr_line_Jun-1994,earliest_cr_line_Jun-1995,earliest_cr_line_Jun-1996,earliest_cr_line_Jun-1997,earliest_cr_line_Jun-1998,earliest_cr_line_Jun-1999,earliest_cr_line_Jun-2000,earliest_cr_line_Jun-2001,earliest_cr_line_Jun-2002,earliest_cr_line_Jun-2003,earliest_cr_line_Jun-2004,earliest_cr_line_Jun-2005,earliest_cr_line_Jun-2006,earliest_cr_line_Jun-2007,earliest_cr_line_Jun-2008,earliest_cr_line_Jun-2009,earliest_cr_line_Jun-2010,earliest_cr_line_Jun-2011,earliest_cr_line_Jun-2012,earliest_cr_line_Mar-1963,earliest_cr_line_Mar-1966,earliest_cr_line_Mar-1967,earliest_cr_line_Mar-1969,earliest_cr_line_Mar-1973,earliest_cr_line_Mar-1974,earliest_cr_line_Mar-1975,earliest_cr_line_Mar-1976,earliest_cr_line_Mar-1977,earliest_cr_line_Mar-1978,earliest_cr_line_Mar-1979,earliest_cr_line_Mar-1980,earliest_cr_line_Mar-1981,earliest_cr_line_Mar-1982,earliest_cr_line_Mar-1983,earliest_cr_line_Mar-1984,earliest_cr_line_Mar-1985,earliest_cr_line_Mar-1986,earliest_cr_line_Mar-1987,earliest_cr_line_Mar-1988,earliest_cr_line_Mar-1989,earliest_cr_line_Mar-1990,earliest_cr_line_Mar-1991,earliest_cr_line_Mar-1992,earliest_cr_line_Mar-1993,earliest_cr_line_Mar-1994,earliest_cr_line_Mar-1995,earliest_cr_line_Mar-1996,earliest_cr_line_Mar-1997,earliest_cr_line_Mar-1998,earliest_cr_line_Mar-1999,earliest_cr_line_Mar-2000,earliest_cr_line_Mar-2001,earliest_cr_line_Mar-2002,earliest_cr_line_Mar-2003,earliest_cr_line_Mar-2004,earliest_cr_line_Mar-2005,earliest_cr_line_Mar-2006,earliest_cr_line_Mar-2007,earliest_cr_line_Mar-2008,earliest_cr_line_Mar-2009,earliest_cr_line_Mar-2010,earliest_cr_line_Mar-2011,earliest_cr_line_Mar-2012,earliest_cr_line_May-1966,earliest_cr_line_May-1968,earliest_cr_line_May-1969,earliest_cr_line_May-1970,earliest_cr_line_May-1971,earliest_cr_line_May-1972,earliest_cr_line_May-1973,earliest_cr_line_May-1974,earliest_cr_line_May-1975,earliest_cr_line_May-1976,earliest_cr_line_May-1977,earliest_cr_line_May-1978,earliest_cr_line_May-1979,earliest_cr_line_May-1980,earliest_cr_line_May-1981,earliest_cr_line_May-1982,earliest_cr_line_May-1983,earliest_cr_line_May-1984,earliest_cr_line_May-1985,earliest_cr_line_May-1986,earliest_cr_line_May-1987,earliest_cr_line_May-1988,earliest_cr_line_May-1989,earliest_cr_line_May-1990,earliest_cr_line_May-1991,earliest_cr_line_May-1992,earliest_cr_line_May-1993,earliest_cr_line_May-1994,earliest_cr_line_May-1995,earliest_cr_line_May-1996,earliest_cr_line_May-1997,earliest_cr_line_May-1998,earliest_cr_line_May-1999,earliest_cr_line_May-2000,earliest_cr_line_May-2001,earliest_cr_line_May-2002,earliest_cr_line_May-2003,earliest_cr_line_May-2004,earliest_cr_line_May-2005,earliest_cr_line_May-2006,earliest_cr_line_May-2007,earliest_cr_line_May-2008,earliest_cr_line_May-2009,earliest_cr_line_May-2010,earliest_cr_line_May-2011,earliest_cr_line_May-2012,earliest_cr_line_Nov-1963,earliest_cr_line_Nov-1964,earliest_cr_line_Nov-1965,earliest_cr_line_Nov-1966,earliest_cr_line_Nov-1968,earliest_cr_line_Nov-1969,earliest_cr_line_Nov-1970,earliest_cr_line_Nov-1971,earliest_cr_line_Nov-1972,earliest_cr_line_Nov-1973,earliest_cr_line_Nov-1974,earliest_cr_line_Nov-1975,earliest_cr_line_Nov-1976,earliest_cr_line_Nov-1977,earliest_cr_line_Nov-1978,earliest_cr_line_Nov-1979,earliest_cr_line_Nov-1980,earliest_cr_line_Nov-1981,earliest_cr_line_Nov-1982,earliest_cr_line_Nov-1983,earliest_cr_line_Nov-1984,earliest_cr_line_Nov-1985,earliest_cr_line_Nov-1986,earliest_cr_line_Nov-1987,earliest_cr_line_Nov-1988,earliest_cr_line_Nov-1989,earliest_cr_line_Nov-1990,earliest_cr_line_Nov-1991,earliest_cr_line_Nov-1992,earliest_cr_line_Nov-1993,earliest_cr_line_Nov-1994,earliest_cr_line_Nov-1995,earliest_cr_line_Nov-1996,earliest_cr_line_Nov-1997,earliest_cr_line_Nov-1998,earliest_cr_line_Nov-1999,earliest_cr_line_Nov-2000,earliest_cr_line_Nov-2001,earliest_cr_line_Nov-2002,earliest_cr_line_Nov-2003,earliest_cr_line_Nov-2004,earliest_cr_line_Nov-2005,earliest_cr_line_Nov-2006,earliest_cr_line_Nov-2007,earliest_cr_line_Nov-2008,earliest_cr_line_Nov-2009,earliest_cr_line_Nov-2010,earliest_cr_line_Nov-2011,earliest_cr_line_Oct-1962,earliest_cr_line_Oct-1965,earliest_cr_line_Oct-1968,earliest_cr_line_Oct-1969,earliest_cr_line_Oct-1970,earliest_cr_line_Oct-1971,earliest_cr_line_Oct-1973,earliest_cr_line_Oct-1974,earliest_cr_line_Oct-1975,earliest_cr_line_Oct-1976,earliest_cr_line_Oct-1977,earliest_cr_line_Oct-1978,earliest_cr_line_Oct-1979,earliest_cr_line_Oct-1980,earliest_cr_line_Oct-1981,earliest_cr_line_Oct-1982,earliest_cr_line_Oct-1983,earliest_cr_line_Oct-1984,earliest_cr_line_Oct-1985,earliest_cr_line_Oct-1986,earliest_cr_line_Oct-1987,earliest_cr_line_Oct-1988,earliest_cr_line_Oct-1989,earliest_cr_line_Oct-1990,earliest_cr_line_Oct-1991,earliest_cr_line_Oct-1992,earliest_cr_line_Oct-1993,earliest_cr_line_Oct-1994,earliest_cr_line_Oct-1995,earliest_cr_line_Oct-1996,earliest_cr_line_Oct-1997,earliest_cr_line_Oct-1998,earliest_cr_line_Oct-1999,earliest_cr_line_Oct-2000,earliest_cr_line_Oct-2001,earliest_cr_line_Oct-2002,earliest_cr_line_Oct-2003,earliest_cr_line_Oct-2004,earliest_cr_line_Oct-2005,earliest_cr_line_Oct-2006,earliest_cr_line_Oct-2007,earliest_cr_line_Oct-2008,earliest_cr_line_Oct-2009,earliest_cr_line_Oct-2010,earliest_cr_line_Oct-2011,earliest_cr_line_Oct-2012,earliest_cr_line_Sep-1964,earliest_cr_line_Sep-1967,earliest_cr_line_Sep-1968,earliest_cr_line_Sep-1969,earliest_cr_line_Sep-1971,earliest_cr_line_Sep-1972,earliest_cr_line_Sep-1973,earliest_cr_line_Sep-1974,earliest_cr_line_Sep-1975,earliest_cr_line_Sep-1976,earliest_cr_line_Sep-1977,earliest_cr_line_Sep-1978,earliest_cr_line_Sep-1979,earliest_cr_line_Sep-1980,earliest_cr_line_Sep-1981,earliest_cr_line_Sep-1982,earliest_cr_line_Sep-1983,earliest_cr_line_Sep-1984,earliest_cr_line_Sep-1985,earliest_cr_line_Sep-1986,earliest_cr_line_Sep-1987,earliest_cr_line_Sep-1988,earliest_cr_line_Sep-1989,earliest_cr_line_Sep-1990,earliest_cr_line_Sep-1991,earliest_cr_line_Sep-1992,earliest_cr_line_Sep-1993,earliest_cr_line_Sep-1994,earliest_cr_line_Sep-1995,earliest_cr_line_Sep-1996,earliest_cr_line_Sep-1997,earliest_cr_line_Sep-1998,earliest_cr_line_Sep-1999,earliest_cr_line_Sep-2000,earliest_cr_line_Sep-2001,earliest_cr_line_Sep-2002,earliest_cr_line_Sep-2003,earliest_cr_line_Sep-2004,earliest_cr_line_Sep-2005,earliest_cr_line_Sep-2006,earliest_cr_line_Sep-2007,earliest_cr_line_Sep-2008,earliest_cr_line_Sep-2009,earliest_cr_line_Sep-2010,earliest_cr_line_Sep-2011,earliest_cr_line_Sep-2012,initial_list_status_f,initial_list_status_w,last_pymnt_d_Apr-2015,last_pymnt_d_Aug-2015,last_pymnt_d_Dec-2015,last_pymnt_d_Feb-2015,last_pymnt_d_Jan-2015,last_pymnt_d_Jan-2016,last_pymnt_d_Jul-2015,last_pymnt_d_Jun-2015,last_pymnt_d_Mar-2015,last_pymnt_d_May-2015,last_pymnt_d_Nov-2015,last_pymnt_d_Oct-2015,last_pymnt_d_Sep-2015,next_pymnt_d_Feb-2016,next_pymnt_d_Jan-2016,next_pymnt_d_Mar-2016,last_credit_pull_d_Apr-2015,last_credit_pull_d_Aug-2015,last_credit_pull_d_Dec-2014,last_credit_pull_d_Dec-2015,last_credit_pull_d_Feb-2015,last_credit_pull_d_Jan-2015,last_credit_pull_d_Jan-2016,last_credit_pull_d_Jul-2015,last_credit_pull_d_Jun-2015,last_credit_pull_d_Mar-2015,last_credit_pull_d_May-2015,last_credit_pull_d_Nov-2015,last_credit_pull_d_Oct-2015,last_credit_pull_d_Sep-2015,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified,issued_yr,earliest_cr_line_yr,payment_period,last_credit_pull_yr
467234,19800.0,19800.0,19800.0,12.88,666.0,78924.0,13.12,0.0,0.0,,,12.0,0.0,12736.0,24.7,27.0,0.0,0.0,19870.84,19870.84,19800.0,70.84,0.0,0.0,0.0,19906.26,0.0,,1.0,,,0.0,0.0,138611.0,2.0,2.0,1.0,1.0,6.0,11438.0,51.0,3.0,3.0,5826.0,32.7,51700.0,9.0,2.0,15.0,2015,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,2015,2005,0,2016
467594,35000.0,35000.0,35000.0,12.88,1177.27,95000.0,7.19,0.0,1.0,66.0,,14.0,0.0,13576.0,33.6,23.0,0.0,0.0,35125.23,35125.23,35000.0,125.23,0.0,0.0,0.0,35187.84,0.0,66.0,1.0,,,0.0,0.0,18219.0,1.0,1.0,0.0,0.0,81.0,4643.0,38.6,3.0,4.0,4953.0,34.7,40400.0,1.0,0.0,3.0,2015,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,2015,2000,0,2016
467825,20000.0,20000.0,20000.0,12.88,672.73,56000.0,22.4,0.0,0.0,69.0,,6.0,0.0,6365.0,46.8,29.0,0.0,0.0,20009.31,20009.31,20000.0,9.31,0.0,0.0,0.0,20045.09,0.0,69.0,1.0,,,0.0,0.0,202156.0,2.0,3.0,2.0,3.0,2.0,42303.0,84.5,1.0,1.0,5744.0,76.4,13600.0,1.0,1.0,3.0,2015,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,2015,2000,0,2016
468190,28000.0,28000.0,28000.0,12.88,635.37,96000.0,31.0,0.0,1.0,,,15.0,0.0,45751.0,66.0,39.0,0.0,0.0,28013.03,28013.03,28000.0,13.03,0.0,0.0,0.0,28063.12,0.0,,1.0,,,0.0,0.0,219222.0,1.0,6.0,2.0,3.0,3.0,78703.0,78.7,1.0,3.0,11075.0,73.5,69300.0,1.0,18.0,1.0,2015,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,2015,1991,0,2016
468331,23100.0,23100.0,23100.0,19.48,605.35,55000.0,22.24,0.0,0.0,,,24.0,0.0,3857.0,11.8,40.0,0.0,0.0,23116.25,23116.25,23100.0,16.25,0.0,0.0,0.0,23178.75,0.0,,1.0,,,0.0,3090.0,183546.0,4.0,11.0,3.0,11.0,2.0,73207.0,88.1,1.0,4.0,2006.0,66.5,32700.0,2.0,0.0,2.0,2015,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,2015,1999,0,2016


### Now let's feed the cleaned dataset through the pipeline! 

#### X represents all the predictors - features; y represents the target variable - whether the borrower has been delinquent on the loan

__task__: divide development dataframe into two X and y. X will be a dataframe that contains all predictors. y will be a single column therefore a Series that contains the target variable 'delinquency'

In [316]:
tmp = df

tmp =  df.fillna(0)

print(tmp.head())
y = tmp['delinquency']
X = tmp
X.drop('delinquency', 1)

print(y.head())
print(X.head())

        loan_amnt  funded_amnt  funded_amnt_inv  int_rate  installment  \
467234  19800.000    19800.000        19800.000    12.880      666.000   
467594  35000.000    35000.000        35000.000    12.880     1177.270   
467825  20000.000    20000.000        20000.000    12.880      672.730   
468190  28000.000    28000.000        28000.000    12.880      635.370   
468331  23100.000    23100.000        23100.000    19.480      605.350   

        annual_inc    dti  delinq_2yrs  inq_last_6mths  \
467234   78924.000 13.120        0.000           0.000   
467594   95000.000  7.190        0.000           1.000   
467825   56000.000 22.400        0.000           0.000   
468190   96000.000 31.000        0.000           1.000   
468331   55000.000 22.240        0.000           0.000   

        mths_since_last_delinq  mths_since_last_record  open_acc  pub_rec  \
467234                   0.000                   0.000    12.000    0.000   
467594                  66.000                   0.0

#### For this pipeline, we will experiment with K-Nearest Neighbor algorithm. Similar to what you've done in the case study by finding comparable examples to the case at hand and decide whether you want to lend money or not, this algorithm calculates the distances of each sample to the rest in the feature space, find the K-nearest neighbors to this particular sample, and assign the sample the most common class from its neighbors.

#### There are several variation of the algorithm with some options more efficient than the others -- calculating distance between one and the rest of the samples can get computationally expensive really fast! Read more about KNN [here](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)



__task__: Call pipeline function and store the returned scores in a variable. You do NOT need to modify the pipeline function

In [317]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import cross_val_score
import time

def pipeline(X, y):
    
    begin = time.time()
    
    # we will split the development set into training(80%) and testing (20%)
    # 80-20 is a convention. 70-30 is also used. The choice really depends on the base rate of your outcome
    X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=123, stratify=y)

    # here we use the make_pipeline module to make a pipeline by using standardscaler to standarize
    # the data and feed it through a K-nearest neighbor algorithm
    pipe_kn = make_pipeline(StandardScaler(), 
                            KNN(n_neighbors=30))

    # we will set up a 5-fold cross validation scheme to evaluate the model
    scores = cross_val_score(estimator=pipe_kn,
                             X=X_train,
                             y=y_train,
                             cv=5,
                             scoring='accuracy',
                             n_jobs=1)
    end = time.time()

    print("The model took %d seconds to train." % (end-begin))
    
    return scores

In [318]:
scores = pipeline(X, y)

The model took 1605 seconds to train.


#### The model has now been trained, let's see how it performs. There are many metrics you can use to evaluate an algorithm including accuracy, precision, recall, ROC_AUC, and so on. You can read more about it [here](https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). For simplicity, we will use accuracy -- of all the predictiions we made, how many times did we make a right decision. 

#### To establish comparison, we need a baseline of accuracy. Accuracy is often not a good enough metric by itself. Imagine if you have a very skewed outcome that you are trying to predict -- the outcome only happens 1% of the time. Your model will be very accurate if it keeps giving out negative predictions and would by default reach an accuracy of 1%. So we want our model to have a higher accuracy than the default rate of positive outcomes. 

__Task__: Identify the baseline accuracy. That is ( 1 - (# of delinquency) / (total # of predictions made))

In [320]:
num_of_del = y
print(y.head())
# # Create variable with TRUE if issue_d, year is > 2015
# delin = y['delinquency'] >= 2015

# # # Select all casess where issue_d's year is > 2015
# df = df[recent]
# print(df.shape)

467234    0
467594    0
467825    0
468190    0
468331    0
Name: delinquency, dtype: int64


In [322]:
baseline = (1 - sum(y) / len(y))
print(baseline)

0.6551694649525384


#### Now let's print out our accuracy obtained from cross-validation by running the following code

In [323]:
print('CV accuracy scores: %s' % scores)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

CV accuracy scores: [ 0.75614535  0.759886    0.75908767  0.76086957  0.76496793]
CV accuracy: 0.760 +/- 0.003


### As you can see the average accuracy is way above the baseline rate. Congrats!

#### Bonus step: Now we are happy with the development algorithm, you can train the model on the entire development dataset and use it to predict on the application dataset where the actual outcomes are not yet oberved. Since it can take quite some time to train a model, you can "pickle" the model to avoid having to train the model each time you use it. See more about pickle library [here](https://docs.python.org/3/library/pickle.html).