# Identifying safe loans with decision trees

The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

- Use SFrames to do some feature engineering.
- Train a decision-tree on the LendingClub dataset.
- Visualize the tree.
- Predict whether a loan will default along with prediction probabilities (on a validation set).
- Train a complex tree model and compare it to simple tree model.

## Load the Lending Club dataset

We will be using a dataset from the [LendingClub](https://www.lendingclub.com/).

Load the dataset into a data frame named loans. Using SFrame, this would look like

In [58]:
# import pandas as pd
import sframe
loans = sframe.SFrame('lending-club-data.gl/')
# loans = loans.to_dataframe()

## Exploring some features

Let's quickly explore what the dataset looks like. First, print out the column names to see what features we have in this dataset. On SFrame, you can run this code:

In [59]:
loans.column_names()

['id',
 'member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'is_inc_v',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'url',
 'desc',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'earliest_cr_line',
 'inq_last_6mths',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'initial_list_status',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'next_pymnt_d',
 'last_credit_pull_d',
 'collections_12_mths_ex_med',
 'mths_since_last_major_derog',
 'policy_code',
 'not_compliant',
 'status',
 'inactive_loans',
 'bad_loans',
 'emp_length_num',
 'grade_num',
 'sub_grade_num',
 'delinq_2yrs_zero',
 'pub_rec

In [60]:
loans.head()

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade
1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2
1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4
1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5
1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1
1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4
1072053,1288686,3000,3000,3000,36 months,18.64,109.43,E,E1
1071795,1306957,5600,5600,5600,60 months,21.28,152.39,F,F2
1071570,1306721,5375,5375,5350,60 months,12.69,121.45,B,B5
1070078,1305201,6500,6500,6500,60 months,14.65,153.45,C,C3
1069908,1305008,12000,12000,12000,36 months,12.69,402.54,B,B5

emp_title,emp_length,home_ownership,annual_inc,is_inc_v,issue_d,loan_status,pymnt_plan
,10+ years,RENT,24000,Verified,20111201T000000,Fully Paid,n
Ryder,< 1 year,RENT,30000,Source Verified,20111201T000000,Charged Off,n
,10+ years,RENT,12252,Not Verified,20111201T000000,Fully Paid,n
AIR RESOURCES BOARD,10+ years,RENT,49200,Source Verified,20111201T000000,Fully Paid,n
Veolia Transportaton,3 years,RENT,36000,Source Verified,20111201T000000,Fully Paid,n
MKC Accounting,9 years,RENT,48000,Source Verified,20111201T000000,Fully Paid,n
,4 years,OWN,40000,Source Verified,20111201T000000,Charged Off,n
Starbucks,< 1 year,RENT,15000,Verified,20111201T000000,Charged Off,n
Southwest Rural metro,5 years,OWN,72000,Not Verified,20111201T000000,Fully Paid,n
UCLA,10+ years,OWN,75000,Source Verified,20111201T000000,Fully Paid,n

url,desc,purpose,title,zip_code
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I need to ...,credit_card,Computer,860xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I plan to use ...,car,bike,309xx
https://www.lendingclub.c om/browse/loanDetail. ...,,small_business,real estate business,606xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > to pay for ...,other,personel,917xx
https://www.lendingclub.c om/browse/loanDetail. ...,,wedding,My wedding loan I promise to pay back ...,852xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > Downpayment ...,car,Car Downpayment,900xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > I own a small ...,small_business,Expand Business & Buy Debt Portfolio ...,958xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > I'm trying to ...,other,Building my credit history. ...,774xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/15/11 > I had recived ...,debt_consolidation,High intrest Consolidation ...,853xx
https://www.lendingclub.c om/browse/loanDetail. ...,,debt_consolidation,Consolidation,913xx

addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record
AZ,27.65,0,19850101T000000,1,,
GA,1.0,0,19990401T000000,5,,
IL,8.72,0,20011101T000000,2,,
CA,20.0,0,19960201T000000,1,35.0,
AZ,11.2,0,20041101T000000,3,,
CA,5.35,0,20070101T000000,2,,
CA,5.55,0,20040401T000000,2,,
TX,18.08,0,20040901T000000,0,,
AZ,16.12,0,19980101T000000,2,,
CA,10.78,0,19891001T000000,0,,

open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt
3,0,13648,83.7,9,f,0.0,0.0,5861.07
3,0,1687,9.4,4,f,0.0,0.0,1008.71
2,0,2956,98.5,10,f,0.0,0.0,3003.65
10,0,5598,21.0,37,f,0.0,0.0,12226.3
9,0,7963,28.3,12,f,0.0,0.0,5631.38
4,0,8221,87.5,4,f,0.0,0.0,3938.14
11,0,5210,32.6,13,f,0.0,0.0,646.02
2,0,9279,36.5,3,f,0.0,0.0,1476.19
14,0,4032,20.6,23,f,0.0,0.0,7677.52
12,0,23336,67.1,34,f,0.0,0.0,13943.1

total_pymnt_inv,...
5831.78,...
1008.71,...
3003.65,...
12226.3,...
5631.38,...
3938.14,...
646.02,...
1469.34,...
7677.52,...
13943.1,...


Here, we should see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc.

## Exploring the target column

The target column (label column) of the dataset that we are interested in is called **bad_loans**. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

- +1 as a safe loan
- -1 as a risky (bad) loan

We put this in a new column called **safe_loans**.

In [61]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans.remove_columns(['bad_loans'])

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade
1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2
1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4
1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5
1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1
1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4
1072053,1288686,3000,3000,3000,36 months,18.64,109.43,E,E1
1071795,1306957,5600,5600,5600,60 months,21.28,152.39,F,F2
1071570,1306721,5375,5375,5350,60 months,12.69,121.45,B,B5
1070078,1305201,6500,6500,6500,60 months,14.65,153.45,C,C3
1069908,1305008,12000,12000,12000,36 months,12.69,402.54,B,B5

emp_title,emp_length,home_ownership,annual_inc,is_inc_v,issue_d,loan_status,pymnt_plan
,10+ years,RENT,24000,Verified,20111201T000000,Fully Paid,n
Ryder,< 1 year,RENT,30000,Source Verified,20111201T000000,Charged Off,n
,10+ years,RENT,12252,Not Verified,20111201T000000,Fully Paid,n
AIR RESOURCES BOARD,10+ years,RENT,49200,Source Verified,20111201T000000,Fully Paid,n
Veolia Transportaton,3 years,RENT,36000,Source Verified,20111201T000000,Fully Paid,n
MKC Accounting,9 years,RENT,48000,Source Verified,20111201T000000,Fully Paid,n
,4 years,OWN,40000,Source Verified,20111201T000000,Charged Off,n
Starbucks,< 1 year,RENT,15000,Verified,20111201T000000,Charged Off,n
Southwest Rural metro,5 years,OWN,72000,Not Verified,20111201T000000,Fully Paid,n
UCLA,10+ years,OWN,75000,Source Verified,20111201T000000,Fully Paid,n

url,desc,purpose,title,zip_code
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I need to ...,credit_card,Computer,860xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I plan to use ...,car,bike,309xx
https://www.lendingclub.c om/browse/loanDetail. ...,,small_business,real estate business,606xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > to pay for ...,other,personel,917xx
https://www.lendingclub.c om/browse/loanDetail. ...,,wedding,My wedding loan I promise to pay back ...,852xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > Downpayment ...,car,Car Downpayment,900xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > I own a small ...,small_business,Expand Business & Buy Debt Portfolio ...,958xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > I'm trying to ...,other,Building my credit history. ...,774xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/15/11 > I had recived ...,debt_consolidation,High intrest Consolidation ...,853xx
https://www.lendingclub.c om/browse/loanDetail. ...,,debt_consolidation,Consolidation,913xx

addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record
AZ,27.65,0,19850101T000000,1,,
GA,1.0,0,19990401T000000,5,,
IL,8.72,0,20011101T000000,2,,
CA,20.0,0,19960201T000000,1,35.0,
AZ,11.2,0,20041101T000000,3,,
CA,5.35,0,20070101T000000,2,,
CA,5.55,0,20040401T000000,2,,
TX,18.08,0,20040901T000000,0,,
AZ,16.12,0,19980101T000000,2,,
CA,10.78,0,19891001T000000,0,,

open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt
3,0,13648,83.7,9,f,0.0,0.0,5861.07
3,0,1687,9.4,4,f,0.0,0.0,1008.71
2,0,2956,98.5,10,f,0.0,0.0,3003.65
10,0,5598,21.0,37,f,0.0,0.0,12226.3
9,0,7963,28.3,12,f,0.0,0.0,5631.38
4,0,8221,87.5,4,f,0.0,0.0,3938.14
11,0,5210,32.6,13,f,0.0,0.0,646.02
2,0,9279,36.5,3,f,0.0,0.0,1476.19
14,0,4032,20.6,23,f,0.0,0.0,7677.52
12,0,23336,67.1,34,f,0.0,0.0,13943.1

total_pymnt_inv,...
5831.78,...
1008.71,...
3003.65,...
12226.3,...
5631.38,...
3938.14,...
646.02,...
1469.34,...
7677.52,...
13943.1,...


Now, let us explore the distribution of the column **safe_loans**. This gives us a sense of how many safe and risky loans are present in the dataset. Print out the percentage of safe loans and risky loans in the data frame.

In [62]:
safe_loans=len(loans[loans['safe_loans']==+1])
risky_loans=len(loans[loans['safe_loans']==-1])
print "safe loans percent:", float(safe_loans)/len(loans)
print "risk loans percent: ", float(risky_loans)/len(loans)

safe loans percent: 0.811185331996
risk loans percent:  0.188814668004


You should have:

- Around 81% safe loans
- Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.



## Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features. Extract these feature columns and target column from the dataset. We will only use these features.

In [63]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a subset of features and the target that we will use for the rest of this notebook.

## Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (**safe_loans_raw**) and one with just the risky loans (**risky_loans_raw**).

In [64]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


*One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half*. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.



In [65]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

You can verify now that loans_data is comprised of approximately 50% safe loans and 50% risky loans.

**Note**: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](https://eventing.coursera.org/api/redirectStrict/hVNup67UWWWN3D8M2sPTpGyYmh0cmrCwo2hbqnaVGSDU487mZx6UNmyiXny9R7kRQBIH5JM-oQH67SoE7VwNYg.08XvO4C7hxWIH-4B8M-aOQ.7WnhJw__PLz0AmF-yjc1svX0QCUKEnzzNCiIvAickSfnCkxyn4W6pkqQtA_LZFU-RBvBpJBZly3gdPYmjkmmCGw1SJjA6OhUbGSPOgOpqyD_orrcJZ4tLU3mrnQ4BN7ujW3D37tm-IHKjLi3Ia3uTS6r1W7Zv86awdsjKAoncyDQY08MnhJy8WobT5DaZa8arl_w8BGp18MvD01ssz9GpNC4-kpJL23DSfLlZ4bm2QHpUdCoV2LpUJfp4XfPQcZ0jx4MDhcMCQbm3s1pYXoIFawXCdXpRbcFvUmhPvgptwwOGNhlSIPOaTXCAeedZepgoXbfs5RNRwOJacNUBvUU6YeC5Z3lJOgLImviQ6gc0ICc4YuxBZyBXyjMDTDKoOHlm9F_u4V40pQFa2cF1xWsZcTheJBKfrkYURBlEbciBKv0zp8BXO5eC7KZYlIPRryG9tQcpD4BCmczRw0Nkh00O9c4lPZtJaTqL_gN2N8iBtbChcRvYjbJBzmJSBv6jZaBoMC1zF35I-QsuFBI4j1wPWP0h77gGZcjbFK_hTj5ERzG9kkrUFTRszs9eW7FVDC9IDhKNEys1MGNb0QGZuxB7qDZp_392vu3aDRE8xxjsNoAOmu39t8iIZ3u7wmVm1vb). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.



In [66]:
print len(safe_loans), len(risky_loans)

23358 23150


## One-hot encoding

For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to *turn categorical variables into binary features* via **one-hot encoding**. The next assignment has more details about this.

If you are using SFrame, feel free to use this piece of code as is. Refer to the SFrame API documentation for a deeper understanding. If you are using different machine learning software, make sure you prepare the data to be passed to the learning software.

In [67]:
loans_data = risky_loans.append(safe_loans)

categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)
    


## Split data into training and validation

We split the data into training and validation sets using an 80/20 split and specifying seed=1 so everyone gets the same results. Call the training and validation sets **train_data** and **validation_data**, respectively.

**Note**: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help **select model parameters** (this is known as model selection). Thus, this portion of data should be called a **validation set**. 

Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.



In [74]:
train_data, validation_data = loans_data.random_split(.8, seed=1)
loans_data = None

AttributeError: 'NoneType' object has no attribute 'random_split'

In [75]:
print train_data.shape, validation_data.shape

(37224, 68) (9284, 68)


## Build a decision tree classifier

Now, let's use the built-in scikit learn decision tree learner ([sklearn.tree.DecisionTreeClassifier](https://eventing.coursera.org/api/redirectStrict/NVZnc1g0zc28UHyPaI7xb865W0umciNnDenzMo3WnRZG6tecl3zjgLTGMkPwXK3zmRdHEXVZuU_NmIlRxq6M0A.2ZMbprPTgXI3EO0PxwJ2iA.eOdCj-UDFuZzPtiw9zxlQARoIVGQRXQ35nCGFbK0RgdaNA5C6-WLo5KI93dyPFtwD-qsru12KjlpNqd88DK5OGLGdcqTjS5q3Ax707Cl_4Aar5YbOc3t2BBZ0BRJI-ti35gvNqG8ZQhQ70paUWeHe91Ccw8LXV4GDt0ff9XSRncDx4A4S7HZnCUM_3-XHNv9G9RTcuaASqfdNkSUyJkQDVoWsXWe2waaA08Vs--X2RRBrfKwrYDClbLzhAYEccgqXzsIkURhJ7tm5BPvJuJTCr39puNIyHhvQjSVkIuZpzvZDNV_N1Js-4rNMPhYC8C7euQNb96KG9YDu7doqBcIwz7ln4hqO1xB_ww-7r-ElP_QnJgwl9NQqLip5IKRcX56sKLPrrGDjlrCCkSk98Ptpz0y1quMJuXSnXokjmE6nreJ_cBZRPl2LXFMUjja2TGL)) to create a loan prediction model on the training data. To do this, you will need to import **sklearn**, **sklearn.tree**, and **numpy**.

Note: You will have to first convert the SFrame into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the **.to_numpy()** method call on SFrame to turn SFrames into numpy arrays). See the [API](https://eventing.coursera.org/api/redirectStrict/NVZnc1g0zc28UHyPaI7xb865W0umciNnDenzMo3WnRZG6tecl3zjgLTGMkPwXK3zmRdHEXVZuU_NmIlRxq6M0A.2ZMbprPTgXI3EO0PxwJ2iA.eOdCj-UDFuZzPtiw9zxlQARoIVGQRXQ35nCGFbK0RgdaNA5C6-WLo5KI93dyPFtwD-qsru12KjlpNqd88DK5OGLGdcqTjS5q3Ax707Cl_4Aar5YbOc3t2BBZ0BRJI-ti35gvNqG8ZQhQ70paUWeHe91Ccw8LXV4GDt0ff9XSRncDx4A4S7HZnCUM_3-XHNv9G9RTcuaASqfdNkSUyJkQDVoWsXWe2waaA08Vs--X2RRBrfKwrYDClbLzhAYEccgqXzsIkURhJ7tm5BPvJuJTCr39puNIyHhvQjSVkIuZpzvZDNV_N1Js-4rNMPhYC8C7euQNb96KG9YDu7doqBcIwz7ln4hqO1xB_ww-7r-ElP_QnJgwl9NQqLip5IKRcX56sKLPrrGDjlrCCkSk98Ptpz0y1quMJuXSnXokjmE6nreJ_cBZRPl2LXFMUjja2TGL) for more information. Make sure to set **max_depth=6**.

Call this model **decision_tree_model**.

In [76]:
feature_names = train_data.column_names()
feature_names.remove(target)

target_values_train_data = train_data[target].to_numpy()
feature_matrix_train_data = train_data.select_columns(feature_names).to_numpy()

feature_matrix_train_data.shape

(37224, 67)

In [77]:
import sklearn
import sklearn.tree
import numpy
decision_tree_model = sklearn.tree.DecisionTreeClassifier(max_depth=6)
decision_tree_model.fit(feature_matrix_train_data, target_values_train_data)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Also train a tree using with **max_depth=2**. Call this model **small_model**.

In [78]:
small_model = sklearn.tree.DecisionTreeClassifier(max_depth=2)
small_model.fit(feature_matrix_train_data, target_values_train_data)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [79]:
print feature_matrix_train_data.shape

(37224, 67)


## Visualizing a learned model (Optional)

For this optional section, we would like to see what the small learned tree looks like. If you are using scikit-learn and have the package **Graphviz**, then you will be able to perform this section. If you are using a different software, try your best to follow along.

Visualize **small_model** in the software of your choice.

## Making predictions

Let's consider two positive and two negative examples from the **validation set** and see what the model predicts. We will do the following:

- Predict whether or not a loan is safe.
- Predict the probability that a loan is safe.

First, let's grab 2 positive examples and 2 negative examples. In SFrame, that would be:

In [80]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,11,11.18,1,1,82.4,0.0,1
0,10,16.85,1,1,96.4,0.0,1
0,3,13.97,0,1,59.5,0.0,-1
0,11,16.33,1,1,62.1,0.0,-1

grade.A,grade.B,grade.C,grade.D,grade.E,grade.F,grade.G,sub_grade.A1,sub_grade.A2,sub_grade.A3,sub_grade.A4
0,1,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0

sub_grade.A5,sub_grade.B1,sub_grade.B2,sub_grade.B3,sub_grade.B4,sub_grade.B5,sub_grade.C1,sub_grade.C2
0,0,0,1,0,0,0,0
0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0

sub_grade.C3,sub_grade.C4,sub_grade.C5,sub_grade.D1,sub_grade.D2,sub_grade.D3,sub_grade.D4,sub_grade.D5
0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0
0,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0

sub_grade.E1,sub_grade.E2,sub_grade.E3,sub_grade.E4,sub_grade.E5,...
0,0,0,0,0,...
0,0,0,0,0,...
0,0,0,0,0,...
0,0,0,0,0,...


Now, we will use our model to predict whether or not a loan is likely to default. For each row in the **sample_validation_data**, use the **decision_tree_model** to predict whether or not the loan is classified as a safe loan. (Hint: if you are using scikit-learn, you can use the **.predict()** method)

**Quiz Question**: What percentage of the predictions on **sample_validation_data** did **decision_tree_model** get correct?

In [81]:
sample_validation_feature_matrix = sample_validation_data.select_columns(feature_names).to_numpy()
print "Predictions: " ,decision_tree_model.predict(sample_validation_feature_matrix)
print "Actual:   ", sample_validation_data[target]

Predictions:  [ 1 -1 -1  1]
Actual:    [1, 1, -1, -1]


## Explore probability predictions

For each row in the **sample_validation_data**, what is the probability (according **decision_tree_model**) of a loan being classified as safe? (Hint: if you are using scikit-learn, you can use the **.predict_proba()** method)

**Quiz Question**: Which loan has the highest probability of being classified as a safe loan?

**Checkpoint**: Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?

In [82]:
# np.concatenate([1,2],[3,4])
decision_tree_model.predict_proba(sample_validation_feature_matrix)

array([[ 0.34156543,  0.65843457],
       [ 0.53630646,  0.46369354],
       [ 0.64750958,  0.35249042],
       [ 0.20789474,  0.79210526]])

## Tricky predictions!

Now, we will explore something pretty interesting. For each row in the **sample_validation_data**, what is the probability (according to **small_model**) of a loan being classified as safe?

**Quiz Question**: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

In [83]:
small_model.predict_proba(sample_validation_feature_matrix)

array([[ 0.41896585,  0.58103415],
       [ 0.59255339,  0.40744661],
       [ 0.59255339,  0.40744661],
       [ 0.23120112,  0.76879888]])

## Visualize the prediction on a tree

Note that you should be able to look at the small tree (of depth 2), traverse it yourself, and visualize the prediction being made. 
Consider the following point in the **sample_validation_data**:

In [84]:
sample_validation_data[1]

{'dti': 16.85,
 'emp_length_num': 10,
 'grade.A': 0,
 'grade.B': 0,
 'grade.C': 0,
 'grade.D': 1,
 'grade.E': 0,
 'grade.F': 0,
 'grade.G': 0,
 'home_ownership.MORTGAGE': 0,
 'home_ownership.OTHER': 0,
 'home_ownership.OWN': 0,
 'home_ownership.RENT': 1,
 'last_delinq_none': 1,
 'last_major_derog_none': 1,
 'purpose.car': 0,
 'purpose.credit_card': 0,
 'purpose.debt_consolidation': 1,
 'purpose.home_improvement': 0,
 'purpose.house': 0,
 'purpose.major_purchase': 0,
 'purpose.medical': 0,
 'purpose.moving': 0,
 'purpose.other': 0,
 'purpose.small_business': 0,
 'purpose.vacation': 0,
 'purpose.wedding': 0,
 'revol_util': 96.4,
 'safe_loans': 1,
 'short_emp': 0,
 'sub_grade.A1': 0,
 'sub_grade.A2': 0,
 'sub_grade.A3': 0,
 'sub_grade.A4': 0,
 'sub_grade.A5': 0,
 'sub_grade.B1': 0,
 'sub_grade.B2': 0,
 'sub_grade.B3': 0,
 'sub_grade.B4': 0,
 'sub_grade.B5': 0,
 'sub_grade.C1': 0,
 'sub_grade.C2': 0,
 'sub_grade.C3': 0,
 'sub_grade.C4': 0,
 'sub_grade.C5': 0,
 'sub_grade.D1': 1,
 'sub_grad

If you have Graphviz, go ahead and re-visualize **small_model** here to do the traversing for this data point.

In [88]:
sklearn.tree.export_graphviz(small_model, "small_model.dot")

In [89]:
%%bash
dot -Tpng small_model.dot -o small_model.png

In [90]:
#from IPython.display import Image
#Image("small_model.png")

![small_model_decision_tree](small_model.png)

**Quiz Question**: Based on the visualized tree, what prediction would you make for this data point (according to **small_model**)? (If you don't have Graphviz, you can answer this quiz question by executing the next part.)

Now, verify your prediction by examining the prediction made using **small_model**.

In [94]:
print sample_validation_feature_matrix[1]
small_model.predict(sample_validation_feature_matrix)

[  0.    10.    16.85   1.     1.    96.4    0.     0.     0.     0.     1.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     1.     0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     0.     1.     0.     0.
   1.     0.     0.     0.     0.     0.     0.     0.     0.     0.     1.
   0.  ]


array([ 1, -1, -1,  1])

## Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows:

$$
\text {accuracy} = \frac {\text {#correctly classified examples}} {\text {#total examples}}
$$

Evaluate the accuracy of **small_model** and **decision_tree_model** on the training data. (Hint: if you are using scikit-learn, you can use the **.score()** method)

**Checkpoint**: You should see that the **small_model** performs worse than the **decision_tree_model** on the training data.

In [95]:
small_model.score(feature_matrix_train_data, target_values_train_data)

0.61350204169353106

In [96]:
decision_tree_model.score(feature_matrix_train_data, target_values_train_data)

0.64052761659144641

Now, evaluate the accuracy of the **small_model** and **decision_tree_model** on the entire **validation_data**, not just the subsample considered above.

**Quiz Question**: What is the accuracy of **decision_tree_model** on the **validation set**, rounded to the nearest .01?



In [99]:
feature_matrix_validation_data = validation_data.select_columns(feature_names).to_numpy()
target_values_validation_data = validation_data[target].to_numpy()

print "decision_tree_model scores: %.2f on train set" % decision_tree_model.score(feature_matrix_train_data, target_values_train_data)
print "decision_tree_model scores: %.2f on validation set" % decision_tree_model.score(feature_matrix_validation_data, target_values_validation_data)

print "small_model scores :%.2f on trian set" % small_model.score(feature_matrix_train_data, target_values_train_data)
print "small_model scores: %.2f on validation set" % small_model.score(feature_matrix_validation_data, target_values_validation_data)

decision_tree_model scores: 0.64 on train set
decision_tree_model scores: 0.64 on validation set
small_model scores :0.61 on trian set
small_model scores: 0.62 on validation set


## Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with **max_depth=10**. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

Using **sklearn.tree.DecisionTreeClassifier**, train a decision tree with **maximum depth = 10**. Call this model **big_model**.

Evaluate the accuracy of **big_model** on the **training set** and **validation set**.

**Checkpoint**: We should see that **big_model** has even better performance on the training set than **decision_tree_model** did on the training set.

**Quiz Question**: How does the performance of **big_model** on the validation set compare to **decision_tree_model** on the validation set? Is this a sign of overfitting?

In [100]:
big_model = sklearn.tree.DecisionTreeClassifier(max_depth=10)
big_model.fit(feature_matrix_train_data, target_values_train_data)

print "big_model scores: %.2f on train set" % big_model.score(feature_matrix_train_data, target_values_train_data)
print "big_model scores: %.2f on validation set" % big_model.score(feature_matrix_validation_data, target_values_validation_data)

big_model scores: 0.66 on train set
big_model scores: 0.63 on validation set


## Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost each mistake made by the model. Assume the following:

- **False negatives**: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of loosing a loan that would have otherwise been accepted.
- **False positives**: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.
- **Correct predictions**: All correct predictions don't typically incur any cost.

Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:

- First, let us compute the predictions made by the model.
- Second, compute the number of false positives.
- Third, compute the number of false negatives.
- Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positves.

**Quiz Question**: Let's assume that each mistake costs us money: a false negative costs \$10,000, while a false positive positive costs $20,000. What is the total cost of mistakes made by **decision_tree_model** on **validation_data**?


In [101]:
import numpy as np
predictions = decision_tree_model.predict(feature_matrix_validation_data)

delta = target_values_validation_data - predictions
false_negatives = (delta == 2)
false_positives = (delta == -2)
correct_predictions = (delta == 0)

false_negatives_count = np.count_nonzero(false_negatives)
false_positives_count = np.count_nonzero(false_positives)
correct_predictions_count = np.count_nonzero(correct_predictions)

assert(false_negatives_count + false_positives_count + correct_predictions_count 
       == feature_matrix_validation_data.shape[0])

In [102]:
cost = false_negatives_count * 10000 + false_positives_count * 20000
print cost

50370000
