# Decision Tree

In this markdown, we will explore the theory and application of the [*Decision Tree*](https://en.wikipedia.org/wiki/Decision_tree) algorithm, for classification and prediction of *safe* or *risk* loan requirements. We use the [Turicreate](https://github.com/apple/turicreate) and our own functions for such task

Decision Tree is a classification method that creates a bunch of *if* conditions based on the input features, in order to create the "best combination" with the lowest error in the training set.

First, import the necessary packages:

In [1]:
import turicreate as tc
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters


Now, the dataset for this tutorial is the *loan_data.csv*, containing several information about the costumers that are applying for a loan, and if it was approved or not. Here, we will use the **decision tree** algorithm to predict if a specific costumer can have the loan approved or not.

In [2]:
data = tc.SFrame("../Data/loan_data.csv")

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,int,int,str,float,float,str,str,str,str,str,int,str,str,str,str,str,str,str,str,str,str,float,int,str,int,int,int,int,int,int,float,int,str,float,float,float,float,float,float,float,float,float,str,float,str,str,int,str,int,int,str,int,int,int,int,float,int,int,int,int,float,str,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
data

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade
1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2
1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4
1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5
1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1
1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4
1072053,1288686,3000,3000,3000,36 months,18.64,109.43,E,E1
1071795,1306957,5600,5600,5600,60 months,21.28,152.39,F,F2
1071570,1306721,5375,5375,5350,60 months,12.69,121.45,B,B5
1070078,1305201,6500,6500,6500,60 months,14.65,153.45,C,C3
1069908,1305008,12000,12000,12000,36 months,12.69,402.54,B,B5

emp_title,emp_length,home_ownership,annual_inc,is_inc_v,issue_d,loan_status,pymnt_plan
,10+ years,RENT,24000,Verified,20111201T000000,Fully Paid,n
Ryder,< 1 year,RENT,30000,Source Verified,20111201T000000,Charged Off,n
,10+ years,RENT,12252,Not Verified,20111201T000000,Fully Paid,n
AIR RESOURCES BOARD,10+ years,RENT,49200,Source Verified,20111201T000000,Fully Paid,n
Veolia Transportaton,3 years,RENT,36000,Source Verified,20111201T000000,Fully Paid,n
MKC Accounting,9 years,RENT,48000,Source Verified,20111201T000000,Fully Paid,n
,4 years,OWN,40000,Source Verified,20111201T000000,Charged Off,n
Starbucks,< 1 year,RENT,15000,Verified,20111201T000000,Charged Off,n
Southwest Rural metro,5 years,OWN,72000,Not Verified,20111201T000000,Fully Paid,n
UCLA,10+ years,OWN,75000,Source Verified,20111201T000000,Fully Paid,n

url,desc,purpose,title,zip_code
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I need to ...,credit_card,Computer,860xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I plan to use ...,car,bike,309xx
https://www.lendingclub.c om/browse/loanDetail. ...,,small_business,real estate business,606xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > to pay for ...,other,personel,917xx
https://www.lendingclub.c om/browse/loanDetail. ...,,wedding,My wedding loan I promise to pay back ...,852xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > Downpayment ...,car,Car Downpayment,900xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > I own a small ...,small_business,Expand Business & Buy Debt Portfolio ...,958xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > I'm trying to ...,other,Building my credit history. ...,774xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/15/11 > I had recived ...,debt_consolidation,High intrest Consolidation ...,853xx
https://www.lendingclub.c om/browse/loanDetail. ...,,debt_consolidation,Consolidation,913xx

addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record
AZ,27.65,0,19850101T000000,1,,
GA,1.0,0,19990401T000000,5,,
IL,8.72,0,20011101T000000,2,,
CA,20.0,0,19960201T000000,1,35.0,
AZ,11.2,0,20041101T000000,3,,
CA,5.35,0,20070101T000000,2,,
CA,5.55,0,20040401T000000,2,,
TX,18.08,0,20040901T000000,0,,
AZ,16.12,0,19980101T000000,2,,
CA,10.78,0,19891001T000000,0,,

open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt
3,0,13648,83.7,9,f,0.0,0.0,5861.07
3,0,1687,9.4,4,f,0.0,0.0,1008.71
2,0,2956,98.5,10,f,0.0,0.0,3003.65
10,0,5598,21.0,37,f,0.0,0.0,12226.3
9,0,7963,28.3,12,f,0.0,0.0,5631.38
4,0,8221,87.5,4,f,0.0,0.0,3938.14
11,0,5210,32.6,13,f,0.0,0.0,646.02
2,0,9279,36.5,3,f,0.0,0.0,1476.19
14,0,4032,20.6,23,f,0.0,0.0,7677.52
12,0,23336,67.1,34,f,0.0,0.0,13943.1

total_pymnt_inv,...
5831.78,...
1008.71,...
3003.65,...
12226.3,...
5631.38,...
3938.14,...
646.02,...
1469.34,...
7677.52,...
13943.1,...


In [4]:
data.column_names()

['id',
 'member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'is_inc_v',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'url',
 'desc',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'earliest_cr_line',
 'inq_last_6mths',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'initial_list_status',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'next_pymnt_d',
 'last_credit_pull_d',
 'collections_12_mths_ex_med',
 'mths_since_last_major_derog',
 'policy_code',
 'not_compliant',
 'status',
 'inactive_loans',
 'bad_loans',
 'emp_length_num',
 'grade_num',
 'sub_grade_num',
 'delinq_2yrs_zero',
 'pub_rec

In [5]:
data['bad_loans'].head()

dtype: int
Rows: 10
[0, 1, 0, 0, 0, 0, 1, 1, 0, 0]

The "target" *bad_loans* contains the information of a risky loan, giving 1 if the loan is risky and 0 otherwise. For our tutorial, we will create the feature *safe_loans*, which gives +1 for a safe loan and -1 for a risky loan.

In [6]:
data['safe_loans'] = data['bad_loans'].apply(lambda x: 1 if x==0 else -1)
data = data.remove_column('bad_loans')

In [7]:
data['safe_loans']

dtype: int
Rows: 122607
[1, -1, 1, 1, 1, 1, -1, -1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, ... ]

In [8]:
good = len(data[data['safe_loans'] == 1])
bad = len(data[data['safe_loans'] == -1])
total = len(data['safe_loans'])
safe_perc = (good*1.0/total) * 100
risky_perc = (bad*1.0/total) * 100
print 'Percentage of safe loans: %.2f' % safe_perc
print 'Percentage of risky loans: %.2f' % risky_perc

Percentage of safe loans: 81.12
Percentage of risky loans: 18.88


Let's take a look on some of the features. It's very useful to understand the dataset. For that, we will check the distribution of some of the features using the [.show()](https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.show.html#turicreate.SFrame.show) function of Turicreate. Initially, it was only available for MacOS, but it was updated for Linux as well. If it is not working for you, please update to the latest version of Turicreate by typing in your terminal:

> source venv activate

> pip install -U turicreate

In [9]:
data['grade'].show()

In [10]:
data['sub_grade'].show()

In [11]:
data['home_ownership'].show()

The [.explore()](https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.explore.html#turicreate.SFrame.explore) option opens a new GUI window with the whole dataset, so you can explore all the features.

In [12]:
data.explore(title = 'Loans Dataset')

Now, let's select the target and some of the features to be used to create our model.

In [13]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
data = data[features + [target]]
print data

+-------+-----------+-----------+----------------+----------------+-------+
| grade | sub_grade | short_emp | emp_length_num | home_ownership |  dti  |
+-------+-----------+-----------+----------------+----------------+-------+
|   B   |     B2    |     0     |       11       |      RENT      | 27.65 |
|   C   |     C4    |     1     |       1        |      RENT      |  1.0  |
|   C   |     C5    |     0     |       11       |      RENT      |  8.72 |
|   C   |     C1    |     0     |       11       |      RENT      |  20.0 |
|   A   |     A4    |     0     |       4        |      RENT      |  11.2 |
|   E   |     E1    |     0     |       10       |      RENT      |  5.35 |
|   F   |     F2    |     0     |       5        |      OWN       |  5.55 |
|   B   |     B5    |     1     |       1        |      RENT      | 18.08 |
|   C   |     C3    |     0     |       6        |      OWN       | 16.12 |
|   B   |     B5    |     0     |       11       |      OWN       | 10.78 |
+-------+---

What we did was to **remove** from the sframe *data* the features we are not using.

As we checked earlier, the data is unbalanced (around 81% are safe loans and 19% are risky loans). Unbalanced classification data can mislead the classification problem. There are several ways to deal with it, even by including this to the classifier algorithm. However, here, we will just downsize the number of the safe loans to be equivalent to the number of risky loans. For that, we need to create some new variables:

In [14]:
safe_loans_raw = data[data[target] == +1]
risky_loans_raw = data[data[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


In [15]:
# Find the percentage of the risky loans relative to the safe loans
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
risky_loans = risky_loans_raw

# Sample the safe loans to the same level as the risky loans. The seed = 1 is for everybody to have same results
safe_loans = safe_loans_raw.sample(percentage, seed=1)   

# Append the risky loans with the downsampled version of safe loans
loans_data = risky_loans.append(safe_loans)

print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data)) * 100
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data)) * 100
print "Total number of loans in our new dataset :", len(loans_data)

Percentage of safe loans                 : 50.2236174422
Percentage of risky loans                : 49.7763825578
Total number of loans in our new dataset : 46508


Now we have the *loans_data* that will be used in our analyze. The next step is to split the data into train and validation data.

In [16]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

### Creating the first model

Using the Turicreate [decision_tree_classifier](https://apple.github.io/turicreate/docs/api/generated/turicreate.decision_tree_classifier.create.html#turicreate.decision_tree_classifier.create) to generate the model. If we do not specify the *max_depth* (maximum depth of the tree) option, it will be automaticaly set to **6**.

In [17]:
decision_tree_model = tc.decision_tree_classifier.create(train_data, validation_set=None,
                                                         target = target, features = features)

In [18]:
decision_tree_model

Class                          : DecisionTreeClassifier

Schema
------
Number of examples             : 37224
Number of feature columns      : 12
Number of unpacked features    : 12
Number of classes              : 2

Settings
--------
Number of trees                : 1
Max tree depth                 : 6
Training time (sec)            : 0.1203
Training accuracy              : 0.6406
Validation accuracy            : None
Training log_loss              : 0.6314
Validation log_loss            : None

Let's now do predictions over the validation set:

In [19]:
predictions = decision_tree_model.predict(validation_data)
results = decision_tree_model.evaluate(validation_data)

In [20]:
results['accuracy']

0.6367944851357173

So, for this model, we got close to 64% accuracy. What is we increase the maximum depth of the model? Let's try $depth = 12$.

In [21]:
decision_tree_model_large = tc.decision_tree_classifier.create(train_data, validation_set=None, target = target, 
                                                               features = features, max_depth = 12)

In [22]:
predictions_large = decision_tree_model_large.predict(validation_data)
results_large = decision_tree_model_large.evaluate(validation_data)

In [23]:
results_large['accuracy']

0.6200990952175787

We actually decreased the accuracy over the validation set. A higher depth of the decision tree can lead to over-fitting.

Now, let's evaluate the wrong predictions. Some of the predictions were +1 where it should be -1 (false positive) and some where predicted as -1 where it should be +1 (false negative). In our loans example, misclassification can lead to loss of money. If it was a medical dataset, for example if we were trying to identify if a patient has cancer or not, a false negative could be considered more dangerous than a false positive. The same idea can be applied for the loan example. A false positive can lead to higher loss of money (approving a loan for someone that will not pay) than a false negative (the loss of money is only over the interests).

So, lets first calculate how many of the misclassifications were false positive (fp) or false negative(fn).

In [24]:
predictions = decision_tree_model.predict(validation_data)
validation_safe_loans = validation_data['safe_loans']
fp = ((validation_data['safe_loans'] == -1) & (predictions == 1)).sum()
fn = ((validation_data['safe_loans'] == 1) & (predictions == -1)).sum()
print fp
print fn

1656
1716


Now, let's say assume the following loss of money for each case:

* Assume a cost of \$10,000 per false negative.
* Assume a cost of \$20,000 per false positive.

How much many was lost in the process?

In [25]:
loss = fp*20000 + fn*10000
print loss

50280000


Deeper analysis is necessary to understand the reason we have so many misclassifications. The ideal was to go and look many features of the misclassification data and try to figure it out how we can improve our model. However, we are not doing it here, as this tutorial is just to show how to use the ML tools.

# Creating our functions