# BT2101 Introduction to Decision Tree

## 1 Goal

In this notebook, we will explore **Decision Tree** including:
* User-defined functions
* Open-source package: `scikit-learn`

For the **Decision Tree** method, you will:
* Use numpy to write functions
* Write binary recursive splitting functions
* Write decision functions
* Write pruning functions
* Use open-source package to do classification

In [1]:
# -*- coding:utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt, log
from __future__ import division
from collections import defaultdict
%matplotlib inline

## 2 Summary of Classification Tree

#### Classification Tree
A typical classification tree looks like this:
<img src="https://cdn-images-1.medium.com/max/750/1*2jnsFCe0YmRjb8EvVAo93w.gif" width="500">

#### Steps for Binary Splitting (E.g., Entropy)
1. Compute the entropy for data-set;
2. For every attribute/feature, calculate information gain for this attribute;
3. Pick the feature with highest information gain;
4. Repeat until we get the tree we desired;

#### Entropy and Information Gain
<img src="https://cdn-images-1.medium.com/max/2000/1*EoWJ8bxc-iqBS-dF-XxsBA.jpeg" width="900">
<img src="https://cdn-images-1.medium.com/max/2000/1*wQjVzx7zCVb87htqk46vUA.jpeg" width="900">

#### Alternative Criterion for Binary Splitting
There are a few possible criteria we can use for selecting features and making the binary splits of classification decision tree:
* Classification Error Rate
* Gini Index

## 3 Case: Kaggle Competition - Lending Club Loan Status
### 3.1 Data

#### Overview
The file "LoanStats_2018Q1.csv" contains complete loan data for all loans issued through the 2018 Quarter-1, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others.  <br/>
Please see https://www.kaggle.com/wendykan/lending-club-loan-data/home.

#### Attributes
The dataset can be downloaded [here](https://www.lendingclub.com/info/download-data.action). Information on the columns and features can be found in data dictionary. A data dictionary is provided in a separate file "LCDataDictionary.xlsx".

#### Goal
Our goal is to show how to do binary splitting and tree pruning for a classification tree.

#### Selected Features
For the sake of simplicity, We only select 3 categorical variables as features. We will further transform these categorical variables into binary ones. You need to learn how to fit decision trees when features are continuous variables. 

### 3.2 Build Tree

#### Function 1. Calculating entropy value of a given tree node with labels of samples.

In [5]:
def entropy(sample_labels):
    '''This function is used to calculate entropy value of a given tree node, in which there are samples with labels (0, 1) or (-1, 1).
    Inputs:
    1) sample_labels: Labels for samples in the current tree node, such as (1, 0, 0, 1, 0) or (1, -1, -1, 1, 0)
    
    Outputs:
    1) entropy: Entropy value of labels in the current tree node.       
    
    '''
    
    # Assert np.array
    sample_labels = np.array(sample_labels)
    
    # What if sample_labels are empty
    if sample_labels.size == 0:
        return 0  
    
    # What if all the labels are the same
    class_values = np.unique(sample_labels) # Sample labels/classes; Usually (0,1), sometimes (-1,1)
    num0 = len(list(filter(lambda x:x==class_values[0], sample_labels))) # Number of samples with one label
    num1 = len(list(filter(lambda x:x==class_values[1], sample_labels))) if class_values.size > 1 else 0 # Number of samples with another label
    
    if sample_labels.size == num0 or sample_labels.size == num1:
        return 0
    
    # Calculate entropy value      
    p0 = num0 / (num0+num1) # Probability of class 0 labels
    p1 = 1 - p0 # Probability of class 1 labels
    
    entropy = -(p0*log(p0,2) + p1*log(p1,2))    
    
    return entropy

#### Function 2. Calculating information gain when a given tree node is splitted by a given feature

In [10]:
def info_gain(samples, output, feature):
    '''This function is used to calculate information gain when a given tree node is splitted by a given feature.
    Inputs:
    1) samples: Samples in the current tree node before making split on the feature (Pandas Dataframe)
    1) output: Name of the output column
    2) feature: Name of the feature used to split the current tree node. Remember the features we selected in this case are binary.
    
    Outputs:
    1) information_gain: How much reduction in entropy value if the current tree node is splitted by the feature 
    2) subsamples[0]: Data samples where feature values are one label (e.g., 0 or -1)
    3) subsamples[1]: Data samples where feature values are another label (e.g., 1)
    
    '''
    
    # Split samples by feature values into subsamples
    subsamples = defaultdict()
    entropy_after = 0 # Entropy value after splitting
    
    for feature_value in np.unique(samples[output]):
        subsamples[feature_value] = samples[samples[feature] == feature_value]
        temp = subsamples[feature_value] # Store a temporary copy
        p = len(temp) / len(samples) # Proportion of this subsample
        entropy_after += p * entropy(temp[output])
        
    # Calculate information gain 
    information_gain = entropy(samples[output]) - entropy_after
    
    return (information_gain, subsamples[0], subsamples[1])    

In [11]:
# Let us have a test
a = np.array([[1,0,0,1],[0,1,1,0],[1,1,1,1],[0,0,0,0],[1,1,0,0]])
data = pd.DataFrame(a, columns=['x1','x2','x3','y'])
info_gain(data, 'y', 'x1')[0]

0.4199730940219749

#### Function 3. Decide the best feature to split on: Using information gain and entropy as criterion
1. Loop over each feature in the feature list;
2. For each loop (feature f), split the data into 2 groups: In group 1 (left split), all samples' feature f has value 0. In group 2 (right split), all samples' feature f has value 1;
3. Calculate the information gain for this split;
4. If the information gain for this split using this feature is highest, then pick this feature.

In [12]:
def best_feature_split(samples, output, features):
    '''This function is used to determine the best feature to split based on maximized information gain.
    Inputs:
    1) samples: Samples in the current tree node before making split on the feature (Pandas Dataframe)
    2) output: Name of the output column
    3) features: A list of feature names
    
    Outputs:
    1) best_feature: The best feature which is used to do binary splitting
    2) best_left_split: Data samples where the best feature's values are 0
    3) best_right_split: Data samples where the best feature's values are 1      
    
    '''
    
    # Initialize best feature, best information gain value, best left/right split samples
    best_feature = None 
    best_information_gain = 0
    best_left_split = None
    best_right_split = None    
    
    samples_row = float(len(samples)) # Number of rows in the data samples
    
    # Loop through features and find the best feature
    for feature in features:
        
        # Splitting the data samples
        current_split = info_gain(samples, output, feature)
        information_gain = current_split[0]
        left_split = current_split[1]
        right_split = current_split[2]
        
        # Check if this feature is better
        if information_gain >= best_information_gain:
            best_feature, best_information_gain, best_left_split, best_right_split = (feature, information_gain, left_split, right_split)
    
    return (best_feature, best_information_gain, best_left_split, best_right_split)

In [13]:
# Let us have a test
a = np.array([[1,0,0,1],[0,1,1,0],[1,1,1,1],[0,0,0,0],[1,1,0,0]])
data = pd.DataFrame(a, columns=['x1','x2','x3','y'])
best_feature_split(data, 'y', ['x1','x2','x3'])[0]


'x1'

#### Function 4. Build our classification tree and do pre-pruning
We need to decide stopping conditions (i.e., pre-pruning):
1. The samples' labels in the current node are the same (either 0 or 1);
2. All the features have already been used for split;
3. The current tree has already reached maximum depth **max_depth**;
4. The number of samples in the current node is lower than minimum number **min_number**;
5. The information gain for the current split is lower than a threshold **min_infogain** 

##### Stopping Condition 1: The samples' labels in the current node are the same (either 0/-1 or 1)

In [16]:
def stop_1(node_labels):
    '''This function is used to verify whether stopping condition 1 is satisfied.
    Inputs:
    1) node_labels: The samples' labels in the current node
    
    Outputs:
    1) True if they are all the same, False if otherwise
    
    '''
    
    # numpy array
    node_labels = np.array(node_labels)
    
    # Empty labels
    if len(node_labels) == 0:
        return True
    
    if len(np.unique(node_labels)) == 1:
        print("Stopping Condition 1: The samples' labels in the current node are the same (either 0/-1 or 1)")
        return True
    else:
        return False

##### Stopping Condition 2: All the features have already been used for split

In [19]:
def stop_2(features):
    '''This function is used to verify whether stopping condition 2 is satisfied.
    Inputs:
    1) features: A list of feature names
    
    Outputs:
    1) True if the feature list is empty, False if otherwise
    
    '''
    
    if len(features) == 0 or features == None:
        print("Stopping Condition 2: All the features have already been used for split")
        return True
    else:
        return False  

##### Stopping Condition 3: The current tree has already reached maximum depth **max_depth**

In [20]:
def stop_3(tree_depth, max_depth):
    '''This function is used to verify whether stopping condition 3 is satisfied.
    Inputs:
    1) tree_depth: The depth of the current tree
    2) max_depth: Maximum tree depth
    
    Outputs:
    1) True if the current depth reaches maximum depth, False if otherwise
    
    '''
    
    if tree_depth >= max_depth:
        print("Stopping Condition 3: The current tree has already reached maximum depth")
        return True
    else:
        return False  

##### Stopping Condition 4: The number of samples in the current node is lower than minimum number **min_number**

In [21]:
def stop_4(samples, min_number):
    '''This function is used to verify whether stopping condition 4 is satisfied.
    Inputs:
    1) samples: Data samples in the current node (Pandas DataFrame)
    2) min_number: Minimum number of node size
    
    Outputs:
    1) True if sample size is smaller than the minimum number, False if otherwise
    
    '''
    
    if samples.size <= min_number:
        print("Stopping Condition 4: The number of samples in the current node is lower than minimum number")
        return True
    else:
        return False      

##### Stopping Condition 5: The information gain for the current split is lower than a threshold **min_infogain** 

In [22]:
# info_gain(samples, output, feature) -> information gain, left, right
# best_feature_split(samples, output, features) -> feature, information gain, left, right
def stop_5(info_gain, min_infogain):
    '''This function is used to verify whether stopping condition 5 is satisfied.
    Inputs:
    1) info_gain: Information gain after this best split
    2) min_infogain: Minimum information gain
    
    Outputs:
    1) True if information gain after this best splitting is smaller than the minimum number, False if otherwise
    
    '''
    
    if info_gain <= min_infogain:
        print("Stopping Condition 5: The information gain for the current split is lower than a threshold")
        return True
    else:
        return False      

##### Build classification tree
The data structure for the nested tree structure (including temporary tree nodes, and leaf nodes) is shown as:

{ <br/>
   'label': None for temporary node, or predicted label at the leaf node (e.g., "Majority Voting" criterion) for leaf node; <br/>
   'left_tree': Left tree after the selected feature (=0 or -1) is splitted for temporary node, None for leaf node; <br/>
   'right_tree': Right tree after the selected feature (=1) is splitted for temporary node, None for leaf node; <br/>
   'best_feature': The feature that is selected to do binary split for temporary node, None for leaf node. <br/>
}

In [23]:
def majority_vote(output_labels):
    '''This function is used to get predicted label based on "Majority Voting" criterion for the current leaf node.     
    Inputs:
    1) output_labels: Outputs (labels) in this leaf node, such as [1, 0, 0, 1, 1]
    
    Outputs:
    1) prediction: Predicted label for this leaf node (e.g., 0/-1, or 1)
    
    '''
    
    # numpy array
    output_labels = np.array(output_labels)
    
    # Empty label
    if output_labels.size == 0:
        return None
    
    # Count output labels (0/-1 or 1)
    values = np.unique(output_labels)
    
    if len(values) == 1:
        return values[0]
    else:
        num0 = len(output_labels[output_labels == values[0]])
        num1 = len(output_labels[output_labels == values[1]])
        return values[1] if num1 >= num0 else values[0] # Prediction based on "Majority Voting" criterion   

In [1]:
def ClassificationTree(samples, output, features, step, tree_depth, max_depth, min_number, min_infogain):
    '''This function is used to build a classification tree in a recursive way.
       Remember how you build a binary tree in the previous C++ and Data Structure courses).
       
    Inputs:
    1) samples: Samples in the current tree node before making split on the feature (Pandas Dataframe)
    2) output: Name of the output column
    3) features: A list of feature names
    4) step: The current binary split step
    5) tree_depth: The depth of the current tree
    6) max_depth: Maximum depth this tree can grow
    7) min_number: Minimum number of node size
    8) min_infogain: Minimum information gain
    
    Outputs:
    1) tree_nodes: Nested tree nodes, which are stored and shown in nested dictionary type    
    
    '''
    
    current_features = features # Current feature list
    labels = samples[output] # Output labels in the current tree node

    print ("----------------------------------------------------------------------------")
    print ("----------------------------------------------------------------------------")
    print ("Step %s: Current tree depth is %s. Current tree node has %s data points" % (step, tree_depth, samples.size))
    
    # Verify whether stopping conditions 1-4 are satisfied. If satisfied, return a leaf_node
    if stop_1(labels) or stop_2(current_features) or stop_3(tree_depth, max_depth) or stop_4(samples, min_number):
        return {
                'label': majority_vote(labels),
                'left_tree': None,
                'right_tree': None,
                'best_feature': None            
                }
    
    # If pass stopping conditions 1-4, then do best splitting
    best_split = best_feature_split(samples, output, current_features)
    best_feature, best_infogain, best_left, best_right = (best_split[0], best_split[1], best_split[2], best_split[3])
    
    # Verify whether stopping condition 5 is satisfied. If satisfied, return a leaf node
    if stop_5(best_infogain, min_infogain):
        return {
                'label': majority_vote(labels),
                'left_tree': None,
                'right_tree': None,
                'best_feature': None          
            
                } 
    
    # If pass stopping condition 5, then move on
    step += 1
    print("Step %s: Binary split on %s. Size of Left and Right tree is (%s, %s)" % (step, best_feature, len(best_left), len(best_right)))
    current_features.remove(best_feature) # Remove this feature if this feature is used for split
    
    # Do binary split on left tree and right tree in a recursive way
    left_split = ClassificationTree(best_left, output, current_features, step+1, tree_depth+1, max_depth, min_number, min_infogain)
    right_split = ClassificationTree(best_right, output, current_features, step+1, tree_depth+1, max_depth, min_number, min_infogain)
    
    return {
            'label': None,
            'left_tree': left_split,
            'right_tree': right_split,
            'best_feature': best_feature        
            
            }  

### 3.3 Data Cleaning
We need to do some simple data cleaning work for original lend club loan data.

In [86]:
%pwd
loan_data = pd.read_csv("./LoanStats_2018Q1.csv", low_memory=False, header=1)
loan_data.head(n=10)
loan_data["loan_status"].unique()

array(['Current', 'Fully Paid', 'Late (31-120 days)', 'Late (16-30 days)',
       'Charged Off', 'In Grace Period', nan], dtype=object)

In [77]:
loan_data.shape

(107866, 145)

In [78]:
loan_data.describe()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,installment,annual_inc,url,desc,dti,delinq_2yrs,...,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,107864.0,107864.0,107864.0,107864.0,107864.0,0.0,0.0,107602.0,107864.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,3.0
mean,,16147.94278,16147.94278,16143.857775,469.663151,78542.27,,,19.648209,0.223773,...,,,,,,,,3287.666667,66.68,13.333333
std,,10184.024938,10184.024938,10182.885389,289.224545,76874.36,,,21.795902,0.730417,...,,,,,,,,1821.483004,2.901189,4.163332
min,,1000.0,1000.0,1000.0,30.12,0.0,,,0.0,0.0,...,,,,,,,,1387.0,65.0,10.0
25%,,8000.0,8000.0,8000.0,254.56,45000.0,,,11.23,0.0,...,,,,,,,,2422.5,65.005,11.0
50%,,14000.0,14000.0,14000.0,389.36,65000.0,,,17.67,0.0,...,,,,,,,,3458.0,65.01,12.0
75%,,22400.0,22400.0,22375.0,637.84,95000.0,,,25.02,0.0,...,,,,,,,,4238.0,67.52,15.0
max,,40000.0,40000.0,40000.0,1618.03,8365188.0,,,999.0,20.0,...,,,,,,,,5018.0,70.03,18.0


In [82]:
loan_data["loan_status"].value_counts()


AttributeError: 'DataFrame' object has no attribute 'unique'

In [72]:
# Select features and output
features = ['grade', 'term', 'home_ownership']       
output = 'risky'
loan_data = loan_data[loan_data['loan_status'] != 'Current']
loan_data[output] = loan_data['loan_status'].map(lambda x: 1 if x in ['Late (31-120 days)', 'Late (16-30 days)', 'Charged Off'] else 0)

In [73]:
dataset = loan_data[features+[output]]
dataset.head()

Unnamed: 0,grade,term,home_ownership,risky
37,A,36 months,RENT,0
83,C,36 months,RENT,0
99,B,36 months,OWN,0
112,D,36 months,RENT,0
135,D,36 months,RENT,1


In [74]:
# Transform categorical features to binary features
grade_dummy = pd.get_dummies(dataset['grade'], prefix='grade')  
term_dummy = pd.get_dummies(dataset['term'], prefix='term')
home_ownership_dummy = pd.get_dummies(dataset['home_ownership'], prefix='home_ownership')

In [75]:
dataset = dataset.join([grade_dummy, term_dummy, home_ownership_dummy])

(6950, 17)

In [67]:
dataset = dataset.drop(features, axis=1)

(6950, 17)


In [37]:
dataset = dataset.dropna() # Remove all missing values

In [38]:
dataset = dataset.reset_index()

In [39]:
dataset.shape

(6950, 15)

In [40]:
# Update our features and output
features = list(dataset.columns[2:])
output = dataset.columns[1]

### 3.4 Classification and Performance

In [41]:
# Suppose max_depth = 6; min_infogain=5e-4
features = list(dataset.columns[2:])
output = dataset.columns[1]
tree_model = ClassificationTree(dataset, output, features, step=0, tree_depth=0, max_depth=6, min_number=5, min_infogain=5e-4)

----------------------------------------------------------------------------
----------------------------------------------------------------------------
Step 0: Current tree depth is 0. Current tree node has 104250 data points
Step 1: Binary split on grade_A. Size of Left and Right tree is (5628, 1322)
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Step 2: Current tree depth is 1. Current tree node has 84420 data points
Step 3: Binary split on home_ownership_MORTGAGE. Size of Left and Right tree is (3015, 2613)
----------------------------------------------------------------------------
----------------------------------------------------------------------------
Step 4: Current tree depth is 2. Current tree node has 45225 data points
Step 5: Binary split on grade_B. Size of Left and Right tree is (2153, 862)
------------------------------------------------------------------------

In [42]:
tree_model

{'label': None,
 'left_tree': {'label': None,
  'left_tree': {'label': None,
   'left_tree': {'label': None,
    'left_tree': {'label': 0,
     'left_tree': None,
     'right_tree': None,
     'best_feature': None},
    'right_tree': {'label': None,
     'left_tree': {'label': 0,
      'left_tree': None,
      'right_tree': None,
      'best_feature': None},
     'right_tree': {'label': None,
      'left_tree': {'label': 0,
       'left_tree': None,
       'right_tree': None,
       'best_feature': None},
      'right_tree': {'label': 0,
       'left_tree': None,
       'right_tree': None,
       'best_feature': None},
      'best_feature': 'home_ownership_RENT'},
     'best_feature': 'term_ 60 months'},
    'best_feature': 'grade_C'},
   'right_tree': {'label': None,
    'left_tree': {'label': 0,
     'left_tree': None,
     'right_tree': None,
     'best_feature': None},
    'right_tree': {'label': 0,
     'left_tree': None,
     'right_tree': None,
     'best_feature': None},
    'b

You can try different initial parameters.

### 3.5 Predictions

Suppose you want to predict new samples' labels. <br/>

Remember our tree structure is like: <br/>
{ <br/>
   'label': None for temporary node, or predicted label at the leaf node (e.g., "Majority Voting" criterion) for leaf node; <br/>
   'left_tree': Left tree after the selected feature (=0 or -1) is splitted for temporary node, None for leaf node; <br/>
   'right_tree': Right tree after the selected feature (=1) is splitted for temporary node, None for leaf node; <br/>
   'best_feature': The feature that is selected to do binary split for temporary node, None for leaf node. <br/>
}

In [43]:
def predict_label(new_sample, train_tree):   
    '''This function is used to predict the label of one new sample.
    Inputs:
    1) new_sample: A new sample, we would like to predict its label (Pandas DataFrame)
    2) train_tree: The classification tree we have just trained
    
    Outputs:
    1) predict_label: The predicted label for this new sample  
    
    '''
    
    # If move to the leaf node
    if train_tree['best_feature']==None:
        return train_tree['label']
    
    # If still stay at temporary node
    else:
        # Find the value of the best feature in the current node
        # If value is 0, then go to left tree
        # If value is 1, then go to right tree
        # Remember what your have learned in Data Structure course, about binary tree
        best_feature = train_tree['best_feature']
        return predict_label(new_sample, train_tree['left_tree']) if new_sample[best_feature]==0 else predict_label(new_sample, train_tree['right_tree'])
        

In [44]:
# You need to learn partial and apply function. They are powerful.
from functools import partial
prediction = partial(predict_label, train_tree=tree_model)
predicted_labels = dataset.apply(lambda x: prediction(x), axis=1)

In [45]:
# Concatenate predicted_labels into our dataset
dataset['prediction'] = predicted_labels

In [46]:
dataset.head()

Unnamed: 0,index,risky,grade_A,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,term_ 36 months,term_ 60 months,home_ownership_ANY,home_ownership_MORTGAGE,home_ownership_OWN,home_ownership_RENT,prediction
0,37,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0
1,83,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
2,99,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0
3,112,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0
4,135,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0


#### Assignments: 
* Write functions to use `Gini index` and `misclassification error rate` as metrics for binary splitting. 

## 4 Open-Source Packages

Take a break and let us use open-source package to run decision tree models. <br/>
Use `Scikit-learn`to make classification trees and make predictions: http://scikit-learn.org/stable/modules/tree.html.

In [None]:
# Import libraries
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

In [None]:
# Update our features and output
features = list(dataset.columns[2:])
output = dataset.columns[1]

# Split dataset to do validation
X = dataset[features]
y = dataset[output]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
# Fit the model on train data
decision_tree = DecisionTreeClassifier()
decision_tree_model = decision_tree.fit(X_train, y_train)
decision_tree_model.classes_

In [None]:
# Get predicted labels for test data
y_pred = decision_tree_model.predict(X_test)

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
TN, FP, FN, TP = cm.ravel()
print cm
print TN, FP, FN, TP

In [None]:
# Performance of decision tree model
print "Accuracy: ", accuracy_score(y_test, y_pred)
print "Sensitivity: ", recall_score(y_test, y_pred)
print "Precision: ", precision_score(y_test, y_pred)

How to calculate:
1. Accuracy
2. Misclassification rate
3. Precision
4. Sensitivity

In [None]:
# ROC and AUC
from sklearn.metrics import roc_curve, auc

# Get predicted scores Pr(y=1): Used as thresholds for calculating TP Rate and FP Rate
score = decision_tree_model.predict_proba(X_test)[:, 1]

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, score) # fpr: FP Rate, tpr: TP Rate, thresholds: Pr(y=1)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.title('Receiver operating characteristic')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Plot the decision tree
# Remember you should install package graphviz first
import graphviz

In [None]:
dot_data = tree.export_graphviz(decision_tree_model, out_file=None, feature_names=features, class_names=output, 
                                filled=True, rounded=True, special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

In [None]:
# Store in .pdf 
graph.render("Lending Club Loan Status") 

#### Assignments: 
* Use Titanic data (“train.csv”); Fit the model using `scikit-learn` with different metrics (e.g., information gain, gini index)
* Observe and report the differences (e.g., best features for splitting, tree structure, performance, etc.)
* There are no right or wrong answers. Don't worry. Just report what you've seen. 

**Note:** You need to do simple data cleaning by yourself, such as binarizing output variable "survived", and transforming categorical variables to dummy variables.

## 5 Questions (Just think about them)

### 5.1 What if features are continuous?
### 5.2 What if output is continuous? 
* Regression Tree

## 6 References

[1] Jason Brownlee, 2018, [Machine Learning Algorithms from Scratch with Python](https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/). <br/>
[2] Peter Harrington, 2012. Machine Learning in Action. Shelter Island, NY: Manning Publications Co.