# Decision Trees in Practice (not working)
In this assignment we will explore various tec hniques for preventing overfitting in decision trees. We will extend the implementation of the binary decision trees that we implemented in the previous assignment. You will have to use your solutions from this previous assignment and extend them.

In this assignment you will:

* Implement binary decision trees with different early stopping methods.
* Compare models with different stopping parameters.
* Visualize the concept of overfitting in decision trees.

Let's get started!

In [17]:
import pandas as pd
import numpy as np
import json

In [18]:
dataFile = r'lending-club-data.csv'
#1. Load in the LendingClub dataset 
loans = pd.read_csv(dataFile, header=0, low_memory=False)
#2. Reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
#delete column 'bad_loans'
loans = loans.drop('bad_loans', 1)
#3. We will be using the following features:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'
#Extract these feature columns from the dataset, and discard the rest of the feature columns.
loans = loans[features + [target]]
#name the new features to keep the core name: for example grade_A, etc...
for colName in features:
    if loans[colName].dtype == object :
        # Create a set of dummy variables from the sex variable
        dummies = pd.get_dummies(loans[colName])
        #update dummies cols name to include initial name reference
        for dummiesName in dummies.columns.values:
            newName = colName + '_' + dummiesName
            dummies.rename(columns={dummiesName: newName}, inplace=True)
        # Join the dummy variables to the main dataframe
        loans = loans.join(dummies)
        loans = loans.drop(colName, 1)       
features = loans.columns.values
features = np.setdiff1d(features, target)
 
print('We will be using %d features in the model.' %(len(features)))
#Load list of indices for the training and validation sets
#open the json file as a string and parse it with json.load ==> a list
train_idx = json.load(open(r'module-6-assignment-train-idx.json')) 
validation_idx = json.load(open(r'module-6-assignment-validation-idx.json'))
train_data = loans.iloc[train_idx]
validation_set = loans.iloc[validation_idx]

We will be using 25 features in the model.


In [19]:
#Early Stopping Methods

#Early stopping condition 1: Maximum depth
#This is already implemented the maximum depth stopping condition in the main function.

#Early stopping condition 2: minimum node size (set by parameter min_node_size)
#The function takes 2 arguments: the data (from a node) and the minimum number of data points that a node is 
# allowed to split on, min_node_size
#This function simply calculates whether the number of data points at a given node is less than or equal to the 
#specified minimum node size. This function will be used to detect this early stopping condition in the 
#decision_tree_create function 
def reached_minimum_node_size(data, min_node_size):
    #return True if the number of data points is less than or equal to the minimum node.size
    return (data.shape[0] <= min_node_size)

#Early stopping condition 3: minimum gain in error reduction
#The function error_reduction takes 2 arguments: error before a split (error_before_split) and error after a split 
#(error_after_split)
# This function computes the gain in error reduction, i.e., the difference between the error before the split 
#and that after the split. This function will be used to detect this early stopping condition in the 
#decision_tree_create function.

def error_reduction(error_before_split, error_after_split):
    return error_before_split - error_after_split

#Calculates number of misclassified examples when predicting the majority class.
#This is used to help determine which feature is best to split on at a given node of the tree
def intermediate_node_num_mistakes(labels_in_node):
    #labels_in_node is 'safe_loans' column 
    # Corner case: If labels_in_node is empty, return 0
    if labels_in_node.shape[0] == 0:
        return 0    
    # Count the number of 1's (safe loans)
    count_safe_loans = sum(labels_in_node[labels_in_node == 1])
    # Count the number of -1's (risky loans)
    count_risky_loans = -sum(labels_in_node[labels_in_node == -1])             
    # Return the number of mistakes that the majority classifier makes.
    return min(count_safe_loans, count_risky_loans)

#finds the best feature to split on given the data and a list of features to consider.
def best_splitting_feature(data, features_list, target):
    target_values = data[target]
    best_feature = None # Keep track of the best feature 
    best_error = 10 # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.
    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(data.shape[0])
    #loop thru each feature to consider splitting on that feature
    for feature in features_list:
        # The left split will have all data points where the feature value is 0
        left_split = data[data[feature]==0]
        # The right split will have all data points where the feature value is 1
        right_split = data[data[feature]==1]
        # Calculate the number of misclassified examples in the left/right split.
        #using the function intermediate_node_num_mistakes()
        left_split_mistakes = intermediate_node_num_mistakes(left_split[target])
        right_split_mistakes = intermediate_node_num_mistakes(right_split[target])       
        # Compute the classification error of this split.
        # Error = (# of mistakes (left) + # of mistakes (right)) / (# of data points)
        error =(left_split_mistakes + right_split_mistakes)/num_data_points
        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        ## YOUR CODE HERE
        if error < best_error: 
            best_error = error
            best_feature = feature            
    return best_feature #Return the best feature, feature split that gives the lowest error 

#Creates a leaf node given a set of target values.
def create_leaf(target_values):    
    # Create a leaf node
    leaf = {'splitting_feature' : None,
            'left' : None,
            'right' : None,
            'is_leaf': True,  
            'prediction': None }
   
    # Count the number of data points that are +1 and -1 in this node.
    num_ones = sum(target_values[target_values == +1])
    num_minus_ones = -sum(target_values[target_values == -1])  

    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] = +1
    else:
        leaf['prediction'] = -1       

    # Return the leaf node
    return leaf

#Incorporating new early stopping conditions in binary decision tree implementation

def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10, min_node_size = 1, \
                         min_error_reduction = 0.0):
    
    remaining_features = features[:] # features remaining to be splitted
    target_values = data[target] #loans = +/-1
    print("--------------------------------------------------------------------")
    print("Subtree, depth = %s (%s data points)." % (current_depth, len(target_values)))
    

    # Stopping condition 1: Check if all nodes are of the same type (no mistakes)

    if intermediate_node_num_mistakes(target_values) == 0:
        print("Stopping condition 1 reached. All data points have the same target value")     
        return create_leaf(target_values)
    
    # Stopping condition 2: No more features to split on
    if remaining_features == []:  
        print("Stopping condition 2 reached. No remaining features")  
        return create_leaf(target_values)    

    # Early stopping condition 1 : Reached maximum depth limit 
    if current_depth >= max_depth:
        print("Early stopping condition 1 reached. Reached maximum depth")
        return create_leaf(target_values)

    # Early stopping condition 2: Reached the minimum node size.
    # If the number of data points is less than or equal to the minimum size, return a leaf.
    if reached_minimum_node_size(data, min_node_size):           ## YOUR CODE HERE 
        print("Early stopping condition 2 reached. Reached minimum node size.")
        return create_leaf(target_values)
    # Find the best splitting feature
    splitting_feature = best_splitting_feature(data, remaining_features, target)
    # Split on the best feature that we found. 
    left_split = data[data[splitting_feature] == 0]
    right_split = data[data[splitting_feature] == 1]
    remaining_features = np.setdiff1d(remaining_features, splitting_feature)
    #print("Split on feature %s. (%s, %s)" % (splitting_feature, len(left_split), len(right_split)))
    
    # Early stopping condition 3: Minimum error reduction
    # Calculate the error before splitting (number of misclassified examples divided by the total number of examples)
    error_before_split = intermediate_node_num_mistakes(target_values) / float(data.shape[0])
    
    # Calculate the error after splitting (number of misclassified example in both groups divided by the 
    #total number of examples)
    left_mistakes = min(len(left_split[left_split[target]==1]), len(left_split[left_split[target]==-1]))
    print(left_mistakes)
    right_mistakes = min(len(right_split[right_split[target]==1]), len(right_split[right_split[target]==-1]))
    error_after_split = (left_mistakes + right_mistakes) / float(len(data))
    # If the error reduction is LESS THAN OR EQUAL TO min_error_reduction, return a leaf.
    if error_reduction(error_before_split, error_after_split) < min_error_reduction:
        print("Early stopping condition 3 reached. Minimum error reduction.")
        return create_leaf(target_values)
    
    remaining_features = np.setdiff1d(remaining_features, splitting_feature)
    print("Split on feature %s. (%s, %s)" % (splitting_feature, len(left_split), len(right_split)))
    
    
    # Repeat (recurse) on left and right subtrees
    left_tree = decision_tree_create(left_split, remaining_features, target, 
                                     current_depth + 1, max_depth, min_node_size, min_error_reduction)        
    
    right_tree = decision_tree_create(right_split, remaining_features, target, current_depth + 1, \
                                      max_depth, min_node_size, min_error_reduction)

    return {'is_leaf'          : False, 
            'prediction'       : None,
            'splitting_feature': splitting_feature,
            'left'             : left_tree, 
            'right'            : right_tree}


#Build the tree!
def classify(tree, x, annotate = False):
       # if the node is a leaf node.
    if tree['is_leaf']:
        if annotate:
            print("At leaf, predicting %s" % tree['prediction'])
        return tree['prediction']
    else:
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]
        if annotate:
             print("Split on %s = %s" % (tree['splitting_feature'], split_feature_value))
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

def evaluate_classification_error(tree, data, target):
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x, annotate=True), axis=1) 
    #needs axis to apply to each row, per default columns
    # Once you've made the predictions, calculate the classification error and return it
    data['prediction'] = prediction
    data['mistakes'] = data.apply(lambda x : 0 if x['prediction'] == x[target] else 1, axis=1)
    Nbr_errors = sum(data['mistakes'])
    
    return (Nbr_errors*1.0/data.shape[0])

In [20]:
#print(train_data.iloc[0])
#14. Now, let's consider the first example of the test set and see what my_decision_tree model predicts 
#for this data point.
print('************ train data ************')

my_decision_tree_new = decision_tree_create(train_data, features, target, max_depth=6, min_node_size=100, \
                                            min_error_reduction=0.0)
print('----------------------------------------------')
#16. Now, let's consider the first example of the validation set and see what the 
# my_decision_tree_new model predicts for this data point. Your code should be analogous to
print(validation_set.iloc[0])
print('Predicted class: %s ' % classify(my_decision_tree_new, validation_set.iloc[0], annotate=True))

#17. Let's add some annotations to our prediction to see what the prediction path was that lead to this 
#predicted class:


************ train data ************
--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).




3221
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
3159
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
2985
Split on feature emp_length_1 year. (8602, 520)
--------------------------------------------------------------------
Subtree, depth = 3 (8602 data points).
1865
Split on feature emp_length_10+ years. (5518, 3084)
--------------------------------------------------------------------
Subtree, depth = 4 (5518 data points).
1592
Split on feature emp_length_2 years. (4755, 763)
--------------------------------------------------------------------
Subtree, depth = 5 (4755 data points).
1359
Split on feature emp_length_3 years. (4081, 674)
--------------------------------------------------------------------
Subtree, depth = 6 (4081 data points).
Early stopping condition 1 reached. Re

In [21]:
my_decision_tree_old = decision_tree_create(train_data, features, target, max_depth=6, min_node_size=0, \
                                            min_error_reduction=-1)
print('Predicted class: %s ' % classify(my_decision_tree_old, validation_set.iloc[0],annotate = True))


--------------------------------------------------------------------
Subtree, depth = 0 (37224 data points).




3221
Split on feature term_ 36 months. (9223, 28001)
--------------------------------------------------------------------
Subtree, depth = 1 (9223 data points).
3159
Split on feature grade_A. (9122, 101)
--------------------------------------------------------------------
Subtree, depth = 2 (9122 data points).
2985
Split on feature emp_length_1 year. (8602, 520)
--------------------------------------------------------------------
Subtree, depth = 3 (8602 data points).
1865
Split on feature emp_length_10+ years. (5518, 3084)
--------------------------------------------------------------------
Subtree, depth = 4 (5518 data points).
1592
Split on feature emp_length_2 years. (4755, 763)
--------------------------------------------------------------------
Subtree, depth = 5 (4755 data points).
1359
Split on feature emp_length_3 years. (4081, 674)
--------------------------------------------------------------------
Subtree, depth = 6 (4081 data points).
Early stopping condition 1 reached. Re

Quiz question: For my_decision_tree_new trained with max_depth = 6, min_node_size = 100, min_error_reduction=0.0, 
    is the prediction path for validation_set[0] shorter, longer, or the same as for my_decision_tree_old that 
    ignored the early stopping conditions 2 and 3?

Quiz question: For my_decision_tree_new trained with max_depth = 6, min_node_size = 100, min_error_reduction=0.0, 
    is the prediction path for any point always shorter, always longer, always the same, shorter or the same, or 
    longer or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?

Quiz question: For a tree trained on any dataset using max_depth = 6, min_node_size = 100, min_error_reduction=0.0, 
    what is the maximum number of splits encountered while making a single prediction?



In [22]:
"""


print('Quiz question: For my_decision_tree_new trained with max_depth = 6, min_node_size = 100, \
min_error_reduction=0.0, is the prediction path for validation_set[0] shorter, longer, or the same \
as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?')
print('same')

#20. Now, let's use this function to evaluate the classification error of my_decision_tree_new on the 
#validation_set. Your code should be analogous to
err_new_tree = evaluate_classification_error(my_decision_tree_new, validation_set, target)
err_old_tree = evaluate_classification_error(my_decision_tree_old, validation_set, target)
print('Validation error new tree:', err_new_tree)
print('*************************************************')
print('*************************************************')
print('Validation error old tree:', err_old_tree)


#Exploring the effect of max_depth
#We will compare three models trained with different values of the stopping criterion. We intentionally 
#picked models at the extreme ends (too small, just right, and too large).

#22. Train three models with these parameters:

#model_1: max_depth = 2 (too small)
#model_2: max_depth = 6 (just right)
#model_3: max_depth = 14 (may be too large)
#For each of these three, set min_node_size = 0 and min_error_reduction = -1. Make sure to call 
#the models model_1, model_2, and model_3.

#Note: Each tree can take up to a few minutes to train. In particular, model_3 will probably take the longest to train.

#23. Let us evaluate the models on the train and validation data. Let us start by evaluating the classification 
#error on the training data. Your code should be analogous to:


model_1 = decision_tree_create(train_data, features, target, max_depth=2, min_node_size=0, min_error_reduction=-1)
model_2 = decision_tree_create(train_data, features, target, max_depth=6, min_node_size=0, min_error_reduction=-1)
model_3 = decision_tree_create(train_data, features, target, max_depth=14, min_node_size=0, min_error_reduction=-1)
model_1_TrainErr = evaluate_classification_error(model_1, train_data, target)
model_2_TrainErr =  evaluate_classification_error(model_2, train_data, target)
model_3_TrainErr =  evaluate_classification_error(model_3, train_data, target)
model_1_ValidationErr = evaluate_classification_error(model_1, validation_set, target)
model_2_ValidationErr =  evaluate_classification_error(model_2, validation_set, target)
model_3_ValidationErr =  evaluate_classification_error(model_3, validation_set, target)
print("Training data, classification error (model 1):", model1_TrainErr)
print("Training data, classification error (model 2):", model2_TrainErr)
print("Training data, classification error (model 3):", model3_TrainErr)
print("Validation Set, classification error (model 1):", model1_ValidationErr)
print("Validation Set, classification error (model 2):", model2_ValidationErr)
print("Validation Set, classification error (model 3):", model3_ValidationErr)
"""


'\n\n\nprint(\'Quiz question: For my_decision_tree_new trained with max_depth = 6, min_node_size = 100, min_error_reduction=0.0, is the prediction path for validation_set[0] shorter, longer, or the same as for my_decision_tree_old that ignored the early stopping conditions 2 and 3?\')\nprint(\'same\')\n\n#20. Now, let\'s use this function to evaluate the classification error of my_decision_tree_new on the \n#validation_set. Your code should be analogous to\nerr_new_tree = evaluate_classification_error(my_decision_tree_new, validation_set, target)\nerr_old_tree = evaluate_classification_error(my_decision_tree_old, validation_set, target)\nprint(\'Validation error new tree:\', err_new_tree)\nprint(\'*************************************************\')\nprint(\'*************************************************\')\nprint(\'Validation error old tree:\', err_old_tree)\n\n\n#Exploring the effect of max_depth\n#We will compare three models trained with different values of the stopping criterio