## Module 5 Assignment 2 : Using pandas and customs DecisionTree
## Implementing binary decision trees

The goal of this notebook is to implement your own binary decision tree classifier. You will:
    
* Use SFrames to do some feature engineering.
* Transform categorical variables into binary variables.
* Write a function to compute the number of misclassified examples in an intermediate node.
* Write a function to find the best feature to split on.
* Build a binary decision tree from scratch.
* Make predictions using the decision tree.
* Evaluate the accuracy of the decision tree.
* Visualize the decision at the root node.


**Important Note**: In this assignment, we will focus on building decision trees where the data contain **only binary (0 or 1) features**. This allows us to avoid dealing with:
* Multiple intermediate nodes in a split
* The thresholding issues of real-valued features.

More details of the tree-building process
* It is computationally infeasible to consider every possible partition of the predictor feature space.
* For this reason, we take a **top-down, greedy** approach that is known as **recursive binary splitting.**
* The approach is **top-down** because it begins at the top of the tree and then successively splits the predictor space; each split is indicated via **two new branches** further down on the tree, we store this as **left_tree** and **right_tree**.
* It is **greedy** because at each step of the tree-building process, the **best split** is made at that particular step (split on feature that results in **lowest classification error rate**),rather than looking ahead and picking a split that will lead to a better tree in some future step.


# Load Pandas

In [35]:
import pandas as pd

# Load the lending club dataset

**1.** We will be using the same [LendingClub](https://www.lendingclub.com/) dataset as in the previous assignment.

In [36]:
loans = pd.read_csv('lending-club-data.csv')

In [37]:
loans.head(2)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1


In [38]:
print "Features are ",loans.columns
print "Number of features = %d" %len(loans.columns)
print "Number of samples = %d" %(loans.size/len(loans.columns))
print "Dimension of data frame=", loans.ndim
print "Indexing for samples=", loans.index
print "Number of samples/rows = %d and num of features/cols =  %d" %loans.shape
(num_rows,num_cols) = loans.shape  #shape returns tuple (num_rows,num_cols)
#print "Product columns data types", loans.dtypes

Features are  Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'is_inc_v', u'issue_d', u'loan_status', u'pymnt_plan', u'url', u'desc',
       u'purpose', u'title', u'zip_code', u'addr_state', u'dti',
       u'delinq_2yrs', u'earliest_cr_line', u'inq_last_6mths',
       u'mths_since_last_delinq', u'mths_since_last_record', u'open_acc',
       u'pub_rec', u'revol_bal', u'revol_util', u'total_acc',
       u'initial_list_status', u'out_prncp', u'out_prncp_inv', u'total_pymnt',
       u'total_pymnt_inv', u'total_rec_prncp', u'total_rec_int',
       u'total_rec_late_fee', u'recoveries', u'collection_recovery_fee',
       u'last_pymnt_d', u'last_pymnt_amnt', u'next_pymnt_d',
       u'last_credit_pull_d', u'collections_12_mths_ex_med',
       u'mths_since_last_major_derog', u'policy_code', u'not_compliant',
       u'stat

**2.** Like the previous assignment, we reassign the labels to have +1 for a safe loan, and -1 for a risky (bad) loan.

In [39]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)

# just check if conversion done properly, before removing the 'bad_loans' col
print loans[['id', 'member_id', 'loan_amnt', 'bad_loans', 'safe_loans']].head(2)

# removing row traightforward
# to remove/drop a column, axis=1 denotes that we are referring to a column
# see http://chrisalbon.com/python/pandas_dropping_column_and_rows.html

loans = loans.drop('bad_loans', axis=1)

        id  member_id  loan_amnt  bad_loans  safe_loans
0  1077501    1296599       5000          0           1
1  1077430    1314167       2500          1          -1


**3.** Unlike the previous assignment where we used several features, in this assignment, we will just be using 4 categorical
features: 

1. grade of the loan 
2. the length of the loan term
3. the home ownership status: own, mortgage, rent
4. number of years of employment.

Since we are building a binary decision tree, we will have to convert these categorical features to a binary representation in a subsequent section using 1-hot encoding.

Extract these feature columns from the dataset, and discard the rest of the feature columns.

In [40]:
features = ['grade',              # grade of the loan
            'term',               # the term of the loan
            'home_ownership',     # home_ownership status: own, mortgage or rent
            'emp_length',         # number of years of employment
           ]
target = 'safe_loans'

#Extract these feature columns from the dataset, and discard the rest of the feature columns.
loans = loans[features + [target]]

print features

['grade', 'term', 'home_ownership', 'emp_length']


Let's explore what the dataset looks like.

In [41]:
print len(loans)
loans.head(10)

122607


Unnamed: 0,grade,term,home_ownership,emp_length,safe_loans
0,B,36 months,RENT,10+ years,1
1,C,60 months,RENT,< 1 year,-1
2,C,36 months,RENT,10+ years,1
3,C,36 months,RENT,10+ years,1
4,A,36 months,RENT,3 years,1
5,E,36 months,RENT,9 years,1
6,F,60 months,OWN,4 years,-1
7,B,60 months,RENT,< 1 year,-1
8,C,60 months,OWN,5 years,1
9,B,36 months,OWN,10+ years,1


Notes to people using other tools

If you are using SFrame, proceed to the next "Subsample dataset to make sure classes are balanced". - we not doing this as we are using pandas. Instead when we do the data split into train and test sets, each of the data sets are sampled so that there is some class balancing.

If you are NOT using SFrame, download the list of indices for the training and test sets: module-5-assignment-2-train-idx.json, module-5-assignment-2-test-idx.json. Then follow the following steps:
*    Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding - get_dummies() !!!
*    Load the JSON files into the lists train_idx and test_idx.
*    Perform train/validation split using train_idx and test_idx. In Pandas, for instance:

## Subsample dataset to make sure classes are balanced

**4.** Just as we did in the previous assignment, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We use `seed=1` so everyone gets the same results.

Note. We are NOT using SFrame, so we download the list of indices for the training and test sets: module-5-assignment-2-train-idx.json, module-5-assignment-2-test-idx.json. Some elements in loans are included neither in train_data nor test_data. This is to perform sampling to achieve class balance, so if u are using pandas, don't need to address class imbalance like we do below.

In [42]:
'''
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]

print "Percentage of safe loans b4               :", len(safe_loans_raw) / float(len(loans))
print "Percentage of risky loans b4              :", len(risky_loans_raw) / float(len(loans))

# Since there are less risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)

print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)
'''

'\nsafe_loans_raw = loans[loans[target] == 1]\nrisky_loans_raw = loans[loans[target] == -1]\n\nprint "Percentage of safe loans b4               :", len(safe_loans_raw) / float(len(loans))\nprint "Percentage of risky loans b4              :", len(risky_loans_raw) / float(len(loans))\n\n# Since there are less risky loans than safe loans, find the ratio of the sizes\n# and use that percentage to undersample the safe loans.\npercentage = len(risky_loans_raw)/float(len(safe_loans_raw))\nsafe_loans = safe_loans_raw.sample(percentage, seed = 1)\nrisky_loans = risky_loans_raw\nloans_data = risky_loans.append(safe_loans)\n\nprint "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))\nprint "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))\nprint "Total number of loans in our new dataset :", len(loans_data)\n'

**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in "[Learning from Imbalanced Data](http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf)" by Haibo He and Edwardo A. Garcia, *IEEE Transactions on Knowledge and Data Engineering* **21**(9) (June 26, 2009), p. 1263–1284. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

82% of data samples were safe loans, and about 18% risky loans, so huge class imbalance. Given 82% of the loans are safe, if we predict that all loans are safe, we will be correct 82% of the time. That sounds like high accuracy, but is not. We need our accurcay to be better than 80% if we are to do do better than prior probability - background spread or distribution! Bit like in fog forecasting, fog only occurs 2% of days in a year, so if we forecast no fog every time, we will be correct 98% of the time. But 2% of the time we will be wrong - we have to determine if this false negatives is acceptable risk to industry.

## Transform categorical data into binary features/Apply one-hot encoding

In this assignment, we will implement **binary decision trees** (decision trees for binary features, a specific case of categorical variables taking on two values, e.g., true/false, fog/no fog, rain/no rain etc). Since all of our features are currently categorical features, we want to turn them into binary features. 

For instance, the **home_ownership** feature represents the home ownership status of the loanee, which is either `own`, `mortgage` or `rent`. For example, if a data point has the feature 
```
   {'home_ownership': 'RENT'}
```
we want to turn this into three features: 
```
 { 
   'home_ownership = OWN'      : 0, 
   'home_ownership = MORTGAGE' : 0, 
   'home_ownership = RENT'     : 1
 }
```

**5.** This technique of turning categorical variables into binary variables is called one-hot encoding. Using the software of your choice, perform one-hot encoding on the four features described above. You should now have 25 binary features.

Since this code requires a few Python tricks, feel free to use this block of code as is - note it will only work with sframes!. Refer to the API documentation for a deeper understanding. Note that for some categorical features, more features are created than existing levels, for e.g home_ownership has 3 levels, 'OWN', 'RENT', 'MORTGAGE' but one-hot encoding creates 4 levels, an extra level called 'OTHER', so in effect creates 4 seperate features when we would have expected only 3 expected features. Another approach is using sklearns LabelEncoder().

In [43]:
### convert all categorical/"object" data type to numeric labels
#from sklearn.preprocessing import LabelEncoder

### create label encoders for categorical features
#for feature in categorical_features:
#    number = LabelEncoder() #different number object for each cat var
#    loans[feature] = number.fit_transform(loans[feature].astype('str'))

Generally one-hot encoding appears to be better approach than using LabelEncoder which assigns different numerical values to seperate levels in a class. E.g home_ownership column values will be assighned numeric value 1 for 'RENT', 2 for 'OWN', 3 for 'MORTGAGE'. This often causes the decision tree classfier to treat the feature as numeric so we start to see splits like (home_ownership <= 2.5) which may not always have meaningful interpretation.

Thankfully One-hot encoding is supported in pandas as pd.get_dummies() !!
* Apply one-hot encoding to loans. see [this](https://gist.github.com/ramhiser/982ce339d5f8c9a769a0) and also [this](http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/) 

In [44]:
# loans_data = risky_loans.append(safe_loans) #already appended earlier!

for feature in features:
    # Get one hot encoding of column feature
    one_hot = pd.get_dummies(loans[feature])
    
    # Drop column B as it is now encoded
    loans = loans.drop(feature, axis=1)
    
    # Join the encoded df
    loans = loans.join(one_hot)

Let's see what the feature columns look like now:
'''sframes one-hot encoding gave these features
* ['grade.A',
* 'grade.B',
*  'grade.C',
*  'grade.D',
*  'grade.E',
*  'grade.F',
*  'grade.G',
*  'term. 36 months',
*  'term. 60 months',
*  'home_ownership.MORTGAGE',
*  'home_ownership.OTHER',
*  'home_ownership.OWN',
*  'home_ownership.RENT',
*  'emp_length.1 year',
*  'emp_length.10+ years',
*  'emp_length.2 years',
*  'emp_length.3 years',
*  'emp_length.4 years',
*  'emp_length.5 years',
*  'emp_length.6 years',
*  'emp_length.7 years',
*  'emp_length.8 years',
*  'emp_length.9 years',
*  'emp_length.< 1 year',
*  'emp_length.n/a']
 

In [45]:
features = list(loans.columns)   #loans.columns return index object, force it to be list
# AttributeError: 'Index' object has no attribute 'remove' , so make features list as above
features.remove('safe_loans')  # Remove the response variable
features

['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 ' 36 months',
 ' 60 months',
 'MORTGAGE',
 'OTHER',
 'OWN',
 'RENT',
 '1 year',
 '10+ years',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '< 1 year',
 'n/a']

In [46]:
print "Number of features (after binarizing categorical variables) = %s" % len(features)

Number of features (after binarizing categorical variables) = 25


Let's explore what one of these columns looks like:

In [47]:
print len(loans['A'])
loans['A'][0:10]

122607


0    0.0
1    0.0
2    0.0
3    0.0
4    1.0
5    0.0
6    0.0
7    0.0
8    0.0
9    0.0
Name: A, dtype: float64

## Train-test split

**6.** We split the data into a train test split with 80% of the data in the training set and 20% of the data in the test set.
* Load the JSON files into the lists train_idx and test_idx. json.load(file) outputs a list.
* Perform train/validation split using train_idx and test_idx.

In [48]:
#train_data, validation_data = loans_data.random_split(.8, seed=1)

import json
with open('module-5-assignment-2-train-idx.json', mode='r') as file1: 
    train_idx = json.load(file1)   #reads entire file in one go into list

with open('module-5-assignment-2-test-idx.json', mode='r') as file2: 
    test_idx = json.load(file2)    #reads entire file in one go into list

file1.close()
file2.close()

# Now we need to grab all data cols from loans 
# with row indices train_idx in train set, similar for validation set
train_data      =  loans.iloc[train_idx] 
test_data       =  loans.iloc[test_idx]

print 'Training set   : %d data points' % len(train_data)
print 'Test set       : %d data points' % len(test_data)

Training set   : 37224 data points
Test set       : 9284 data points


Note. Some elements in loans are included neither in train_data nor test_data. This is to perform sampling to achieve class balance.

Now proceed to the section "Decision tree implementation", skipping three (subsampling, one-hot encoding and train/test split - as already done above) sections below.

In [49]:
print len(train_data['A'])
train_data['A'][0:10]

37224


1     0.0
6     0.0
7     0.0
10    0.0
12    0.0
18    0.0
21    0.0
23    0.0
45    0.0
48    0.0
Name: A, dtype: float64

This column is set to 1 if the loan grade is A and 0 otherwise (i.e when loan grade is any of B,C,D,E,F,G)

**Checkpoint:** Make sure the following answers match up.

In [50]:
print "Total number of grade.A loans : %s" % train_data['A'].sum()
print "Expected answer               : 6422"  
#ours about 1000 less due attempt class balance

Total number of grade.A loans : 5130.0
Expected answer               : 6422


# Decision tree implementation

In this section, we will implement binary decision trees from scratch. There are several steps involved in building a decision tree. For that reason, we have split the entire assignment into several sections.

## Function to count number of mistakes while predicting majority class

Recall from the lecture that prediction at an intermediate node works by predicting the **majority class** for all data points that belong to this node.

Now, we will write a function that calculates the number of **missclassified examples** when predicting the **majority class**. This will be used to help determine which feature is the best to split on at a given node of the tree.

**Note**: Keep in mind that in order to compute the number of mistakes for a majority classifier, we only need the label (y values) of the data points in the node. 

** Steps to follow **:
* ** Step 1:** Calculate the number of safe loans and risky loans.
* ** Step 2:** Since we are assuming majority class prediction, all the data points that are **not** in the majority class are considered **mistakes**.
* ** Step 3:** Return the number of **mistakes**.


**7.**Now, let us write the function `intermediate_node_num_mistakes` which computes the number of misclassified examples of an intermediate node given the set of labels (y values) of the data points contained in the node. Fill in the places where you find `## YOUR CODE HERE`. There are **three** places in this function for you to fill in.

In [51]:
def intermediate_node_num_mistakes(labels_in_node):
    
    # Corner case: If labels_in_node is empty, return 0
    if len(labels_in_node) == 0:
        return 0
    
    # print "Number of samples %d " %len(labels_in_node)
    # print "Samples class distribution %s " %(labels_in_node)
    # mistakes = None
    
    # Count the number of 1's (safe loans)
    # num_of_1s =      (loans['safe_loan'] == +1).sum()
    safe_loans_count = sum(labels_in_node == +1)
    # safe_loans_count = sum([1 if x == 1 else 0 for x in labels_in_node])
    
    # Count the number of -1's (risky loans)
    risky_loans_count = sum(labels_in_node == -1)
    # risky_loans_count = sum([1 if x == -1 else 0 for x in labels_in_node])
       
    
    ##uncomment if want to print some diagnostics...
    
    print "(num risky: %d , num safe: %d) in the split." %(risky_loans_count,safe_loans_count)
    
    # Return the number of mistakes that the majority classifier makes.
    #maj_class = '+1' if safe_loans_count >= risky_loans_count else '-1'
    
    
    #mistakes = risky_loans_count if (maj_class == '+1') else safe_loans_count
    #mistakes = risky_loans_count if (safe_loans_count >= risky_loans_count) else safe_loans_count
    # what above line is doing is just trying to find minimum of the two count values!!!
    # hint
    # Since we are assuming majority class prediction, 
    # all the data points that are not in the majority class are considered mistakes
    # we don't even have to determine majority class
    # the bigger count will always be majority, lower count wud always be mistakes
    # mistakes = min(safe_loans_count, risky_loans_count)   #as easy as that!!
    # print "Majority class is %s so num mistakes will be %d " %(maj_class, mistakes)
    
    # So for e.g if we have more safe loans than risky, majority vote wud be to classify the
    # whole lot as of safe class, and mistakes would be just count of the other (risky) class
    # so just return count of the smaller class sample, cause that wud always be minority class
    # and minority class always be misclassfied !!    
    
    '''
    # uncomment if want to print some diagnostics... 
    if (maj_class == '+1'):
        #mistakes = len(labels_in_node) - safe_loans_count
        mistakes = risky_loans_count
        print "Majority class is %s so num mistakes will be %d " %(maj_class, mistakes)
    else:
        #mistakes = len(labels_in_node) - risky_loans_count
        mistakes = safe_loans_count
        print "Majority class is %s so num mistakes will be %d " %(maj_class, mistakes)
    '''  
    return min(safe_loans_count, risky_loans_count)

**8.**Because there are several steps in this assignment, we have introduced some stopping points where you can check your code and make sure it is correct before proceeding. To test your `intermediate_node_num_mistakes` function, run the following code until you get a **Test passed!**, then you should proceed. Otherwise, you should spend some time figuring out where things went wrong.

In [52]:
# Test case 1
#example_labels = sf.SArray([-1, -1, 1, 1, 1])
example_labels = pd.Series([-1, -1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print 'Test passed!\n'
else:
    print 'Test 1 failed... try again!'

# Test case 2
example_labels = pd.Series([-1, -1, 1, 1, 1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print 'Test passed!\n'
else:
    print 'Test 2 failed... try again!'
    
# Test case 3
example_labels = pd.Series([-1, -1, -1, -1, -1, 1, 1])
if intermediate_node_num_mistakes(example_labels) == 2:
    print 'Test passed!'
else:
    print 'Test 3 failed... try again!'

(num risky: 2 , num safe: 3) in the split.
Test passed!

(num risky: 2 , num safe: 5) in the split.
Test passed!

(num risky: 5 , num safe: 2) in the split.
Test passed!


## Function to pick best feature to split on

The function **best_splitting_feature** takes 3 arguments: 
1. The data (SFrame of data which includes all of the feature columns and label column)
2. The features to consider for splits (a list of strings of column names to consider for splits)
3. The name of the target/label column (string)

The function will loop through the list of possible features, and consider splitting on each of them. It will calculate the classification error of each split and return the feature that had the smallest classification error when split on.

Recall that the **classification error** is defined as follows:
$$
\mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# total examples}}
$$
Which is same as 
$$
\mbox{classification error} = \frac{\mbox{min(# +ve class, # -ve class)}}{\mbox{# total examples}}
$$

**9.** Follow these steps to implement *best_splitting_feature*
* **Step 1:** Loop over each feature in the feature list
* **Step 2:** Within the loop, split the data into two groups: one group where all of the data has feature value 0 or False (we will call this the **left** split), and one group where all of the data has feature value 1 or True (we will call this the **right** split). Make sure the **left** split corresponds with 0 and the **right** split corresponds with 1 to ensure your implementation fits with our implementation of the tree building process.
* **Step 3:** Calculate the number of misclassified examples in both groups of data and use the above formula to compute the **classification error**.
* **Step 4:** If the computed error is smaller than the best error found so far, store this **feature and its error**.

This may seem like a lot, but we have provided pseudocode in the comments in order to help you implement the function correctly.

**Note:** Remember that since we are only dealing with binary features, we do not have to consider thresholds for real-valued features. This makes the implementation of this function much easier.

Fill in the places where you find `## YOUR CODE HERE`. There are **five** places in this function for you to fill in.

In [53]:
def best_splitting_feature(data, features, target):
    
    best_feature = None # Keep track of the best feature 
    best_error = 10     # Keep track of the best error so far 
    # Note: Since error is always <= 1, we should intialize it with something larger than 1.

    # Convert to float to make sure error gets computed correctly.
    num_data_points = float(len(data))  
    # print "num_data_points = %d" %(num_data_points)
    # print data[features].head(1)
    
    # Loop through each feature to consider splitting on that feature
    for feature in features:
        
        print "\n************Trying out feature:", feature
        
        # The left split will have all data points where the feature value is 0
        left_split = data[ data[feature] == 0 ]
        #print "left split",left_split[feature] #this will just split out a list of 0's
        print "num_data_points in (data[%s] == 0) i.e. left_split = %d " %(feature,len(left_split))
        
        # The right split will have all data points where the feature value is 1
        right_split =  data[ data[feature] == 1 ]
        #print "right split", right_split[feature] #this will just split out a list of 1's
        print "num_data_points (data[%s] == 1) i.e. right_split = %d " %(feature,len(right_split))
        
        
        # Calculate the number of misclassified examples in the left split.
        left_mistakes = intermediate_node_num_mistakes((left_split[target]))
        
        # What we are doing above is is trying to find out how many target class is +ve 
        # and how many target class is -ve for cases where the feature values are 0
        # so we try find majority class label where the feature values are all 0
        # Reuse function intermediate_node_num_mistakes()
        # this function determines how many samples data[ data[feature] == 0 ]
        # have +ve labels and how many -ve lables, then determines majority lable
        # say if majority were +ves, then mistakes is simply count of -ves in sample
        
             
        # Calculate the number of misclassified examples in the right split.
        right_mistakes = intermediate_node_num_mistakes((right_split[target]))
        
        # Now repeat above step for all data points where data[ data[feature] == 1 ] 
        # i.e where the feature variable in question has all values 1
        # and find number of mistakes here same way as above (find majority class
        # and mistakes is just count of data points of the other minority class)
        
        #print num_data_points == (len(left_split) + len(right_split))                  
        
            
        # -------------------------------------------------------------------------    
        # Compute the classification error of this split.
        # i.e if we chose this feature to make a decision/split in tree 
        # then how do we measure how pure leaf nodes we will get
        # Error = (# of mistakes (left) + # of mistakes (right)) / (# of data points)
        error = (left_mistakes + right_mistakes)/num_data_points
        print "Error using feature: %s for split is %0.4f " %(feature, error)
        
        # If this is the best error we have found so far, store the feature as best_feature and the error as best_error
        if error < best_error:
            print "\nFOUND LOWER ERROR using feature: %s, error is %0.4f, previous best error %0.4f" \
                                             %(feature, error, best_error)
            best_feature = feature
            best_error = error
            
    
    return best_feature # Return the best feature we found

To test your `best_splitting_feature` function, run the following code:

In [54]:
print train_data.dtypes
print len(train_data.dtypes)

#drop 'safe_loans' from features using simple list slice
features = list(train_data.columns)[1:]
print features
print len(features)

safe_loans      int64
A             float64
B             float64
C             float64
D             float64
E             float64
F             float64
G             float64
 36 months    float64
 60 months    float64
MORTGAGE      float64
OTHER         float64
OWN           float64
RENT          float64
1 year        float64
10+ years     float64
2 years       float64
3 years       float64
4 years       float64
5 years       float64
6 years       float64
7 years       float64
8 years       float64
9 years       float64
< 1 year      float64
n/a           float64
dtype: object
26
['A', 'B', 'C', 'D', 'E', 'F', 'G', ' 36 months', ' 60 months', 'MORTGAGE', 'OTHER', 'OWN', 'RENT', '1 year', '10+ years', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '< 1 year', 'n/a']
25


In [55]:
(14876.0 + 1258)/ (14876 + 1258+ 17218+3872) 

0.4334300451321728

In [56]:
if best_splitting_feature(train_data, features, 'safe_loans') == ' 36 months':
    print 'Test passed!'
else:
    print 'Test failed... try again!'


************Trying out feature: A
num_data_points in (data[A] == 0) i.e. left_split = 32094 
num_data_points (data[A] == 1) i.e. right_split = 5130 
(num risky: 17218 , num safe: 14876) in the split.
(num risky: 1258 , num safe: 3872) in the split.
Error using feature: A for split is 0.4334 

FOUND LOWER ERROR using feature: A, error is 0.4334, previous best error 10.0000

************Trying out feature: B
num_data_points in (data[B] == 0) i.e. left_split = 26858 
num_data_points (data[B] == 1) i.e. right_split = 10366 
(num risky: 14133 , num safe: 12725) in the split.
(num risky: 4343 , num safe: 6023) in the split.
Error using feature: B for split is 0.4585 

************Trying out feature: C
num_data_points in (data[C] == 0) i.e. left_split = 27812 
num_data_points (data[C] == 1) i.e. right_split = 9412 
(num risky: 13562 , num safe: 14250) in the split.
(num risky: 4914 , num safe: 4498) in the split.
Error using feature: C for split is 0.4852 

************Trying out feature: D


## Building the tree

With the above functions implemented correctly, we are now ready to build our decision tree. Each node in the decision tree is represented as a dictionary which contains the following keys and possible values:

    { 
       'splitting_feature'  : The feature that this node splits on.    
       'left'               : (dictionary object corresponding to the left tree)
       'right'              : (dictionary object corresponding to the right tree).
       'is_leaf'            : True/False.
       'prediction'         : Prediction at the leaf node, +1 or -1
    }

**10.** First, we will write a function that creates a leaf node given a set of target values. Note mainthing fn does is create leaf node and set its prediction to class that has largest count (majority class classifier!) For a true leaf node, the splitting_feature, left_tree and right_tree will always remain as 'None', is_leaf will be 'True'. Intermediate nodes will have splitting_feature set to whichever predictor the node is split on, left will be set to resulting left branch or left_tree, similar way right key is set to right tree. Note that it may appear that even for intermedite nodes is_leaf is also set to 'True', but see return stmnt for fn *decision_tree_create(data, features, target, current_depth = 0, max_depth = 10)* where for intermediate nodes is_leaf is set to False and prediction is set to 'None'. *create_leaf* is only called by the **3 stopping conditions** below and when we hit a **pure node**

In [57]:
def create_leaf(target_values):
    
    # Create a leaf node
    leaf = {'splitting_feature' : None, # unknown at this stage
            'left' : None,              # unknown at this stage
            'right' : None,             # unknown at this stage
            'is_leaf': True     #set to True cause we are creating a leaf
           }  
    
    # Count the number of data points that are +1 and -1 in this node.
    num_ones =      len(target_values[target_values == +1])
    num_minus_ones = len(target_values[target_values == -1])
    
    # For the leaf node, set the prediction to be the majority class.
    # Store the predicted class (1 or -1) in leaf['prediction']
    if num_ones > num_minus_ones:
        leaf['prediction'] =  +1        ## YOUR CODE HERE
    else:
        leaf['prediction'] =  -1        ## YOUR CODE HERE
        
    # Return the leaf node        
    return leaf 

We have provided a function that learns the decision tree recursively and implements 3 stopping conditions:
1. **Stopping condition 1:** All data points in a node are from the same class - - so stop and return this leaf. No further subtress or left/right splits.
2. **Stopping condition 2:** No more features to split on - , we have used all the features for split decisions, return leaf..
3. **Additional stopping condition:** In addition to the above two stopping conditions covered in lecture, in this assignment we will also consider a stopping condition based on the **max_depth** of the tree. By not letting the tree grow too deep, we will save computational effort in the learning process , also wont overfit as well.. 

**11.** Now, we will write down the skeleton of the learning algorithm. Fill in the places where you find `## YOUR CODE HERE`. There are **seven** places in this function for you to fill in.

In [58]:
def decision_tree_create(data, features, target, current_depth = 0, max_depth = 10):
    remaining_features = features[:] # Make a copy of the features.
    
    target_values = data[target]
    
    print "Subtree, current depth = %s (num data points %s)." % (current_depth, len(target_values))
    # root node will be our special case of subtree when we start processing!!

    # STOPPING CONDITION 1
    # (Check if there are mistakes at current node.
    if  intermediate_node_num_mistakes(target_values) == 0:  
        print "####### STOPPING CONDITION 1 reached ################# No mistakes - PURE node! make current node a leaf."     
        # If not mistakes at current node, make current node a leaf node
        return create_leaf(target_values)
    
    # STOPPING CONDITION 2 (check if there are remaining features to consider splitting on)
    # len(remaining_features) is  0 so no features left in list
    if remaining_features == 0:   
        print "####### STOPPING CONDITION 2 reached ################# All features used for split - can't go further, make current node a leaf."    
        # If there are no remaining features to consider, make current node a leaf node
        return create_leaf(target_values)    
    
    # STOPPING CONDITION 3
    # Additional stopping condition (limit tree depth)
    if current_depth >= max_depth:  
        print "####### STOPPING CONDITION 3 reached ################# Reached maximum depth, not allowed to grow any taller trees. Stopping for now after create leaf currect node."
        # If the max tree depth has been reached, make current node a leaf node
        return create_leaf(target_values)

    # Find the best splitting feature (recall the function best_splitting_feature implemented above)
    # we assume that this fn will always find n return a feature - so no error checking
    splitting_feature = best_splitting_feature(data, features, target)
    
    
    # Split on the best feature that we found. 
    left_split  = data[ data[splitting_feature] == 0 ]
    right_split = data[ data[splitting_feature] == 1 ]      
    remaining_features.remove(splitting_feature)      #remove feature we split on
    print "\n\nSPLIT ON FEATURE that gives lowest error -> %s. (Left = %s, Right = %s)" % (\
                      splitting_feature, len(left_split), len(right_split))
    
    # Create a leaf node if the split is "perfect" - not happen often!
    if len(left_split) == len(data):
        print "CREATING A PURE LEFT LEAF NODE."
        return create_leaf(left_split[target])
    if len(right_split) == len(data):
        print "CREATING A PURE RIGHT LEAF NODE"
        return create_leaf(right_split[target])

    # Now enter recursive binary splitting -  a top-down, greedy approach
    # Start at top of the tree (point at which all observations belong to a single region) 
    # and then successively splits the predictor space; 
    # each split is indicated via two new branches further down on the tree    
    # remaining_features won't consider feature we split on above,
    # inc depth as we going one level deeper, target is just 'soft_loans'
    print "########### Growing left branch of tree further ##############"
    # Takes dataframe where value for predictor_feature (splitting_feature) is 0 and
    # subset of features and grows decision tree on that branch
    left_tree = decision_tree_create(left_split, remaining_features, target, current_depth + 1, max_depth)        
    
    
    print "########### Growing right branch of tree further ##############"
    # Takes dataframe where value for predictor_feature (splitting_feature) is 1 and
    # subset of features and grows decision tree on that branch    
    right_tree = decision_tree_create(right_split, remaining_features, target, current_depth + 1, max_depth)        

    # now we have two dictionary objects left_tree and right_tree
    # so node we are at can't be a leaf, cause we have left and right branches - is_leaf FALSE
    # also store the left_tree and right_tree dictionaries at this node
    # all done in the return stmnt BLW
    # da magic of recursion happens in the return statement !!
    
    return {'is_leaf'          : False,   #did left/right split abv - so this can't be leaf!
            'prediction'       : None,    #bit more tricky - leave it to create_leaf()
            'splitting_feature': splitting_feature,  #what we used in split above
            'left'             : left_tree, 
            'right'            : right_tree}

Here is a recursive function to count the nodes in your tree:

In [59]:
def count_nodes(tree):
    if tree['is_leaf']:
        return 1
    return 1 + count_nodes(tree['left']) + count_nodes(tree['right'])

Run the following test code to check your implementation. Make sure you get **'Test passed'** before proceeding.
Note max_depth = 3 passed into decision_tree_create() overrides default value max_depth = 10 in function definition above.

In [60]:
small_data_decision_tree = decision_tree_create(train_data, features, 'safe_loans', max_depth = 3)
if count_nodes(small_data_decision_tree) == 13:
    print 'Test passed!'
else:
    print 'Test failed... try again!'
    print 'Number of nodes found                :', count_nodes(small_data_decision_tree)
    print 'Number of nodes that should be there : 13' 

Subtree, current depth = 0 (num data points 37224).
(num risky: 18476 , num safe: 18748) in the split.

************Trying out feature: A
num_data_points in (data[A] == 0) i.e. left_split = 32094 
num_data_points (data[A] == 1) i.e. right_split = 5130 
(num risky: 17218 , num safe: 14876) in the split.
(num risky: 1258 , num safe: 3872) in the split.
Error using feature: A for split is 0.4334 

FOUND LOWER ERROR using feature: A, error is 0.4334, previous best error 10.0000

************Trying out feature: B
num_data_points in (data[B] == 0) i.e. left_split = 26858 
num_data_points (data[B] == 1) i.e. right_split = 10366 
(num risky: 14133 , num safe: 12725) in the split.
(num risky: 4343 , num safe: 6023) in the split.
Error using feature: B for split is 0.4585 

************Trying out feature: C
num_data_points in (data[C] == 0) i.e. left_split = 27812 
num_data_points (data[C] == 1) i.e. right_split = 9412 
(num risky: 13562 , num safe: 14250) in the split.
(num risky: 4914 , num sa

## Build the tree!

**12.** Now train a tree model on the **train_data**. Limit the depth to 6 (**max_depth = 6**) to make sure the algorithm doesn't run for too long. Call this tree **my_decision_tree**. 

**Warning**: This code block may take 1-2 minutes to learn.  Only 20-30s using pandas!

In [61]:
# Make sure to cap the depth at 6 by using max_depth = 6

my_decision_tree = decision_tree_create(train_data, features, 'safe_loans', max_depth = 6)

Subtree, current depth = 0 (num data points 37224).
(num risky: 18476 , num safe: 18748) in the split.

************Trying out feature: A
num_data_points in (data[A] == 0) i.e. left_split = 32094 
num_data_points (data[A] == 1) i.e. right_split = 5130 
(num risky: 17218 , num safe: 14876) in the split.
(num risky: 1258 , num safe: 3872) in the split.
Error using feature: A for split is 0.4334 

FOUND LOWER ERROR using feature: A, error is 0.4334, previous best error 10.0000

************Trying out feature: B
num_data_points in (data[B] == 0) i.e. left_split = 26858 
num_data_points (data[B] == 1) i.e. right_split = 10366 
(num risky: 14133 , num safe: 12725) in the split.
(num risky: 4343 , num safe: 6023) in the split.
Error using feature: B for split is 0.4585 

************Trying out feature: C
num_data_points in (data[C] == 0) i.e. left_split = 27812 
num_data_points (data[C] == 1) i.e. right_split = 9412 
(num risky: 13562 , num safe: 14250) in the split.
(num risky: 4914 , num sa

In [62]:
my_decision_tree

{'is_leaf': False,
 'left': {'is_leaf': False,
  'left': {'is_leaf': False,
   'left': {'is_leaf': False,
    'left': {'is_leaf': False,
     'left': {'is_leaf': False,
      'left': {'is_leaf': True,
       'left': None,
       'prediction': -1,
       'right': None,
       'splitting_feature': None},
      'prediction': None,
      'right': {'is_leaf': True,
       'left': None,
       'prediction': -1,
       'right': None,
       'splitting_feature': None},
      'splitting_feature': 'E'},
     'prediction': None,
     'right': {'is_leaf': True,
      'left': None,
      'prediction': -1,
      'right': None,
      'splitting_feature': None},
     'splitting_feature': 'D'},
    'prediction': None,
    'right': {'is_leaf': True,
     'left': None,
     'prediction': -1,
     'right': None,
     'splitting_feature': None},
    'splitting_feature': 'C'},
   'prediction': None,
   'right': {'is_leaf': False,
    'left': {'is_leaf': True,
     'left': None,
     'prediction': -1,
     '

## Making predictions with a decision tree

**13.** As discussed in the lecture, we can make predictions from the decision tree with a simple recursive function. Below, we call this function `classify`, which takes in a learned `tree` and a test point `x` to classify.  We include an option `annotate` that describes the prediction path when set to `True`.

classify() takes a learned_decision_tree model and also a test data point x

In [63]:
def classify(tree, x, annotate = False):   
    # if the node is a leaf node.
    # x = x.to_dict()
    if tree['is_leaf']:
        if annotate: 
            print "At leaf, predicting %s" % tree['prediction']
        return tree['prediction'] 
    else:
        # split on feature.
        split_feature_value = x[tree['splitting_feature']]#.item()
        if annotate: 
            print "\nSplit on %s = %s" % (tree['splitting_feature'], split_feature_value)
        if split_feature_value == 0:
            return classify(tree['left'], x, annotate)
        else:
            return classify(tree['right'], x, annotate)

**14.** Now, let's consider the first example of the test set and see what `my_decision_tree` model predicts for this data point.

In [64]:
test_data[0:1]

Unnamed: 0,safe_loans,A,B,C,D,E,F,G,36 months,60 months,...,2 years,3 years,4 years,5 years,6 years,7 years,8 years,9 years,< 1 year,n/a
24,-1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
print test_data.iloc[0]

safe_loans   -1.0
A             0.0
B             0.0
C             0.0
D             1.0
E             0.0
F             0.0
G             0.0
 36 months    0.0
 60 months    1.0
MORTGAGE      0.0
OTHER         0.0
OWN           0.0
RENT          1.0
1 year        0.0
10+ years     0.0
2 years       1.0
3 years       0.0
4 years       0.0
5 years       0.0
6 years       0.0
7 years       0.0
8 years       0.0
9 years       0.0
< 1 year      0.0
n/a           0.0
Name: 24, dtype: float64


In [66]:
print 'Predicted class: %s ' % classify(my_decision_tree, test_data.iloc[0])

Predicted class: -1 


**15.** Let's add some annotations to our prediction to see what the prediction path was that lead to this predicted class. Note: annotate=True is passed into function overriding default  annotate=False in function definition.

In [67]:
classify(my_decision_tree, test_data.iloc[0], annotate=True)


Split on  36 months = 0.0

Split on A = 0.0

Split on B = 0.0

Split on C = 0.0

Split on D = 1.0
At leaf, predicting -1


-1

In [68]:
we follow our prediction path through the decision tree below


at root, 1st node - not a leaf, depth 0
18476 -1s and 18748 +1s'  - just split on target feature levels

so we need to go further into tree - the else bit
we check the dictionary to see which feature root is split on
splitting_feature is '36 months'
so we check our test_data, what the value for feature col '36 months' is
its 0, meaning this test data loan term is not '36 months', so we need to go further into tree
since feature value is 0, we follow LEFT branch of tree

classify(my_decision_tree, test_data[0:1], annotate=True)
    if tree['is_leaf']: FALSE  
    else:
        tree['splitting_feature'] --> 36 months, split_feature_value--> 24  0.0
        feature = 36 months, value is 0 at index 24
        if split_feature_value.item() == 0:
            return classify(tree['left'], x, annotate)

-----------------------------------------------------------------------        
classify(tree['left'], x, annotate)  depth 1

now we use tree['left'], this is node containing all train data where feature '36 months' has value 0
this is not a leaf, just left node with 6002 -1's, 3211 +1's

we check the dictionary to see which feature node is split on
splitting_feature is 'grade A'
so we check our test_data, what the value for feature A is
its 0, meaning this test data grade is not A so we need to go further into tree
since feature value is 0, we follow LEFT branch of tree


classify(tree['left'], x, annotate)        
    if tree['is_leaf']: FALSE 
    else:
        tree['splitting_feature'] --> A, split_feature_value--> 24  0.0
        feature = A, value is 0 at index 24
        if split_feature_value.item() == 0:
            return classify(tree['left'], x, annotate)
-----------------------------------------------------------------------  
classify(tree['left'], x, annotate)  depth 2


now we go check tree['left'], this is node containing all train data where feature A has value 0
this is not a leaf, just left node with 5936 -1's, 3151 +1's

we check the dictionary to see which feature node is split on
splitting_feature is 'grade B'
so we check our test_data, what the value for feature B is
its 0, meaning this test data grade is not B so we need to go further into tree
since feature value is 0, we follow LEFT branch of tree



classify(tree['left'], x, annotate)        
    if tree['is_leaf']: FALSE 
    else:
        tree['splitting_feature'] --> B, split_feature_value--> 24  0.0
        feature = B, value is 0 at index 24
        if split_feature_value.item() == 0:
            return classify(tree['left'], x, annotate) 
-----------------------------------------------------------------------    
classify(tree['left'], x, annotate)  depth 3


now we go to tree['left'], this is node containing all train data where feature B has value 0
this is not a leaf, just left node with 5376 -1's, 2698 +1's

we check the dictionary to see which feature root is split on
splitting_feature is 'grade C'
so we check our test_data, what the value for feature C is
its 0, meaning this test data grade is not C so we need to go further into tree
since feature value is 0, we follow left branch of tree


classify(tree['left'], x, annotate)        
    if tree['is_leaf']: FALSE 
    else:
        tree['splitting_feature'] --> C, split_feature_value--> 24  0.0
        feature=C, value is 0 at index 24
        if split_feature_value.item() == 0:
            return classify(tree['left'], x, annotate)   
        
-----------------------------------------------------------------------   
classify(tree['left'], x, annotate)  depth 4


now we go to tree['left'], this is node containing all train data where feature C has value 0
this is not a leaf, just left node with 4100 -1's, 1784 +1's

we check the dictionary to see which feature root is split on
splitting_feature is 'grade D'
so we check our test_data, what the value for feature D is
its 1, meaning this test data grade is D, but this is not leaf so we need to go further into tree
since feature value is 1, we follow RIGHT branch of tree


classify(tree['left'], x, annotate)        
    if tree['is_leaf']: FALSE 
    else:
        tree['splitting_feature'] --> D, split_feature_value--> 24  1.0
        feature=D, value is 0 at index 24
        if split_feature_value.item() == 0:  FALSE
            return classify(tree['left'], x, annotate)  NOT THIS PATH
        else:
            return classify(tree['right'], x, annotate)   THIS ***********
-----------------------------------------------------------------------  
classify(tree['right'], x, annotate)  depth 5

Now this node seems like just an intermediate node with 1335 -1s n 723 +1's

But looking at tree growing screen dump
we see that although the best splitting feature for this intermediate node is determined 
to be feature = grade E, no split is made. WHY??
see screen dump ouptput blw

SPLIT ON FEATURE that gives lowest error -> E. (Left = 2058, Right = 0)
CREATING A PURE LEFT LEAF NODE.
LEFT BRANCH IS A LEAF - Prediction -1 

len(left_split) = len(all_data_in_intermddiate _node)

So what happens is that a split on feature = grade E creates a PURE LEFT node
and had the code not been been stopped by stopping condition based on node purity,
it would have gone ahead and created a left node with 2058 -1s and 0 +1s , and len(right node) 
wud have been 0 as no data points there, so right node wud not have been created.
But code doesnt do the split on E, why bother if smallest classification error rate
is the goal and here we have error rate effectively 0, with a pure node.

so bit lazy code. Just sets intermediate node which has 1335 -1s and 723 +1s as -1 class
and makes it a leaf node.

classify(tree['right'], x, annotate)         
    if tree['is_leaf']: TRUE  - node with 1335 -1's, 723 +1's        
        print "At leaf, predicting %s" % tree['prediction']  - PRINTS -1
        return tree['prediction']  this return exits fn
        

SyntaxError: invalid syntax (<ipython-input-68-0563bbd2c8ec>, line 1)

see page 314 ISLR why we should proceed and create a split based on feature E for data points where all feature D values are 1 above..

"The split should be performed on feature=grade E because it leads to increased node purity. That is, all 2058 of the observations corresponding to the left-hand leaf have a response value of -1, whereas no obs in right-hand leaf, alhtough for a test data set we may get some points here.

Why is node purity important? Suppose that we have a test observation that belongs to the region given by that left-hand leaf. Then we  can be almost certain (100% confident) that its response value is -1. In contrast, if a test observation belongs to the region given by the right-hand leaf (grade E values = 1) , then its response value is probably -1, probably +1, but we are MUCH LESS certain. Even though the split grade E == 0 would not reduce the classification error, it improves the Gini index and the cross-entropy, which are more sensitive to node purity."

** Quiz question:** What was the feature that **my_decision_tree** first split on while making the prediction for test_data[0]? term.36 months

** Quiz question:** What was the first feature that lead to a right split of test_data[0]? grade.D

** Quiz question:** What was the last feature split on before reaching a leaf node for test_data[0]? grade.D.
grade.E was considered, as it gave lowest error rate, but since the error rate was 0 (all data points about to be split on E all had value 0 for feature E), split was never made and the intermediate node was converted to leaf node.

## Evaluating your decision tree

**16.** Now, we will write a function to evaluate a decision tree by computing the classification error of the tree on the given dataset.

Again, recall that the **classification error** is defined as follows:
$$
\mbox{classification error} = \frac{\mbox{# mistakes}}{\mbox{# total examples}}
$$

Now, write a function called `evaluate_classification_error` that takes in as input:
1. `tree` (as described above)
2. `data` (an SFrame)
3. `target` (a string - the name of the target/label column)

This function should calculate a prediction (class label) for each row in `data` using the decision `tree` and return the classification error computed using the above formula. 

Note that when using .apply() with pandas dataframe, [pandas.DataFrame.apply()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html), also need to supply axis info. See [also](http://chrisalbon.com/python/pandas_apply_operations_to_dataframes.html)

* axis : {0 or ‘index’, 1 or ‘columns’}, default 0
 *        0 or ‘index’: apply function to each column
 *        1 or ‘columns’: apply function to each row


In [69]:
def evaluate_classification_error(tree, data, target):
    # Apply the classify(tree, x) to each row in your data
    prediction = data.apply(lambda x: classify(tree, x), axis=1)

    # Once you've made the predictions, calculate the classification error and return it
    # correct = sum(prediction == data[target])
    # error = len(data) - correct
    # return 1.0*error/len(data)
    return 1.0 * sum(prediction != data[target])/len(data)
    
    #check https://gist.github.com/why-not/4582705
    #http://stackoverflow.com/questions/36742169/indexing-pandas-series-with-parent-dataframe-index
    #http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-callable
    #http://pandas.pydata.org/pandas-docs/stable/indexing.html

In [None]:
# For loop takes longer - use lambda
#def evaluate_classification_error(tree, data, target):
#    prediction = list()
#    for i in range(len(data)):
#        prediction.append( classify(tree, data.iloc[i], annotate=False))
#    return 1.0 * sum(prediction != data[target])/len(data)

**17.** Now, let's use this function to evaluate the classification error on the test set.

In [70]:
# print len(test_data)
print round(evaluate_classification_error(my_decision_tree, test_data, target),2)

0.38


**Quiz Question:** Rounded to 2nd decimal point, what is the classification error of **my_decision_tree** on the **test_data**? 0.38

## Printing out a decision stump

**18.** As discussed in the lecture, we can print out a single decision stump (printing out the entire tree is left as an exercise to the curious reader).  name parameter is name of splitting_feature.

In [71]:
def print_stump(tree, name = 'root'):
    split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'
    if split_name is None:
        print "(leaf, label: %s)" % tree['prediction']
        return None
    
    #split_feature, split_value = split_name.split('.')
    print '                       %s' % name
    print '         |---------------|----------------|'
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '  [{0} == 0]               [{0} == 1]    '.format(split_name)
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '    (%s)                         (%s)' \
        % (('leaf, label: ' + str(tree['left']['prediction']) if tree['left']['is_leaf'] else 'subtree'),
           ('leaf, label: ' + str(tree['right']['prediction']) if tree['right']['is_leaf'] else 'subtree'))

In [72]:
def print_all(tree, name = 'root'):
    split_name = tree['splitting_feature'] # split_name is something like 'term. 36 months'
    if split_name is None:
        print "(leaf, label: %s)" % tree['prediction']
        return None
    
    #split_feature, split_value = split_name.split('.')
    print '                       %s' % name
    print '         |---------------|----------------|'
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '  [{0} == 0]               [{0} == 1]    '.format(split_name)
    print '         |                                |'
    print '         |                                |'
    print '         |                                |'
    print '    (%s)                         (%s)' \
        % (('leaf, label: ' + str(tree['left']['prediction']) \
            if tree['left']['is_leaf'] else print_all(tree['left'], split_name)),
           
           ('leaf, label: ' + str(tree['right']['prediction']) \
            if tree['right']['is_leaf'] else print_all(tree['right'], split_name)))
    

**19.**  Using this function, we can print out the root of the decision tree.

In [73]:
print_all(my_decision_tree)

                       root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [ 36 months == 0]               [ 36 months == 1]    
         |                                |
         |                                |
         |                                |
                        36 months
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [A == 0]               [A == 1]    
         |                                |
         |                                |
         |                                |
                       A
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [B == 0]               [B

In [74]:
print_stump(my_decision_tree)

                       root
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [ 36 months == 0]               [ 36 months == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


**Quiz Question:** What is the feature that is used for the split at the root node?

### Exploring the intermediate left subtree

The tree is a recursive dictionary, so we do have access to all the nodes! We can use
* `my_decision_tree['left']` to go left
* `my_decision_tree['right']` to go right

**20.** We can print out the left subtree:

In [75]:
print_stump(my_decision_tree['left'], my_decision_tree['splitting_feature'])

                        36 months
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [A == 0]               [A == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


### Exploring the left subtree of the left subtree


In [76]:
print_stump(my_decision_tree['left']['left'], my_decision_tree['left']['splitting_feature'])

                       A
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [B == 0]               [B == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (subtree)


**Quiz question:** What is the path of the **first 3 feature splits** considered along the **left-most** branch of **my_decision_tree**?

**Quiz question:** What is the path of the **first 3 feature splits** considered along the **right-most** branch of **my_decision_tree**?

In [77]:
print_stump(my_decision_tree['left']['left']['left'], my_decision_tree['left']['left']['splitting_feature'])

                       B
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [C == 0]               [C == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)


In [78]:
print_stump(my_decision_tree['left']['left']['left']['left'], my_decision_tree['left']['left']['left']['splitting_feature'])

                       C
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [D == 0]               [D == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)


In [79]:
print_stump(my_decision_tree['left']['left']['left']['left'], 'C')

                       C
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [D == 0]               [D == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)


Is the above the path we take when making a prediction for test_data[0]? I tink so

classify(my_decision_tree, test_data.iloc[0], annotate=True)

In [80]:
print_stump(my_decision_tree['right'], my_decision_tree['splitting_feature'])

                        36 months
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [D == 0]               [D == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)


In [81]:
print_stump(my_decision_tree['right']['right'], my_decision_tree['right']['splitting_feature'])

(leaf, label: -1)


In [82]:
print_stump(my_decision_tree['right']['left'], my_decision_tree['right']['splitting_feature'])

                       D
         |---------------|----------------|
         |                                |
         |                                |
         |                                |
  [E == 0]               [E == 1]    
         |                                |
         |                                |
         |                                |
    (subtree)                         (leaf, label: -1)
