### Decision Trees 
1. <b>ID3 algorithm</b>
2. <b> Storing the tree</b>
3. <b> A prettier Tree </b>
4. <b>Automatic prediction</b>
5. <b>Making Multiple Predictions</b>
6. <b>Conclusion</b>

In [1]:
import json
import matplotlib
import warnings
import pandas as pd
import numpy as np
import math
import pickle
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
from IPython.display import Image, display


warnings.simplefilter("ignore")
root = r"/Users/Kenneth-Aristide/anaconda3/bin/python_prog/ML/styles/bmh_matplotlibrc.json"
s = json.load(open(root))
matplotlib.rcParams.update(s)
% matplotlib inline

In [2]:
_headers =['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation',
         'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']

income = pickle.load(open("/Users/Kenneth-Aristide/anaconda3/bin/python_prog/ML/data/income.pickle", "rb"))

### ID3 Algorithm
In the last notebook, we learned about the basics of decision trees, including $entropy$ and $information$ $gain$. In this mission, we'll build on those concepts to construct a full decision tree in Python and make predictions.

We will use [ID3 Algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for constructing decision trees.
This algorithm involves recursion and an understanding of time complexity.
In general, recursion is the process of splitting a large problem into small chunks. Recursive functions will call themselves, then combine the results to create a final result.

Building trees is a perfect case for a recursive algorithm -- at each node, we'll call a recursive function, which will split the data into two branches. Each branch will lead to a node, and the function will call itself to build out the tree.

def id3(data, target_name, column):
    1. create a node for the tree
    2. If all values of the target are attribute are 1, Return the node with label 1
    3. If all values of the target are attribute are 0, Return the node with label 0
    4. Using information gain, find A, the column that splits the data best
    5. Find the median value in column A
    6. Split A into values below or equal to the median (0) and value above the median (1)
    7. For each possible value (0, or 1), vi, of A
        8. add new treebranch below Root, corresponding to the rows in the data where A = vi
        9. Let Examples(vi) be the subset of examples that have the value vi for A
        10. Below this new branch add the subtree id3(data[A==vi], target, columns)
    11. Return Root
    
We've made a minor modification to the algorithm to only make two branches from each node. This will simplify constructing the tree, and make it easier to demonstrate the principles involved.

The recursive nature of the algorithm comes into play on line 10. Every node in the tree will call the id3 function, and the final tree will be the result of all of these calls.

In [3]:
def compute_entropy(column):
    """
    Convenience function:
        Calculate entropy given a pandas Series, list, or numpy array
    """
    counts = np.bincount(column)
    probabilities = counts / len(column)
    
    entropy = 0
    
    for prob in probabilities:
        if prob > 0 : 
            entropy += prob * math.log(prob, 2)
        
    return -entropy

def compute_information_gain(data, split_name, target_name):
    """
    Calculate information gain given a dataset, column to split on, and target
    """
    total_entropy = compute_entropy(data[target_name])
    
    column = data[split_name]
    median = column.median()
    
    left_split = data[column <= median]
    right_split = data[column > median]
    
    to_subtract = 0
    
    for subset in [left_split, right_split]:
        prob = (subset.shape[0] / data.shape[0])
        to_subtract += prob * compute_entropy(subset[target_name])
        
    return total_entropy - to_subtract


def find_best_column(data, target_name, columns):
    """
    Convenience function:
        find the optimale column for ID3 to split on
    """
    information_gain = {}
    for column in columns : 
        information_gain[column] = compute_information_gain(data, column, target_name)
        
    return max(information_gain, key = information_gain.get)

In [4]:
label_0s = []
label_1s = []

def id3(data, target, columns):
    """
    Convenience function:
        id3 algorithm, where we change step 1 and step 2 instead just store labels in a list
        just label we bulding a new tree on the left 0 or right 1 sides
    """
    
    unique_targets = data[target].unique()
    if len(unique_targets) == 1:
        if 0 in unique_targets:
            label_0s.append(0)
        elif 1 in unique_targets:
            label_1s.append(1)
            
        return
    
    best_column = find_best_column(data, target, columns)
    column_median = data[best_column].median()
    
    left_split = data[data[best_column] <= column_median]
    right_split = data[data[best_column] > column_median]
    
    # Loop through the split and call id3 recursively
    for split in [left_split, right_split]:
        id3(split, target, columns) 
        
        

# test
data = pd.DataFrame([
    [0,20,0],
    [0,60,2],
    [0,40,1],
    [1,25,1],
    [1,35,2],
    [1,55,1]
    ])
# Assign column names to the data.
data.columns = ["high_income", "age", "marital_status"]
id3(data, "high_income", ["age", "marital_status"])
print(label_1s), print(label_0s);

[1, 1, 1]
[0, 0, 0]


### Storing the tree
We can now store the entire tree instead of just the labels at the leaves. In order to do this, we'll use nested dictionaries.

In order to keep track of the tree, we'll need to make some modifications to id3. The first is that we'll be changing the definition to pass in the tree dictionary

def id3(data, target_name, column):
    1. create a node for the tree
    2. Number of node
    3. If all values of the target are attribute are 1, Return the node with label 1
    4. If all values of the target are attribute are 0, Return the node with label 0
    5. Using information gain, find A, the column that splits the data best
    6. Find the median value in column A
    7. Assign the column and median keys in tree
    8. Split A into values below or equal to the median (0) and value above the median (1)
    9. For each possible value (0, or 1), vi, of A
        10. add new treebranch below Root, corresponding to the rows in the data where A = vi
        11. Let Examples(vi) be the subset of examples that have the value vi for A
        12. Create a new key with the name corresponding to the side of the split (0=left, 1=right).  The value of               this key should be an empty dictionary
        13. Below this new branch add the subtree id3(data[A==vi], target, columns, tree[split_side])
    14. Return Root
    
    
The main difference is that we're now passing the tree dictionary into our id3 function, and setting some keys on it. One complexity is in how we're creating the nested dictionary. For the left split, we're adding a key to the tree dictionary that looks like: tree["left"] = {}. For the right side, we're doing tree["right"] = {}. After we add this key, we're able to pass the newly created dictionary into the recursive call to id3. This new dictionary will be the dictionary for that specific node, but will be tied back to the parent dictionary (because it's a key of the original dictionary).

This will keep building up the nested dictionary, and we'll be able to access the whole thing using the variable tree we define before the function. Think of it like each recursive call building a piece of the tree, which we can access after all the functions are done.

In [5]:
# create a dictionaryto hold the tree. as label_s before this has to be outside the function so we can accessit later
tree = {}

# This list will let us number the nodes.  It has to be a list so we can access it inside the function.
nodes = []

def id3(data, target, columns, tree):
    """
    Convenience function : 
        id3 algorithm
    """
    unique_targets  = data[target].unique()
    
    # Assign the number of key to the node list
    nodes.append(len(nodes) + 1)
    tree["number"] = nodes[-1]
    
    if len(unique_targets) == 1:
        if 0 in unique_targets:
            tree["label"] = 0
        elif 1 in unique_targets:
            tree["label"] = 1
        
        return
    best_column = find_best_column(data, target, columns)
    column_median = data[best_column].median()
    
    tree["column"] = best_column
    tree["median"] = column_median
    left_split = data[data[best_column] <= column_median]
    right_split = data[data[best_column] > column_median]
    
    split_dict = [["left", left_split], ["right", right_split]]
    
    for name, split in split_dict:
        tree[name] = {}
        id3(split, target, columns, tree[name])
        
# test
id3(data, "high_income", ["age", "marital_status"], tree)

In [6]:
tree

{'column': 'age',
 'left': {'column': 'age',
  'left': {'column': 'age',
   'left': {'label': 0, 'number': 4},
   'median': 22.5,
   'number': 3,
   'right': {'label': 1, 'number': 5}},
  'median': 25.0,
  'number': 2,
  'right': {'label': 1, 'number': 6}},
 'median': 37.5,
 'number': 1,
 'right': {'column': 'age',
  'left': {'column': 'age',
   'left': {'label': 0, 'number': 9},
   'median': 47.5,
   'number': 8,
   'right': {'label': 1, 'number': 10}},
  'median': 55.0,
  'number': 7,
  'right': {'label': 0, 'number': 11}}}

### A Prettier Tree
The tree dictionary shows all the relevant information, but it doesn't look very good. We can fix this by printing out our dictionary in a nicer way.

In order to do this, we'll need to recursively iterate through our tree dictionary. If we find a dictionary with a label key, then we know it's a leaf, so we print out the label of the leaf. Otherwise, we loop through the left and right keys of the tree, and recursively call the same function. We'll also need to keep track of a depth variable so we can indent the nodes properly to indicate which nodes come before others. When we print out anything, we'll take the depth variable into account by adding space beforehand.

In [7]:
def print_with_depth(string, depth):
    """
    Convenience function:
        
    """
    # add space before the string
    prefix = " " * depth
    print("{0}{1}".format(prefix, string))
    
def print_node(tree, depth):
    """
    Convenience function:
        
    """
    if "label" in tree:
        print_with_depth("Leaf: Label {0}".format(tree["label"]), depth)
        # This is critical -- without it, you'll get infinite recursion.
        return
    print_with_depth("{0} > {1}".format(tree["column"], tree["median"]), depth)
    
    branches = [tree["left"], tree["right"]]
    for branch in branches : 
        print_node(branch, depth + 1)
        
print_node(tree, 0)

age > 37.5
 age > 25.0
  age > 22.5
   Leaf: Label 0
   Leaf: Label 1
  Leaf: Label 1
 age > 55.0
  age > 47.5
   Leaf: Label 0
   Leaf: Label 1
  Leaf: Label 0


The left branch is printed first, then the right branch. Each node prints the criteria that it is split based on. It's easy to tell how to predict a new value by looking at this tree.

Let's say we wanted to predict the following row:
age = 50, marital_status = 1


We'd first split on age > 37.5, and go to the right. Then, we'd split on age > 55.0, and go to the left. Then, we'd split on age > 47.5, and go to the right. We'd end up predicting a 1 for high_income.

It's simple to make predictions with such a small tree, but what if we want to use the whole income dataframe? We wouldn't be able to make predictions by eye, and would want an automated way to do so.

In [10]:
def predict(tree, row):
    """
    Convenience function:
        predict from row
    """
    if "label" in tree:
        return tree["label"]
    
    column = tree["column"]
    median = tree["median"]
    
    if row[column] <= median:
        return predict(tree["left"], row)
    else :
        return predict(tree["right"], row)
    
print(predict(tree, data.iloc[4]))

1


In [14]:
new_data = pd.DataFrame([
    [40,0],
    [20,2],
    [80,1],
    [15,1],
    [27,2],
    [38,1]
    ])
# Assign column names to the data.
new_data.columns = ["age", "marital_status"]

def batch_predict(tree, df):
    """
    Convenience function:
        make multiple prediction
    """
    return df.apply(lambda df : predict(tree, df), axis = 1)

predictions = batch_predict(tree, new_data)
predictions

0    0
1    0
2    0
3    0
4    1
5    0
dtype: int64

### Conclusion 
In this notebook, we learned how to create a full decision tree model, print the results, and make predictions using the tree. We applied a modified version of the ID3 algorithm.