# 1. Intro to the data

In the last mission, we used a data set on U.S. income from the 1994 census; we'll continue using it here. It contains information on marital status, age, type of work, and more. The target column, high_income, indicates a salary less than or equal to 50k per year (0), or more than 50k per year (1).

You can download the data from the University of California, Irvine's website.

# 2. Overview of the ID3 algorithm

In the last mission, we learned about the basics of decision trees, including entropy and information gain. In this mission, we'll build on those concepts to construct a full decision tree in Python and use it make predictions.

We'll use the ID3 Algorithm for constructing decision trees to accomplish this. This algorithm involves recursion and an understanding of time complexity. If you're unfamiliar with these topics, we suggest trying our Data Structures and Algorithms course. We also suggest learning about lambda functions through our command line course.

In general, recursion is the process of splitting a large problem into smaller chunks. Recursive functions will call themselves, then combine the results into a final output.

Building a tree is a perfect use case for recursive algorithms. At each node, we'll call a recursive function that will split the data into two branches. Each branch will lead to a node, and the function will call itself to build the tree out.

We've created a pseudocode version of the full ID3 Algorithm below. Pseudocode is a plain-text outline of a piece of code that explains how it works. Exploring the pseudocode for an algorithm is a good way to understand it better before trying to code it.

    def id3(data, target, columns)
        1 Create a node for the tree
        2 If all values of the target attribute are 1, Return the node, with label = 1
        3 If all values of the target attribute are 0, Return the node, with label = 0
        4 Using information gain, find A, the column that splits the data best
        5 Find the median value in column A
        6 Split column A into values below or equal to the median (0), and values above the median (1)
        7 For each possible value (0 or 1), vi, of A,
            8 Add a new tree branch below Root that corresponds to rows of data where A = vi
            9 Let Examples(vi) be the subset of examples that have the value vi for A
           10 Below this new branch add the subtree id3(data[A==vi], target, columns)
           11 Return Root
       
We've made a minor modification to the algorithm so that it only creates two branches from each node. This will simplify the process of constructing the tree, and make it easier to demonstrate the principles it involves.

The recursive nature of the algorithm comes into play on line 10. Every node in the tree will call the id3() function, and the final tree will be the result of all of these calls.

# 3. Walking through an example of the ID3 algorithm
Let's make ID3 easier to follow by walking through an example with a dummy data set. We want to predict high_income using age and marital_status. In the marital_status column, 0 means unmarried, 1 means married, and 2 means divorced.

    high_income    age    marital_status
    0              20     0
    0              60     2
    0              40     1
    1              25     1
    1              35     2
    1              55     1
We start with our algorithm: There are both 0s and 1s in high_income, so we skip lines 2 and 3. We jump to line 4. We won't go through the information gain calculations here, but the column we split on is age.
On line 5, we find the median, which is 37.5.
Per line 6, we make everything less than or equal to the median 0, and anything greater than the median 1. Next, we start the loop on line 7. Because we're going through the possible values for A in order, we hit the 0 values first. We make a branch going to the left for rows of data where age <= 37.5.
We reach line 10, and call id3() on the new node at the end of that branch. We "pause" this current execution of id3() because we called the function again. We'll call this paused state Node 1.

The following diagram illustrates this chain of events. We've numbered the nodes in the bottom right corner.

![](node2.png)

The new node has the following data:

    high_income    age    marital_status
    0              20     0
    1              25     1
    1              35     2
Because we recursively called the id3() function on line 10, we start over at the top, with only the post-split data. We skip lines 2 and 3 again, and find another variable to split on. age is again the best split variable, with a median of 25. We make a branch to the left where age <= 25.

![](node3.png)

The new node has the following data:

    high_income    age    marital_status
    0              20     0
    1              25     1
We'll hit line 10 again, and "pause" node 2 to start over in the id3() function. We find that the best column to split on is again age, and the median is 22.5.

We perform another split:

![](node4.png)

All of the values for high_income in node 4 are 0. This means that line 3 applies, and we don't continue building the tree lower. This causes the id3 function for node 4 to return. This "unpauses" the id3() function for node 3, which then moves on to building the right side of the tree. Line 7 specifies that we're in a for loop. When the id3() algorithm for node 4 returns, node 3 goes to the next iteration in the for loop, which is the right branch.

We're now on node 5, which is the right side of the split we make from node 3. This calls the id3() function for node 5, which stops at line 2 and returns. There's only one row in this split, and we end up with a leaf node again, where the label is 1.

![](node5.png)

We're done with the entire loop for node 3. We've constructed a left-hand subtree and a right-hand subtree, both of which end in terminal leaves having only one value for the target column.

The id3() function for node 3 now hits line 11 and returns. This "unpauses" node 2, where we construct the right split. There's only one row here -- the 35 year old. This again creates a leaf node, which will have the label 1.

![](node6.png)

This causes node 2 to finish processing and return on line 11. This causes node 1 to "unpause" and start building the right side of the tree.

We won't build out the entire right side of the tree right now. Instead, we'll dive into some code that will construct trees automatically.

# 4. Determining the Column to Split On

In the last mission, we wrote functions to calculate entropy and information gain. We've loaded these functions in as calc_entropy() and calc_information_gain().

Now we need a function that returns the name of the column we should use to split a data set. The function should take the name of the data set, the target column, and a list of columns we might want to split on as input.

**Instructions:**

- Write a function named find_best_column() that returns the name of a column to split the data on. We've started to define this function for you.
- Use find_best_column() to find the best column on which to split income.
- The target is the high_income column, and the potential columns to split with are in the list columns below.
- Assign the result to income_split.

In [68]:
import pandas

# Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)
income = pandas.read_csv("income.csv", index_col=False)
print(income.head(5))

# Convert a single column from text categories to numbers
col = pandas.Categorical.from_array(income["workclass"])
income["workclass"] = col.codes
print(income["workclass"].head(5))
for name in ["education", "marital_status", "occupation", "relationship", "race", "sex", "native_country", "high_income"]:
    col = pandas.Categorical.from_array(income[name])
    income[name] = col.codes

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country high_income  
0          2174             0              40   United-States   



In [69]:
import numpy as np
import math
# first re-load the entropy function
def calc_entropy(column):
    """
    Calculate entropy given a pandas series, list, or numpy array.
    """
    # Compute the counts of each unique value in the column
    counts = np.bincount(column)
    # Divide by the total column length to get a probability
    probabilities = counts / len(column)
    
    # Initialize the entropy to 0
    entropy = 0
    # Loop through the probabilities, and add each one to the total entropy
    for prob in probabilities:
        if prob > 0:
            entropy += prob * math.log(prob, 2)
    
    return -entropy

In [70]:
def calc_information_gain(data, split_name, target_name):
    """
    Calculate information gain given a data set, column to split on, and target
    """
    # Calculate the original entropy
    original_entropy = calc_entropy(data[target_name])
    
    # Find the median of the column we're splitting
    column = data[split_name]
    median = column.median()
    
    # Make two subsets of the data, based on the median
    left_split = data[column <= median]
    right_split = data[column > median]
    
    # Loop through the splits and calculate the subset entropies
    to_subtract = 0
    for subset in [left_split, right_split]:
        prob = (subset.shape[0] / data.shape[0]) 
        to_subtract += prob * calc_entropy(subset[target_name])
    
    # Return information gain
    return original_entropy - to_subtract

In [71]:
columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]
def find_best_column(data, target_name, columns):
    information_gains = []
    # Loop through and compute information gains
    for col in columns:
        information_gain = calc_information_gain(data, col, "high_income")
        information_gains.append(information_gain)

    # Find the name of the column with the highest gain
    highest_gain_index = information_gains.index(max(information_gains))
    highest_gain = columns[highest_gain_index]
    return highest_gain

income_split = find_best_column(income, "high_income", columns)

# 5. Creating a Simple Recursive Algorithm

Let's build up to making the full id3() function by creating a simpler algorithm that we can extend. Here's what that algorithm looks like in pseudocode:

    def id3(data, target, columns)
        1 Create a node for the tree
        2 If all values of the target attribute are 1, add 1 to counter_1
        3 If all values of the target attribute are 0, add 1 to counter_0
        4 Using information gain, find A, the column that splits the data best
        5 Find the median value in column A
        6 Split A into values below or equal to the median (0), and values above the median (1)
        7 For each possible value (0 or 1), vi, of A,
        8    Add a new tree branch below Root that corresponds to rows of data where A = vi
        9    Let Examples(vi) be the subset of examples that have the value vi for A
       10    Below this new branch, add the subtree id3(data[A==vi], target, columns)
       11 Return Root

This version is very similar to the algorithm above, but lines 2 and 3 are different. Rather than storing the entire tree (which is a bit complicated), we'll just tally how many leaves end up with the label 1, and how many end up with the label 0.

We'll replicate this algorithm in code, and apply it to the same data set we just stepped through on a previous screen:

    high_income    age    marital_status
    0              20     0
    0              60     2
    0              40     1
    1              25     1
    1              35     2
    1              55     1

**Instructions:**

- Read the id3() function below and fill in the lines that say "Insert code here...".
    - The function should append 1 to label_1s if the node should be a leaf, and only has 1s for high_income.
    - It should append 0 to label_0s if the node should be a leaf, and only has 0s for high_income.

In [72]:
# We'll use lists to store our labels for nodes (when we find them)
# Lists can be accessed inside our recursive function, whereas integers can't.  
# Look at the python missions on scoping for more information on this topic
label_1s = []
label_0s = []

def id3(data, target, columns):
    unique_targets = pandas.unique(data[target])

    if len(unique_targets) == 1:
        if 0 in unique_targets:
            label_0s.append(0)
        elif 1 in unique_targets:
            label_1s.append(1)
        return
    
    best_column = find_best_column(data, target, columns)
    column_median = data[best_column].median()
    
    left_split = data[data[best_column] <= column_median]
    right_split = data[data[best_column] > column_median]
    
    for split in [left_split, right_split]:
        id3(split, target, columns)


id3(data, "high_income", ["age", "marital_status"])

# 6. Storing the tree

Now we can store the entire tree, rather than the leaf labels only. We'll use nested dictionaries to do this. We can represent the root node with a dictionary, and branches with the keys left and right. We'll store the column we're splitting on as the key column, and the median value as the key median. Finally, we can store the label for a leaf as the key label. We'll also number each node as we go along using the number key.

    We'll use the same data set we've been working with:

    high_income    age    marital_status
    0              20     0
    0              60     2
    0              40     1
    1              25     1
    1              35     2
    1              55     1
Here's what the dictionary for the decision tree will look like:

    {  
       "left":{  
          "left":{  
             "left":{  
                "number":4,
                "label":0
             },
             "column":"age",
             "median":22.5,
             "number":3,
             "right":{  
                "number":5,
                "label":1
             }
          },
          "column":"age",
          "median":25.0,
          "number":2,
          "right":{  
             "number":6,
             "label":1
          }
       },
       "column":"age",
       "median":37.5,
       "number":1,
       "right":{  
          "left":{  
             "left":{  
                "number":9,
                "label":0
             },
             "column":"age",
             "median":47.5,
             "number":8,
             "right":{  
                "number":10,
                "label":1
             }
          },
          "column":"age",
          "median":55.0,
          "number":7,
          "right":{  
             "number":11,
             "label":0
          }
       }
    }
If we look at node 2 (the left branch of the root node), we see that it matches the hand exercise we completed a few screens ago. It splits, creating a right branch (node 6) with the label 1, and a left branch (node 3) that splits again.

In order to keep track of the tree, we'll need to make some modifications to id3(). The first modification involves changing the definition to pass in the tree dictionary:

    def id3(data, target, columns, tree)
        1 Create a node for the tree
        2 Number the node
        3 If all of the values of the target attribute are 1, assign 1 to the label key in tree
        4 If all of the values of the target attribute are 0, assign 0 to the label key in tree
        5 Using information gain, find A, the column that splits the data best
        6 Find the median value in column A
        7 Assign the column and median keys in tree
        8 Split A into values less than or equal to the median (0), and values above the median (1)
        9 For each possible value (0 or 1), vi, of A,
       10    Add a new tree branch below Root that corresponds to rows of data where A = vi
       11    Let Examples(vi) be the subset of examples that have the value vi for A
       12    Create a new key with the name corresponding to the side of the split (0=left, 1=right).  The value of this key should be an empty dictionary.
       13    Below this new branch, add the subtree id3(data[A==vi], target, columns, tree[split_side])
       14 Return Root
       
Under this approach, we're now passing the tree dictionary into our id3 function and setting some keys on it. One complexity is in how we're creating the nested dictionary. For the left split, we're adding a key to the tree dictionary that looks like this:

tree["left"] = {}

For the right side, we're adding:

tree["right"] = {}

Now that we've added this key, we're able to pass our new dictionary into the recursive call to id3(). While this new dictionary will be the dictionary for that specific node, it will be tied back to the parent dictionary (because it's a key of the original dictionary).

This process will continue building up the nested dictionary. We'll be able to access the entire dictionary using the variable tree we define before the function. Think of each recursive call as building a piece of the tree, which we can then access after all of the functions have terminated.

# 7. Storing the tree
Instructions

Fill in the sections labelled "Insert code here..." in the id3() function.
- The first section should assign the correct label to the tree dictionary.
    - You can do this by setting the label key equal to the correct label.
- The second section should assign the column and median keys to the tree dictionary.
    - The values should be equal to best_column and column_median.

Finally, call the id3 function with the correct inputs -- id3(data, "high_income", ["age", "marital_status"], tree).

In [73]:
tree = {}
nodes = []

def id3(data, target, columns, tree):
    unique_targets = pandas.unique(data[target])
    nodes.append(len(nodes) + 1)
    tree["number"] = nodes[-1]

    if len(unique_targets) == 1:
        if 0 in unique_targets:
            tree["label"] = 0
        elif 1 in unique_targets:
            tree["label"] = 1
        return
    
    best_column = find_best_column(data, target, columns)
    column_median = data[best_column].median()
    
    tree["column"] = best_column
    tree["median"] = column_median
    
    left_split = data[data[best_column] <= column_median]
    right_split = data[data[best_column] > column_median]
    split_dict = [["left", left_split], ["right", right_split]]
    
    for name, split in split_dict:
        tree[name] = {}
        id3(split, target, columns, tree[name])


id3(data, "high_income", ["age", "marital_status"], tree)

# 8. Printing Labels for a more Attractive Tree

The tree dictionary shows all of the relevant information, but it doesn't look very nice. We can fix its appearance by printing it out in a nicer format.

To do this, we'll need to recursively iterate through our tree dictionary. Any dictionary that has a label key is a leaf. Whenever we find one, we'll print out the label. Otherwise, we'll loop through the tree's left and right keys and recursively call the same function.

We also need to keep track of a depth variable. This variable will allow us to use indentation to indicate the order of the nodes. Before we print anything out, we'll prefix it with the number of spaces corresponding to the depth variable.

Here's the pseudocode:

    def print_node(tree, depth):
        1 Check for the presence of the "label" key in the tree
        2     If found, print the label and return
        3 Print out the tree's "column" and "median" keys
        4 Iterate through the tree's "left" and "right" keys
        5     Recursively call print_node(tree[key], depth+1)
**Instructions:**

Fill in the gaps in the print_node() function that say "Insert code here...".
- Your code should iterate through both branches of the branches list (in order), and recursively call print_node().
    - Don't forget to increment depth when you call print_node.

Call print_node(), and pass in tree and depth 0

In [74]:
def print_with_depth(string, depth):
    # Add space before a string
    prefix = "    " * depth
    # Print a string, and indent it appropriately
    print("{0}{1}".format(prefix, string))
    
    
def print_node(tree, depth):
    if "label" in tree:
        print_with_depth("Leaf: Label {0}".format(tree["label"]), depth)
        return
    print_with_depth("{0} > {1}".format(tree["column"], tree["median"]), depth)
    for branch in [tree["left"], tree["right"]]:
        print_node(branch, depth+1)

print_node(tree, 0)

age > 37.5
    age > 25.0
        age > 22.5
            Leaf: Label 0
            Leaf: Label 1
        Leaf: Label 1
    age > 55.0
        age > 47.5
            Leaf: Label 0
            Leaf: Label 1
        Leaf: Label 0


# 9. Making prediction with the Printed Tree

Now that we've printed the tree out, we can see what the split points are:

    age > 37.5
        age > 25.0
            age > 22.5
                Leaf: Label 0
                Leaf: Label 1
            Leaf: Label 1
        age > 55.0
            age > 47.5
                Leaf: Label 0
                Leaf: Label 1
            Leaf: Label 0
The left branch prints out first, then the right branch. Each node prints the criteria on which it was split. Can you tell how to predict a new value by looking at this tree?

Let's say we want to predict the following row:

    age    marital_status
    50     1
    
First, we'd split on age > 37.5 and go to the right. Then, we'd split on age > 55.0 and go to the left. Then, we'd split on age > 47.5 and go to the right. We'd end up predicting a 1 for high_income.

Making predictions with such a small tree is fairly straightforward, but what if we want to use the entire income dataframe? We wouldn't be able to eyeball predictions; we'd want an automated way to do this instead.

# 10. Making Predictions Automatically

Let's write a function that makes predictions automatically. All we need to do is follow the split points we've already defined with a new row.

Here's the pseudocode:

    def predict(tree, row):
        1 Check for the presence of "label" in the tree dictionary
        2    If found, return tree["label"]
        3 Extract tree["column"] and tree["median"]
        4 Check whether row[tree["column"]] is less than or equal to tree["median"]
        5    If it's less than or equal, call predict(tree["left"], row) and return the result
        6    If it's greater, call predict(tree["right"], row) and return the result
        
The major difference here is that we're returning values. Because we're only calling the function recursively once in each iteration (we only go "down" a single branch), we can return a single value up the chain of recursion. This will let us get a value back when we call the function.

**Instructions:**

Fill in the gaps in the predict() function that say "Insert code here...".
- The code should check whether row[column] is less than or equal to median, and return the appropriate result for each side of the tree.
- Print the result of predicting the first row of the data with predict(tree, data.iloc[0])

In [75]:
def predict(tree, row):
    if "label" in tree:
        return tree["label"]
    
    column = tree["column"]
    median = tree["median"]
    if row[column] <= median:
        return predict(tree["left"], row)
    else:
        return predict(tree["right"], row)

print(predict(tree, data.iloc[0]))

0


# 11. Making Multiple Predictions

Now that we can make a prediction for a single row, we can write a function that makes predictions for multiple rows simultanously.

To do this, we'll use the apply() method on pandas dataframes to apply a function across each row. You can read more about the function in the pandas documentation. You'll need to pass in the axis=1 argument to apply the function to each row. This method will return a dataframe.

You can use the apply() method along with lambda functions to apply the predict() function to each row of new_data.

**Instructions:**

Create a function named batch_predict() that takes two parameters, tree and df.
- It should use the apply() method to apply the predict() function across each row of df.
    - You can use lambda functions to pass tree and row into predict.

Call batch_predict() with new_data as the parameter df, and assign the result to predictions

In [77]:
new_data = pandas.DataFrame([
    [40,0],
    [20,2],
    [80,1],
    [15,1],
    [27,2],
    [38,1]
    ])
# Assign column names to the data
new_data.columns = ["age", "marital_status"]

def batch_predict(tree, df):
    return df.apply(lambda x: predict(tree, x), axis=1)

predictions = batch_predict(tree, new_data)
predictions

0    0
1    0
2    0
3    0
4    1
5    0
dtype: int64

# 12. Next Steps
In this mission, we learned how to create a full decision tree model, print the results, and use the tree to make predictions. We applied a modified version of the ID3 algorithm on a small data set for clarity.

In future missions, we'll apply decision trees across larger data sets, learn the trade-offs associated with different algorithms, and explore how to generate more accurate predictions from decision trees.