# Module 11 - Programming Assignment

In [1]:
from __future__ import division # so that 1/2 = 0.5 and not 0
from IPython.core.display import *
import csv
import pprint
import copy
import random
import math

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

You can download the two files and read them to find out the attributes, attribute values and class labels as well as their locations in the file.

One of the things we did not talk about in the lectures was how to deal with missing values. In C4.5, missing values were handled by treating "?" as an implicit attribute value for every feature. For example, if the attribute was "size" then the domain would be ["small", "medium", "large", "?"]. Another approach is to skip instances with missing values. Yet another approach is to infer the missing value conditioned on the class. For example, if the class is "safe" and the color is missing, then we would infer the attribute value that is most often associated with "safe", perhaps "red". **Use the "?" approach for this assignment.**

As we did with the neural network, you should randomize your data (always randomize your data...you don't know if it is in some particular order like date of collection, by class label, etc.) and split it into two (2) sets. Train on the first set then test on the second set. Then train on the second set and test on the first set.

For regression, we almost always use something like Mean Squared Error to judge the performance of a model. For classification, there are a lot more options but for this assignment we will just look at classification error:

$$error\_rate=\frac{errors}{n}$$

You must implement four functions. `train` takes training_data and returns the Decision Tree as a data structure or object (for this one, I'm removing the OOP restriction...people often feel more comfortable writing a Tree in an OOP fashion). Make sure your Tree can be represented somehow.

```
def train( training_data):
   # returns a decision tree data structure
```

and `view` takes a tree and prints it out:

```
def view( tree):
    pass # probably doesn't return anything.
```

the purpose of the function is to be able to see what the tree looks like. It should be legible/pretty. You can use ASCII if you like or use something like NetworkX.

and `classify` takes a tree and a List of instances (possibly just one) and returns the classifications:

```
def classify( tree, test_data):
    # returns a list of classifications
```

and `evaluate` takes the classifications and the test_data and returns the error rate:

```
def evaluate( test_data, classifications):
    # returns an error rate
```

Basically, you're going to:

1. learn the tree for set 1
2. view the tree
3. classify set 2
4. evaluate the tree
5. learn the tree for set 2
6. view the tree
7. classify set 1
8. evalute the tree
9. average the classification error.

This is all that is required for this assignment. I'm leaving more of the particulars up to you but you can definitely use the last module as a guide.

**This is a very important assignment to reflect on the use of deepcopy because it has a natural recursive implementation**

-----

## Helper Structures and Methods

**Attribute Index - Name Dictionary**

This dictionary has attributes indices as keys and the corresponding attribute name as values.  This structure is used to pretty print the decision tree, since the tree is build using the indices and not the labels.

In [2]:
attribute_names = {
    1: 'cap-shape',                
    2: 'cap-surface',              
    3: 'cap-color',                
    4: 'bruises?',                
    5: 'odor',                 
    6: 'gill-attachment',          
    7: 'gill-spacing',         
    8: 'gill-size',            
    9: 'gill-color',               
    10: 'stalk-shape',              
    11: 'stalk-root',             
    12: 'stalk-surface-above-ring', 
    13: 'stalk-surface-below-ring',
    14: 'stalk-color-above-ring',   
    15: 'stalk-color-below-ring',  
    16: 'veil-type',                
    17: 'veil-color',               
    18: 'ring-number',              
    19: 'ring-type',             
    20: 'spore-print-color',        
    21: 'population',         
    22: 'habitat'
}

&nbsp;

**X**

x

In [3]:
def read_csv(file_name):
    with open(file_name, 'rb') as f:
        reader = csv.reader(f)
        table = list(reader)
    
    return table

&nbsp;

**X**

x

In [4]:
def create_train_test_sets(data):
    random.shuffle(data)
    split_point = int(len(data) / 2)
    test_set = data[:split_point]
    train_set = data[split_point:]
    
    return train_set, test_set

&nbsp;

**X**

x

In [5]:
def get_attribute_domains(data):
    attributes = {}
    
    for i in range(len(data[0]) - 1):
        i = i + 1           #Skip the first column
        attributes[i] = []
    
    for row in data:
        for i in range(len(row) - 1):
            i = i + 1           #Skip the first column
            attribute_value = row[i]
            if attribute_value not in attributes[i]:
                attributes[i].append(attribute_value)
                
    return attributes

&nbsp;

**X**

x

In [6]:
def get_majority_label(data):
    poisonous_count = 0
    edible_count = 0
    
    for row in data:
        if row[0] == 'p':
            poisonous_count += 1
        if row[0] == 'e':
            edible_count += 1
            
    if poisonous_count > edible_count:
        return 'p'
    elif edible_count > poisonous_count:
        return 'e'
    else:
        random_choice = random.randint(0, 1)
        if random_choice == 0:
            return 'p'
        else:
            return 'e'

&nbsp;

**X**

x

In [7]:
def homogeneous(data):
    label = data[0][0]
    for row in data:
        if row[0] != label:
            return False
    
    return True

&nbsp;

**X**

x

In [8]:
def get_data_subset(best_attribute, value, data):
    data_subset = []
    for row in data:
        if row[best_attribute] == value:
            row_copy = copy.deepcopy(row)
            data_subset.append(row_copy)

    return data_subset

&nbsp;

**X**

x

In [9]:
def calculate_entropy(data):
    poisonous_count = 0.0
    edible_count = 0.0
    
    for row in data:
        if row[0] == 'p':
            poisonous_count += 1
        if row[0] == 'e':
            edible_count += 1
    
    length_data = len(data)
    p1 = poisonous_count / length_data
    p2 = edible_count / length_data
    entropy = p1 * math.log(p1) + p2 * math.log(p2)
    
    return -entropy

&nbsp;

**X**

x

In [10]:
def calculate_information_gain(attribute, data, entropy):
    value_counts = {}
    for row in data:
        value = row[attribute]
        label = row[0]
        
        if value in value_counts:
            value_counts[value]['count'] += 1.0
            value_counts[value][label] += 1.0
        else:
            value_counts[value] = {}
            value_counts[value]['count'] = 1.0
            value_counts[value]['p'] = 0.0
            value_counts[value]['e'] = 0.0
            value_counts[value][label] += 1.0
    
    summation = 0.0
    data_length = len(data)
    
    for value in value_counts:
        count = value_counts[value]['count']
        p = value_counts[value]['p']
        e = value_counts[value]['e']
        
        #if count != 0:
        if p/count == 0.0:
            summation -= (count / data_length) * ( (e/count) * math.log(e/count) )
        elif e/count == 0.0:
            summation -= (count / data_length) * ( (p/count) * math.log(p/count) )
        else:
            summation -= (count / data_length) * ( (p/count) * math.log(p/count) + (e/count) * math.log(e/count) )

    information_gain = entropy - summation
    
    return information_gain

&nbsp;

**X**

x

In [11]:
def pick_best_attribute(data, attributes):
    entropy = calculate_entropy(data)
    max_information_gain = 0.0
    best_attribute = None
    
    for attribute in attributes:
        information_gain = calculate_information_gain(attribute, data, entropy)
        if information_gain > max_information_gain:
            max_information_gain = information_gain
            best_attribute = attribute
            
    return best_attribute   

&nbsp;

**X**

x

In [12]:
def id3(data, attributes, default):
    if not data:
        return default
        
    if homogeneous(data):
        label = data[0][0] #TODO make sure this works
        return label
        
    if not attributes:
        label = get_majority_label(data)
        return label
    
    default_label = get_majority_label(data)
    best_attribute = pick_best_attribute(data, attributes)
    domain = attributes[best_attribute]
    
    tree = { best_attribute: {} }
        
    for value in domain:        
        subset = get_data_subset(best_attribute, value, data)
        
        new_attributes = copy.deepcopy(attributes)
        new_attributes.pop(best_attribute, None)
    
        subtree = id3(subset, new_attributes, default_label)
        tree[best_attribute][value] = subtree
    
    return tree

&nbsp;

**X**

x

In [13]:
def train(training_data):
    default_label = get_majority_label(training_data)
    attributes = get_attribute_domains(training_data)
     
    decision_tree = id3(training_data, attributes, default_label)

    return decision_tree

&nbsp;

**X**

x

In [14]:
def pretty_print_tree(tree, indent=0):
    for key, value in tree.iteritems():
        if type(key) == int:
            key = attribute_names[key]
        print '    ' * indent + str(key)
        if isinstance(value, dict):
            pretty_print_tree(value, indent+1)
        else:
            print '    ' * (indent+1) + str(value)    

&nbsp;

**X**

x

In [15]:
def view(tree):
    pretty_print_tree(tree)

&nbsp;

**X**

x

In [16]:
def classify_instance(tree, instance):
    root = next(iter(tree))
    instance_value = instance[root]
    
    while type(tree) == dict:
        tree = tree[root][instance_value]
        
        if type(tree) == dict:
            root = next(iter(tree))
            instance_value = instance[root]
            
    return tree

&nbsp;

**X**

x

In [17]:
def classify(tree, test_data):
    classifications = []
    for row in test_data:
        classification = classify_instance(tree, row)
        classifications.append(classification)
    
    return classifications    

&nbsp;

**X**

x

In [18]:
def evaluate(test_data, classifications):
    errors = 0.0
    for i in range(len(test_data)):
        if classifications[i] != test_data[i][0]:
            errors += 1
            
    error_rate = errors / len(test_data)
    return error_rate

-----

Put your final function invocations starting here, one per cell:

In [19]:
data = read_csv('agaricus-lepiota.data')
set1, set2 = create_train_test_sets(data)

tree = train(set1)
view(tree)
classifications = classify(tree, set2)
error_rate = evaluate(set2, classifications)
print error_rate
print
print
print


tree = train(set2)
view(tree)
classifications = classify(tree, set1)
error_rate = evaluate(set1, classifications)
print error_rate

[['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u'], ['e', 'x', 's', 'y', 't', 'a', 'f', 'c', 'b', 'k', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'n', 'n', 'g'], ['e', 'b', 's', 'w', 't', 'l', 'f', 'c', 'b', 'n', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'n', 'n', 'm'], ['p', 'x', 'y', 'w', 't', 'p', 'f', 'c', 'n', 'n', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u'], ['e', 'x', 's', 'g', 'f', 'n', 'f', 'w', 'b', 'k', 't', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'e', 'n', 'a', 'g'], ['e', 'x', 'y', 'y', 't', 'a', 'f', 'c', 'b', 'n', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 'n', 'g'], ['e', 'b', 's', 'w', 't', 'a', 'f', 'c', 'b', 'g', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 'n', 'm'], ['e', 'b', 'y', 'w', 't', 'l', 'f', 'c', 'b', 'n', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'n', 's', 'm'], ['p', 'x', 'y', 'w', 't', 'p', 'f', 'c', 'n', 'p', 'e', 'e', 's