# Module 8 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy
from math import log2
import random
from pprint import pprint

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Blackboard.

One of the things we did not talk about in the lectures was how to deal with missing values. In C4.5, missing values were handled by treating "?" as an implicit attribute value for every feature. For example, if the attribute was "size" then the domain would be ["small", "medium", "large", "?"]. Another approach is to skip instances with missing values. Yet another approach is to infer the missing value conditioned on the class. For example, if the class is "safe" and the color is missing, then we would infer the attribute value that is most often associated with "safe", perhaps "red". **Use the "?" approach for this assignment.**



You must implement the following functions:

`cross_validate` takes the data and performs 10 fold cross validation (from Module 3!).

`train` takes training_data and returns the Decision Tree as a data structure or object (a tree is definitely an Abstract Data Type, ADT, so OOP is warranted). Make sure your Tree can be represented somehow.

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Format your error rate so that it appears as a percent. That is, not 0.0234 but 2.34%.


```
def train(training_data):
   # returns a decision tree data structure
```

and `classify` takes a tree and a List of instances (possibly just one) and returns the classifications:

```
def classify( tree, test_data):
    # returns a list of classifications
```

and `evaluate` takes the actual classifications and the predicted classes and returns the classification error rate:

```
def evaluate(actual, predicted):
    # returns an error rate
```

You must apply 10 fold cross validation to your data set. You will treat each fold as a test set, using the combined remainder as the training set. You should print out the error rate for each fold and then an average error rate for the entire cross validation process.

This is all that is required for this assignment. I'm leaving more of the particulars up to you but you can definitely use the last module as a guide.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.

-----

**load_file**<br>
The `load_file` load the file given the filename.

Parameters:
* **file_name** is the name of the file we want to load.

retuns:<br>
It return the data loaded form the file in `list of lists` format.

In [2]:
def load_file(file_name):
    data = []
    file = open(file_name, "r")
    for line in file:
        tmp_data = line.rstrip().split(",")
        data += [tmp_data[1:] + tmp_data[:1]] # made sure the class column was the last one
    return data

**get_attribute_metadata**<br>
The `get_attribute_metadata` is a helper function for `train`. It creates a dictionary of all the attributes and its domain values.

Parameters:
* **data** is the data for which we want to get the attributes and its domains.

retuns:<br>
It return the `dictionary` of attributes and its domain values.

In [3]:
def get_attribute_metadata(data):
    attributes = {}
    for col_num in range(len(data[0])-1):
        attributes[col_num] = list(set([row[col_num] for row in data]))
    return attributes    

**entropy**<br>
The `entropy` is a helper function for `id3 algorithm`. it calculates the entorpy of the given data. The below function is used to calculte the entropy.
$$E(S) = -\sum_{i} p_ilog_2(p_i)$$

Parameters:
* **data** is the data whose entropy we want to find.

retuns:<br>
It return the `entropy` of the data.

In [4]:
def entropy(data):
    n = len(data)
    if n == 0: return 0
    attr1 = sum([1 for row in data if row[-1] == 'e'])
    attr2 = n - attr1
    if attr1 & attr2 != 0:
        entrpy = -(attr1/n)*log2(attr1/n)-(attr2/n)*log2(attr2/n)
        return entrpy
    return -(attr2/n)*log2(attr2/n) if attr1 == 0 else -(attr1/n)*log2(attr1/n)

**homogeneous**<br>
The `homogeneous` is a helper function for `id3 algorithm`. It check if the given dataset has all class labels of the same type or not.

Parameters:
* **data** is the data we want to check.

retuns:<br>
It returns `majority class label` if the data is homogeneours, otherwise it returns `False`.

In [5]:
def homogeneous(data):
    count = {'e':0, 'p':0}
    count['e'] = sum([1 for row in data if row[-1] == 'e'])
    count['p'] = sum([1 for row in data if row[-1] == 'p'])
    if count['e']== 0 or count['p'] == 0:
        return 'p' if count['e'] == 0 else 'e'
    else:
        return False

**majority_label**<br>
The `majority_label` is a helper function for `id3 algorithm`. It finds the majority class label in the data.
Parameters:
* **data** is the data we want to find majority class for.

retuns:<br>
It returnns `majority class label` for the givend data.

In [6]:
def majority_label(data):
    count = {'e':0, 'p':0}
    count['e'] = sum([1 for row in data if row[-1] == 'e'])
    count['p'] = sum([1 for row in data if row[-1] == 'p'])
    return 'e' if count['e'] > count['p'] else 'p'

**filter_data**<br>
The `filter_data` is a helper function for `id3 algorithm`. It creates a new list of data that maches that condition that attribute = attribute value.

Parameters:
* **data** is the data we want to filter.

retuns:<br>
It returns `filtered data` that mahces the condition.

In [7]:
def filter_data(data, attr, attr_val):
    filtered_data =  [row for row in data if row[attr] == attr_val]
    return filtered_data

**pick_best_attributes**<br>
The `pick_best_attributes` is a helper function for `id3 algorithm`. it finds the best attribute from the attribuets list by calculating the information gain for each attribute. The below function was used to calculte the information gain.
$$ G(S,A) = E(S) -\sum_{v \epsilon V_A} \frac{|S_v|}{|S|}E(S_v) $$


Parameters:
* **data** is the data we want to filter.
* **attributes** is the dictionary of attributes and its domain values.


retuns:<br>
It returns `filtered data` that mahces the condition.

In [8]:
def pick_best_attributes(data, attributes):
    start_entrpy = entropy(data)
    info_gain = []
    attr_list = list(attributes.keys())[:-1]
    for attr in attr_list:
        attr_entropy = []
        for attr_val in attributes[attr]:
            f_data = filter_data(data, attr, attr_val)
            attr_entropy += [(len(f_data)/len(data))*entropy(f_data)]
        info_gain += [start_entrpy - sum(attr_entropy)]
    best_attr_indx = info_gain.index(max(info_gain))
    return attr_list[best_attr_indx]

**id3**<br>
The `id3` is the algorithm that train the decision tree and creates a tree.

Parameters:
* **data** is the data used to train.
* **attributes** is the dictionary of attributes and its domain values.
* **default** is majority class.


retuns:<br>
It returns a trained `tree`.

In [9]:
def id3(data, attributes, default):
    if len(data) == 0:
        return default
    if homogeneous(data) != False:
        return homogeneous(data)
    if not attributes:
        return majority_label(data)
    best_attr = pick_best_attributes(data, attributes)
    node = {best_attr: {val:None for val in attributes[best_attr]}}  # create a new node
    default_label = majority_label(data)
    for val in attributes[best_attr]:
        subset = filter_data(data, best_attr, val)
        attr_copy =  {attr: attributes[attr] for attr in attributes if attr != best_attr}
        child = id3(subset, attr_copy, default_label)
        node[best_attr][val] = child   # add child to node 
    return node

**train**<br>
The `train` is the function that prepares the required data for `id3 algorithm`.

Parameters:
* **data** is the data used to train.

retuns:<br>
It returns a trained `tree`.

In [10]:
def train(data):
    attributes = get_attribute_metadata(data)
    tree = id3(data, attributes, default = 'p')
    return tree

**predict**<br>
The `predict` is the recursive function used to predic the class label for the given test.

Parameters:
* **tree** is trained decision tree.
* **test** is the test point we want to predict the class label.


retuns:<br>
It returns the predicted `class label`.

In [11]:
def predict(tree, test):
    num = list(tree.keys())[0]
    return tree[num][test[num]] if type(tree[num][test[num]]) == str else predict(tree[num][test[num]], test)

**classify**<br>
The `classify` function that makes a call to the `predict` for each test point we want to test.

Parameters:
* **tree** is trained decision tree.
* **test_data** is the test points we want to predict the class label.


retuns:<br>
It returns the predicted `class label` for each testpoint.

In [12]:
def classify(tree, test_data):
    if type(test_data) == list and type(test_data[0]) == list:
        return [predict(tree, test) for test in test_data]
    else:
        return predict(tree, test_data)

**evaluate**<br>
The `evaluate` function that calculates the error rate between the predicted and the actuaal. The below equation is used to calculate the erorr rate.
$$error\_rate=\frac{errors}{n}$$

Parameters:
* **actual** is the actual test label for the each test observation.
* **predicted** is the class labels that were predicted by our model.

retuns:<br>
It returns the `error rate`.

In [13]:
def evaluate(actual, predicted):
    result  = [1 for i in range(len(predicted)) if actual[i] != predicted[i]]
    error_rate = sum(result)/len(actual)
    return error_rate

**cross_validate**<br>
The `cross_validate` check the accuracy of our model by running the 10 fold cross validation. It prints the error rate of each fold and the averate error rate of 10 folds.

Parameters:
* **data** is data we want to use for cross validation.
* **cross_validate** indicates what percent(as decimal) of the data we want to use to train the model.

retuns:<br>
It does not return anything

In [14]:
def cross_validate(data, train_percent=.8):
    errors = []
    for test_nun in range(10):
        train_data = random.sample(data, int(len(data)*train_percent))
        test_data = [row[:-1] for row in data if row not in train_data]
        test_label = [row[-1] for row in data if row not in train_data]

        tree = train(train_data)
        predicted = classify(tree, test_data)
        error_rate = evaluate(test_label, predicted)
        errors += [error_rate]
        print(f"Test# {test_nun+1:2}: Error Rate:{round(error_rate*100,3)}%")
    print(f"Average Error Rate: {round((sum(errors)/ len(errors))*100, 3)}%")

In [15]:
# read the data
data_file = 'agaricus-lepiota.data'
data = load_file(data_file)

# Train the tree for representaiton
tree = train(data)
print("Tree representation:")
pprint(tree)

Tree representation:
{4: {'a': 'e',
     'c': 'p',
     'f': 'p',
     'l': 'e',
     'm': 'p',
     'n': {19: {'b': 'e',
                'h': 'e',
                'k': 'e',
                'n': 'e',
                'o': 'e',
                'r': 'p',
                'u': 'e',
                'w': {14: {'b': 'e',
                           'c': 'e',
                           'e': 'e',
                           'g': 'e',
                           'n': {10: {'?': 'p',
                                      'b': 'e',
                                      'c': 'e',
                                      'e': 'e',
                                      'r': 'e'}},
                           'o': 'e',
                           'p': 'e',
                           'w': {7: {'b': 'e', 'n': 'p'}},
                           'y': 'p'}},
                'y': 'e'}},
     'p': 'p',
     's': 'p',
     'y': 'p'}}


In [16]:
# run the cross validation
cross_validate(data)

Test#  1: Error Rate:0.0%
Test#  2: Error Rate:0.0%
Test#  3: Error Rate:0.0%
Test#  4: Error Rate:0.123%
Test#  5: Error Rate:0.0%
Test#  6: Error Rate:0.0%
Test#  7: Error Rate:0.0%
Test#  8: Error Rate:0.0%
Test#  9: Error Rate:0.0%
Test# 10: Error Rate:0.0%
Average Error Rate: 0.012%


## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.