# Module 8 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Blackboard.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>

One of the things we did not talk about in the lectures was how to deal with missing values. There are two aspects of the problem here. What do we do with missing values in the training data? What do we do with missing values when doing classifcation?

There are a lot of different ways that we can handle this.
A common algorithm is to use something like kNN to impute the missing values.
We can use conditional probability as well.
There are also clever modifications to the Decision Tree algorithm itself that one can make.

We're going to do something simpler, given the size of the data set: remove the observations with missing values ("?").

You must implement the following functions:

`train` takes training_data and returns the Decision Tree as a data structure.

```
def train(training_data):
   # returns the Decision Tree.
```

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

```
def classify(tree, observations, labeled=True):
    # returns a list of classifications
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application).

Following Module 3's assignment, `cross_validate` should print out a table in exactly the same format. What you are looking for here is a consistent evaluation metric cross the folds. Print the error rate to 4 decimal places. **Do not convert to a percentage.**

```
def pretty_print_tree(tree):
    # pretty prints the tree
```

This should be a text representation of a decision tree trained on the entire data set (no train/test).

To summarize...

Apply the Decision Tree algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. When you are done, apply the Decision Tree algorithm to the entire data set and print out the resulting tree.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.

-----

<a id="Import Data"></a>
## Import Data

This code is copied from the Module 03 programming assignment. `parse_data` line 5 was updated for data type `str`. 

In [2]:
import random
import numpy as np
from typing import List, Dict, Tuple, Callable

In [3]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [str(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [4]:
data = parse_data("agaricus-lepiota.data")

In [5]:
len(data)

8124

In [6]:
len(data[0])

23

### Remove Missing Values
Per the assignment directions, this code will remove all lines in the data with "?" values. Per the dataset description, these values are only expected in attribute #11. 

In [7]:
data = [row for row in data if "?" not in row]

In [8]:
len(data)

5644

In [9]:
len(data[0])

23

## Train/Test Splits - n folds

This code is copied from the Module 03 programming assignment. It creates folds from the data, then creates train and test datasets. 

`create_folds` will take a list (xs) and split it into `n` equal folds with each fold containing one-tenth of the observations.

In [10]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n) #k = numdata/10, m = remainder = 0
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [11]:
folds = create_folds(data, 10)

In [12]:
len(folds)

10

In [13]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

We can test the function to give us a train and test datasets where the test set is the fold at index 0:

In [14]:
training_data, test_data = create_train_test(folds, 0)
assert len(training_data) == 5079
assert len(test_data) == 565 #10%

## `pick_best_attribute` subfunctions

<a id="get_cols"></a>
### get_cols

`get_cols` takes in the full dataset and outputs a list of all the column data (including y-column)

* **data**: List: full dataset

**returns** cols: List[List] of columns

In [15]:
def get_cols(data: List[List]):
    cols = [[] for i in range(len(data[0]))] #init empty lists
    for row in data: 
        for i in range(len(data[0])): #not incl y
            cols[i].append(row[i])

    return cols

In [16]:
#Unit Tests / Assertions
test_mat = [
    [1, 2, 7], 
    [3, 4, 7]]
assert get_cols(test_mat) == [[1, 3], [2, 4], [7, 7]]
assert len(get_cols(test_mat)) == len(test_mat[0])
test_mat = [
    [1, 2], 
    [3, 4]]
assert get_cols(test_mat) == [[1, 3], [2, 4]]

<a id="get_unique"></a>
### get_unique

`get_unique` takes in a column and outputs the unique values from that column. 

* **col**: List: single column of data

**returns** unique_vals: List of unique values

In [17]:
def get_unique(col): 

    unique_vals = [] #init
    for val in col: 
        if val not in unique_vals: 
            unique_vals.append(val)
    
    return unique_vals

In [18]:
# Unit Tests / Assertions
test1 = [1, 2, 3]
assert get_unique(test1) == [1, 2, 3]

test2 = [2, 2, 2]
assert get_unique(test2) == [2]

test3 = [-1, 0, 0]
assert get_unique(test3) == [-1, 0]

<a id="get_domain"></a>
### get_domain

`get_domain` takes in the dataset and a column index (`best_attr`), cols the `get_cols` function and `get_unique` function, and returns the domain of a given attribute. This is the same style output of `get_unique`, though this function can start from the full dataset. This is a cleaner implementation, but was created after `get_unique`, so both remain. 

* **data**: List: full dataset
* **best_attr**: List: single column index

**returns** unique_vals: List of unique values

In [19]:
def get_domain(data, best_attr): 
    cols = get_cols(data)
    test_col = cols[best_attr]
    return get_unique(test_col)

In [20]:
# Unit Tests / Assertions
test1 = [[1, 2, -1], 
         [2, 2, 0], 
         [3, 2, 0]]
assert get_domain(test1, 2) == [-1, 0]
assert get_domain(test1, 1) == [2]

test2 = [[1, 2, 3]]
assert get_domain(test2, 0) == [1]

<a id="calc_e"></a>
### calc_e

`calc_e` takes in number of yes & no items, and total items, and performs the basic entropy calculation. Given this is a basic math function, no unit tests / assertions are performed. 

* **num_yes**: int: number of observations resulting in positive / yes / 'e' edible label
* **num_no**: int: number of observations resulting in negative / no / 'p' poisonous label
* **total**: int: total observations

**returns** E: float, entropy

In [21]:
def calc_e(num_yes, num_no, total): 
    # with np.errstate(divide='ignore', invalid='ignore'): #handle divide by 0
    E_yes = num_yes / total
    E_no = num_no / total
    E = 0.0
    
    if E_yes != 0: 
        E -= (E_yes * np.log2(E_yes))
    if E_no != 0: 
        E -= (E_no * np.log2(E_no))
    
    return E

<a id="get_initial_entropy"></a>
### get_initial_entropy

`get_initial_entropy` takes in the label column `y`, and calculates the initial entropy for a given dataset. This is different than the `get_entropy` function because it only considers the `y` column and no attributes. 'e' and 'p' labels are set in these functions; in the future they could be passed in as function inputs. This is a subfunction of `pick_best_attribute`; unit tests and assertions will be performed there. 

* **y**: List: label column of data

**returns** Ew: float, weighted entropy based on number of yes/no items in total set

In [22]:
def get_initial_entropy(y): 
    yes, no = 'e', 'p'
    Ew = []
    
    total = len(y) #total number
    num_yes = sum(1 for i in range(len(y)) if (y[i] == yes))
    num_no = sum(1 for i in range(len(y)) if (y[i] == no))
    # E = - (num_yes / total)*np.log2(num_yes / total) - (num_no/total)*np.log2(num_no/total)
    E = calc_e(num_yes, num_no, total)

    weight = (num_yes+num_no) / total
    Ew.append(E*weight)

    return sum(Ew)

<a id="get_entropy"></a>
### get_entropy

`get_entropy` takes in an attribute column `att_col` and the label column `y`, and calculates the entropy for that attribute. 'e' and 'p' labels are set in these functions; in the future they could be passed in as function inputs. This is a subfunction of `pick_best_attribute`; unit tests and assertions will be performed there. 

* **att_col**: List: column of data for a given attribute
* **y**: List: label column of data

**returns** Ew: float, weighted entropy based on number of yes/no items in total set

In [23]:
def get_entropy(att_col, y): #weighted entropy

    yes, no = 'e', 'p'
    unique_vals = get_unique(att_col)
    len_col = len(att_col)
    Ew = [] #init

    for val in unique_vals: 
        total = att_col.count(val) #total number
        num_yes = sum(1 for i in range(len(att_col)) if (att_col[i] == val and y[i] == yes))
        num_no = sum(1 for i in range(len(att_col)) if (att_col[i] == val and y[i] == no))

        E = calc_e(num_yes, num_no, total)
        
        weight = (num_yes+num_no) / len_col
        Ew.append(E*weight)

    return sum(Ew)

<a id="pick_best_attribute"></a>
### pick_best_attribute

`pick_best_attribute` takes in the dataset and a list of attribute indices, and outputs the current best attribute. This uses several helper functions defined above. 

* **data**: List[list]: full dataset
* **attributes**: List[int]: attribute indices

**returns** best_att: int, best attribute index

In [24]:
def pick_best_attribute(data, attributes): 

    cols = get_cols(data) #all cols
    y = cols[0]
    entropy_0 = get_initial_entropy(y) #starting
    best_att, best_ig = -1, 0 #init

    for att in attributes: 
        info = get_entropy(cols[att], y)
        ig = entropy_0 - info
        if ig >= best_ig: 
            best_att, best_ig = att, ig
        
    return best_att

In [25]:
#unit tests / assertions
test_mat = [
    ['e', 2, 7], 
    ['e', 4, 7], 
    ['p', 6, 6]]
assert pick_best_attribute(test_mat, [1, 2]) == 2

test_mat = [
    ['e', 2, 1, 5, 3], 
    ['e', 5, 7, 2, 3], 
    ['p', 2, 4, 6, 6], 
    ['e', 8, 9, 2, 3]]
assert pick_best_attribute(test_mat, [1, 2, 3, 4]) == 4
assert pick_best_attribute(test_mat, [1, 2, 3]) == 3

<a id="majority_label"></a>
### majority_label

`majority_label` takes in the dataset, assumes the label column is at col[0], and returns the majority label from these values. 

* **data**: List[list]: full dataset

**returns** maj_label: majority label from data

In [26]:
def majority_label(data): 
    
    vals = get_domain(data, 0)
    maj_count, maj_label = 0, vals[0] #y[0]
    
    for val in vals: 
        sub_count = sum(1 for i in vals if i == val)
        if sub_count > maj_count: 
            maj_count = sub_count
            maj_label = val

    return maj_label

In [27]:
test_mat = [
    ['e', 2, 1, 5, 3], 
    ['e', 5, 7, 2, 3], 
    ['p', 2, 4, 6, 6], 
    ['e', 8, 9, 2, 3]]
assert majority_label(test_mat) == 'e'

test_mat = [
    ['p', 2, 1, 5, 3], 
    ['p', 5, 7, 2, 3], 
    ['p', 2, 4, 6, 6], 
    ['e', 8, 9, 2, 3]]
assert majority_label(test_mat) == 'p'

test_mat = [
    ['e', 2, 1, 5, 3], 
    ['e', 5, 7, 2, 3], 
    ['p', 2, 4, 6, 6], 
    ['p', 8, 9, 2, 3]]
assert majority_label(test_mat) == 'e' #first to reach tied majority

<a id="is_homogeneous"></a>
### is_homogeneous

`is_homogeneous` takes in the dataset, assumes the label column is at col[0], and determines if the data all has the same label.

* **data**: List[list]: full dataset

**returns** homogeneous: bool, True/False on whether cols[0] is all one value

In [28]:
def is_homogeneous(data): 
    cols = get_cols(data)
    y = cols[0]
    val = y[0]
    homogeneous = True
    for i in y: 
        if i != val: 
            homogeneous = False

    return homogeneous    

In [29]:
#Unit Tests / Assertions
test_mat = [
    ['e', 2, 1, 5, 3], 
    ['e', 5, 7, 2, 3], 
    ['p', 2, 4, 6, 6], 
    ['p', 8, 9, 2, 3]]
assert is_homogeneous(test_mat) == False

test_mat = [
    ['e', 2, 1, 5, 3], 
    ['e', 5, 7, 2, 3], 
    ['e', 2, 4, 6, 6], 
    ['e', 8, 9, 2, 3]]
assert is_homogeneous(test_mat) == True

test_mat = [
    ['e', 2, 1, 5, 3], 
    ['e', 5, 7, 2, 3], 
    ['e', 2, 4, 6, 6], 
    ['p', 8, 9, 2, 3]]
assert is_homogeneous(test_mat) == False

<a id="id3"></a>
### id3

`id3` is the main recursive function to generate a decision tree. This takes in the full dataset, a set of attribute indices, and the default (majority) label for the dataset. This recursively builds the decision tree by forming subsets of data and calling itself. Unit tests and assertions are shown in the following sections; `train` is a main function that calls on ID3. 

* **data**: List[list]: full dataset
* **attributes**: List[int]: attribute indices
* **default**: str: majority / default attribute value

**returns** Dict: best attributes and children

In [30]:
def id3(data, attributes, default): 
    if not data: 
        return default #if not data else 
    if is_homogeneous(data): 
        return data[0][0] #class label
    if not attributes: 
        return majority_label(data)
    best_attr = pick_best_attribute(data, attributes)
    nodes = {}
    default_label = majority_label(data)
    domain = get_domain(data, best_attr)
    for value in domain: 
        subset = deepcopy(data)
        subset = [row for row in subset if row[best_attr]==value]
        new_attr = deepcopy(attributes)
        new_attr.remove(best_attr)       
        child = id3(subset, new_attr, default_label)
        nodes[value] = child
    return {best_attr: nodes}

<a id="train"></a>
## train

`train` takes in the training dataset, and outputs the decision tree. 

* **training_data**: List[List[str]]]: training portion of the dataset, 9 folds

**returns** Decision Tree


In [31]:
def train(training_data):
    
    maj = majority_label(training_data)
    attributes = list(range(1, len(training_data[0])))
    tree = id3(training_data, attributes, maj)

    return tree

In [32]:
#test output: 
ans = train(training_data)
print(ans)

{5: {'n': {20: {'k': 'e', 'n': 'e', 'w': {21: {'y': 'e', 'v': 'e', 'c': 'p'}}, 'r': 'p'}}, 'f': 'p', 'p': 'p', 'a': 'e', 'l': 'e', 'c': 'p', 'm': 'p'}}


<a id="classify"></a>
## classify

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). `observations` is a single row of attribute data. 

* **tree**: Dict: decision tree
* **observations**: List[str]: single row of attribute data
* **labeled**: Bool: whether or not first column is labels for data

**eval** str: prediction for label of observation


In [33]:
def classify(tree, observations, labeled=True):
    
    for key in tree.keys():
        if labeled == False: 
            key = key - 1
        val = observations[key]
        next_val = tree[key][val]
        if isinstance(next_val, str): 
            eval = deepcopy(next_val)
        else: 
            eval = classify(next_val, observations, labeled=True)

    return eval

In [34]:
test_row = test_data[4]
ans = train(training_data) #tree 
eval = classify(ans, test_row) #prediction
assert eval == test_row[0]

test_row = test_data[0]
assert classify(ans, test_row, True) == test_row[0]

test_row = test_data[1]
assert classify(ans, test_row, True) == test_row[0]

<a id="evaluate"></a>
## evaluate

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

* **data**: List[List[str]]: Full dataset

*returns print statements*



In [35]:
def evaluate(data): 
    for i in range(len(folds)): 
        train_data, test_data = create_train_test(folds, i)
        tree = train(test_data)
        # evals = []
        errors, n = 0, len(test_data)
        for row in test_data: 
            # evals.append(classify(tree, row, True))
            eval = classify(tree, row, True)
            if eval != row[0]: #incorrect classification
                errors += 1
        error_rate = (errors / n)
        print('Fold ', i, ' Error: ', error_rate*100, '%')

In [36]:
evaluate(data)

Fold  0  Error:  0.0 %
Fold  1  Error:  0.0 %
Fold  2  Error:  0.0 %
Fold  3  Error:  0.0 %
Fold  4  Error:  0.0 %
Fold  5  Error:  0.0 %
Fold  6  Error:  0.0 %
Fold  7  Error:  0.0 %
Fold  8  Error:  0.0 %
Fold  9  Error:  0.0 %


<a id="Discussion"></a>
## Discussion

This seems a little fishy. This `evaluate` function shows that I can predict class labels with 0 error, meaning the decision tree is fully determined. It's unclear if this is the intent, or if I'm somehow skewing / overfitting my data. Given the propensity for over fitting, it is possible this is the natural outcome of this decision tree and dataset. 

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

# Submit Notes

* Initial submittal Sunday 3/17/24
    * Submittal not complete - family health news resulting in delay
    * Messaged professor via Canvas on 3/17/24; will resubmit complete assignment and notify professor when submitted
    * Current progress stopped in the middle of debugging the `id3` algorithm
 
* Resubmit Notes 3/24/24
    * Error Rate: seems odd that I'd be showing 0% overall - perhaps that's just overfitting? 
    * Not sure how to handle `labeled=False`- I may have needed to pass this label into other functions like `id3` and `pick_best_attribute`.
    * Overall I feel my code is far more complex than it needs to be... I did struggle a bit with understanding this algorithm and figuring out the recursion. I welcome any feedback you have on how I could have simplified this! 
    * Thank you again for the extension on this assignment - I truly appreciate your understanding! 