# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

## Imports from Previous Module (08)

In [1]:
from copy import deepcopy

<a id="Import Data"></a>
### Import Data

This code is copied from the Module 03 programming assignment. `parse_data` line 5 was updated for data type `str`. 

In [2]:
import random
import numpy as np
from typing import List, Dict, Tuple, Callable

In [3]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [str(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [4]:
data = parse_data("agaricus-lepiota.data")

In [5]:
len(data)

8124

In [6]:
len(data[0])

23

### Remove Missing Values
Per the assignment directions, this code will remove all lines in the data with "?" values. Per the dataset description, these values are only expected in attribute #11. 

<div style="background: #4682b4">
Commented out for resubmit, per feedback in initial submit. 

In [7]:
# data = [row for row in data if "?" not in row]

In [8]:
len(data)

8124

In [9]:
len(data[0])

23

### Train/Test Splits - n folds

This code is copied from the Module 03 programming assignment. It creates folds from the data, then creates train and test datasets. 

`create_folds` will take a list (xs) and split it into `n` equal folds with each fold containing one-tenth of the observations.

In [10]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n) #k = numdata/10, m = remainder = 0
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [11]:
folds = create_folds(data, 10)

In [12]:
len(folds)

10

In [13]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

We can test the function to give us a train and test datasets where the test set is the fold at index 0:

In [14]:
training_data, test_data = create_train_test(folds, 0)
# assert len(training_data) == 5079
# assert len(test_data) == 565 #10%

<a id="get_cols"></a>
### get_cols

`get_cols` takes in the full dataset and outputs a list of all the column data (including y-column)

* **data**: List: full dataset

**returns** cols: List[List] of columns

In [15]:
def get_cols(data: List[List]):
    cols = [[] for i in range(len(data[0]))] #init empty lists
    for row in data: 
        for i in range(len(data[0])): #not incl y
            cols[i].append(row[i])

    return cols

In [16]:
#Unit Tests / Assertions
test_mat = [
    [1, 2, 7], 
    [3, 4, 7]]
assert get_cols(test_mat) == [[1, 3], [2, 4], [7, 7]]
assert len(get_cols(test_mat)) == len(test_mat[0])
test_mat = [
    [1, 2], 
    [3, 4]]
assert get_cols(test_mat) == [[1, 3], [2, 4]]

<a id="get_unique"></a>
### get_unique

`get_unique` takes in a column and outputs the unique values from that column. 

* **col**: List: single column of data

**returns** unique_vals: List of unique values

In [17]:
def get_unique(col): 

    unique_vals = [] #init
    for val in col: 
        if val not in unique_vals: 
            unique_vals.append(val)
    
    return unique_vals

In [18]:
# Unit Tests / Assertions
test1 = [1, 2, 3]
assert get_unique(test1) == [1, 2, 3]

test2 = [2, 2, 2]
assert get_unique(test2) == [2]

test3 = [-1, 0, 0]
assert get_unique(test3) == [-1, 0]

<a id="train"></a>
## Train

`train` takes a data set with labels (like the training set or test set), and a boolean indicator using +1 smoothing, and returns a naive bayes classifier dictionary of probabilities. The resulting NBC dictionary has first level keys of attribute numbers, second level keys of attribute values, and third level keys of 'e', 'p', and 'att_total'. The 'att_total' key represents the total probability of that attribute value occuring in the training dataset. A single NBC callout will be used as an assertion to show the format of the output. 

* **training_data**: List[List[str]]: Dataset for training the NBC
* **smoothing**: bool: whether or not to use +1 smoothing

**returns** Dict: NBC - naive bayes classifier

In [19]:
def train(training_data, smoothing=True): 
    smooth = 1 if smoothing else 0
    yes, no = 'e', 'p' #init
    attributes = list(range(0, len(training_data[0]))) #[1, 2, ... N]
    NBC = {} #init
    cols = get_cols(training_data) #columns 
    total_yes, total_no = cols[0].count(yes), cols[0].count(no)
    for att in attributes: 
        vals = get_unique(cols[att])
        probs = {} #init
        for val in vals: 
            val_yes = sum(1 for i in range(len(cols[att])) if (cols[att][i] == val and cols[0][i] == yes))
            val_no = sum(1 for i in range(len(cols[att])) if (cols[att][i] == val and cols[0][i] == no))
            py = (val_yes + smooth) / (total_yes + smooth)
            pn = (val_no + smooth) / (total_no + smooth)
            pt = (val_yes + val_no) / (total_yes + total_no)
            probs[val] = {str(yes): py, str(no): pn, 'att_total': pt}
        NBC[att] = probs    
    return NBC

In [20]:
temp_train = training_data[0]
train(temp_train)

{0: {'e': {'e': 1.0, 'p': 0.3333333333333333, 'att_total': 0.5},
  'x': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'y': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  't': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'a': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'f': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'c': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'b': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'n': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  's': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'w': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'p': {'e': 0.3333333333333333, 'p': 1.0, 'att_total': 0.5},
  'o': {'e': 0.3333333333333333, 'p': 0.3333333333333333, 'att_total': 0.0},
  'g': {'e': 0.33333333333

<a id="probability_of"></a>
### probability_of

`probability_of` takes in a single instance (like a row in the test dataset), a label ('e' or 'p'), and the nbc dictionary. This calculates P(value | instance). To use the example from the self-check assignment, this function returns the value of: 
[𝑃(𝑠𝑞𝑢𝑎𝑟𝑒 | 𝑦𝑒𝑠) * 𝑃(𝑙𝑎𝑟𝑔𝑒 | 𝑦𝑒𝑠) * 𝑃(𝑟𝑒𝑑 | 𝑦𝑒𝑠) * 𝑃(𝑦𝑒𝑠)] ÷ [𝑃(𝑠𝑞𝑢𝑎𝑟𝑒) * 𝑃(𝑙𝑎𝑟𝑔𝑒) * 𝑃(𝑟𝑒𝑑)]

* **instance**: List[str]: single row of a test dataset (observation)
* **label**: str: single label to evaluate, either yes/no, 'e'/'p'
* **nbc**: Dict: Naive Bayes Classifier (nbc) dictionary

**returns** float: P(value | instance)

<div style="background: #4682b4">
Denominator calculation was commented out (rather than fully deleted, for consistency), per feedback received after initial submit.

In [21]:
def probability_of(instance, label, nbc, labeled=True): 
    numerator_vals, denominator_vals = [], [] #init
    start = 1 if labeled else 0 # change start index if unlabeled data (include col 0)
    for col in range(start, len(instance)): #columns, skip 0 (label)
        val = instance[col] #attribute value
        probs = nbc[col][val] #probabilities @ attribute = value
        numerator_vals.append(probs[label])
        # denominator_vals.append(probs['att_total'])

    numerator = np.prod(numerator_vals) * nbc[0][label]['att_total']
    # denominator = np.prod(denominator_vals)
    
    # return (numerator / denominator) 
    return numerator

In [22]:
# assertions / unit tests
instance = training_data[1]
nbc = train(training_data)
label = instance[0]
opp_label = 'e' if label=='p' else 'p'
assert probability_of(instance, label, nbc) > probability_of(instance, opp_label, nbc)

instance = training_data[5]
label = instance[0]
opp_label = 'e' if label=='p' else 'p'
assert probability_of(instance, label, nbc) > probability_of(instance, opp_label, nbc)

instance = training_data[100]
label = instance[0]
opp_label = 'e' if label=='p' else 'p'
assert probability_of(instance, label, nbc) > probability_of(instance, opp_label, nbc)

<a id="normalize"></a>
### normalize

`normalize` takes in the results from `classify` and `probability_of`, and returns normalized results in the same format (Dict).

* **results**: Dict: non-normalized results from the naive bayes classifier

**returns** Dict: new_results - normalized 

In [23]:
def normalize(results: Dict): 
    new_results = {} #init
    results_sum = sum(results.values())
    for key in results: 
        new_results[key] = results[key] / results_sum

    return new_results    

In [24]:
# assertions / unit tests
instance = training_data[4]
results = {'e': probability_of(instance, 'e', nbc), 'p': probability_of(instance, 'p', nbc)}
new_results = normalize(results)
# print('original results: ', results)
# print('new results: ', new_results)

assert sum(results.values()) != 1.0
# assert sum(new_results.values()) == 1.0

<a id="classify"></a>
### classify

`classify` takes in the probabilities (nbc), a list of instances (observations, test data), and returns the best possible class and normalized probability distribution. It is important that the observations are in a List[List] form. 

* **nbc**: Dict: Naive bayes classifier, dictionary of probabilities
* **observations**: List[List[str]]: list of observations, like the training dataset
* **labeled**: Bool, whether or not data is labeled in col[0]

**returns** tuple: (best, results)

In [25]:
def classify(nbc: Dict, observations: List[List[str]], labeled=True): 
    
    class_labels = ['e', 'p']
    total_results = []
    for instance in observations: 
        results ={}
        for label in class_labels: #yes / no
            results[label] = probability_of(instance, label, nbc, labeled)
        results = normalize(results)
        best = max(results, key=results.get) #arg max
        total_results.append((best, results))

    return total_results

In [26]:
# Assertions / Unit Tests
obsrvs = [test_data[5]]
nbc = train(training_data)
actual_label = obsrvs[0][0]
classification = classify(nbc, obsrvs)
assert actual_label == classification[0][0]

obsrvs = [test_data[12]]
actual_label = obsrvs[0][0]
classification = classify(nbc, obsrvs)
assert actual_label == classification[0][0]

obsrvs = [test_data[100]]
actual_label = obsrvs[0][0]
classification = classify(nbc, obsrvs)
assert actual_label == classification[0][0]

## Evaluate

`evaluate` takes a data set with labels (like the training set or test set) and the naive bayes classifier, calls the `classify` function, and determines the error rate over that dataset. Since this calls the classify function, and acts as a subset of the `cross_validate` loop, no assertions or unit tests are included here; these can be found under the `cross_validate` function call. 

$$error\_rate=\frac{errors}{n}$$

* **data**: List[List[str]]: dataset to evaluate
* **nbc**: Dict: naive bayes classifier

**returns** error rate, as a percentage

In [27]:
def evaluate(data, nbc, labeled=True): 
    errors, n = 0, len(data)
    eval = classify(nbc, data, labeled)
    
    for i in range(len(data)): 
        if data[i][0] != eval[i][0]: #incorrect classification
            errors += 1
   
    error_rate = (errors / n)
    
    return error_rate*100

## Cross Validate

`cross_validate` takes in the full dataset and a smoothing operator, creates 10 shuffled folds in the data, then steps through each fold. For each iteration in the for-loop, train and test data is created, a new naive bayes classifier is calculated, and the overall error rate is evaluated and printed. 

* **data**: List[List[str]]: total dataset
* **smoothing**: bool: whether or not to use +1 smoothing

**returns** None, print statements


In [28]:
def cross_validate(data, smoothing=True): 
    errors = []
    folds = create_folds(data, 10)
    for i in range(len(folds)): 
        train_data, test_data = create_train_test(folds, i)
        nbc = train(train_data, smoothing)
        error_rate = evaluate(test_data, nbc, True)
        errors.append(error_rate)
        print('Fold', i, ' Error: ', error_rate, '%')
        
    print('Average error rate: ', np.mean(errors), '%') #sum(errors)/len(errors)
    print('Standard Deviation: ', np.std(errors), '%')
    
    return None

In [29]:
print('Cross validation WITH smoothing:')
cross_validate(data, smoothing=True)

Cross validation WITH smoothing:
Fold 0  Error:  4.797047970479705 %
Fold 1  Error:  4.182041820418204 %
Fold 2  Error:  4.674046740467404 %
Fold 3  Error:  5.289052890528905 %
Fold 4  Error:  4.433497536945813 %
Fold 5  Error:  4.80295566502463 %
Fold 6  Error:  4.1871921182266005 %
Fold 7  Error:  3.32512315270936 %
Fold 8  Error:  5.295566502463054 %
Fold 9  Error:  4.926108374384237 %
Average error rate:  4.591263277164791 %
Standard Deviation:  0.5610548343418379 %


In [30]:
print('Cross validation WITHOUT smoothing:')
cross_validate(data, smoothing=False)

Cross validation WITHOUT smoothing:
Fold 0  Error:  0.24600246002460024 %
Fold 1  Error:  0.36900369003690037 %
Fold 2  Error:  0.6150061500615006 %
Fold 3  Error:  0.24600246002460024 %
Fold 4  Error:  0.3694581280788177 %
Fold 5  Error:  0.49261083743842365 %
Fold 6  Error:  0.12315270935960591 %
Fold 7  Error:  0.12315270935960591 %
Fold 8  Error:  0.12315270935960591 %
Fold 9  Error:  0.6157635467980296 %
Average error rate:  0.332330540054169 %
Standard Deviation:  0.18298199108712435 %


## Discussion of Results
* In this instance, both sets of results are identical, showing that smoothing did not matter. This could be because every feature / attribute has a split between 'e' and 'p' labels (i.e. none are perfectly homogeneous). While I would expect this to change the underlying probabilities, it shouldn't change the normalized distribution, so perhaps this makes sense.
* Overall a 38%+ error rate is not good. I'm wondering if my overall classification calculation in `probability_of` is off. There were several different approaches to the self-check, and I was unsure what was actually correct for this calculation. This is the most likely place for an error in my code.
* The results are fairly consistent across each fold, which is what we should expect. 

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

# Submit Notes
* I'm still not quite sure how to handle labeled = False. I modified the starting column index from 1 to 0 in that case, shown in the `probability_of` function. 

## Professor Feedback on Initial Submit: 
- With Naive Bayes, you can actually keep data points with missing features. One of the cool parts about the algorithm.
- Nice work trying to keep your code generic, ie, things like `yes, no = 'e', 'p' #init`
- Next step would be to drive that info from the dataset.
- Like you said in your discussion, your error is pretty high, which makes me thing that there's something funky going on.
- I'm not sure why you're dividing by the denominator. Based on your code and your print statements, you have the probability of a feature for a given class.
- [P(square) * P(large) * P(red)] - isn't needed.
- Additionally, you need to multiple times the probability of the class, irrespective of the data point values
- (5/10) - Revise. I think you're close.

## Changes Made in Resubmit: 
- Section 1.3.2 `remove missing values` was commented out, as were the assert statements in 1.3.3. I left the blocks in rather than removing entirely to preserve initial submit structure. 
- `probability_of` - denominator calculation removed entirely
- `evaluate` was passing a single column value (like 'e' or 'p') into the `classify` function, which was causing all sorts of issues. The `evaluate` function was rewritten to pass the entire dataset in to the `classify` function (as was initially intended)
- Incorrect classification calculation within `evaluate` was updated and the encompassing loop was entirely restructured to accurately interpret each prediction.
- All changes resulted in a much better error rate of ~4.5% with smoothing and 0.32% without smoothing!! This is much more in line with what should be expected. This means that the evaluation *without* smoothing far outperformed the smoothing version by essentially an order of magnitude. This is likely due to the fact we were using +1 smoothing, rather than a smaller or more adaptive measure. In the future we could run multiple validation steps to determine the 'best' smoothing value.- 