# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
import numpy as np
import random

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

You'll first need to calculate all of the necessary probabilities (don't forget to use +1 smoothing) using a `train` function. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Dicts. Each Dict has a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[{"e": 0.98, "p": 0.02}, {"e": 0.34, "p": 0.66}]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You'll also need an `evaluate` function as before. You should use the $error\_rate$ again.

With +1 smoothing, the Naive Bayes Classifier has quite a different way of handling missing values.


Again, you must implement the following functions:

`cross_validate` takes the data and performs 10 fold cross validation.

`train` takes training_data and returns the probabilities as a data structure. If some kind of ADT seems reasonable to you, then you can create one but you don't really need one. Nested Dicts will work just fine.

`classify` takes the probabilities produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$


```
def train(training_data):
   # returns a NBC probability structure
```

and `classify` takes probabilities and a List of instances (possibly just one) and returns the classifications:

```
def classify(probabilities, test_data):
    # returns a list of classifications
```

and `evaluate` takes the actual classifications and the predicted classes and returns the classification error rate:

```
def evaluate(actual, predicted):
    # returns an error rate
```

You must apply 10 fold cross validation to your data set. You will treat each fold as a test set, using the combined remainder as the training set. You should print out the error rate for each fold and then an average error rate for the entire cross validation process. Format the error rate as a percent (2.34% not 0.0234).

This is all that is required for this assignment. I'm leaving more of the particulars up to you but you can definitely use the last module as a guide.

**load_file**<br>
The `load_file` load the file given the filename.

Parameters:
* **file_name** is the name of the file we want to load.

The function return the data loaded from the file in `list of lists` format.</br>
For example if the data file contrains
```
        0,1,2
        2,3,4
```
retuns:<br>
`[[0,1,2], [2,3,4]]`

In [2]:
def load_file(file_name):
    data = []
    file = open(file_name, "r")
    for line in file:
        tmp_data = line.rstrip().split(",")
        data += [tmp_data[1:] + tmp_data[:1]] # made sure the class column was the last one
        # np.append(data, np.array(tmp_data[1:] + tmp_data[:1])) # made sure the class column was the last one
    return np.array(data)

**filter_data**<br>
The `filter_data` is a helper function for `id3 algorithm`. It creates a new list of data that maches that condition that attribute = attribute value.

Parameters:
* **data** is the data we want to filter.
* **feature** the feature we want to usee for the filter.
* **feature_val** is the data value we want to filter the data with.

It returns `filtered data` in list of lists format.<br>
For data contrains below where columns are features, feature = 1 and feature_val = 1
```
        0,1,2
        2,5,4
        3,1,2
```

retuns:<br>
`[[0,1,2], [3,1,2]]`

In [3]:
def filter_data(data, feature, feature_val):
    return data[np.where(data[:,feature] == feature_val)]

**train**<br>
The `train` is the function that calulates the probability of the class labels and the features and save it in a dictionary format.

Parameters:
* **data** is the data used to train.

It returns a `dictionary` containing features and probabilites for each class label.<br>
For example,  if given data is as shown below where last column is class label
```
        2,5,p
        3,1,e
```

returns:<br>
```
{
    0: {2:{{e:0.5 , p:0.5 }}, 3:{{e:0.5 , p:0.5 }}},
    1: {1:{{e:0.5 , p:0.5 }}, 5:{{e:0.5 , p:0.5 }}},
    p: 0.5,
    e: 0.5
}
```

In [4]:
def train(data):
    nbc_prob = {}    
    count = {class_label:data[:,-1].tolist().count(class_label) for class_label in list(set(data[:,-1]))}

    for col in range(data.shape[1]-1):
        nbc_prob[col] = {}
        for domain_val in set(data[:,col]):
            subset = filter_data(data, col, domain_val)
            nbc_prob[col][domain_val] = {}
            for class_label in set(data[:,-1]):
                nbc_prob[col][domain_val][class_label] = (subset[:,-1].tolist().count(class_label)+1) / (count[class_label] + 1)
    for clss in count: # add class probabilities
        nbc_prob[clss] = count[clss]/len(data)
    return nbc_prob   # returns a NBC probability structure

**normalize**<br>
The `normalize` is helper function for `classify`. It takes the dictionary of class label probabilites as an argument and calculates normalized probabilities.

Parameters:
* **dictionary** is the class label probabilites we want to normalize.

It returns a `dictionary` of of normalized probabilities.<br>

For example, if the dictionary is <br>
`{e:0.2 , p:0.6}`

retuns:<br>
`{e:0.25 , p:0.75}`

In [5]:
def normalize(dictionary):
    probs = dictionary.copy()
    total_prob = sum(probs.values()) 
    for key in probs:
        probs[key] /= total_prob
    return probs

**classify**<br>
The `classify` function takes the test data and find calculates a label for each testpoint using the calculated probabilities. It then calculates probability each testpoint normalizes it and then finds the label for each testpoint.

Parameters:
* **nbc_prob** is dictionary of calculated probabilites for each feature.
* **test_data** is the test points we want to predict the class label.

It returns tuple of predicted `class labels` and list of `normalized probabilities` for each testpoint.<br>

For Example, if nbc_prob is as shown below and test data is [[2,1], [3,5]]<br>
```
{
    0: {2:{{e:0.22 , p:0.6 }}, 3:{{e:0.8 , p:0.1 }}},
    1: {1:{{e:0.3 , p:0.2 }}, 5:{{e:0.9 , p:0.05 }}},
    p: 0.4,
    e: 0.6
}
```
retuns:<br>
`([p, e], [{e:0.4520 , p:0.5479 },{e:0.995 , p:0.005}])`

In [6]:
def classify(nbc_prob, test_data):
    # get the total possible class labels
    result_list = []
    class_labels = [key for key in nbc_prob if type(key) != int]  # get the class labels
    for testpoint in test_data:
        result = {}
        for c_label in class_labels:
            result[c_label] = nbc_prob[c_label]    # initialize probability
            for col in range(len(testpoint)):
                result[c_label] *= nbc_prob[col][testpoint[col]][c_label] # multiply by probability of each feature
        result_list += [result]
    
    norm_probs = [normalize(result) for result in result_list] # normalize
    label = [max(rst, key = lambda x: rst[x]) for rst in norm_probs] # find the label
    return (label, norm_probs)

**evaluate**<br>
The `evaluate` function that calculates the error rate between the predicted and the actual. The below equation is used to calculate the erorr rate.
$$error\_rate=\frac{errors}{n}$$

Parameters:
* **actual** is the actual test label for the each test observation.
* **predicted** is the class labels that were predicted by our model.

It returns the `error rate`.<br>

For example, if actual = [e,e,e,p], predicted = [e,p,e,p]<br>

returns:<br>
`0.25`

In [7]:
def evaluate(actual, predicted):
    count = np.where(np.array(actual) != np.array(predicted))[0]
    error_rate = len(count)/len(actual)
    return error_rate

**cross_validate**<br>
The `cross_validate` check the accuracy of our model by running the 10 fold cross validation. It prints the error rate of each fold and the averate error rate of 10 folds.

Parameters:
* **data** is data we want to use for cross validation.
* **cross_validate** indicates what percent(as decimal) of the data we want to use to train the model.

retuns:<br>
It does not return anything

In [8]:
def cross_validate(data, train_percent=.8):
    errors = []
    for test_nun in range(10):
        train_data = random.sample(data.tolist(), int(len(data)*train_percent))
        test_data = [row[:-1] for row in data.tolist() if row not in train_data]
        test_label = [row[-1] for row in data.tolist() if row not in train_data]
        train_data = np.array(train_data)
        test_data = np.array(test_data)
        tree = train(train_data)
        predicted, result_prob = classify(tree, test_data)
        error_rate = evaluate(test_label, predicted)
        errors += [error_rate]
        print(f"Fold# {test_nun+1:2}: Error Rate: {round(error_rate*100,3)}%")
    print(f"\nAverage Error Rate: {round((sum(errors)/ len(errors))*100, 3)}%")

In [9]:
#read the file
data = load_file("agaricus-lepiota.data")
cross_validate(data)

Fold#  1: Error Rate: 5.231%
Fold#  2: Error Rate: 4.369%
Fold#  3: Error Rate: 4.123%
Fold#  4: Error Rate: 4.431%
Fold#  5: Error Rate: 4.308%
Fold#  6: Error Rate: 4.862%
Fold#  7: Error Rate: 5.969%
Fold#  8: Error Rate: 6.215%
Fold#  9: Error Rate: 3.938%
Fold# 10: Error Rate: 5.538%

Average Error Rate: 4.898%


## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.