# Practicum 6: Classification - Alternative techniques


## Learning objectives:

  - Implementing a Naive Bayes classifier
  - Using various classifiers implemented in scikit-learn


## References
  - [Matplotlib plotting framework](http://matplotlib.org/api/pyplot_api.html)
    * [How to make beautiful data visualizations in Python with matplotlib](http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/)
  - [Numpy](http://www.python-course.eu/numpy.php)
    * [Numpy arrays](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html#numpy.array)
    * [Numpy statistics](http://docs.scipy.org/doc/numpy/reference/routines.statistics.html)

## Task 1: Implementing a Naive Bayes classifier

  - Load the Iris dataset and divide it into to 2/3 training and 1/3 test sets.  
  - Implement a Naive Bayes classifier
   * a) Use categorical attributes by discretizing each attribute into three equally-sized bins: low, medium, high.
   * b) Use continuous attributes and assume a Gaussian (normal) distribution. Estimate the parameters of the distribution (mean and variance) from the training data (you'll have different parameters for each attribute)!
  - Compare the performance of the two solutions in terms of accuracy and error rate.

### Task 1A) Implement a Naive Bayes classifier
 
   - We use categorical attributes by discretizing each attribute into three equally-sized bins: low, medium, high.
   - We need to apply smoothing to avoid zero probabilities.
   - Additionally, we compute probabilities in the log space.

In [None]:
import csv
from collections import Counter
import numpy as np
import pprint
import math

The four attributes in the dataset:

In [None]:
ATTRS = ["sepal_length", "sepal_width", "petal_length", "petal_width"]

We implement the Naive Bayes classifier using categorical attributes, in a class with methods for learning (train) and applying the model.

In [None]:
class NB(object):
    def __init__(self):
        self.model = None
    
    def train(self, attributes, labels):
        self.model = {}
        # TODO    
    
    def apply(self, attributes):
        if not self.model:
            raise Exception("Model has not been trained")
        label = "Iris-setosa"
        # TODO
        return label

We define a data loading in a way to obtain the attributes set and class labels for each the training and the test sets.

In [None]:
def load_data(filename):
    train_x = []
    train_y = []
    test_x = []
    test_y = []
    with open(filename, 'rt') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        i = 0
        for row in csvreader:
            if len(row) == 5:
                i += 1
                instance = {ATTRS[i]: float(row[i]) for i in range(4)}  # first four values are attributes
                label = row[4]  # 5th value is the class label
                if i % 3 == 0:  # test instance
                    test_x.append(instance)
                    test_y.append(label)
                else:  # train instance
                    train_x.append(instance)
                    train_y.append(label)
                    
    return train_x, train_y, test_x, test_y

We need to define how to evaluate the model predictions.

In [None]:
def evaluate(predictions, true_labels):
    correct = 0
    incorrect = 0
    for i in range(len(predictions)):
        if predictions[i] == true_labels[i]:
            correct += 1
        else:
            incorrect += 1

    print("\tAccuracy:   ", correct / len(predictions))
    print("\tError rate: ", incorrect / len(predictions))

Discretization. We need to replace numerical values with labels 'low', 'medium', 'high' such that 1/3 of the values are assigned 'low', 1/3 of the values are assigned 'medium', and 1/3 of the values are assigned 'high'. 

In [None]:
def discretize(attributes):
    attrs2 = [{} for _ in range(len(attributes))]  # initialize list of empty dicts
    for a in ATTRS:
        # find thresholds
        values = np.array([x[a] for x in attributes])
        # TODO

    return attrs2

#### Main logic

Load data.

In [None]:
train_x, train_y, test_x, test_y = load_data("data/iris.data")

Discretize attribute values.

Importantly, we do it on the entire data set (training and testing), to ensure that values are assigned to the same bins in the train and in the test part. We then split back the data into train and test.

In [None]:
x2 = discretize(train_x + test_x)
train_x2 = x2[:len(train_x)]
test_x2 = x2[-len(test_x):]

Train the model.

In [None]:
nb = NB()
nb.train(train_x2, train_y)

Apply the model.

In [None]:
predictions = []
for instance in test_x2:
    label = nb.apply(instance)
    predictions.append(label)

And evaluate predictions.

In [None]:
evaluate(predictions, test_y)