In [9]:
from builtins import range, input
from datetime import datetime
from future.utils import iteritems
from scipy.stats import multivariate_normal as mvn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Below is the implementation of the NaiveBayes class, which contains three methods within it. The first, fit, takes the model, the features and labels lists, and a smoothing constant. It creates two dictionaries, one called gaussians which stores the mean and variance using the built in numpy methods .mean() https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html and .var() https://docs.scipy.org/doc/numpy-1.6.0/reference/generated/numpy.var.html and one called priors. c is the current class [0 ... 9], and gaussians holds the mean and variance for each current_x in the class. This is the training method we are going to be using. 

The score method takes in the list of features and labels again, as well as the model. It starts by getting a list of the model's predictions based on the features, then runs through all the classes and counts true positives, false positives, and total instances of the class.

The predict function utilizes a numpy method called shape https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html, which is used to get the shape of the array. Following, it creates an array P using the numpy method zeros to create an array with the same shape of size N, K. it then iterates through the gaussians dictionary to use the scipy.stats logpdf function to get the probability that the current item is in a given class c. At the end, it takes the one with the highest probability and returns the array filled with that.

The fit function and predict function were taken from https://github.com/lazyprogrammer/machine_learning_examples/blob/master/supervised_class/nb.py 

In [10]:
class NaiveBayes(object):
    def fit(self, X, Y, smoothing=1e-2):
        self.gaussians = dict()
        self.priors = dict()
        labels = set(Y)
        for c in labels:
            current_x = X[Y == c]
            self.gaussians[c] = {
                'mean': current_x.mean(axis=0),
                'var': current_x.var(axis=0) + smoothing,
            }
            self.priors[c] = float(len(Y[Y == c])) / len(Y)

    def score(self, features, labels):
        P = self.predict(features)
        for j in range(10):
            true_positives = 0
            predicted_amount = 0
            for i in range(len(P)):
                if P[i] == j:
                    predicted_amount +=1
                    if labels[i] == P[i]:
                        true_positives += 1
            tags = 0
            tag_amount = len([i for i in labels if i == j])
            for i in range(len(labels)):
                if labels[i] == P[i] == j:
                    tags += 1
            print(j,"- Recall:", round(tags/tag_amount, 2), "Precision:", round(true_positives/predicted_amount, 2))
        return np.mean(P == labels)

    def predict(self, features):
        N, D = features.shape
        K = len(self.gaussians)
        P = np.zeros((N, K))
        for c, g in iteritems(self.gaussians):
            mean, var = g['mean'], g['var']
            P[:,c] = mvn.logpdf(features, mean=mean, cov=var) + np.log(self.priors[c])
        return np.argmax(P, axis=1)


The get_data method takes in the file, and shuffles the data to randomize the dataset for each run. We then take the data and divide each entry into a float value from 0.0 to 1.0, which represents the previous 0 to 255 integer value. the data is splitt into featurs and labels, with X representing the features and Y representing the labels

In [13]:
def get_data(limit):
    print("Reading in and transforming data...")
    df = pd.read_csv("train.csv", encoding="ISO-8859-1")
    data = df.values
    np.random.shuffle(data)
    features = data[:, 1:] / 255.0 
    labels = data[:, 0]
    features, labels = features[:limit], labels[:limit]
    return features, labels

Here we get our features and our labels using the above get_data method, with a limit of the entire dataset. We create a variable Ntrain to get the datasplit we are going to be working with. We then seperate the data into a training set and a test set for us to utilize later. We create a Naive Bayes object,  then train it using the model.fit method.  we then use the model.score method on both our training set and our test set to get our results.

In [14]:
features, labels = get_data(42000)
Ntrain = len(labels) // 10 * 3
featuresTrain, labelsTrain = features[Ntrain:], labels[Ntrain:]
featuresTest, labelsTest = features[:Ntrain], labels[:Ntrain]

model = NaiveBayes()
model.fit(featuresTrain, labelsTrain)

print("Train accuracy:", round(model.score(featuresTrain, labelsTrain)*100, 2),"%")
print("Train size:", len(labelsTrain))

print()
print("Test accuracy:", round(model.score(featuresTest, labelsTest)*100, 2),"%")
print("Test size:", len(labelsTest))

Reading in and transforming data...
Training time: 0:00:01.130846
0 - Recall: 0.9 Precision: 0.93
1 - Recall: 0.96 Precision: 0.79
2 - Recall: 0.75 Precision: 0.9
3 - Recall: 0.75 Precision: 0.81
4 - Recall: 0.65 Precision: 0.84
5 - Recall: 0.61 Precision: 0.88
6 - Recall: 0.93 Precision: 0.83
7 - Recall: 0.8 Precision: 0.94
8 - Recall: 0.75 Precision: 0.64
9 - Recall: 0.87 Precision: 0.6
Train accuracy: 79.84 %
Train size: 29400

0 - Recall: 0.9 Precision: 0.94
1 - Recall: 0.97 Precision: 0.78
2 - Recall: 0.74 Precision: 0.91
3 - Recall: 0.76 Precision: 0.8
4 - Recall: 0.64 Precision: 0.83
5 - Recall: 0.61 Precision: 0.86
6 - Recall: 0.91 Precision: 0.85
7 - Recall: 0.81 Precision: 0.95
8 - Recall: 0.72 Precision: 0.63
9 - Recall: 0.87 Precision: 0.63
Test accuracy: 80.12 %
Test size: 12600


As you can see from the data above, we get around 80% accuracy for both the training set and the test set. This is about what I was expecting, as the downfall of using Naive Bayes is that there isn't a way to tie together features to gain a different perspective. Many of the features are present in multiple numbers, which leads to a greater level of misclassification. If you look at our recall and accuracy, you notice that there is a dip in recall for 4s and 5s, and a dip in precision for 8's and 9's. This is due to the 9's being classified as 4's, and the 8's being classified as 5's.