**CSC 466: Knowledge Discovery in Data **

** Individual Test**

**Task 3**

**Your Name :** 

**Cal Poly Email:** 

**Your Assignment**:

1. Complete the code for the Naive Bayes Classifier
2. Complete the training and testing of the Classifier
3. Compute the accuracy of the classifier and output the overall accuracy and the confusion matrix.

In [1]:
## Imports
import numpy as np
from matplotlib import pyplot as plt
import seaborn
import pandas as pd
%matplotlib inline

**estimateNBModel**: function to produce all parameter estimates for the Naive Bayes model.

This function returns two structures: an array of class probabilities P(Class = c_i),
and a collection of 
$$P(A_j = a_k|Class = c_i)$$
probability estimators for the probability that an observed record from class $c_i$ will take the value $a_k$ for its $j$th attribute $A_j$.

The latter collection is stored as a list model[0..nClasses-1], where each model[i] is itself a list of length nAttributes of probability distributions over the values of each attribute.


Input parameters for estimateNBModel():

data: training set data points

labels: labels for the training set data

nAttributes: number of attributes in the dataset

attributeRanges: number of unique values each attribute takes

nClasses: number of classes (class labels)


**Your Task** :   write the getClassProbability() function that given
   
   * the list of labels (labels) of the training set

   * the class label (classId),
   
   * and the total number of classes in the dataset (nClasses)
   
 returns the probability estimate for $P(Class = classId)$

In [2]:
def getClassProbability(labels, classId,nClasses):
    return sum([1 for i in labels if i == classId]) / len(labels)

**Your Task**: write the getNBEstimate() function that given

* the training set (data)

* the class labels for the training set (labels)

* the class label for which the estimate is being produced (classLabel)

* the attribute for which the estimate is being produced (attId)

* the attribute value for which the estimate is being produced (attValue)

* and the total number of values attribute attId has (nValues)

produces the probability estimate for $P(A_{attId} = attValue | Class = classLabel)$

In [3]:
def getNBEstimate(data, labels, classLabel, attId, attValue, nValues):
    indices, = np.where(labels == classLabel)
    subset = data[indices,attId]
    return sum([1 for i in subset if i == attValue]) / len(subset)

In [4]:
def estimateNBModel(data, labels, nAttributes, attributeRanges, nClasses):
    ## Naive Bayes Model consists of two types of estimators
    
    ## First, we estimate the probability of seeing an object from a specific class
    
    classProbabilities = [getClassProbability(labels,l, nClasses) for l in range(nClasses)]
    #print(classProbabilities, nClasses)
    
    ## now we estimate the probabilities of seeing a specific value of a specific attribute in
    ## a data point from a given class
    
    ## for each class create the appropriate estimates
    model = []                  # model is the list of estimates for all classes
    for i in range(nClasses):   # for each class
        ## for each attribute
        classDistr = []         # classDistr is the collection of estimates for one class
        for j in range(nAttributes):
            estimates = []                    # estimates is a distribution of estimates for a single attribute
            for k in range(attributeRanges[j]):
                #print(i)
                est = getNBEstimate(data, labels, i,j,k, attributeRanges[j])
                #print(est)
                #break
            #break
                estimates.append(est)
            classDistr.append(estimates)
        model.append(classDistr)
    
    return classProbabilities, model
    

**Predicting The Class**

function predictNBLabels() predicts the class labels for all data points in the test set.

function predictNB() computes the probability estimates for each class and selects the class with the highest estimate for a single data point

function predictNBClass computes the probability estimate for a specific class.

** Your Task **: implement predictNBClass()

its parameters are:

* point: the data point for which the estimate is given

* classProb:  the class probability P(Class = classID) for the class 

* classModel: the portion of the Naive Bayes model related to predicting this particular class

(note that class label is not passed, but all proper values are selected in predictNB())

In [5]:
def predictNBClass(point, classProb, classModel):
    ans = 1
    for i in range(len(classModel)):
        ans *= classModel[i][int(point[i])]
    return ans * classProb

predictNB()  and predictNBLabels() parameters

point: data point

classProbabilities: the list of $P(Class = c_i)$ estimates

model: the collection of $P(A_j = a_k |Class = c_i)$ probability estimates

nClasses: number of class labels in the dataset

You do not need to touch this code

In [6]:
def predictNB(point, classProbabilities, model,nClasses):
    
    predictions= np.array([predictNBClass(point, classProbabilities[i],model[i]) for i in range(nClasses)])
    
    predictedClass = np.argmax(predictions)
    return predictedClass

In [7]:
def predictNBLabels(data, classProbabilities, model, nClasses):
    predicted = [predictNB(point, classProbabilities, model, nClasses) for point in data]
    return predicted

** Load Data **

In [8]:
filename="data8.csv"

rawData = np.loadtxt(filename, delimiter = ",")

## let's keep only the two columns with the data attributes

nAttributes = rawData.shape[1] - 1

data = rawData[:,0:nAttributes]
labels = rawData[:,nAttributes]

** Train the Model**

In the cells below the entire dataset  is used to train the model. 

This allows us to see the predictions, but this is not a fair way to evaluate the quality of prediction.

In [9]:
### Let us compute how many unique values each attribute has.
### all attributes have values 0,1,.., k-1 where k is the number of unique values for that attribute.

attributeRanges= [np.unique(data[:,i]).shape[0] for i in range(nAttributes)]

## number of classes

nClasses = np.unique(labels).shape[0]

In [10]:
d,m = estimateNBModel(data,labels,nAttributes, attributeRanges, nClasses)

In [11]:
predicted = predictNBLabels(data,d,m,nClasses)

An easy way to see where we missed:

In [12]:
predicted - labels

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0., -2.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,  0.,  0.,
       -2.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1., -2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,
        0.,  0., -2.,  1.,  0.,  0.,  1., -1.,  0.,  0.,  0.,  0., -1.,
        0., -1.,  0.,  0.,  0.,  0.,  0.,  0., -1.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -2.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0., -2.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
       -2.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., -1

**Your Task**:

Split your dataset into the training set (2/3 of the data = 200 data points) and the test set (the remaining 1/3 of the data points = 100 points). Select your points at random, but make it reproducible by setting the seed.

In [13]:
np.random.seed(0)
indices = list(range(300))
np.random.shuffle(indices)
diff = np.array(list(set(list(range(300))).difference(indices[:200])))
X_train, y_train, X_test, y_test = data[indices,:], labels[indices], data[diff,:], labels[diff]

** Your Task**: train the model on the training set

In [14]:
attributeRanges= [np.unique(X_train[:,i]).shape[0] for i in range(nAttributes)]
nClasses = np.unique(y_train).shape[0]
d, m = estimateNBModel(X_train, y_train, nAttributes, attributeRanges, nClasses)

** Your Task**: evaluate the model on the test set. Retrieve the predicted labels for the test set data points.

In [15]:
predicted_train = predictNBLabels(X_train, d, m, nClasses) 
predicted_test = predictNBLabels(X_test, d, m, nClasses)

** Your Task**: compute the predictive accuracy on the training set and report it. 

Compute the predictive accuract on the test set and report it.

Is there any evidence that the model overfits? (put a note in a markdown cell)

In [16]:
print("Accuracy on train")
print(sum([1 for i in range(len(y_train)) if y_train[i] == predicted_train[i]]) / len(y_train))
print()
print("Accuracy on test")
print(sum([1 for i in range(len(y_test)) if y_test[i] == predicted_test[i]]) / len(y_test))

Accuracy on train
0.8466666666666667

Accuracy on test
0.86


Answer the overfit question here.

There does not seem to be much overfitting on the test set, if at all. Since the accuracy of training set is not drastically higher than the accuracy on the test set (84.6% vs. 86%), which is a good sign of overfitting on the training set, we can conclude Naïve Bayes does not overfit on this dataset.

**Your Task**: compute and output the confusion matrix

In [17]:
mat = np.zeros((3,3))
for i in range(len(predicted_test)):
    mat[predicted_test[i], int(y_test[i])] += 1
df = pd.DataFrame({"True value: 0": mat[:,0], "True value: 1": mat[:,1], "True value: 2": mat[:,2]})
df.index = ["Predicted value: {}".format(i) for i in range(3)]

In [18]:
df

Unnamed: 0,True value: 0,True value: 1,True value: 2
Predicted value: 0,27.0,2.0,4.0
Predicted value: 1,1.0,20.0,4.0
Predicted value: 2,1.0,2.0,39.0


**Congratulations!** Your are done.

Download the notebook, and submit it using the 

        handin dekhtyar 446-test <file>
 command.