# A Notebook to Use Naïve Bayes Classifiers

This notebook shows how to train a Naïve Bayes classifier to classify unseen instances.

For those of you interested in understanding the code, it uses predefined functions from the [sklearn](http://scikit-learn.org) library of machine learning primitives. A few more details about the code:  
* The variable "dataset" stores the name of text file that you input and is passed as an argument of the function "loadDataSet()".  
* After processing, the loadDataSet function will output, or in other words, return two variables, "instances", and "labels".  
* "attributes" stores the names of all features. "instances" stores the feature vector of each instance. "labels" stores the labels of all instances.   
* The variable "n_foldCV" stores the number of times of n-fold cross validation that you input.
* The variable "clf" stores up to three Naive Bayes models, and it can be fitted with "instances" and "labels". Once the model are trained, they can be used to predict unseen instances. 

Naive Bayes Classifer is a probabilistic classifier which is based on Bayes Theorem. It can be used to find the probability of Hypothesis (H) being True given an event (E) has occurred. 

![alt text](https://drive.google.com/uc?id=1Hdtn7HztaQC8R5U3wQEgMVvC4dTvvNLg)


In [None]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB,GaussianNB,MultinomialNB
from sklearn.model_selection import cross_val_score

def loadDataSet(dataset): 
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            dataArray.append([float(instance[i]) for instance in instances])
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return attributes,instances,labels



def predict(testset):
    if "clf_G" in globals():
        prediction=clf_G.predict(testset)
        print("GaussianNB: ",prediction)
    if "clf_M" in globals():
        prediction=clf_M.predict(testset)
        print("MultinomialNB: ",prediction)

## Training: Build a Naïve Bayes Classifier##
The cell below asks for a dataset. It trains a Naïve Bayes classifier. There are three Naive Bayes classifiers provided. They are based on different math fundations and might have different performance over different datasets.   

We provide two the lenses dataset that can be applied to the Naïve Bayes algorithms. 
* "lenses.csv" contains four attributes with discrete values and three classes.

### Gaussian Naïve Bayes Classifier

In [None]:
dataset="lenses.csv"
attributes,instances,labels=loadDataSet(dataset)
clf_G = GaussianNB()
clf_G.fit(instances, labels)
print("Gaussian Naive Bayes is used.")


### Multinomial Naive Bayes Classifier





In [None]:
attributes,instances,labels=loadDataSet(dataset)
clf_M = MultinomialNB()
clf_M.fit(instances, labels)
print("Multinomial Naive Bayes is used.")

## Predict unseen instances##
When you are prompted to input a prediction set, please create an example of an instance that looks like the instances in the lenses.csv file. For example:

"young,myope,yes,normal"


Each feature value is separated with a comma, and should have the same length as the instances in the original dataset. 

In [None]:
testset=input('Please Enter Your Prediction Set:')
testset=testset.split(",")
temp=[]
for i in range(len(testset)):
        temp.append(float(testset[i]))
testset=np.array(temp).reshape((1,len(temp)))

In [None]:
predict(testset)

## Evaluate a classifier##
The following cell will output the accuracy score in each run and the accuracy estimate of the model under 95% confidence interval.  

### Gaussian Naïve Bayes Classifier



In [None]:
%%time
# dataset=input('Please Enter Your Test Data:')
n_foldCV=int(input("Please Enter the Number of Folds:"))
attributes,instances,labels=loadDataSet(dataset)
clf_G = GaussianNB()
scores = cross_val_score(clf_G, instances, labels, cv=n_foldCV)
print("======GaussianNB======")
print(np.array2string(scores,separator=","))
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

### Multinomial Naive Bayes

In [None]:
%%time
# dataset=input('Please Enter Your Test Data:')
n_foldCV=int(input("Please Enter the Number of Folds:"))
attributes,instances,labels=loadDataSet(dataset)
clf_M = MultinomialNB()
scores = cross_val_score(clf_M, instances, labels, cv=n_foldCV)
print("======MultinomialNB======")
print(np.array2string(scores,separator=","))
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))