# Introduction:
[Naive bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) belongs to the family of generative machine learning models which try to model all the features of the dataset by learning $P(x|y)$ and $P(y)$, where 
1. $P(x | y)$ = probability of feature given class.
2. $P(y)$ = probability of a class.

Naive Bayes is based on Bayes Theorem, though being simple it is widely used because it often outperforms more sophisticated classification methods. 
 

# Assumptions in Naive Bayes:
1. All the features of the data are independent of each other. It is because of this assumption Naive Bayes is called naive.
2. The data is IID.


# Algorithm:
Let us consider a binary classification problem where the classes($C$) are 0 and 1. Assume our data($X$) has $d$ features and $n$ samples. As stated above naive bayes is based on bayes theorem which states that:
$$P(C|X) = \frac{P(X|C)P(C)}{P(X)}$$
where,
1. P(C|X) is called the prosterior probability of class given predictor.
2. P(X|C) is called the likelihood.
3. P(C) is called the prior probability(class distribution).
4. P(X) is called the prior probability of predictor variable.

We need to estimate $P(X|C)$ and $P(C)$ in order to classify a data point. Give a dataset $X = {x_{1},x_{2},x_{3},
...,x_{d}}$, we can estimate $P(C|X)$ as 
$$P(C|X)=\frac{P(X|C)P(C)}{P(X)} \propto P(X|C)P(C)$$

$$P(C|X) \propto P(x_{1}|C)P(x_{2}|C)..P(x_{d}|C) P(C)$$

**Note:** Naive Bayes learns a linear distriminant fucntion in case of a binary classification.

# Python implemntation:
Before implementing let us define few functions that we will require.

In [17]:
'Import libraries'
from __future__ import division
import numpy as np
import pandas as pd
from collections import defaultdict
import copy
import time
from sklearn.cross_validation import ShuffleSplit
from sklearn.preprocessing import StandardScaler
from itertools import izip




In [45]:
'Utils functions'
def splitData(test_size,cv, numpoints):
    #This function from sklearn takes the length of the data and test size and returns bootstrapped indices 
    #depending on how many boostraps are required
    '''
    :param test_size: size of the test data required (value between 0 and 1).
    :param cv: Number of re-shuffling.
    :param numpoints: Total number of data points.
    :return: indices of the shuffled splits.
    '''
    ss = ShuffleSplit(n=numpoints, n_iter=cv, test_size=test_size, random_state=32)
    return ss

def calAccuracy(pred,ytest):
    '''
    :param pred: vector containing all the predicted classes
    :param ytest: vector containing all the true classes
    :return: accuracy of classification
    '''
    count = 0
    for i,j in izip(pred,ytest):
        if i==j:
            count +=1
    return count/(len(ytest))

def calgaussianprob(value,mean,std):
    '''
    :param value: Point for which the probability is to be found.
    :param mean: mean of distribution.
    :param std: standard deviation of distribution.
    :return: probability of the value to fall in distribution with given mean and std.
    '''
    return (np.exp(- np.power((value-mean),2) / (2*np.power(std,2)) )) / (np.sqrt(2*np.power(std,2)*np.pi ) )

# Naive Bayes:

In [46]:
'Assume that the data is normally distributed'
class naivebayesGaussian():
    def __init__(self):
        self.meanandstd = defaultdict(list)
        self.classes = None
        self.classCount = None
        self.numFeatures = None
        self.probXgivenClass = defaultdict(list)


    def fit(self,Xtrain,ytrain):
        start = time.time()
        self.numFeatures = Xtrain.shape[1]
        
        'Get the classes and their respective counts, the counts will be used to calculate the class priors'
        self.classes, self.classCount = np.unique(ytrain, return_counts=True)
        
        'Save the indices of data points belonging to each class'
        indices = defaultdict(list)
        
        'Divide data according to classes'
        for classes in self.classes:
            indices[classes] =  np.where(ytrain == classes)[0]
            'Calculate the mean and standard deviation of each feature respective to the class'
            for i in range(self.numFeatures):
                self.meanandstd[classes].append( ( Xtrain[indices[classes],:][:,i].mean(), Xtrain[indices[classes],:][:,i].std() ) )
        end = time.time()
        print 'Time taken to fit the data:', end-start
        
    def predict(self,Xtest):
        start = time.time()
        for classes, meanstd in self.meanandstd.iteritems():
            for i in range(len(Xtest)):
                prob = 1
                for j in range(self.numFeatures):
                    prob *= calgaussianprob(Xtest[i][j], meanstd[j][0], meanstd[j][1])
                self.probXgivenClass[i].append(prob * (self.classCount[np.where(self.classes == classes)[0]][0]/self.classCount.sum() ) )

        predictions = []
        for i in range(len(Xtest)):
            index = self.probXgivenClass[i].index(max(self.probXgivenClass[i]))
            predictions.append(self.classes[index])
        end = time.time()
        print 'Time taken to predict:', end-start
        return predictions

# Load the dataset

In [47]:
from sklearn.datasets import load_breast_cancer
breastcancer = load_breast_cancer()
numSamples, numFeat = breastcancer.data.shape
ss = splitData(test_size=0.25,cv=1, numpoints=numSamples)
for train_index, test_index in ss:
    Xtrain = breastcancer.data[train_index, :]
    ytrain = breastcancer.target[train_index].reshape((train_index.shape[0], 1))
    Xtest = breastcancer.data[test_index, :]
    ytest = breastcancer.target[test_index].reshape((test_index.shape[0], 1))

In [48]:
'Normalize the data to zero mean and unit variance'
scalar = StandardScaler()
Xtrain = scalar.fit_transform(Xtrain)
Xtest = scalar.transform(Xtest)

# Running our algorithms:
I have also compared the performance of my naive bayes implementation with sklearn naive bayes as a sanity check.

In [49]:
from sklearn.naive_bayes import GaussianNB

In [50]:
myNB = naivebayesGaussian()
myNB.fit(Xtrain,ytrain)
pred = myNB.predict(Xtest)
print 'Accuracy:',calAccuracy(pred,ytest) * 100

Time taken to fit the data: 0.00693106651306
Time taken to predict: 0.121052980423
Accuracy: 92.3076923077


In [51]:
clf = GaussianNB()
clf.fit(Xtrain,ytrain)
pred = clf.predict(Xtest)
print 'Accuracy of sklearn:',calAccuracy(pred,ytest) * 100

Accuracy of sklearn: 92.3076923077


  y = column_or_1d(y, warn=True)


# Couclusion:
The naive bayes that we have implmented gives the same accuracy as sklearn Gaussian Naive Bayes.

# Refrences:
My class notes and class slides.