## Naive Bayes classifier
Based on https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/


In [83]:
%load_ext autoreload
%autoreload 2

In [84]:
import numpy as np 
import naive

### Load data 

In [85]:
filename = 'data.csv'
dataset = naive.loadCsv(filename)

print dataset[0:5]

[[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0], [1.0, 85.0, 66.0, 29.0, 0.0, 26.6, 0.351, 31.0, 0.0], [8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0, 1.0], [1.0, 89.0, 66.0, 23.0, 94.0, 28.1, 0.167, 21.0, 0.0], [0.0, 137.0, 40.0, 35.0, 168.0, 43.1, 2.288, 33.0, 1.0]]


### Split into training and test set data 

In [86]:
splitRatio = 0.67

trainingSet, testSet = naive.splitDataset(dataset, splitRatio)
print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))

Split 768 rows into train=514 and test=254 rows


### Prepare model 
Separate training set into classes. The attributes of each class (e.g.) will be used for computing statistics 

In [87]:
summaries = naive.summarizeByClass(trainingSet)

#### What's happening in summarizeByClass


In [88]:
# Here we show how on of 'summarizeByClass' routines separates the data in 0 (no diabetes)  and 1 (w diabetes) data sets
trialSet = trainingSet[0:10]
separated = naive.separateByClass(trialSet)
print('Separated instances: {0}').format(separated)

Separated instances: {0.0: [[6.0, 111.0, 64.0, 39.0, 0.0, 34.2, 0.26, 24.0, 0.0], [1.0, 79.0, 75.0, 30.0, 0.0, 32.0, 0.396, 22.0, 0.0], [1.0, 71.0, 62.0, 0.0, 0.0, 21.8, 0.416, 26.0, 0.0], [1.0, 95.0, 74.0, 21.0, 73.0, 25.9, 0.673, 36.0, 0.0], [0.0, 105.0, 64.0, 41.0, 142.0, 41.5, 0.173, 22.0, 0.0], [1.0, 135.0, 54.0, 0.0, 0.0, 26.7, 0.687, 62.0, 0.0], [1.0, 112.0, 72.0, 30.0, 176.0, 34.4, 0.528, 25.0, 0.0]], 1.0: [[1.0, 163.0, 72.0, 0.0, 0.0, 39.0, 1.222, 33.0, 1.0], [4.0, 171.0, 72.0, 0.0, 0.0, 43.6, 0.479, 26.0, 1.0], [4.0, 95.0, 64.0, 0.0, 0.0, 32.0, 0.161, 31.0, 1.0]]}


we compute std dev and mean for each attribute of each class 
summarizeByClass calls summarize to compute mean/std dev for each attribute
<code>
def summarize(dataset):
	summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
	del summaries[-1]
	return summaries
</code>

Note: each 'event' was reported as a line in the csv file. Each line contained 9 attributes, of which the last column corresponded to output. Hence, in this function we're reporting statistics for each attribute across all events 

Below I show the mean for each attribute, though we are also computing stddev in the 'real' code

In [89]:

summaries = [(np.mean(attribute)) for attribute in zip(*trialSet)]
print('Separated instances: {0}').format(summaries)


Separated instances: [2.0, 113.7, 67.299999999999997, 16.100000000000001, 39.100000000000001, 33.109999999999999, 0.49949999999999994, 30.699999999999999, 0.29999999999999999]


Show class statistics. By eyeball it is clear that the attributes associated with each outcome have different means

In [90]:

summary = naive.summarizeByClass(trainingSet)

print('Summary by class value for outcome=0: {0}').format(summary[0])

print('Summary by class value for outcome=1: {0}').format(summary[1])

Summary by class value for outcome=0: [(3.3264094955489614, 3.1234972365245057), (109.85756676557864, 26.142191495829245), (68.58753709198812, 17.82685137448383), (19.51038575667656, 15.273129771121265), (67.20771513353115, 96.61918158328272), (30.198219584569742, 7.569491343653093), (0.438427299703264, 0.31081933041120535), (31.792284866468844, 12.24325129221498)]
Summary by class value for outcome=1: [(4.858757062146893, 3.7715057025688683), (141.08474576271186, 30.48972710878125), (70.13559322033899, 22.375507328264693), (21.497175141242938, 16.788837029199332), (99.00564971751412, 127.28491203412793), (34.99209039548024, 6.614279029697137), (0.5547514124293789, 0.3736294580784771), (36.90395480225989, 10.767524624694392)]


### Compute probabilites of belonging to a given class

First, we'll compute $P(\theta|D)$ for a single attribute, assuming Gaussian statistics. Here, the mean value we pass in is our 'D', and our 'theta' is essentially a measure of whether we belong to a given class.

In [91]:
mean = 73
stdev = 6.2

x = 100
probability = naive.calculateProbability(x, mean, stdev)
print('Probability of belonging to this class: {0}').format(probability)

x = 75
probability = naive.calculateProbability(x, mean, stdev)
print('Probability of belonging to this class: {0}').format(probability)


Probability of belonging to this class: 4.90233998485e-06
Probability of belonging to this class: 0.0610832884561


Since we have multiple attributes that are assumed independent, our probability of theta belonging to a given class is based on the product of all attribute probabilities in a given event 

<code>
def calculateClassProbabilities(summaries, inputVector):
	probabilities = {}
	for classValue, classSummaries in summaries.iteritems():
		probabilities[classValue] = 1
		for i in range(len(classSummaries)):
			mean, stdev = classSummaries[i]
			x = inputVector[i]
			probabilities[classValue] *= calculateProbability(x, mean, stdev)
	return probabilities
</code>    


Ex. here we create a 'classSummary' set, each with one attribute for which means (1 or 20) and stddevs (0.5 or 5.0) are reported. We ask for the probability of belonging to each class, given that the mean in (our inputvector) is 1.1

In [92]:
summaries = {0:[(1, 0.5)], 1:[(20, 5.0)]}
inputVector = [1.1, '?']
probabilities = naive.calculateClassProbabilities(summaries, inputVector)
print('Probabilities for each class: {0}').format(probabilities)
print("My class is {0}").format( naive.predict(summaries,inputVector))

Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05}
My class is 0


Now we'll run the classifier on a subset of the test set 

In [93]:
# prepare model
summaries = naive.summarizeByClass(trainingSet)
# test model
subset = testSet[0:3]
print "Subset", subset
predictions = naive.getPredictions(summaries, subset)
print "Predictions", predictions

Subset [[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0], [8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0, 1.0], [5.0, 116.0, 74.0, 0.0, 0.0, 25.6, 0.201, 30.0, 0.0]]
Predictions [1.0, 1.0, 0.0]


### Execute all steps. 
We also have a metric that allows us to test the classifier performance based our test data set 

In [98]:
naive.main()

Split 768 rows into train=514 and test=254 rows
Accuracy: 81.4960629921%
