## Naive Bayes

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:

    P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data.

Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value.

This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

### Separate By Class

We will need to calculate the probability of data by the class they belong to, the so-called base rate.

This means that we will first need to separate our training data by class. A relatively straightforward operation.

We can create a dictionary object where each key is the class value and then add a list of all the records as the value in the dictionary.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [47]:
df = pd.read_csv('./Iris.csv')

In [48]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [20]:
# this is done so as to cluster the data on the basis of the class it belongs to, so that we can perform any sort of computation diretly on that cluster easily.

def seperate_by_class(dataset):
    seperated = dict()
    for i in range(len(dataset)):
        vectors = dataset[i]
        cls = vectors[-1]
        
        if cls not in seperated:
            seperated[cls] = []
        seperated[cls].append(vectors)
    return seperated

In [4]:
# let's test our function

# Test separating data by class
dataset = [[3.393533211,2.331273381,0],
	[3.110073483,1.781539638,0],
	[1.343808831,3.368360954,0],
	[3.582294042,4.67917911,0],
	[2.280362439,2.866990263,0],
	[7.423436942,4.696522875,1],
	[5.745051997,3.533989803,1],
	[9.172168622,2.511101045,1],
	[7.792783481,3.424088941,1],
	[7.939820817,0.791637231,1]]

In [52]:
test_df = pd.DataFrame(dataset)
test_df.columns = ['X1','X2','y']
test_df

Unnamed: 0,X1,X2,y
0,3.393533,2.331273,0
1,3.110073,1.78154,0
2,1.343809,3.368361,0
3,3.582294,4.679179,0
4,2.280362,2.86699,0
5,7.423437,4.696523,1
6,5.745052,3.53399,1
7,9.172169,2.511101,1
8,7.792783,3.424089,1
9,7.939821,0.791637,1


In [23]:
separated = seperate_by_class(dataset)
for label in separated:
	print(label)
	for row in separated[label]:
		print(row)
print(separated)

0
[3.393533211, 2.331273381, 0]
[3.110073483, 1.781539638, 0]
[1.343808831, 3.368360954, 0]
[3.582294042, 4.67917911, 0]
[2.280362439, 2.866990263, 0]
1
[7.423436942, 4.696522875, 1]
[5.745051997, 3.533989803, 1]
[9.172168622, 2.511101045, 1]
[7.792783481, 3.424088941, 1]
[7.939820817, 0.791637231, 1]
{0: [[3.393533211, 2.331273381, 0], [3.110073483, 1.781539638, 0], [1.343808831, 3.368360954, 0], [3.582294042, 4.67917911, 0], [2.280362439, 2.866990263, 0]], 1: [[7.423436942, 4.696522875, 1], [5.745051997, 3.533989803, 1], [9.172168622, 2.511101045, 1], [7.792783481, 3.424088941, 1], [7.939820817, 0.791637231, 1]]}


### Summarize Dataset

We need two statistics from a given set of data.

We’ll see how these statistics are used in the calculation of probabilities in a few steps. The two statistics we require from a given dataset are the mean and the standard deviation (average deviation from the mean).

The mean is the average value and can be calculated as:

    mean = sum(x)/n * count(x)

Where x is the list of values or a column we are looking.

The sample standard deviation is calculated as the mean difference from the mean value. This can be calculated as:

    standard deviation = sqrt((sum i to N (x_i – mean(x))^2) / N-1)

You can see that we square the difference between the mean and a given value, calculate the average squared difference from the mean, then take the square root to return the units back to their original value.

In [6]:
def mean(data):
    return sum(data)/len(data)

In [7]:
def std(data):
    avg = mean(data)
    var = sum([(x-avg)**2 for x in data])/float(len(data)-1)
    return np.sqrt(var)

In [8]:
t = zip(*dataset)

In [9]:
for each in t:
    print(each)

(3.393533211, 3.110073483, 1.343808831, 3.582294042, 2.280362439, 7.423436942, 5.745051997, 9.172168622, 7.792783481, 7.939820817)
(2.331273381, 1.781539638, 3.368360954, 4.67917911, 2.866990263, 4.696522875, 3.533989803, 2.511101045, 3.424088941, 0.791637231)
(0, 0, 0, 0, 0, 1, 1, 1, 1, 1)


In [14]:
a,b,c = zip(*dataset)
print(a,'-',b,'-',c)

(3.393533211, 3.110073483, 1.343808831, 3.582294042, 2.280362439, 7.423436942, 5.745051997, 9.172168622, 7.792783481, 7.939820817) - (2.331273381, 1.781539638, 3.368360954, 4.67917911, 2.866990263, 4.696522875, 3.533989803, 2.511101045, 3.424088941, 0.791637231) - (0, 0, 0, 0, 0, 1, 1, 1, 1, 1)


In [16]:
# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
    summaries = [(mean(column), std(column), len(column)) for column in zip(*dataset)]
    del(summaries[-1]) #deleting the sumaryof the last column (for the classes)
    return summaries

In [25]:
summarize_dataset([[1,2,3],[4,5,6]])

[(2.5, 2.1213203435596424, 2), (3.5, 2.1213203435596424, 2)]

In [28]:
summarize_dataset(dataset)

[(5.178333386499999, 2.7665845055177263, 10),
 (2.9984683241, 1.218556343617447, 10)]

### Summarize Data By Class

We require statistics from our training dataset organized by class.

Above, we have developed the separate_by_class() function to separate a dataset into rows by class. And we have developed summarize_dataset() function to calculate summary statistics for each column.

We can put all of this together and summarize the columns in the dataset organized by class values.

In [26]:
# noe again we summarize the data based the cluster created on the basis of the class, so that we can differentiate the summaries obtained from diferent classes.
def summarize_by_class(dataset):
    summaries = dict()
    separated = seperate_by_class(dataset)
    for cls,rows in separated.items():
        summaries[cls] = summarize_dataset(rows)
    return summaries
        

In [27]:
summarize_by_class(dataset)

{0: [(2.7420144012, 0.9265683289298018, 5),
  (3.0054686692, 1.1073295894898725, 5)],
 1: [(7.6146523718, 1.2344321550313704, 5),
  (2.9914679790000003, 1.4541931384601618, 5)]}

### Gaussian Probability Density Function

Calculating the probability or likelihood of observing a given real-value like X1 is difficult.

One way we can do this is to assume that X1 values are drawn from a distribution, such as a bell curve or Gaussian distribution.

A Gaussian distribution can be summarized using only two numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate the probability of a given value. This piece of math is called a Gaussian Probability Distribution Function (or Gaussian PDF) and can be calculated as:

    f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x-mean)^2 / (2 * sigma^2)))

Where sigma is the standard deviation for x, mean is the mean for x and PI is the value of pi.

In [33]:
from math import pi,sqrt,exp
def calculate_probability(mean,std,x):
    z = (x-mean)/std
    prob = (1/sqrt(2*pi*pow(std,2))*(exp((-1/2)*pow(z,2))))
    return prob

In [36]:
# Test Gaussian PDF
print('calculate_probability(1.0, 1.0, 1.0)->',calculate_probability(1.0, 1.0, 1.0))
print('calculate_probability(2.0, 1.0, 1.0)->',calculate_probability(2.0, 1.0, 1.0))
print('calculate_probability(0.0, 1.0, 1.0)->',calculate_probability(0.0, 1.0, 1.0))
"""Running it prints the probability of some input values. You can see that when the value is 1 and the mean and standard deviation is 1 our input is the most likely (top of the bell curve) and has the probability of 0.39.

We can see that when we keep the statistics the same and change the x value to 1 standard deviation either side of the mean value (2 and 0 or the same distance either side of the bell curve) the probabilities of those input values are the same at 0.24."""

calculate_probability(1.0, 1.0, 1.0)-> 0.3989422804014327
calculate_probability(2.0, 1.0, 1.0)-> 0.24197072451914337
calculate_probability(0.0, 1.0, 1.0)-> 0.24197072451914337


'Running it prints the probability of some input values. You can see that when the value is 1 and the mean and standard deviation is 1 our input is the most likely (top of the bell curve) and has the probability of 0.39.\n\nWe can see that when we keep the statistics the same and change the x value to 1 standard deviation either side of the mean value (2 and 0 or the same distance either side of the bell curve) the probabilities of those input values are the same at 0.24.'

### Class Probabilities

Now it is time to use the statistics calculated from our training data to calculate probabilities for new data.

Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes.

The probability that a piece of data belongs to a class is calculated as follows:

    P(class|data) = P(X|class) * P(class)

You may note that this is different from the Bayes Theorem described above.

The division has been removed to simplify the calculation.

This means that the result is no longer strictly a probability of the data belonging to a class. The value is still maximized, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification as we are often more interested in the class prediction rather than the probability.

The input variables are treated separately, giving the technique it’s name “naive“. For the above example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as:

    P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)

Now you can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1 and the statistics we prepared are used in this calculation.

Below is a function named calculate_class_probabilities() that ties all of this together.

It takes a set of prepared summaries and a new row as input arguments.

First the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class or P(class) as the ratio of rows with a given class of all rows in the training data.

Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they accumulated.

This process is repeated for each class in the dataset.

Finally a dictionary of probabilities is returned with one entry for each class.

In [44]:
def  calculate_class_probabilities(summaries,row):
    total_rows = sum([summaries[cls][0][2] for cls in summaries])# here we just consider the first aub array (column) since very column have the same number of entries for a given class.
    probabilities = dict()
    for cls,cls_summaries in summaries.items():
        probabilities[cls] = summaries[cls][0][2]/float(total_rows)
        for i in range(len(cls_summaries)): # for each column pertaining to a class
            mean,std,count = cls_summaries[i]
            probabilities[cls] *= calculate_probability(mean,std,row[i])
    return probabilities

In [43]:
summaries = summarize_by_class(dataset)
sum([summaries[cls][0][2] for cls in summaries])

10

In [46]:
probabilities = calculate_class_probabilities(summaries,dataset[0])
print(probabilities)

{0: 0.05032427673372073, 1: 0.00011557718379945776}


### Iris Flower Species Case Study

This section applies the Naive Bayes algorithm to the Iris flowers dataset.

The first step is to load the dataset and convert the loaded data to numbers that we can use with the mean and standard deviation calculations. For this we will use the helper function load_csv() to load the file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values.

We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 150/5=30 records will be in each fold. We will use the helper functions evaluate_algorithm() to evaluate the algorithm with cross-validation and accuracy_metric() to calculate the accuracy of predictions.

A new function named predict() was developed to manage the calculation of the probabilities of a new row belonging to each class and selecting the class with the largest probability value.

Another new function named naive_bayes() was developed to manage the application of the Naive Bayes algorithm, first learning the statistics from a training dataset and using them to make predictions for a test dataset.