# Naive Bayes

Bayes’ Theorem offers a method to compute the probability of data belonging to a specific class, given our prior knowledge. The theorem can be expressed as follows:

P(class|data) = (P(data|class) * P(class)) / P(data)

Here, P(class|data) represents the probability of the class given the provided data.

For a comprehensive introduction to Bayes' Theorem, you can refer to the tutorial titled:

"A Gentle Introduction to Bayes' Theorem for Machine Learning." https://machinelearningmastery.com/bayes-theorem-for-machine-learning/

Naive Bayes is a classification algorithm used for binary (two-class) and multiclass classification tasks. It is known as Naive Bayes or idiot Bayes due to its simplification of probability calculations for each class to ensure tractability.

Rather than directly computing the probabilities of each attribute value, Naive Bayes assumes that attributes are conditionally independent, given the class value.

This assumption is quite strong and often improbable in real-world data, as it assumes no interaction among attributes. Nevertheless, Naive Bayes performs surprisingly well on data even when this assumption is not met.

the goals from this learning is understand:
- How to calculate the probabilities required by the Naive Bayes algorithm.
- How to implement the Naive Bayes algorithm from scratch.
- How to apply Naive Bayes to a real-world predictive modeling problem.


## Dataset
We will use iris dataset from scikit learn

In [13]:
from sklearn.datasets import load_iris

In [14]:
data = load_iris()

In [15]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [16]:
data['DESCR']



In [17]:
import pandas as pd
import numpy as np

In [22]:
X = data['data']
y = data['target']
data = np.concatenate((X, y.reshape(-1, 1)), axis=1)

# Breakdown The Algorithm
This Naive Bayes tutorial is organized into 5 distinct parts, each contributing to the implementation of Naive Bayes from scratch and its application in predictive modeling problems:

- Step 1: Separate By Class.
- Step 2: Summarize Dataset.
- Step 3: Summarize Data By Class.
- Step 4: Gaussian Probability Density Function.
- Step 5: Class Probabilities.

By following these steps, you will build a solid foundation to understand and effectively utilize Naive Bayes for your own machine learning tasks.

## Step 1: Separate By Class

To calculate the probability of data based on their respective class, known as the base rate, we must initially separate our training data by their corresponding classes. This operation is relatively straightforward.

We can achieve this by creating a dictionary object where each key represents a class value, and the corresponding value is a list containing all records belonging to that class.

Below is a function called separate_by_class() that illustrates this process. It assumes that the last column in each row denotes the class value.

In [27]:
# Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
	separated = dict()
	for i in range(len(dataset)):
		vector = dataset[i]
		class_value = vector[-1]
		if (class_value not in separated):
			separated[class_value] = list()
		separated[class_value].append(vector)
	return separated

In [28]:
separated = separate_by_class(data)

In [29]:
for label in separated:
    print(label)
    for row in separated[label]:
        print(row)

0.0
[5.1 3.5 1.4 0.2 0. ]
[4.9 3.  1.4 0.2 0. ]
[4.7 3.2 1.3 0.2 0. ]
[4.6 3.1 1.5 0.2 0. ]
[5.  3.6 1.4 0.2 0. ]
[5.4 3.9 1.7 0.4 0. ]
[4.6 3.4 1.4 0.3 0. ]
[5.  3.4 1.5 0.2 0. ]
[4.4 2.9 1.4 0.2 0. ]
[4.9 3.1 1.5 0.1 0. ]
[5.4 3.7 1.5 0.2 0. ]
[4.8 3.4 1.6 0.2 0. ]
[4.8 3.  1.4 0.1 0. ]
[4.3 3.  1.1 0.1 0. ]
[5.8 4.  1.2 0.2 0. ]
[5.7 4.4 1.5 0.4 0. ]
[5.4 3.9 1.3 0.4 0. ]
[5.1 3.5 1.4 0.3 0. ]
[5.7 3.8 1.7 0.3 0. ]
[5.1 3.8 1.5 0.3 0. ]
[5.4 3.4 1.7 0.2 0. ]
[5.1 3.7 1.5 0.4 0. ]
[4.6 3.6 1.  0.2 0. ]
[5.1 3.3 1.7 0.5 0. ]
[4.8 3.4 1.9 0.2 0. ]
[5.  3.  1.6 0.2 0. ]
[5.  3.4 1.6 0.4 0. ]
[5.2 3.5 1.5 0.2 0. ]
[5.2 3.4 1.4 0.2 0. ]
[4.7 3.2 1.6 0.2 0. ]
[4.8 3.1 1.6 0.2 0. ]
[5.4 3.4 1.5 0.4 0. ]
[5.2 4.1 1.5 0.1 0. ]
[5.5 4.2 1.4 0.2 0. ]
[4.9 3.1 1.5 0.2 0. ]
[5.  3.2 1.2 0.2 0. ]
[5.5 3.5 1.3 0.2 0. ]
[4.9 3.6 1.4 0.1 0. ]
[4.4 3.  1.3 0.2 0. ]
[5.1 3.4 1.5 0.2 0. ]
[5.  3.5 1.3 0.3 0. ]
[4.5 2.3 1.3 0.3 0. ]
[4.4 3.2 1.3 0.2 0. ]
[5.  3.5 1.6 0.6 0. ]
[5.1 3.8 1.9 0.4 0. ]
[4.8 3

## Step 2: Summarize Dataset

To proceed with calculating probabilities in subsequent steps, we require two essential statistics from the dataset.

These statistics are the mean and the standard deviation, which helps us understand the average value and the deviation from that average.

The mean, denoted as the average value, can be computed using the formula:

mean = sum(x) / n, where x represents the list of values or a specific column we are examining, and n denotes the count of values in that column.

Below is a concise function called mean() that calculates the mean of a list of numbers.

In [30]:
# Calculate the mean of a list of numbers
def mean(numbers):
	return sum(numbers)/float(len(numbers))

The sample standard deviation is determined by measuring the mean difference from the mean value. Mathematically, it is computed as:

standard deviation = sqrt((sum i to N (x_i – mean(x))^2) / N-1)

This formula involves squaring the difference between each data point and the mean, then averaging these squared differences, and finally taking the square root to return the units back to their original scale.

Below, you will find a concise function named standard_deviation() that calculates the standard deviation of a list of numbers. Notably, it also computes the mean. For efficiency, you may explore the option of calculating the mean of a list of numbers once and passing it as a parameter to the standard_deviation() function. This optimization can be explored at a later stage if you are interested.

In [33]:
from math import sqrt

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
	return sqrt(variance)

To compute the mean and standard deviation statistics for each input attribute (column) in our dataset, we can follow the following steps:

1. Gather all the values of each column into a list.
2. Calculate the mean and standard deviation on that list of values.
3. Store the statistics for each column in a list or tuple.
4. Repeat the above steps for each column in the dataset.
5. Return a list of tuples containing the statistics for all columns.

Below, you'll find a function called summarize_dataset() that employs Python tricks to reduce the number of lines required for implementation. This function facilitates the calculation of statistics for each column in the dataset and returns a list of tuples containing the mean and standard deviation statistics for all columns.

In [31]:
# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
	summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
	del(summaries[-1])
	return summaries

The first clever trick involves using the zip() function to aggregate elements from each provided argument. By passing the dataset to the zip() function with the * operator, we can separate the dataset (a list of lists) into separate lists for each row. As a result, the zip() function iterates over each element of each row and returns a column from the dataset as a list of numbers.

Subsequently, we calculate the mean, standard deviation, and count of rows for each column. These three numbers are then combined into a tuple, and a list of these tuples is created and stored. To ensure we exclude the statistics for the class variable, we remove the corresponding tuple.

Let's put these functions to the test using the contrived dataset mentioned earlier. Below, you'll find the complete example incorporating these techniques.

In [34]:
summary = summarize_dataset(data)
summary

[(5.843333333333335, 0.8280661279778629, 150),
 (3.057333333333334, 0.435866284936698, 150),
 (3.7580000000000027, 1.7652982332594667, 150),
 (1.199333333333334, 0.7622376689603465, 150)]

Now we are ready to use these functions on each group of rows in our dataset

## Step 3: Summarize Data By Class

In this step, our goal is to obtain relevant statistics from our training dataset, organized by class values.

To achieve this, we have already developed two essential functions: 

1. `separate_by_class()`: This function effectively segregates the dataset into rows based on their respective class labels.

2. `summarize_dataset()`: This function calculates summary statistics, such as mean, standard deviation, and count, for each column in the dataset.

By combining these two functions, we can conveniently summarize the columns in the dataset, grouped by their class values.

For your convenience, the function named `summarize_by_class()` has been created to accomplish this task. It efficiently splits the dataset by class, and then computes statistics for each subset. The resulting statistics are represented as a list of tuples and stored in a dictionary, with each class value serving as the key.

In summary, `summarize_by_class()` streamlines the process of organizing the dataset by class values and deriving relevant statistics for each class, which is crucial for various classification tasks, such as building Naive Bayes classifiers or other classification models.

In [35]:
# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
	separated = separate_by_class(dataset)
	summaries = dict()
	for class_value, rows in separated.items():
		summaries[class_value] = summarize_dataset(rows)
	return summaries

In [36]:
summary = summarize_by_class(data)
for label in summary:
 print(label)
 for row in summary[label]:
    print(row)

0.0
(5.005999999999999, 0.3524896872134512, 50)
(3.428000000000001, 0.3790643690962886, 50)
(1.4620000000000002, 0.1736639964801841, 50)
(0.2459999999999999, 0.10538558938004569, 50)
1.0
(5.936, 0.5161711470638635, 50)
(2.7700000000000005, 0.3137983233784114, 50)
(4.26, 0.46991097723995806, 50)
(1.3259999999999998, 0.197752680004544, 50)
2.0
(6.587999999999998, 0.635879593274432, 50)
(2.9739999999999998, 0.3224966381726376, 50)
(5.552, 0.5518946956639835, 50)
(2.026, 0.27465005563666733, 50)


## Step 4: Gaussian Probability Density Function

Calculating the probability or likelihood of observing a given real-value, such as X1, can be challenging. However, one approach to estimate this probability is to assume that the X1 values are drawn from a specific distribution, such as a bell curve or Gaussian distribution.

A Gaussian distribution can be characterized by just two numbers: the mean and the standard deviation. With these parameters, we can use a mathematical function called the Gaussian Probability Density Function (PDF) to estimate the probability of a given value.

The Gaussian PDF can be expressed as follows:

f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x - mean)^2 / (2 * sigma^2)))

Where:
- sigma is the standard deviation of the variable x.
- mean is the mean value of the variable x.
- PI is the mathematical constant representing the value of π (approximately 3.14159).

Implementing this function can be done as follows:

In [37]:
# Example of Gaussian PDF
from math import sqrt
from math import pi
from math import exp

# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
	exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
	return (1 / (sqrt(2 * pi) * stdev)) * exponent

# Test Gaussian PDF
print(calculate_probability(1.0, 1.0, 1.0))
print(calculate_probability(2.0, 1.0, 1.0))
print(calculate_probability(0.0, 1.0, 1.0))

0.3989422804014327
0.24197072451914337
0.24197072451914337


After running the function, it will display the probabilities of certain input values. Notably, when the input value is 1, and both the mean and standard deviation are 1, the input is the most likely, situated at the peak of the bell curve, and has a probability of 0.39.

Furthermore, if we maintain the same statistics and modify the input value to be 1 standard deviation away from the mean value, both in the positive and negative directions (i.e., 2 and 0, which are equidistant from the center of the bell curve), the probabilities of these input values will be identical at 0.24.

## Step 5: Class Probabilities

We will now use the statistics calculated from our training data to estimate probabilities for new data.

Probabilities are calculated separately for each class, meaning that we calculate the probability that a new piece of data belongs to the first class, then calculate probabilities for the second class, and so on for all classes.

The probability that a piece of data belongs to a class is calculated as follows:

P(class|data) = P(X|class) * P(class)

You may notice that this is different from the Bayes Theorem described above. The division has been removed to simplify the calculation. Although the result is no longer strictly a probability of the data belonging to a class, we still maximize the value, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification, as we are often more interested in the class prediction rather than the probability.

The input variables are treated separately, giving the technique its name "naive". For the previous example where we had 2 input variables, the calculation of the probability that a row belongs to the first class (class 0) can be calculated as:

P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)

Now you can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1, and the statistics we prepared are used in this calculation.

Below is a function named calculate_class_probabilities() that ties all of this together.

It takes a set of prepared summaries and a new row as input arguments.

First, the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class (P(class)) as the ratio of rows with a given class to all rows in the training data.

Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they are accumulated.

This process is repeated for each class in the dataset.

Finally, a dictionary of probabilities is returned with one entry for each class.

In [38]:
# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
	total_rows = sum([summaries[label][0][2] for label in summaries])
	probabilities = dict()
	for class_value, class_summaries in summaries.items():
		probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
		for i in range(len(class_summaries)):
			mean, stdev, count = class_summaries[i]
			probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
	return probabilities

In [41]:
probabilities = calculate_class_probabilities(summary, data[0])
print(probabilities)

{0.0: 2.7915339171768885, 1.0: 8.322426199968131e-18, 2.0: 6.008422572010989e-25}


In [53]:
# Predict the class for a given row
def predict(summaries, row):
 probabilities = calculate_class_probabilities(summaries, row)
 best_label, best_prob = None, -1
 for class_value, probability in probabilities.items():
    if best_label is None or probability > best_prob:
        best_prob = probability
        best_label = class_value
 return best_label

In [82]:
model = summarize_by_class(data)
# define a new record
row = [5.0 , 3.5, 1.3, 0.3]
# predict the label
label = predict(model, row)

In [83]:
print('Data=%s, Predicted: %s' % (row, label))

Data=[5.0, 3.5, 1.3, 0.3], Predicted: 0.0


## Comparation 

In [81]:
# Make Predictions with Naive Bayes On The Iris Dataset
from csv import reader
from math import sqrt
from math import exp
from math import pi

# Load a CSV file
# def load_csv(filename):
# 	dataset = list()
# 	with open(filename, 'r') as file:
# 		csv_reader = reader(file)
# 		for row in csv_reader:
# 			if not row:
# 				continue
# 			dataset.append(row)
# 	return dataset

# Convert string column to float
# def str_column_to_float(dataset, column):
# 	for row in dataset:
# 		row[column] = float(row[column].strip())

# # Convert string column to integer
# def str_column_to_int(dataset, column):
# 	class_values = [row[column] for row in dataset]
# 	unique = set(class_values)
# 	lookup = dict()
# 	for i, value in enumerate(unique):
# 		lookup[value] = i
# 		print('[%s] => %d' % (value, i))
# 	for row in dataset:
# 		row[column] = lookup[row[column]]
# 	return lookup

# Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
	separated = dict()
	for i in range(len(dataset)):
		vector = dataset[i]
		class_value = vector[-1]
		if (class_value not in separated):
			separated[class_value] = list()
		separated[class_value].append(vector)
	return separated

# Calculate the mean of a list of numbers
def mean(numbers):
	return sum(numbers)/float(len(numbers))

# Calculate the standard deviation of a list of numbers
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
	return sqrt(variance)

# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
	summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
	del(summaries[-1])
	return summaries

# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
	separated = separate_by_class(dataset)
	summaries = dict()
	for class_value, rows in separated.items():
		summaries[class_value] = summarize_dataset(rows)
	return summaries

# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
	exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
	return (1 / (sqrt(2 * pi) * stdev)) * exponent

# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
	total_rows = sum([summaries[label][0][2] for label in summaries])
	probabilities = dict()
	for class_value, class_summaries in summaries.items():
		probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
		for i in range(len(class_summaries)):
			mean, stdev, _ = class_summaries[i]
			probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
	return probabilities

# Predict the class for a given row
def predict(summaries, row):
	probabilities = calculate_class_probabilities(summaries, row)
	best_label, best_prob = None, -1
	for class_value, probability in probabilities.items():
		if best_label is None or probability > best_prob:
			best_prob = probability
			best_label = class_value
	return best_label

# # Make a prediction with Naive Bayes on Iris Dataset
# filename = 'iris.csv'
# dataset = load_csv(filename)
# for i in range(len(dataset[0])-1):
# 	str_column_to_float(dataset, i)
# convert class column to integers
# str_column_to_int(dataset, len(dataset[0])-1)
# fit model
model = summarize_by_class(data)
# define a new record
row = [5. , 3.5, 1.3, 0.3]
# predict the label
label = predict(model, row)
print('Data=%s, Predicted: %s' % (row, label))

Data=[5.0, 3.5, 1.3, 0.3], Predicted: 0.0


# comparation With Scikit Learn Library

In [72]:
# >>> from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [73]:
X, y = load_iris(return_X_y=True)

In [74]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [75]:
gnb = GaussianNB()

In [76]:
y_pred = gnb.fit(X_train, y_train).predict(X_test)

In [79]:

# Predict Output
predicted = gnb.predict([X_test[6]])

print("Actual Value:", y_test[6])
print("Predicted Value:", predicted[0])

Actual Value: 0
Predicted Value: 0


In [80]:
X_test[6]

array([5. , 3.5, 1.3, 0.3])

In [84]:
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    f1_score,
)

y_pred = gnb.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")

print("Accuracy:", accuray)
print("F1 Score:", f1)

Accuracy: 0.9466666666666667
F1 Score: 0.9474242424242425
