# Naive Bayes Classification from Scratch

In this notebook, we'll implement the Naive Bayes algorithm from scratch. We'll go through each step in detail, explaining the concepts and code to ensure a thorough understanding.

## Importing Libraries

First, we'll import the necessary libraries.

In [36]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from collections import defaultdict
import matplotlib.pyplot as plt

In [40]:
d1 = {}
d1[1]

KeyError: 1

In [48]:
def defaultValue():
    return 0

d2 = defaultdict(defaultValue)
d2[3]

0

## Loading the Dataset

We'll use the Iris dataset for our Naive Bayes implementation. Let's load and inspect the dataset.

In [50]:
# Loading the Iris dataset
from sklearn.datasets import load_iris

iris = load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                    columns= iris['feature_names'] + ['species'])

# Mapping target labels to iris species
data['species'] = data['species'].map({0: 'Iris-setosa', 1: 'Iris-versicolor', 2: 'Iris-virginica'})

# Displaying the first few rows of the dataset
print(data.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

       species  
0  Iris-setosa  
1  Iris-setosa  
2  Iris-setosa  
3  Iris-setosa  
4  Iris-setosa  


## Splitting the Dataset

We'll split the dataset into training and testing sets to evaluate our Naive Bayes model.

In [52]:
# Splitting the dataset into training and testing sets
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Naive Bayes Classifier

The Naive Bayes classifier is based on Bayes' theorem with the "naive" assumption of conditional independence between every pair of features given the value of the class variable.

### Bayes' Theorem

Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. The formula is:

\[
P(A|B) = \frac{P(B|A) \, P(A)}{P(B)}
\]

Where:
- \( P(A|B) \) is the posterior probability of class (target) given predictor (attribute).
- \( P(B|A) \) is the likelihood which is the probability of predictor given class.
- \( P(A) \) is the prior probability of class.
- \( P(B) \) is the prior probability of predictor.

### Naive Bayes Assumption

The "naive" assumption is that the presence of a particular feature in a class is unrelated to the presence of any other feature. In other words, the features are conditionally independent.

### Gaussian Naive Bayes

For continuous data, we assume that the features follow a Gaussian (normal) distribution. The probability density function of a normal distribution is given by:

$$ P(x|\mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) $$

Where \( \mu \) is the mean and \( \sigma \) is the standard deviation.


In [60]:
np.unique(y)

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [12]:
# Function to calculate the mean and standard deviation for each feature
def summarize_by_class(X, y):
    summaries = defaultdict(list)
    for label in np.unique(y):
        features = X[y == label]
        summaries[label] = [(np.mean(feature), np.std(feature)) for feature in zip(*features)]
    return summaries

In [68]:
X_test

array([[6.1, 2.8, 4.7, 1.2],
       [5.7, 3.8, 1.7, 0.3],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.9, 4.5, 1.5],
       [6.8, 2.8, 4.8, 1.4],
       [5.4, 3.4, 1.5, 0.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.9, 3.1, 5.1, 2.3],
       [6.2, 2.2, 4.5, 1.5],
       [5.8, 2.7, 3.9, 1.2],
       [6.5, 3.2, 5.1, 2. ],
       [4.8, 3. , 1.4, 0.1],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.1, 3.8, 1.5, 0.3],
       [6.3, 3.3, 4.7, 1.6],
       [6.5, 3. , 5.8, 2.2],
       [5.6, 2.5, 3.9, 1.1],
       [5.7, 2.8, 4.5, 1.3],
       [6.4, 2.8, 5.6, 2.2],
       [4.7, 3.2, 1.6, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [5. , 3.4, 1.6, 0.4],
       [6.4, 2.8, 5.6, 2.1],
       [7.9, 3.8, 6.4, 2. ],
       [6.7, 3. , 5.2, 2.3],
       [6.7, 2.5, 5.8, 1.8],
       [6.8, 3.2, 5.9, 2.3],
       [4.8, 3. , 1.4, 0.3],
       [4.8, 3.1, 1.6, 0.2],
       [4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2],
       [4.4, 3

## Calculating Class Probabilities

Using the summaries (mean and standard deviation), we can calculate the probability of a data point belonging to each class.

In [14]:
# Function to calculate the Gaussian probability density function
def calculate_probability(x, mean, stdev):
    exponent = np.exp(-((x - mean) ** 2 / (2 * stdev ** 2)))
    return (1 / (np.sqrt(2 * np.pi) * stdev)) * exponent

In [15]:
# Function to calculate the class probabilities for a given input
def calculate_class_probabilities(summaries, input_vector):
    probabilities = {}
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = 1
        for i in range(len(class_summaries)):
            mean, stdev = class_summaries[i]
            x = input_vector[i]
            probabilities[class_value] *= calculate_probability(x, mean, stdev)
    return probabilities

In [104]:
class_

dict_items([('Iris-setosa', [(4.964516129032259, 0.3346142575455468), (3.3774193548387097, 0.36957663435023635), (1.4645161290322577, 0.18236508419487185), (0.24838709677419357, 0.10737623856189187)]), ('Iris-versicolor', [(5.8621621621621625, 0.524714260130706), (2.724324324324324, 0.29537461821220007), (4.210810810810811, 0.48922651791278887), (1.3027027027027025, 0.20333235539247596)]), ('Iris-virginica', [(6.559459459459459, 0.6499311645295484), (2.986486486486486, 0.310328650689888), (5.545945945945946, 0.5370563383777094), (2.005405405405405, 0.29311555321809374)])])

## Making Predictions

To make a prediction, we select the class with the highest probability.

In [106]:
# Function to make a prediction for a given input
def predict(summaries, input_vector):
    probabilities = calculate_class_probabilities(summaries, input_vector)
    best_label, best_prob = None, -1
    print(probabilities)
    for class_value, probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = class_value
    return best_label

## Training the Model

We'll use the training data to calculate the summaries (mean and standard deviation) for each class.

In [19]:
# Training the Naive Bayes model
summaries = summarize_by_class(X_train, y_train)
summaries

defaultdict(list,
            {'Iris-setosa': [(4.964516129032259, 0.3346142575455468),
              (3.3774193548387097, 0.36957663435023635),
              (1.4645161290322577, 0.18236508419487185),
              (0.24838709677419357, 0.10737623856189187)],
             'Iris-versicolor': [(5.8621621621621625, 0.524714260130706),
              (2.724324324324324, 0.29537461821220007),
              (4.210810810810811, 0.48922651791278887),
              (1.3027027027027025, 0.20333235539247596)],
             'Iris-virginica': [(6.559459459459459, 0.6499311645295484),
              (2.986486486486486, 0.310328650689888),
              (5.545945945945946, 0.5370563383777094),
              (2.005405405405405, 0.29311555321809374)]})

## Evaluating the Model

We'll use the testing data to evaluate the accuracy of our Naive Bayes model.

In [62]:
X_test

array([[6.1, 2.8, 4.7, 1.2],
       [5.7, 3.8, 1.7, 0.3],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.9, 4.5, 1.5],
       [6.8, 2.8, 4.8, 1.4],
       [5.4, 3.4, 1.5, 0.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.9, 3.1, 5.1, 2.3],
       [6.2, 2.2, 4.5, 1.5],
       [5.8, 2.7, 3.9, 1.2],
       [6.5, 3.2, 5.1, 2. ],
       [4.8, 3. , 1.4, 0.1],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.1, 3.8, 1.5, 0.3],
       [6.3, 3.3, 4.7, 1.6],
       [6.5, 3. , 5.8, 2.2],
       [5.6, 2.5, 3.9, 1.1],
       [5.7, 2.8, 4.5, 1.3],
       [6.4, 2.8, 5.6, 2.2],
       [4.7, 3.2, 1.6, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [5. , 3.4, 1.6, 0.4],
       [6.4, 2.8, 5.6, 2.1],
       [7.9, 3.8, 6.4, 2. ],
       [6.7, 3. , 5.2, 2.3],
       [6.7, 2.5, 5.8, 1.8],
       [6.8, 3.2, 5.9, 2.3],
       [4.8, 3. , 1.4, 0.3],
       [4.8, 3.1, 1.6, 0.2],
       [4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2],
       [4.4, 3

In [116]:
# Function to evaluate the accuracy of the model
def evaluate_model(summaries, X_test, y_test):
    correct = 0
    y_pred = []
    for i in range(len(X_test)):
        input_vector = X_test[i]
        true_label = y_test[i]
        predicted_label = predict(summaries, input_vector)
        print("Predicted", predicted_label , "True", true_label)
        if predicted_label == true_label:
            correct += 1
        y_pred.append(predicted_label)
    accuracy = correct / len(X_test)
    return accuracy,y_pred

# Evaluating the model
accuracy,y_pred = evaluate_model(summaries, X_test, y_test)
print(f'Accuracy: {accuracy}')

{'Iris-setosa': 3.819364566872026e-88, 'Iris-versicolor': 0.7660318658826788, 'Iris-virginica': 0.0034412108459275525}
Predicted Iris-versicolor True Iris-versicolor
{'Iris-setosa': 0.1880670217551684, 'Iris-versicolor': 2.0646523299900197e-14, 'Iris-virginica': 3.494454015975696e-21}
Predicted Iris-setosa True Iris-setosa
{'Iris-setosa': 2.3233632571942632e-287, 'Iris-versicolor': 5.348993080940257e-15, 'Iris-virginica': 0.001980094348251244}
Predicted Iris-virginica True Iris-virginica
{'Iris-setosa': 8.152283962643816e-92, 'Iris-versicolor': 0.6974409547337572, 'Iris-virginica': 0.017984777902018273}
Predicted Iris-versicolor True Iris-versicolor
{'Iris-setosa': 2.1672636827285892e-104, 'Iris-versicolor': 0.13899956541713993, 'Iris-virginica': 0.028085151216358836}
Predicted Iris-versicolor True Iris-versicolor
{'Iris-setosa': 1.6210218247921802, 'Iris-versicolor': 9.204114449941676e-13, 'Iris-virginica': 9.709314209511034e-21}
Predicted Iris-setosa True Iris-setosa
{'Iris-setosa': 

In [118]:
pd.crosstab(y_test,y_pred)

col_0,Iris-setosa,Iris-versicolor,Iris-virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iris-setosa,19,0,0
Iris-versicolor,0,12,1
Iris-virginica,0,0,13


## Conclusion

In this notebook, we've implemented the Naive Bayes algorithm from scratch. We've covered the following steps:
- Understanding Bayes' theorem and the naive assumption.
- Calculating the mean and standard deviation for each feature in the training data.
- Calculating class probabilities using the Gaussian probability density function.
- Making predictions by selecting the class with the highest probability.
- Evaluating the accuracy of our model on the testing data.

Naive Bayes is a simple yet powerful algorithm that performs well on a variety of tasks, especially those involving text classification.

In [70]:
from sklearn.naive_bayes import GaussianNB

In [76]:
nb = GaussianNB()

nb.fit(X_train,y_train)

y_pred = nb.predict(X_test)

In [84]:
pd.crosstab(y_test,y_pred)

col_0,Iris-setosa,Iris-versicolor,Iris-virginica
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iris-setosa,19,0,0
Iris-versicolor,0,12,1
Iris-virginica,0,0,13


In [88]:
nb.classes_

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype='<U15')

In [92]:
nb.class_prior_

array([0.2952381 , 0.35238095, 0.35238095])

In [120]:
import datetime

datetime.datetime.now()

datetime.datetime(2024, 7, 14, 16, 13, 22, 527770)

In [None]:
# on a break till 4:25