<a href="https://colab.research.google.com/github/m607stars/Machine-Learning-Algorithms/blob/master/Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries and data

In [None]:
#Import the necessary libraries

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

from math import sqrt
from math import pi
from math import exp

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix


In [None]:
#Mount Google Drive on colab Notebook 

from google.colab import drive 
drive.mount('/content/drive', force_remount=True)

In [None]:
#Load Dataset from Google Drive to notebook into a dataframe named data

data=pd.read_csv('/content/drive/My Drive/Datasets /Iris Data Set.csv')
data.drop(['Id'], axis=1, inplace=True)
data.head(3)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


# **Naive Bayes Algorithm** 
For simplicity we will perform naive bayes implementation from scratch on a randomly generated dataset having two features only. This has been done in order to keep the implementation pretty straightforward and easy to understand.

## Step 1: Define the necessary functions



In [None]:
#numbers is a list
# Calculate the mean of a list of numbers
def mean(numbers):
	return sum(numbers)/(len(numbers))
 
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([(x-avg)**2 for x in numbers]) / (len(numbers)-1)
	return sqrt(variance)
 
# Summarize the dataset provided, i.e. calculate the mean, stdev and length of the datset for each feature 
# Note that we pass a dataframe to the summarize function so for calculating mean,stdev,etc. we need to convert each feature into list one by one
def summarize(dataframe):
  summary = []                                       #Create empty list 
  for i in range (len(dataframe.columns)-1):         #Iterate over the no. of columns i.e. features (-1 is for species excluded)
    list = dataframe.iloc[:,i].tolist()              #Convert dataframe feature to list 
    m = mean(list)                                    
    std = stdev(list)                                
    n = len(dataframe['Species'])                    
    stats = [m,std,n]                             #store the obtained values of mean,stdev,count in a list 
    summary.append(stats)                         #Append the list to the summary containing other feature's list
  return summary

In [None]:
#Testing on our sample Data

sample_data = [[3.393533211,2.331273381,0],
	[3.110073483,1.781539638,0],
	[1.343808831,3.368360954,0],
	[3.582294042,4.67917911,0],
	[2.280362439,2.866990263,0],
	[7.423436942,4.696522875,1],
	[5.745051997,3.533989803,1],
	[9.172168622,2.511101045,1],
	[7.792783481,3.424088941,1],
	[7.939820817,0.791637231,1]]

#Converting the list data to dataframe for easy implementation 
df = pd.DataFrame(sample_data, columns = ['a', 'b','Species']) 
k = summarize(df) 

#print the summaries
print(k)

[[5.178333386499999, 2.7665845055177263, 10], [2.9984683241, 1.218556343617447, 10]]


## Step 2: Gaussian Probability Function 

In [None]:
#implementing the Gaussian probability distribution function 

def gaussian_probability(x,mean,stdev):
  power = exp(-((x-mean)**2)/(2*(stdev**2)))
  return (1/(sqrt(2*pi)*stdev)) * power

In [None]:
#Testing Gaussian Probability Function on simple values

print(gaussian_probability(1.0, 1.0, 1.0))

0.3989422804014327


## Step 3: Collecting all together in Naive Bayes and final output

For demonstration purposes, we will make our Naive Bayes function calculate the probability of the datapoint belonging to only one class at a time. This means that we need to call our Naive Bayes function as many times as the number of classes.

In [None]:
#Defining Naive Bayes Function which will calculate the probabilty of the datapoint belonging to a particular class

def naive_bayes(summaries,test_value):
  p=1               
  for i in range (len(summaries)):   #We iterate as many times as the number of features 
    temp = gaussian_probability(test_value[i],summaries[i][0],summaries[i][1])     #We calculate gaussian probability of each feature 
    p = p*temp                          #We obtain the product of gaussian probabilities of all the features as we iterate through the for loop
  p = p * 0.5              #We multiply by 0.5 as there are equal no. of datapoints in both the classes (i.e. 5 datapoints in class 0 and 5 in class 1)
  return p


In [None]:
#We generate our sample data

sample_data = [[3.393533211,2.331273381,0],
	[3.110073483,1.781539638,0],
	[1.343808831,3.368360954,0],
	[3.582294042,4.67917911,0],
	[2.280362439,2.866990263,0],
	[7.423436942,4.696522875,1],
	[5.745051997,3.533989803,1],
	[9.172168622,2.511101045,1],
	[7.792783481,3.424088941,1],
	[7.939820817,0.791637231,1]]

#Convert the list to dataframe
df = pd.DataFrame(sample_data, columns = ['a', 'b','Species']) 

#We take the summaries one ny one for each class
summaries0 = summarize(df.head(5))
summaries1 = summarize(df.tail(5))

#Store the probabilities obtained from the naive bayes functions in the variables separately and then print them 
#We calculate the probabilities of the first sample datapoint 
#We can conclude that if the first probability is greater then the test data point belongs to class 0 and vice versa

probabilities0 = naive_bayes(summaries0, sample_data[0])   
probabilities1 = naive_bayes(summaries1, sample_data[0])
print(probabilities0,probabilities1)

0.05032427673372076 0.00011557718379945765


In the above algorithm, we tested for the first sample datapoint which belongs to class 0 as given in the dataset. Since the probaility0 is greater than the probability1 we can say that our test_data belongs to class 0 according to the naive bayes algorithm. 

# **Using SkLearn Library**
Now that we have looked at how the algorithm functions, we will now look at the implementation of Naive Bayes Algorithm using SkLearn Library.

In [None]:
#For classification we rename the species of the flowers as 0,1,2

data["Species"].replace({"Iris-setosa":0,"Iris-versicolor":1, "Iris-virginica":2}, inplace=True)
data.head(150)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [None]:
# for smooth implementation we convert the data format from dataframe to list 

y = data['Species'].values.tolist()

X_features = data.drop(columns=['Species'])  #for features, since there are 4 columns, we simple drop one and then use it in our tolist() function
X = X_features.values.tolist()
print(X)
print(y)

[[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5.0, 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [4.6, 3.4, 1.4, 0.3], [5.0, 3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2], [4.9, 3.1, 1.5, 0.1], [5.4, 3.7, 1.5, 0.2], [4.8, 3.4, 1.6, 0.2], [4.8, 3.0, 1.4, 0.1], [4.3, 3.0, 1.1, 0.1], [5.8, 4.0, 1.2, 0.2], [5.7, 4.4, 1.5, 0.4], [5.4, 3.9, 1.3, 0.4], [5.1, 3.5, 1.4, 0.3], [5.7, 3.8, 1.7, 0.3], [5.1, 3.8, 1.5, 0.3], [5.4, 3.4, 1.7, 0.2], [5.1, 3.7, 1.5, 0.4], [4.6, 3.6, 1.0, 0.2], [5.1, 3.3, 1.7, 0.5], [4.8, 3.4, 1.9, 0.2], [5.0, 3.0, 1.6, 0.2], [5.0, 3.4, 1.6, 0.4], [5.2, 3.5, 1.5, 0.2], [5.2, 3.4, 1.4, 0.2], [4.7, 3.2, 1.6, 0.2], [4.8, 3.1, 1.6, 0.2], [5.4, 3.4, 1.5, 0.4], [5.2, 4.1, 1.5, 0.1], [5.5, 4.2, 1.4, 0.2], [4.9, 3.1, 1.5, 0.1], [5.0, 3.2, 1.2, 0.2], [5.5, 3.5, 1.3, 0.2], [4.9, 3.1, 1.5, 0.1], [4.4, 3.0, 1.3, 0.2], [5.1, 3.4, 1.5, 0.2], [5.0, 3.5, 1.3, 0.3], [4.5, 2.3, 1.3, 0.3], [4.4, 3.2, 1.3, 0.2], [5.0, 3.5, 1.6, 0.6], [5.1, 3.8, 1.9, 0.4], [4.8, 3.0

In [None]:
 #We split our data into training and testing set

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
#We train our model on training set using Gaussian Naive Bayes class present in sklearn library 
model = GaussianNB()
model.fit(X_train, y_train)

#We try to predict y values of testing data  
y_pred = model.predict(X_test)

In [None]:
# In order to assess our model, we print the confusion matrix 

print(confusion_matrix(y_test, y_pred))

[[16  0  0]
 [ 0 18  0]
 [ 0  0 11]]


The Diagonal elements along the confusion matrix show those values which are correctly identified and the rest of the values are incorrectly identified. Thus here we can see that our model performed good as it identified all the values correctly! 