# Naive bayes 

It is the most widely used algorithm of Bayesian Decision Theory for classification a supervised learning data, it depends on using probability to make predections in machine learning, Bayes’ theorem states the following relationship:

$$ P(class|data) = \frac{P(data|class) * P(class)}{ P(data)} $$

# load iris flowers data 

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines and Naive Bayes

Reference : https://www.kaggle.com/arshid/iris-flower-dataset#IRIS.csv 

In [306]:
# Naive Bayes On The Iris Dataset
from csv import reader
from random import seed
from random import randrange
from math import sqrt
from math import exp
from math import pi

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

seed(1)
filename = 'irisflowers.csv'
dataset = load_csv(filename)
dataset[:10]

[['5.1', '3.5', '1.4', '0.2', 'setosa'],
 ['4.9', '3.0', '1.4', '0.2', 'setosa'],
 ['4.7', '3.2', '1.3', '0.2', 'setosa'],
 ['4.6', '3.1', '1.5', '0.2', 'setosa'],
 ['5.0', '3.6', '1.4', '0.2', 'setosa'],
 ['5.4', '3.9', '1.7', '0.4', 'setosa'],
 ['4.6', '3.4', '1.4', '0.3', 'setosa'],
 ['5.0', '3.4', '1.5', '0.2', 'setosa'],
 ['4.4', '2.9', '1.4', '0.2', 'setosa'],
 ['4.9', '3.1', '1.5', '0.1', 'setosa']]

In [280]:
# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

for i in range(len(dataset[0])-1):
    str_column_to_float(dataset, i)
    
# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)

{'virginica': 0, 'versicolor': 1, 'setosa': 2}

# Algorithm Implementation 

### 1- Seperate by class


In [282]:
# Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated
separate_by_class(dataset)
# those three functions can make the seperation too
# df1 = df[df['species'].isin([1])]  - this function will return all of the rows containing the desired value (string or integer)- 
# df1 = df[df['species'].str.contains("setosa")] - this function will return all of the rows containing the desired value (string)-
#dfs = dict(tuple(df.groupby('species'))) # - this function will return grouped tables when you call it by 
#dfs[1]

{2: [[5.0, 3.3, 1.4, 0.2, 2],
  [5.0, 3.5, 1.3, 0.3, 2],
  [4.9, 3.1, 1.5, 0.1, 2],
  [4.4, 2.9, 1.4, 0.2, 2],
  [5.0, 3.5, 1.6, 0.6, 2],
  [5.7, 3.8, 1.7, 0.3, 2],
  [5.1, 3.7, 1.5, 0.4, 2],
  [4.6, 3.4, 1.4, 0.3, 2],
  [5.5, 4.2, 1.4, 0.2, 2],
  [5.1, 3.5, 1.4, 0.3, 2],
  [5.0, 3.2, 1.2, 0.2, 2],
  [5.8, 4.0, 1.2, 0.2, 2],
  [5.2, 4.1, 1.5, 0.1, 2],
  [5.4, 3.7, 1.5, 0.2, 2],
  [4.5, 2.3, 1.3, 0.3, 2],
  [5.0, 3.0, 1.6, 0.2, 2],
  [5.1, 3.4, 1.5, 0.2, 2],
  [4.8, 3.0, 1.4, 0.3, 2],
  [5.1, 3.8, 1.5, 0.3, 2],
  [5.4, 3.4, 1.7, 0.2, 2],
  [4.3, 3.0, 1.1, 0.1, 2],
  [4.8, 3.4, 1.6, 0.2, 2],
  [4.6, 3.2, 1.4, 0.2, 2],
  [5.1, 3.8, 1.6, 0.2, 2],
  [4.6, 3.6, 1.0, 0.2, 2],
  [5.4, 3.4, 1.5, 0.4, 2],
  [5.0, 3.6, 1.4, 0.2, 2],
  [5.5, 3.5, 1.3, 0.2, 2],
  [4.4, 3.0, 1.3, 0.2, 2],
  [4.4, 3.2, 1.3, 0.2, 2],
  [5.7, 4.4, 1.5, 0.4, 2],
  [5.1, 3.3, 1.7, 0.5, 2],
  [4.8, 3.0, 1.4, 0.1, 2],
  [4.7, 3.2, 1.3, 0.2, 2],
  [4.9, 3.1, 1.5, 0.1, 2],
  [5.1, 3.8, 1.9, 0.4, 2],
  [4.7, 3.2, 1.6, 0.2, 2]

### 2- Summarize data set

Get some statistical calculation like mean:

$$ mean(column) = \frac{sum(column)}{lenght(column)} $$

In [283]:
# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))

and the standard deviation:

$$ stdev(column) =  \sqrt {\frac {\sum_ {i=1}^{N} (column(i)-Mean)^2} {N-1} } $$

**where N is the length of the column** 

In [284]:
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)

then we can implement our dataset summarization for each column 

In [285]:
# Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
    summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
    del(summaries[-1])
    return summaries
summarize_dataset(dataset)

[(5.843333333333335, 0.8280661279778632, 150),
 (3.054, 0.4335943113621738, 150),
 (3.758666666666667, 1.7644204199522624, 150),
 (1.198666666666667, 0.7631607417008416, 150)]

### 3- Summarize data set by Class

In [286]:
# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = dict()
    for class_value, rows in separated.items():
        summaries[class_value] = summarize_dataset(rows)
    return summaries
summarize_by_class(dataset)

{2: [(5.005999999999999, 0.35248968721345114, 50),
  (3.418, 0.3810243979546911, 50),
  (1.4639999999999997, 0.17351115943644538, 50),
  (0.24399999999999988, 0.10720950308167837, 50)],
 1: [(5.936, 0.5161711470638636, 50),
  (2.7700000000000005, 0.31379832337841135, 50),
  (4.26, 0.46991097723995806, 50),
  (1.3259999999999998, 0.197752680004544, 50)],
 0: [(6.587999999999998, 0.6358795932744322, 50),
  (2.9739999999999993, 0.32249663817263746, 50),
  (5.552, 0.5518946956639835, 50),
  (2.0259999999999994, 0.27465005563666733, 50)]}

### 4- Gaussian probability density function

Calculating the probability or likelihood of observing a given real-value is difficult.

One way we can do this is to assume that the feature values are drawn from a distribution, such as a bell curve or Gaussian distribution.

A Gaussian distribution can be summarized using only two numbers: the **mean** and the **standard deviation**. Therefore, with a little math, we can estimate the probability of a given value, and can be calculated as:

$$ f(x) = \frac{1}{\sqrt{2\pi} * stdev} e^{\frac{-(x-mean)^{2}}{2*stdev^{2}}} $$

In [287]:
# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent

### 5-Class Probabilities

Now it is time to use the statistics calculated from our training data to calculate probabilities for new data.

Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes.

The probability that a piece of data belongs to a class is calculated as follows:

$$ P(class|data) = P(X|class) * P(class) $$



In [288]:
# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
        for i in range(len(class_summaries)):
            mean, stdev, _ = class_summaries[i]
            probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
    return probabilities

# Cross validation split

In [301]:
# using Fisher–Yates shuffle Algorithm 
# to shuffle a list 
for i in range(len(dataset)-1, 0, -1): 
      
    # Pick a random index from 0 to i  
    j = random.randint(0, i + 1)  
    
    # Swap arr[i] with the element at random index  
    dataset[i], dataset[j] = dataset[j], dataset[i] 

    
# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for _ in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split
dataset_split = cross_validation_split(dataset,5)

In [302]:
train_split = dataset_split[0] + dataset_split[1] + dataset_split[2]
test_split = dataset_split[3] + dataset_split[4] 

# Preciction model 

In [303]:
# Predict the class for a given row
def predict(summaries, row):
    probabilities = calculate_class_probabilities(summaries, row)
    best_label, best_prob = None, -1
    for class_value, probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = class_value
    return best_label

In [304]:
# Make a prediction with Naive Bayes on Iris Dataset
# fit model
model = summarize_by_class(train_split)
# define a new record
accuracy = 0 
for i in range (len(test_split)):
    row = dataset[i][:-1] 
    # predict the label
    label = predict(model, row)
    print('Data=%s, Predicted: %s , and the right prediction: %s' % (row, label , dataset[i][-1]))
    if (dataset[i][-1]  == label):
        accuracy = accuracy + 1
print ('accuracy = %s' % ((accuracy/len(test_split))*100))

Data=[6.2, 2.2, 4.5, 1.5], Predicted: 1 , and the right prediction: 1
Data=[7.0, 3.2, 4.7, 1.4], Predicted: 1 , and the right prediction: 1
Data=[6.5, 3.0, 5.2, 2.0], Predicted: 0 , and the right prediction: 0
Data=[5.4, 3.4, 1.7, 0.2], Predicted: 2 , and the right prediction: 2
Data=[6.3, 2.5, 4.9, 1.5], Predicted: 1 , and the right prediction: 1
Data=[5.4, 3.4, 1.5, 0.4], Predicted: 2 , and the right prediction: 2
Data=[5.7, 2.8, 4.5, 1.3], Predicted: 1 , and the right prediction: 1
Data=[6.3, 3.3, 6.0, 2.5], Predicted: 0 , and the right prediction: 0
Data=[6.9, 3.2, 5.7, 2.3], Predicted: 0 , and the right prediction: 0
Data=[6.3, 2.7, 4.9, 1.8], Predicted: 0 , and the right prediction: 0
Data=[7.3, 2.9, 6.3, 1.8], Predicted: 0 , and the right prediction: 0
Data=[5.6, 2.7, 4.2, 1.3], Predicted: 1 , and the right prediction: 1
Data=[5.6, 2.9, 3.6, 1.3], Predicted: 1 , and the right prediction: 1
Data=[7.2, 3.2, 6.0, 1.8], Predicted: 0 , and the right prediction: 0
Data=[5.9, 3.0, 5.1,