# Gaussian Naive Bayes from Scratch

## Representation for Gaussian Naive Bayes

* We need to calculate the probabilities for input values for each class using a frequency. 
* With real-valued inputs, we can calculate the mean and standard deviation of input values (x) for each class to summarize the distribution. This means that in addition to the probabilities for each class, we must also store the mean and standard deviations for each input variable for each class.

## Learn a Gaussian Naive Bayes Model From Data

* This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.

![S6.1.png](attachment:S6.1.png)

* Where n is the number of instances and x are the values for an input variable in your training data.

In [18]:
# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))

* We can calculate the standard deviation using the following equation:

![S6.2.png](attachment:S6.2.png)

* This is the square root of the average squared difference of each value of x from the mean value of x, where n is the number of instances, xi is a specific value of the x variable for the i'th instance and mean(x) is described above.

In [19]:
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)

## Make Predictions With a Gaussian Naive Bayes Model

* Probabilities of new x values are calculated using the **Gaussian Probability Density Function (PDF)**. 
* When making predictions these parameters can be plugged into the Gaussian PDF with a new input for the variable, and in return the Gaussian PDF will provide an estimate of the probability of that new input value for that class.

![S6.3.png](attachment:S6.3.png)

In [20]:
from math import sqrt
from math import exp
from math import pi
# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent

## Class Probabilities

* Now it is time to use the statistics calculated from our training data to calculate probabilities for new data. 
* Probabilities are calculated separately for each class. 
* This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes. 
* The probability that a piece of data belongs to a class is calculated as follows:

![S6.4.png](attachment:S6.4.png)

* The input variables are treated separately, giving the technique it's name naive. 
* For the example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as:

![S6.5-2.png](attachment:S6.5-2.png)

In [21]:
# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
    total_rows = sum([summaries[label][0][2] for label in summaries])
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) # P(class=0)
        for i in range(len(class_summaries)):
            mean, stdev, _ = class_summaries[i]
            probabilities[class_value] *= calculate_probability(row[i], mean, stdev) #P(X1|class=0)*P(X2|class=0)*P(class=0)
    return probabilities

In [22]:
# Calculate the mean, stdev and count for each column in a dataset
def descriptive_stat_by_class(dataset):
    summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] # Column with stats
    del(summaries[-1])  #Delete the prediction class related summary
    return summaries

In [23]:
# Split the dataset by class values, returns a dictionary
def separate_data_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1]
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

In [24]:
# Split dataset by class then calculate statistics for each row
def descriptive_statistics_by_class(dataset):
    separated = separate_data_by_class(dataset)
    summaries = dict()
    for class_value, rows in separated.items():
        summaries[class_value] = descriptive_stat_by_class(rows)
    return summaries

In [25]:
# Predict the class for a given row
def predict(summaries, row):
    probabilities = calculate_class_probabilities(summaries, row)
    best_label, best_prob = None, -1
    for class_value, probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = class_value
    return best_label

In [26]:
# Naive Bayes Algorithm
def naive_bayes(train, test):
    summarize = descriptive_statistics_by_class(train)
    predictions = list()
    for row in test:
        output = predict(summarize, row)
        predictions.append(output)
    return(predictions)

In [27]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [28]:
df = pd.read_csv("Data/iris.csv",header=None)

In [29]:
df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [30]:
df[4].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: 4, dtype: int64

In [31]:
df[4] = df[4].map({'Iris-setosa':0, 'Iris-virginica':1, 'Iris-versicolor':2})

In [32]:
df.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [33]:
X_train, X_test = train_test_split(df, test_size=0.33, random_state=42)

In [34]:
predictions = naive_bayes(X_train.values, X_test.values)

In [35]:
print (accuracy_score(X_test[4].values, predictions))

0.96


In [36]:
from sklearn.naive_bayes import GaussianNB

In [37]:
X = df.drop(4, axis=1)
y = df[4]

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [39]:
model = GaussianNB()

In [40]:
model.fit(X_train, y_train)

GaussianNB()

In [41]:
predictions = model.predict(X_test)

In [42]:
print (accuracy_score(y_test, predictions))

0.96
