# Naive Bayes
---
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.


*There are three types of Naive Bayes Model, which are given below:*

**Gaussian:** The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.

**Multinomial:** The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

**Bernoulli:** The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables. Such as if a particular word is present or not in a document. This model is also famous for document classification tasks.

**About data set**
---
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

**Naive Bayes Gaussian Classifier**

In [None]:
# importing required libraries
import csv
import math
import random

In [None]:
def load_csv(filename):
  lines = csv.reader(open(r'/content/drive/MyDrive/Colab Notebooks/'+filename))
  dataset = list(lines)
  # removing heading column from dataset
  del dataset[0]
  # converting all data into float originally the data is in form of string
  for i in range( len(dataset)):
    dataset[i] = [float(x) for x in dataset[i]]
  return dataset

In [None]:
def split_dataset(dataset, split_ratio):
  train_size = int(len(dataset) * split_ratio)
  train_set = []
  copy = list(dataset)
  while len(train_set) < train_size:
    index = random.randrange(len(copy))
    # appending to training set and removing from testing set
    train_set.append(copy.pop(index))
  # returns training set and testing set
  return [train_set, copy]

In [None]:
def separate_by_class(dataset):
  separated = {}
  for i in range(len(dataset)):
    vector = dataset[i]
    if (vector[-1] not in separated):
      separated[vector[-1]] = []
    separated[vector[-1]].append(vector)
  return separated

In [None]:
def mean(numbers):
  return sum(numbers)/float(len(numbers))

In [None]:
def standard_deviation(numbers):
  avg = mean(numbers)
  variance = sum([pow(x-avg,2) for x in numbers]) / float(len(numbers)-1)
  return math.sqrt(variance)

In [None]:
def summarize(dataset):
  # zip() function is used to map two or more list together, we can use * to unzip that mapping
  summaries =  [(mean(attribute), standard_deviation(attribute)) for attribute in zip(*dataset)]
  del summaries[-1]
  return summaries

In [None]:
def summarize_by_class(dataset):
  separated = separate_by_class(dataset)
  summaries = {}
  for class_value, instances in separated.items():
    summaries[class_value] = summarize(instances)
  return summaries

In [None]:
def calculate_probability(x, mean, standard_deviation):
  exponent = math.exp(-(math.pow(x-mean, 2) / (2*math.pow(standard_deviation, 2))))
  return (1/(math.sqrt(2*math.pi)*standard_deviation))*exponent

In [None]:
def calculate_class_probabilities(summaries, input_vector):
  probabilities = {}
  # items() returns a view object that contain key value pair
  for class_value, class_summaries in summaries.items():
    probabilities[class_value] = 1
    for i in range(len(class_summaries)):
      mean, standard_deviation = class_summaries[i]
      x = input_vector[i]
      probabilities[class_value] *= calculate_probability(x, mean, standard_deviation)
  return probabilities


In [None]:
def predict(summaries, input_vector):
  probabilities = calculate_class_probabilities(summaries, input_vector)
  best_label, best_prob = None, -1
  for class_value, probability in probabilities.items():
    if best_label is None or probability > best_prob:
      best_prob = probability
      best_label = class_value
  return best_label
   

In [None]:
def get_predictions(summaries, test_set):
  predictions = []
  for i in range(len(test_set)):
    result = predict(summaries, test_set[i])
    predictions.append(result)
  return predictions

In [None]:
def get_accuracy(test_set, predictions):
  correct = 0
  for x in range(len(test_set)):
    if test_set[x][-1] == predictions[x]:
      correct += 1
  return (correct/float(len(test_set)))*100.0

**Code Driver**

In [None]:
filename = 'diabetes.csv'
split_ratio = 0.67
dataset = load_csv(filename)
training_set, test_set = split_dataset(dataset, split_ratio)
print('Split {0} rows into train = {1} and test = {2} rows'.format(len(dataset), len(training_set), len(test_set)))

# prepare model
summaries = summarize_by_class(training_set)

# test model
predictions = get_predictions(summaries, test_set)
accuracy = get_accuracy(test_set, predictions)
print('Accuracy: {0}'.format(accuracy))

Split 768 rows into train = 514 and test = 254 rows
Accuracy: 75.19685039370079
