# BT2101 Introduction to Naive Bayes Classification

## 1 Goal:

In this notebook, we will explore Naive Bayes classification using:
* Maximize a-posterior probability (MAP)
* Open-source package: `scikit-learn`

For MAP, you will:
* Understand Bayes formula: Likelihood, Prior, Posterior
* Understand total/complete probability and conditional probability
* Write functions to calculate likelihood, prior and posterior
* Write a prediction function
* Use MAP to make classification

In [None]:
# -*- coding:utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from __future__ import division
from math import sqrt
%matplotlib inline

### 1.1 Summary of Bayes Formula

From the lecture class, we know that a typical Bayes formula with regard to an event $\theta$ is:
\begin{align}
P(\theta|X) &= \frac{P(X|\theta)P(\theta)}{P(X)} \\
&\propto P(X|\theta)P(\theta)  \\
\end{align}

which is equivalently expressed as:
\begin{align}
\text{Posterior} &= \frac{\text{Likelihood}\times\text{Prior}}{\text{Evidence}} \\
&\propto \text{Likelihood}\times\text{Prior}  \\
\end{align}

Let us suppose:
1. Prior probability distribution of an event $\theta$ is: $P(\theta)$

2. Data likelihood function of $X=(x_{1},...,x_{f})$ with $f$ features:
\begin{align}
l(\theta) &= P(X=(x_{1},...,x_{f})|\theta) \\
&= \prod_{j=1}^{f}P(x_{j}|\theta)
\end{align}

3. Posterior probability is: $P(\theta|X=(x_{1},...,x_{f}))$

4. According to Bayes formula, we can get:
\begin{align}
P(\theta|X) &\propto P(\theta)\times\prod_{j=1}^{f}P(x_{j}|\theta) \\
\end{align}

Assumption: Features are independent, which is a very strong assumption !

### 1.2 A Typical Naive Bayes Problem 

In machine learning, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features [Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

Suppose output $y$ is assigned to one of the classes $C_{1},...,C_{k}$, then the Naive Bayes classification problem is to **Maximize** the posterior probability:

\begin{align}
y &= \mathop{\arg\max}_{k\in{1,...,K}}P(C_{k}|X) \\
&= \mathop{\arg\max}_{k\in{1,...,K}}P(C_{k})\prod_{j=1}^{f}P(x_{j}|C_{k}) \\
\end{align}

This equation has two important arguments:
1. Prior probability of class $C_{k}$: $P(C_{k})$ for $k=1,...,K$
2. Conditional probability of each feature $j$: $P(x_{j}|C_{k})$ for $j=1,...,f$

Given by the data, we need to calcuate (1) prior probability of each class, and (2) probability of one feature conditional on one class.

### 1.3 An Example

The dataset for this example can be found [here](http://www.inf.u-szeged.hu/~ormandi/ai2/06-naiveBayes-example.pdf).

Attributes information:
1. Color: Car colour
2. Type: Car type
3. Origin: Whether car is manufactured from domestic or imported
4. Stolen: Whether car is stolen or not (1=Yes; 0=No)

We aim to predict whether a (Red, Domestic, SUV) car will be stolen or not. <br/>

So the problem becomes: <br/>
Comparing $ P(Color=Red | Stolen=1)P(Type=SUV | Stolen=1)P(Origin=Domestic | Stolen=1)P(Stolen=1) $ <br/>
with $ P(Color=Red | Stolen=0)P(Type=SUV | Stolen=0)P(Origin=Domestic | Stolen=0)P(Stolen=0) $. <br/>

Your need to calculate these conditional probabilities and total probabilities.

In [None]:
# Create this dataset
Color = ['Red','Red','Red','Yellow','Yellow','Yellow','Yellow','Yellow','Red','Red']
Type = ['Sports','Sports','Sports','Sports','Sports','SUV','SUV','SUV','SUV','Sports']
Origin = ['Domestic','Domestic','Domestic','Domestic','Imported','Imported','Imported','Domestic','Imported','Imported']
Stolen = ['Yes','No','Yes','No','Yes','No','Yes','No','No','Yes']
data = zip(Stolen, Color, Type, Origin)
colnames = ['Stolen','Color','Type','Origin']

In [None]:
data

Create a dictionary of dictionaries for each class and each feature.
For example, a dictionary `data_dict` contains two keys `Yes` and `No`. For each key, it has a dictionary which contains 3 features as keys `Color`, `Type` and `Origin`. For each feature as key, it has a dictionary of its values as keys, and each key corresponds to the number of its occurances as value.

In [None]:
# Example: data_dict['Yes']['Color']['Red'] = 3 -> data_dict[Class_label][Feature_name][Feature_value] = Number_of_occurence
from collections import defaultdict

def load_data(dataset, column_names):
    '''This function is used to load dataset and transform to the nested dictionary as illustrated above.
    Inputs:
    1) dataset: Dataset as a list of tuples
    2) column_names: Column names of this dataset
    
    Outputs:
    1) data_dictionary: A nested dictionary
    2) prior_class: A dictionary of frequencies for each class label
    3) prior_prob: A dictionary of prior probabilities for each class label
    
    
    '''
    
    # Initialize this dictionary
    data_dictionary = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))
    prior_class = defaultdict(lambda: 0)
    prior_prob = defaultdict(lambda: 0)
    
    # Number of rows
    nrow = len(dataset)

    for _, row in enumerate(dataset):
        
        # Update priors for class labels
        prior_class[row[0]] += 1
        
        # Update prior probabilities for class labels
        prior_prob[row[0]] += 1/nrow
        
        # Update nested dictionary for feature values of each class label
        for j in range(1, len(row)):
            data_dictionary[row[0]][column_names[j]][row[j]] += 1
            
    return (data_dictionary, prior_class, prior_prob)    
    

In [None]:
data_dict, prior, prior_probability = load_data(data, colnames)

In [None]:
def predict(data_dictionary, prior_class, prior_prob, new_dataset, new_data_colnames):
    '''This function is used to calculate posterior probability of which class label this new_data belongs to.
    
    Inputs:
    1) data_dictionary: A nested dictionary
    2) prior_class: A dictionary for class labels
    3) prior_prob: A dictionary of prior probabilities for each class label
    4) new_dataset: A tuple of features
    5) new_data_colnames: Column names of this new_data
    
    Outputs:
    1) posterior: A dictionary of posterior probabilities of each class label   
    
    '''
    
    # Initialize the dictionary of posterior probabilities: Using prior probabilities
    posterior = prior_prob    
    
    # Multiplying priors with likelihood values
    for key in posterior.keys():
        for k in range(len(new_dataset)):
            # Calculate conditional probability term of this feature
            cond_pr = data_dictionary[key][new_data_colnames[k]][new_dataset[k]] / prior_class[key]
            
            # Multiply
            posterior[key] *= cond_pr

    return posterior

In [None]:
# Let us have a test
# Remember we aim to predict whether a (Red, Domestic, SUV) car will be stolen or not. 
new_data = ['Red','SUV','Domestic']
new_column_names = ['Color','Type','Origin']
predict(data_dict, prior, prior_probability, new_data, new_column_names)

Since the posterior probability for 'No' is larger than 'Yes', we can conclude that this car will not be stolen.

### Questions:
1. How to calculate posterior probability if $P(x_{j}|C_{k})=0$? <br/>
Answer: [Laplace smoothing](http://classes.engr.oregonstate.edu/eecs/winter2011/cs434/notes/bayes-6.pdf) <br/>
2. What if $x_{j}$ is a continuous variable rather than a categorical variable? <br/>
Answer: Assume Normal/Gaussian distribution <br/>
3. How to overcome float-point underflow/overflow issue when you calculate data likelihood $\prod_{j=1}^{f}P(x_{j}|C_{k})$? <br/>
Answer: Take logarithmic form and then transform back

## 2 Try Open-Source Package
### 2.1 Scikit-Learn
The package `scikit-learn` can be found at http://scikit-learn.org/stable/index.html. <br/>
Please install the package first. <br/>
The introduction of Naive Bayes function can be found [here](http://scikit-learn.org/stable/modules/naive_bayes.html).

In [None]:
# Let us try a Gaussian Naive Bayes Method
from sklearn.naive_bayes import GaussianNB

In [None]:
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

In [None]:
# Train the model
clf = GaussianNB().fit(X, Y)

In [None]:
# Suppose we want to know which class does [-0.8, -1] belongs to
new_data = np.array([[-0.8, -1]])
print clf.predict(new_data)

In [None]:
# Posterior probabilities for each class label
print clf.classes_
print clf.predict_proba(new_data)

More about Multiple Linear Regression can be found at http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html.

## 3 Summary of Naive Bayes

### Advantages
* Easy to implement
* Good results obtained in most of the cases

### Disadvantages
* Assumption: class conditional independence , therefore loss of accuracy
* Practically, dependencies exist among variables
* Dependencies among these cannot be modeled by Naïve Bayesian Classifier