$\def\*#1{\mathbf{#1}}$
$\DeclareMathOperator*{\argmax}{arg\,max}$

# Mathematical Models : the case of the Naive Bayes classifier

In [None]:
from scipy import stats
import matplotlib.pyplot as plt
import numpy as np
from math import log
%matplotlib inline

## Classifier

A classifier is a function $M$ that predicts the class label $\hat{y} \in \{c_1, c_2, \ldots, c_k\}$ for a given point $\*x \in \mathbb{R}^d$ :

$$\hat{y} = M(\*x)$$

## Supervised Learning

1. Collect a **Training Data Set** composed of points along with their known classes
2. Train Classifier
3. Make Predictions

### Collect Training Data

A training set of points with their known classes :

| Day | Outlook | Temp | Humidity | Beach? |
|-----|---------|------|----------|--------|
| 1   | Sunny   | High | High     | Yes    |
| 2   | Sunny   | High | Normal   | Yes    |
| 3   | Sunny   | Low  | Normal   | No     |
| 4   | Sunny   | Mild | High     | Yes    |
| 5   | Rain    | Mild | Normal   | No     |
| 6   | Rain    | High | High     | No     |
| 7   | Rain    | Low  | Normal   | No     |
| 8   | Cloudy  | High | High     | No     |
| 9   | Cloudy  | High | Normal   | Yes    |
| 10  | Cloudy  | Mild | Normal   | No     |

## Probabilistic Classification

We consider a dataset $D = \{\*x_1, \ldots, \*x_n\}$, with $y_i \in \{c_1, c_2, \ldots, c_k\}$ for all $i = 1, 2, \ldots, n$.

The **predicted class** for a new data point $\*x$ is obtained as follows :

$$
\hat{y} = \argmax_{c_i}\{P(c_i | \*x)\}
$$

**Bayes' formula**

Let $E_1, E_2, \ldots, E_k$ be a partition of $\Omega$ :

* $\Omega = E_1 \cup E_2 \cup \ldots \cup E_k$ 
* $E_i \cap E_j = \emptyset$, for all pair $i, j \in \{1, 2, \ldots, k\}$

Based on the probability axioms and elementary set theory, it can be showed that the **marginal probability** $P(E)$ is equal to :

$$
P(E) = P(E \cap \Omega) = P(E_1 \cap E) + P(E_2 \cap E) + \ldots + P(E_k \cap E)
$$

Based on conditional probability, this can be rewritten as follows :

$$
P(E) = P(E_1) P(E | E_1) + P(E_2) P(E | E_2) + \ldots + P(E_k) P(E | E_k)
$$

This leads to the **Bayes's formula** :

$$
P(E_i|E) = \frac{P(E_i \cap E)}{P(E)} = \frac{P(E_i)P(E|E_i)}{\sum_{l = 1}^k P(E_l)P(E|E_l)}
$$

### Bayes Classifier

Based on this formula, The **predicted class** for a new data point $\*x$ is obtained as follows :
$$
\begin{align}
\hat{y} &= \argmax_{c_i}\{P(c_i | \*x)\}\\
        &= \argmax_{c_i}\Bigg\{\frac{P(\*x | c_i) P(c_i)}{P(\*x)}\Bigg\}\\
        &= \argmax_{c_i}\{P(\*x | c_i) P(c_i)\}
\end{align}
$$

**Prior Probability Estimator** :

$$
\hat{P}(c_i) = \frac{|\*D_i|}{|\*D|} = \frac{n_i}{n}
$$

where $\*D_i = \{\*x_j \in \*D : y_j = c_i\}$, $|\*D| = n$, and $|\*D_i| = n_i$.

**Likelyhood estimator**, *i.e.* and estimator for $P(\*x | c_i)$ : see previous labs/lectures.

### Naive Bayes Classifier
In this case we assume that attributes are independent. This allows to write the likelyhood as follows :
    
$$
P(\*x|c_i) = P(x_1, x_2, \ldots, x_n | c_i) = \prod_{j = 1}^d P(x_i|c_i)
$$

For **categorical attributes**, 

$$
P(\*x|c_i) = \prod_{j = 1}^d P(x_i|c_i) \approx \prod_{j = 1}^d \hat{f}(\*v_j | c_i) = \prod_{j = 1}^d \frac{n_i(\*v_j)}{n_i}
$$

where $n_i(\*v_j)$ is the observed frequency of the value $\*v_j$ for the attribute $X_i$ in the class $c_i$.

| Day | Outlook | Temp | Humidity | Beach? |
|-----|---------|------|----------|--------|
| 1   | Sunny   | High | High     | Yes    |
| 2   | Sunny   | High | Normal   | Yes    |
| 3   | Sunny   | Low  | Normal   | No     |
| 4   | Sunny   | Mild | High     | Yes    |
| 5   | Rain    | Mild | Normal   | No     |
| 6   | Rain    | High | High     | No     |
| 7   | Rain    | Low  | Normal   | No     |
| 8   | Cloudy  | High | High     | No     |
| 9   | Cloudy  | High | Normal   | Yes    |
| 10  | Cloudy  | Mild | Normal   | No     |

| Outlook      | Beach | No Beach |
|--------------|-------|----------|
| Sunny        | 3/4   | 1/6      |
| Rain         | 0/4   | 3/6      |
| Cloudy       | 1/4   | 2/6      |

| Temperature  | Beach | No Beach |
|--------------|-------|----------|
| High         | 3/4   | 2/6      |
| Mild         | 1/4   | 2/6      |
| Low          | 0/4   | 2/6      |

| Humidity     | Beach | No Beach |
|--------------|-------|----------|
| High         | 2/4   | 2/6      |
| Normal       | 2/3   | 4/6      |

For example, based on the wather condition (Sunny, Mild, High), would this be a day to go to the beach ?

For **numerical attributes**, we can, for example, assume a normal distribution for each clas $c_i$ :

$$
P(x_j | c_i) \propto f(x_j) = \frac{1}{\sqrt{2\sigma_{ij}^2\pi} } \; e^{ -\frac{(x_-\mu_{ij})^2}{2\sigma_{ij}^2} }
$$

where $\mu_{ij}$ and $\sigma_{ij}^2$ denote the mean and variance for attribute $X_j$ and class $c_i$.

In [None]:
from sklearn import datasets

iris = datasets.load_iris()

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)

print("Number of mislabeled points out of a total {} points : {}".format(iris.data.shape[0],
                                                                         (iris.target != y_pred).sum()))                  

In [None]:
# naive_bayes.MultinomialNB([alpha, ...]) : Naive Bayes classifier for multinomial models
# naive_bayes.BernoulliNB([alpha, binarize, ...]) : Naive Bayes classifier for multivariate Bernoulli models.

## Classifier Evaluation

### Performance Measure

* $\*D$ : the testing set composed of $n$ points in a $d$ dimensional space
* $k$ : the number of classes
* $M$ : the classifier
* $y_i$ : the "true" class that corresponds to $\*x_i \in \*D$
* $\hat{y}_i = M(\*x_i)$ : the predicted class
    
#### Error Rate

The fraction of incorrect predictions :

$$
Error\ Rate = \frac{1}{n} \sum_{i = 1}^n I(y_i \neq \hat{y}_i)
$$

#### Accuracy

The fraction of correct predictions :
$$
Accuracy = \frac{1}{n} \sum_{i = 1}^n I(y_i = \hat{y}_i) = 1 - Error\ Rate
$$

### Evaluation

The data set $\*D$ is split into two disjoint sets :

* **Training set** : used to train $M$
* **Testing set** : used to evaluate the performance measure (denoted $\theta$) of $M$

#### $K$-fold Cross-Validation

The data set $\*D$ is divided into $K$ equal-sized parts, denoted $\*D_1, \*D_2, \ldots, \*D_K$, called folds.

For each $i = 1,2,\ldots,K$, a model $M_i$ is trained and the corresponding measure $\theta_i$ is evaluated with respect to :

* $\*D_i$ : the testing set
* $\*D \backslash \*D_i = \cup_{j \neq i} \*D_j$

This lead estimate the expected and variance values of the performance measure as follows,

* $\displaystyle \hat{\mu}_{\theta} = E[\theta] = \frac{1}{K} \sum_{i=1}^K \theta_i$
* $\displaystyle \hat{\sigma}_{\theta}^2 = \frac{1}{K} \sum_{i=1}^K (\theta_i - \hat{\mu}_{\theta})^2$

This validation is performed $n$ times, where the data set $\*D$ is partitioned randomly.

## References :

* **The Data Science Design Manual**, by Steven Skiena, 2017, Springer;
* Python notebooks available at [http://data-manual.com/data](http://data-manual.com/data);
* Lectures slides available at [http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/](http://www3.cs.stonybrook.edu/~skiena/data-manual/lectures/).
* The Google Developers channel : Machine Learning Recipes
* Spring 2015 Boston University CS591 "Tools and Techniques for Data Mining and Applications" course 