# Naive Bayes Classifier

---

In this tutorial, we will introduce the **naive Bayes classifier**, which is a simple yet effective classification method based on probability and reasoning under uncertainty.

## Bayes Rule <a name="bayes-rule"></a>

---

First, we introduce the Bayes rule, which the naive Bayes classifier is based on.

From the <a href="https://en.wikipedia.org/wiki/Chain_rule_(probability)">product/chain rule</a>, we know that given two propositions $A$ and $B$, we have 

$$
\begin{aligned}
P(A,B) & = P(B) * P(A\ |\ B) \\
& = P(A) * P(B\ |\ A).
\end{aligned}
$$

We can easily convert the above equation into the following one by dividing all parts by $P(B)$:

$$
P(A\ |\ B) = \frac{P(A) * P(B\ |\ A)}{P(B)}.
$$

The above equation is exactly the **Bayes rule**. It is named after the famous statistician [Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes), who proposed this rule/theorem.

We can also extend the Bayes rule to more variables. Given propositions $X_1, X_2, \dots, X_n, Y$, we have 

$$
P(Y\ |\ X_1, X_2, \dots, X_n) = \frac{P(Y) * P(X_1, X_2, \dots, X_n\ |\ Y)}{P(X_1, X_2, \dots, X_n)}.
$$

<!-- <img src='img/thomas-bayes.gif' width=200></img> -->

### Why is Bayes rule useful?

The Bayes rule looks very simple, and is just one step of conversion from the product rule. However, it is very helpful for reasoning under uncertainty due to the following interpretation.

- Let $A$ be the proposition that we want to reason. We call it the **query proposition**, and its probability $P(A)$ is called the **prior probability/belief**.
- Let $B$ be another proposition that we can observe as evidence. We call it the **evidence proposition**, and its probability $P(B)$ indicates the probability that we observe the evidence.
- $P(A\ |\ B)$ is the probability of the query $A$ given that the evidence $B$ has been observed. This is called the **posterior probability/belief** given the evidene $B$.
- $P(B\ |\ A)$ is the conditional probability of the evidence $B$ given that $A$ is true.

In most cases, it is easier to estimate or calculate $P(A)$, $P(B)$ and $P(B\ |\ A)$ than directly estimating $P(A\ |\ B)$. Thus, by the Bayes rule, we can calculate the posterior probability $P(A\ |\ B)$ from our domain knowledge of $P(A)$, $P(B)$ and $P(B\ |\ A)$. 

### Example: Medical Test

You are worried about having a rare cancer. The cancer is very rare, occurring in only one of every 10,000 people. You go with the cancer test, which has 99% accuracy (if you have the cancer, it shows that you do with 99% probability; if you don't have the cancer, it shows that you do not with 99% probability).

#### QUESTION: 

If your test result comes back positive, what are your chances that you actually have the cancer?

#### ANSWER

First we define the following propositions:

- $A$: You have the cancer
- $B$: Your test result comes back positive.

From the problem description, we know that the probability to have the cancer is 1/10000, i.e.,

$$
P(A) = \frac{1}{10000} = 0.0001,
$$

In addition, from the 99% accuracy of the test, we have

$$
P(B\ |\ A) = 0.99,
$$

$$
P(\neg B\ |\ \neg A) = 0.99.
$$

The question is to calculate $P(A\ |\ B)$. By Bayes rule, we have

$$
P(A\ |\ B) = \frac{P(A) * P(B\ |\ A)}{P(B)},
$$

where $P(A) = 0.0001$ and $P(B\ |\ A) = 0.99$ are already known. However, $P(B)$ is not known yet. Here, we use the probability rules introduced in a previous [tutorial](https://github.com/meiyi1986/tutorials/blob/master/notebooks/reasoning-under-uncertainty-basics.ipynb) to calculate $P(B)$.

First, we use the **sum rule** to include $A$ jointly with $B$ as follows:

$$
P(B) = P(B, A) + P(B, \neg A).
$$

Then, for each term on the right hand side, we use the **product rule** as follows.

$$
P(B, A) = P(A) * P(B\ |\ A),
$$

$$
P(B, \neg A) = P(\neg A) * P(B\ |\ \neg A).
$$

The first equation can be easily calculated as 
$$
\begin{aligned}
P(B, A) & = P(A) * P(B\ |\ A) \\
& = 0.0001 \times 0.99 = 0.000099.
\end{aligned}
$$

To calculate the second equation, we use the **normalisation rule** as follows.

$$
P(\neg A) = 1 - P(A) = 0.9999,
$$

$$
P(B\ |\ \neg A) = 1 - P(\neg B\ |\ \neg A) = 0.01.
$$

Thus, we have 

$$
P(B, \neg A) = 0.9999 \times 0.01 = 0.009999.
$$

Overall, we have $P(B) = P(B, A) + P(B, \neg A) = 0.000099 + 0.009999 = 0.010098$.

Finally, we can have 

$$
P(A\ |\ B) = \frac{P(A) * P(B\ |\ A)}{P(B)} = \frac{0.0001 \times 0.99}{0.010098} \approx 0.01.
$$

We can see that even with the positive test result, the chance of actually having the cancer is still only 1%. This is due to the very small prior probability $P(A) = 0.0001$. Note that the positive test result already makes the posterior probability 100 times as the prior probability.

## Naive Bayes Classifier Algorithm <a name="nb"></a>

---

### Classification by Probability <a name="prob"></a>

In classification, an instance is represented as a feature vector. If we consider the classification problem from the perspective of probability, we are asking the following question:

> **QUESTION**: What is the **conditional probability** of the **class label** of an instance given its **feature vector**?

For this purpose, we define the class label and each feature as a random variable. Let $Y$ be the random variable for the class label, and $X_i$ the random variable of the $i$th feature, given an instance (feature vector) $[X_1 = x_1, \dots, X_n = x_n]$, we aim to estimate

$$
P(Y = y\ |\ X_1 = x_1, \dots, X_n = x_n)
$$

for each class label $y$. Then, we can use the **winner-take-all** approach, and predict the class label with the highest probability.

### Use Bayes Rule <a name="nb-rule"></a>

To directly estimate the $P(Y = y\ |\ X_1 = x_1, \dots, X_n = x_n)$ from the training data, we need to

1. Find the instances with $[X_1 = x_1, \dots, X_n = x_n]$ in the training data. Say that we have found $M$ such instances from the training data.
2. Find the instances with class label $Y = y$ among those in Step 1. Say we have found $K (K \leq M)$ such instances.
3. Calculate the empirical probability $P(Y = y\ |\ X_1 = x_1, \dots, X_n = x_n) = \frac{K}{M}$.

Normally, this direction estimation requires a **LOT** of training instances to ensure a reasonable accuracy of the empirical probability for any possible test instance (feature vector). For example, if there are 10 binary features and 2 classes, the number of different feature vectors is $2^{10} = 1024$. If we need at least 30 instances for each possible feature vector and each class to ensure the accuracy of the empirical probability, then we need as least $30 \times 1024 \times 2 = 61440$ instances in total.

To address this issue, the naive Bayes classifier uses the Bayes rule as follows:

$$
P(Y = y\ |\ X_1 = x_1, \dots, X_n = x_n) = \frac{P(Y = y) * P(X_1 = x_1, \dots, X_n = x_n\ |\ Y = y)}{P(X_1 = x_1, \dots, X_n = x_n)}
$$

### Conditional Independence Assumption <a name="independence"></a>

Note that the use of Bayes rule still does not relax the requirement, since we need the same number of instances to estimate $P(X_1 = x_1, \dots, X_n = x_n\ |\ Y)$ and $P(X_1 = x_1, \dots, X_n = x_n)$. Therefore, the naive Bayes classifier further makes the following assumption:

> **DEFINITION**: In the **conditional independence assumption**, the features are assumed to be conditionally independent with each other given the class label. 

For each $x_1 \in \Omega(X_1), \dots, x_n \in \Omega(X_n), y \in \Omega(Y)$, we have

$$
P(X_1 = x_1, \dots, X_n = x_n\ |\ Y = y) = P(X_1 = x_1\ |\ Y = y) * \dots * P(X_n = x_n\ |\ Y = y).
$$

Therefore,

$$
\begin{aligned}
P(Y = y\ |\ X_1 = x_1, \dots, X_n = x_n) & = \frac{P(Y = y) * P(X_1 = x_1, \dots, X_n = x_n\ |\ Y = y)}{P(X_1 = x_1, \dots, X_n = x_n)} \\
& = \frac{P(Y = y) * P(X_1 = x_1\ |\ Y = y) * \dots * P(X_n = x_n\ |\ Y = y)}{P(X_1 = x_1, \dots, X_n = x_n)}
\end{aligned}
$$

Thus, we only need to estimate $P(X_i = x_i\ |\ Y = y)$ for each feature independently. This can greatly reduce the amount of training data required, as there are much more instances matching $[X_i = x_i, Y = y]$ for each feature.

> **NOTE**: The conditional independence assumption is **naive** (this is why the algorithm is called **naive** Bayes classifier). In practice, this assumption is almost always wrong. However, the naive Bayes classifier can often show high classification accuracy. Why?

### No Need for Denominator Calculation <a name="denom"></a>

Note that the demoninator $P(X_1 = x_1, \dots, X_n = x_n)$ is independent of the class label. Therefore, it is a constant coefficient in the conditional probabilities of all the class labels. In other words, the conditional probabilities of the class labels are proportional to its enumerator. That is,

$$
\begin{aligned}
P(Y = y\ |\ X_1 = x_1, \dots, X_n = x_n) & = \alpha * P(Y = y) * P(X_1 = x_1\ |\ Y = y) * \dots * P(X_n = x_n\ |\ Y = y) \\
& \propto P(Y = y) * P(X_1 = x_1\ |\ Y = y) * \dots * P(X_n = x_n\ |\ Y = y)
\end{aligned}
$$

where $\alpha = 1/P(X_1 = x_1, \dots, X_n = x_n)$ is the constant coefficient.

Therefore, the class label with the highest $P(Y = y) * P(X_1 = x_1\ |\ Y = y) * \dots * P(X_n = x_n\ |\ Y = y)$ will have the highest conditional probability.

### Overall Algorithm <a name="overall"></a>

Putting everything together, the pseudo code of the overall naive Bayes classifier is as follows.

```Python
'''
Preprocessing, calculate P(class) and P(feature = value | class) for each class and feature
'''
for each class_label:
    count the number of training instances N[class_label] matching this label
    P[class_label] = N[class_label] / N
    for each feature:
        for each value in domain[feature]:
            count the number of training instances N[feature, value, class_label] matching the feature value
            P[feature, value, class_label] = N[feature, value, class_label] / N[class_label]

'''
Make prediction for a new feature_vector
'''
def predict(feature_vector):
    pred_class = None
    pred_prob = 0
    for each class_label:
        score[class_label] = P[class_label]
        for each feature:
            score[class_label] = score[class_label] * P[feature, feature_vector[feature], class_label]
        
        if score[class_label] > pred_prob:
            pred_class = class_label
            pred_prob = score[class_label]
    
    return (pred_class, pred_prob)
```

## Case Study on Bank Loan Application <a name="case-study"></a>

---

Assuming that we are to build a system to approve/decline bank loan applications. The system learns the decision based on three features:

1. The **Job** feature: whether the applicant has a job or not?
2. The **Deposit** feature: whether the applicant has a high or low deposit?
3. The **Family** feature: whether the applicant is single, couple (without child), or with children?

Below are 10 historical applications with their decisions. We will use them to train our Naive Bayes classifier.

| Applicant | Job | Deposit | Family | Decision |
| --------- | --- | ------- | ------ | -------- |
|     1     | true | low    | single | Approve  |
|     2     | true | low    | couple | Approve  |
|     3     | true | high    | single | Approve  |
|     4     | true | high    | single | Approve  |
|     5     | false | high    | couple | Approve  |
|     6     | true | low    | couple | Decline  |
|     7     | false | low    | couple | Decline  |
|     8     | true | low    | children | Decline  |
|     9     | false | low    | single | Decline  |
|    10     | false | high    | children | Decline  |

To calculate the `P(class)` and `P(feature = value | class)` for each class and feature, we count the **number of instances** matching different class labels and feature values as follows.

| Class | Approve | Decline |
| ----- | ------- | ------- |
| Total |    5    |    5    |
| Job = true |    4    |    2    |
| Job = false |    1    |    3    |
| Dep = low |    2    |    4    |
| Dep = high |    3    |    1    |
| Fam = single |    3    |    1    |
| Fam = couple |    2    |    2    |
| Fam = children |    0    |    2    |

Then, we calculate the (conditional) probabilities as follows.


| Class | Approve | Decline |
| ----- | ------- | ------- |
| Total |    5/10    |    5/10    |
| Job = true |    4/5    |    2/5    |
| Job = false |    1/5    |    3/5    |
| Dep = low |    2/5    |    4/5    |
| Dep = high |    3/5    |    1/5    |
| Fam = single |    3/5    |    1/5    |
| Fam = couple |    2/5    |    2/5    |
| Fam = children |    0/5    |    2/5    |

Given a new instance [Job = true, Dep = high, Fam = children], we use the naive Bayes classifier to calculate the **class score** (ignoring the denominator) for each class label (Decline and Approve).

$$
\begin{aligned}
& P(Decline\ |\ Job = true, Dep = high, Fam = children) \\
& \propto P(Decline) * P(Job = true\ |\ Decline) * P(Dep = high\ |\ Decline) * P(Fam = children\ |\ Decline) \\
& = 5/10 \times 2/5 \times 1/5 \times 2/5 \\
& = 0.016.
\end{aligned}
$$

$$
\begin{aligned}
& P(Approve\ |\ Job = true, Dep = high, Fam = children) \\
& \propto P(Approve) * P(Job = true\ |\ Approve) * P(Dep = high\ |\ Approve) * P(Fam = children\ |\ Approve) \\
& = 5/10 \times 4/5 \times 3/5 \times 0/5 \\
& = 0.
\end{aligned}
$$

The predicted class will be **Decline**, since it has the highest score.

## Dealing with Zero Occurrence <a name="zero"></a>

---

In the above example, if we further look into the instance, we can see that the applicable is actually very promising. The applicant has a job and the deposit is high. However, the application is declined, because $P(Fam = children\ |\ Approve) = 0$ in the training data (also note that the two declined applications with $Fam = children$ had either no job or low deposit). This cancels out all the effects to other features, since zero multiplies everything is still zero.

To deal with zero occurrence for a feature that voids the effects of all the other features, we can take a simple approach: we start counting the number of instances from 1 rather than 0.

After adjusting the counting, the table of the number of instances matching the class labels and feature values is shown as follows.

| Class | Approve | Decline |
| ----- | ------- | ------- |
| Total |    6    |    6    |
| Job = true |    5    |    3    |
| Job = false |    2    |    4    |
| Dep = low |    3    |    5    |
| Dep = high |    4    |    2    |
| Fam = single |    4    |    2    |
| Fam = couple |    3    |    3    |
| Fam = children |    1    |    3    |

We can see that compared with the original table, each entry in this adjusted table is incremented by 1.

From this table, we calculate the (conditional) probabilities as follows.

$$
P(Class) = \frac{N(Class)}{N(Class = Approve) + N(Class = Decline)}
$$

$$
P(Job\ |\ Class) = \frac{N(Job, Class)}{N(Job = true, Class) + N(Job = false, Class)}
$$

$$
P(Dep\ |\ Class) = \frac{N(Dep, Class)}{N(Dep = low, Class) + N(Dep = high, Class)}
$$

$$
P(Fam\ |\ Class) = \frac{N(Fam, Class)}{N(Fam = single, Class) + N(Fam = couple, Class) + N(Fam = children, Class)}
$$

Then, the table for the (conditional) probabilities is below.

| Class | Approve | Decline |
| ----- | ------- | ------- |
| Total |    6/12    |    6/12    |
| Job = true |    5/7    |    3/7    |
| Job = false |    2/7    |    4/7    |
| Dep = low |    3/7    |    5/7    |
| Dep = high |    4/7    |    2/7    |
| Fam = single |    4/8    |    2/8    |
| Fam = couple |    3/8    |    3/8    |
| Fam = children |    1/8    |    3/8    |

From the new table, we calculate the class scores as follows.

$$
\begin{aligned}
& P(Decline\ |\ Job = true, Dep = high, Fam = children) \\
& \propto P(Decline) * P(Job = true\ |\ Decline) * P(Dep = high\ |\ Decline) * P(Fam = children\ |\ Decline) \\
& = 6/12 \times 3/7 \times 2/7 \times 3/8 \\
& = 0.0230.
\end{aligned}
$$

$$
\begin{aligned}
& P(Approve\ |\ Job = true, Dep = high, Fam = children) \\
& \propto P(Approve) * P(Job = true\ |\ Approve) * P(Dep = high\ |\ Approve) * P(Fam = children\ |\ Approve) \\
& = 6/12 \times 5/7 \times 4/7 \times 1/8 \\
& = 0.0255.
\end{aligned}
$$

The predicted class now becomes **Approve**. This shows the effectiveness of dealing with zero occurrence.

---

- More tutorials can be found [here](https://github.com/meiyi1986/tutorials).
- [Yi Mei's homepage](https://meiyi1986.github.io/)