# Naive Bayes

## Review of Probability
* Conditional probability
* Product rule (chain rule)
* Independence
* Sum rule
* Bayes' rule

Welcome to the Naive Bayes lecture! Before we discuss Naive Bayes in particular, we review some probability basics that will allow us to understand the details. In particular, we will look at conditional probability, product rule, independence, sum rule and Bayes' rule.

## Conditional Probability
The **conditional** probability of $A$ given $B$ is defined by:
$$p(A|B)=\frac{p(A, B)}{p(B)}$$

* $p(A, B)$ is the **joint** probability of observing $A$ and $B$
* $p(B)$ is the probability of observing $B$

Example:

In [None]:
p_b = 0.4
p_a_b = 0.1
p_a_cond_b = p_a_b / p_b
p_a_cond_b

The conditional probability of $A$ given $B$ is defined as the ratio of the joint probability of observing $A$ and $B$ over the probability of observing $B$. For example, we could calculate the probability a pet is a dog ($A$) if its height is more than 30 cm ($B$). The example shows how this can be calculated using code.

## Product Rule (Chain Rule)
From the formula for conditional probability, we can express **joint** probability $p(A, B)$ as:
$$p(A, B)=p(A|B)p(B)$$

Alternatively:
$$p(A, B)=p(B|A)p(A)$$

Example:

In [None]:
p_a_cond_b = 0.25
p_b = 0.4
p_a_b = p_a_cond_b * p_b
p_a_b

We can take the formula for conditional probability and use it to express the joint probability $p(A, B)$ as the product of $p(A|B)$ and $p(B)$ - this is known as the product or chain rule. For example, we could calculate the joint probability a pet is a dog and at the same time its height is more than 30 cm. We can easily swap the random variables $A$ and $B$ in the joint probability and it still remains the same probability. This allows us to write $p(A,B)$ alternatively in terms of the product of $p(B|A)$ and $p(A)$. We again include a code snippet example.

## Independence
If random variables $A$ and $B$ are **independent**, then $p(A|B)=p(A)$ and $p(B|A)=p(B)$.

Joint probability for independent random variables $A$ and $B$ then becomes:
$$p(A, B) = p(A)p(B)$$

Example:

In [None]:
p_a = 0.25
p_b = 0.4
p_a_b = p_a * p_b
p_a_b

If random variables $A$ and $B$ are independent, then the conditional probability $p(A|B)$ is equal to $p(A)$. Similarly conditional probability $p(B|A)$ is equal to $p(B)$. As a result, the joint probability for independent random variables $A$ and $B$ becomes $p(A, B) = p(A)p(B)$. For example, the probability of a pet being a dog given it is sunny today is the same as the prior probability of the pet being a dog because the fact it is sunny does not tell us anything about the pet.

## Sum Rule
We can express probability $p(A)$ by considering various values of random variable $B$:
$$p(A)=p(A, B=b_1) + p(A, B=b_2) + ... + p(A, B=b_N) = \sum_{i=1}^N p(A, B=b_i)$$

Using product rule:
$$
\begin{aligned}
 p(A)&=p(A| B=b_1)p(B=b_1) + ... + p(A| B=b_N)p(B=b_N)\\
 &= \sum_{i=1}^N p(A|B=b_i)p(B=b_i)
\end{aligned}
$$

Example:

In [None]:
p_b1 = 0.2
p_b2 = 0.3
p_a_cond_b1 = 0.5
p_a_cond_b2 = 0.1
p_a = p_a_cond_b1 * p_b1 + p_a_cond_b2 * p_b2
p_a

Another very useful rule from probability is the sum rule. It allows us to express probability $p(A)$ in terms of various values of random variable $B$. It is often combined with the product rule so that we can express the probability $p(A)$ by conditioning on different values of $B$. The code snippet gives an example that includes the conditioning on different values of $B$. We could again calculate the probability that a pet is a dog and we could condition on the various colours of its coat: for example, brown, black, red or white.

## Bayes' Rule
* In machine learning, we typically want to find the probability of class $y$ given data $x$: $p(y|x)$.

* Using the conditional probability formula and the product rule, we obtain Bayes' rule:
$$p(y|x)=\frac{p(y, x)}{p(x)}=\frac{p(x|y)p(y)}{p(x)}$$

* Prediction is made by selecting the most likely class $y$.

* We will use $X, Y$ (in capital) for the random variables (representing features - e.g. $Y$ can denote the class), while $x, y$ will correspond to their values (e.g. $y$ can be *dog*).

In machine learning, we typically want to find the probability of class $y$ given data $x$: $p(y|x)$. However, this is difficult to obtain directly from the data, so we need to use the conditional probability formula and combine it with the product rule. This leads to the formula shown on the slide. In terms of making predictions, we select the most likely class $y$. To make our notation a bit clearer, we use capital letters for random variables - these represent features. Lower-case letters correspond to the values that these features take. Random variable is called random because its value depends on the outcome of a random process. For example, capital $Y$ could be used to denote a random variable representing the class, while lowercase $y$ represents a specific value that the random variable representing the class takes - which can be for example a dog.

## Bayes' Rule
Bayes' rule:
$$p(y|x)=\frac{p(y, x)}{p(x)}=\frac{p(x|y)p(y)}{p(x)}$$

Explanation:
* $p(y|x)$: posterior probability
    * What is the probability of class $y$ if we have seen data $x$?
* $p(x|y)$: model likelihood
    * What is the probability of seeing $x$ if the class is $y$?
* $p(y)$: prior probability of class $y$
    * What is the probability of class $y$ if we do not know $x$?
* $p(x)$: normalization
    * Normalizes $p(y|x)$ to sum to 1 for various values of $y$, but does not affect which class is the most likely

Let's explain what the individual parts of the Bayes' rule actually mean. $p(y|x)$ is the posterior probability, which says what is the probability of class $y$ if we have seen data $x$. $p(x|y)$ is the model likelihood and says what is the probability of seeing $x$ if the class is $y$. $p(y)$ is the prior probability of class $y$ and describes what is the probability of class $y$ if we do not know $x$. $p(x)$ is the normalization, which  normalizes $p(y|x)$ to sum to 1 for various values of $y$, but does not affect which class is the most likely.

## Bayes' Rule Example
* Goal: find the probability a pet is a cat or dog based on height

* Classes: cat ($c$) and dog ($d$)

* Height values: $<30$ cm (short $s$), $\geq30$ cm (tall $t$)

* Specific problem: calculate $p(y|x)$ for $y$ being a cat given its height $x$ is $<30$ cm ($s$)
    * $p(y|x)$: What is the probability it is a cat if its height is $<30$ cm?
    * $p(x|y)$: What is the probability of height $<30$ cm if it is a cat?
    * $p(y)$: What is the probability it is a cat if we do not know the height?
    * $p(x)$: What is the probability its height is $<30$ cm?

To make it simpler to understand what these specific parts mean, we include an example. In our example, we want to find the probability a pet is a cat or dog based on height. We have two classes: cat and dog. We have two categories for height values: less than 30 cm, more than or equal to 30 cm. The specific problem we are trying to solve is to calculate $p(y|x)$ for $y$ being a cat given its height $x$ is less than 30 cm.

We break down the formula again into small parts, which we now intepret in the context of our specific example.
* $p(y|x)$: What is the probability it is a cat if its height is $<30$ cm?
* $p(x|y)$: What is the probability of height $<30$ cm if it is a cat?
* $p(y)$: What is the probability it is a cat if we do not know the height?
* $p(x)$: What is the probability its height is $<30$ cm?

## Bayes' Rule
Bayes' rule:
$$p(y|x)=\frac{p(y, x)}{p(x)}=\frac{p(x|y)p(y)}{p(x)}$$

* Probability $p(x)$ can be expanded using the sum and product rule (for $N$ classes):
$$p(x)=\sum_{i=1}^N p(x|y_i)p(y_i)$$

* In practice, $p(x)$ can be skipped as it is enough to know that $p(y|x)$ is proportional to $p(x|y)p(y)$ (posterior $\propto$ likelihood $\times$ prior).

* Values of $p(x|y)$ and $p(y)$ can be easily estimated using the data. 

Having explained the details on an example, we go back to the formula. In the formula, there is a term $p(x)$ which acts as the normalization. This term can be expanded using the sum and product rule with the resulting formula shown on the slide. Note that $N$ in the formula corresponds to the number of classes and $i$ describes the currently considered class. In practice, this term can be skipped as it is enough to know that $p(y|x)$ is proportional to $p(x|y)p(y)$. In other words, the posterior is proportional to the product of the likelihood and the prior.

Values of $p(x|y)$ (model likelihood) and $p(y)$ (prior) can be easily estimated using the data. 

## Bayes' Rule Example
Problem: find the probability $p(c|s)$ that a pet with height $<30$ cm ($s$) is a cat ($c$)

Available information:
* there are 30 cats and 70 dogs in our dataset
* 90% of cats have height smaller than 30 cm
* 30% of dogs have height smaller than 30 cm



Now we go through an example to explain Bayes' rule in more depth. In this example, we want to find the probability that a pet with height lower than 30 cm is a cat. We know that there are 30 cats and 70 dogs in our dataset, 90% of cats have height smaller than 30 cm, 30% of dogs have height smaller than 30 cm.

## Bayes' Rule Example
Available information in terms of probabilities:
* $p(c)=0.3$
* $p(d)=0.7$
* $p(s|c)=0.9$
* $p(s|d)=0.3$

In [None]:
p_c = 0.3
p_d = 0.7
p_s_cond_c = 0.9
p_s_cond_d = 0.3

We can write down this information easily in terms of probabilities.

## Bayes' Rule Example
Using the Bayes' rule formula:
$$p(c|s)=\frac{p(s|c)p(c)}{p(s)}=\frac{p(s|c)p(c)}{p(s|c)p(c)+p(s|d)p(d)}$$

In [None]:
p_s = p_s_cond_c * p_c + p_s_cond_d * p_d
p_c_cond_s = p_s_cond_c * p_c / p_s
p_c_cond_s

It looks like it is a cat!

Using the formula for Bayes' rule that we have seen earlier, we can calculate the probability the pet with the given height is a cat. We also include the expanded formula for the probability that the pet is short (has height less than 30 cm). This part of the formula acts as normalization.

The probability it is a cat is about 56%, so it looks like it is a cat!

## Naive Bayes Classifiers
Generative classifier:
* Naive Bayes tries to model probability a point belongs to class $y$ if it has value $x$
* Allows us to sample a value $x$ for class $y$ using the underlying probability distribution - "generate data"
* Discriminative classifiers (e.g. logistic regression) do not generate data
    * They model class boundary rather than probability distribution

Now we can go to the main part of our lesson: Naive Bayes classifiers. Naive Bayes is an example of a generative classifier - the reason for this is that Naive Bayes tries to model probability a point belongs to class $y$ if it has value $x$. The approach allows us to select a class $y$ and then sample an observation $x$ based on the underlying probability distribution - in other words "generate data". Not all probabilistic classifiers allow this because for example discriminative classifiers (e.g. logistic regression) do not generate data. Discriminative classifiers model the class boundary rather than the probability distribution.

## Naive Bayes Classifiers
In practice, we can have many features for a point: $N$ rather than one
* $N$ can be e.g. $1000$, and each feature can take e.g. $2$ values: together $2^{1000}$ combinations

Challenge: we need to estimate $p(x_1, x_2, ..., x_N|y)$ from data, but we are unlikely to see all possible combinations of values
* **Curse of dimensionality**: as the number of dimensions increases, it is impossible to obtain enough data

Solution: assume the features are **conditionally independent** of each other
* Estimate the probabilities separately for each feature value rather than their combination


When working with more realistic machine learning problems, we often encounter many features - generally $N$ rather than just one. For example, $N$ can be 1000 (e.g. computer vision problems, in which they correspond to pixel values or also spam classification). If each feature takes only 2 values, we end up with 2 to the power of 1000 combinations. This is a challenge because we need to estimate $p(x_1, x_2, ..., x_N|y)$ from data, but we are unlikely to see all possible combinations of values. This phenomenon is also known as the curse of dimensionality: as the number of dimensions increases, it is impossible to obtain enough data.

Naive Bayes offers a simple solution to this problem: assume the features are independent of each other. It estimates the probabilities separately for each feature value rather than their combination.

## Naive Bayes Classifiers
Exact calculation:
$$
\begin{aligned}
 p(x_1, x_2, ..., x_N|y) &=p(x_1|y)p(x_2|x_1,y)...p(x_N|x_1, x_2, ..., x_{N-1}, y) \\
  &= \prod_{i=1}^N p(x_i|x_1, x_2, ..., x_{i-1}, y)
\end{aligned}
$$

Assume the $N$ features are conditionally independent given the class:
$$
\begin{aligned}
 p(x_1, x_2, ..., x_N|y) &=p(x_1|y)p(x_2|y)...p(x_N|y) \\
  &= \prod_{i=1}^N p(x_i|y)
\end{aligned}
$$

Naive Bayes: "naively" assume the features are conditionally independent of each other

Let's have a more detailed look at the details of this assumption. The exact calculation would be done using the formula at the top of the slide. If we assume the $N$ features are conditionally independent given the class, then we can use the simpler formula. In this formula, we condition only on the class value and not on the values of other features. In fact, the name for Naive Bayes comes from the fact it somewhat "naively" assumes the features are conditionally independent to each other.

## Naive Bayes Classifiers
Going back to the Bayes' rule:
* Naive Bayes says $p(y|x_1, x_2, ..., x_N)$ is **proportional** to $p(x_1|y)...p(x_N|y)p(y)$
* Proportionality is enough to say which class is the most likely

How can we estimate these terms?
* The prior $p(y)$ can be specified
    * Default prior: $N_C/N$ (number of examples from class $C$ over number of all examples)
* The probabilities $p(x_i|y)$ can be modelled
     * if $x_i$ is continuous, as a **Gaussian**
     * if $x_i$ is ordinal, as a **Multinomial**
     * if $x_i$ is binary, as a **Bernoulli**
     
The parameters of each distribution are fitted using maximum likelihood estimation on the training data.

Naive Bayes uses the Bayes' rule to do the classification - using the conditional independence assumption. It says that the probability of a class for the data is proportional to the product of the probabilities of seeing the specific feature values for the given class, which is then multiplied by the prior probability of the class. Proportionality is enough to say which class is the most likely, so we do not need to worry about normalizing it.

The formula on the top of the slide has two main parts: the prior $p(y)$ which can be specified and is usually calculated as the number of examples from class $C$ over number of all examples. The other part is the product of $p(x_i|y)$ for different values of $i$. These probabilities can be modelled using different probability distributions, depending on the type of the feature. If the feature is continuous, then we can model it as a Gaussian. If it is ordinal, then we can model it as a Multinomial. If it is binary, then we can model it as a Bernoulli.

The parameters of each distribution are fitted using maximum likelihood estimation on the training data. In ``sklearn``, there are various versions of Naive Bayes, including ``GaussianNB``, ``MultinomialNB`` and ``BernoulliNB``. We will use ``GaussianNB`` during the practical, but all of them are used in a very similar way and which one to use depends on the type of the feature. You can read more about them [here](https://scikit-learn.org/stable/modules/naive_bayes.html).

## Modelling Continuous Features

Model the probabilities $p(x_i|y)$ as a Gaussian:
$$p(x_i|y)=\frac{1}{\sqrt{2\pi\sigma^2_{x_i,y}}}\exp\left(-\frac{\left(x_i-\mu_{x_i,y}\right)^2}{2\sigma^2_{x_i, y}}\right),$$

where the mean $\mu_{x_i,y}$ and variance $\sigma^2_{x_i, y}$ for feature $x_i$ and class $y$ can be calculated as (across $N$ examples):
$$\mu_{x_i,y}=\frac{1}{N}\sum_{j=1}^N x_{i}^{(j)}$$
$$\sigma^2_{x_i, y}=\frac{1}{N}\sum_{j=1}^N \left(x_{i}^{(j)}-\mu_{x_i,y}\right)^2$$

We will now have a closer look at modelling continuous features and we will then work through an example in detail. For continuous features, we can model the probabilities $p(x_i|y)$ as a Gaussian, using the formula on the slide.

We need to specify the mean $\mu_{x_i,y}$ and variance $\sigma^2_{x_i, y}$ for feature $x_i$ and class $y$. These can be calculated using the further formulas on the slide.

## Gaussian Distribution Example


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

mu = 0
variance = 1
std = np.sqrt(variance)
x = np.linspace(mu - 6*std, mu + 6*std, 100)
plt.plot(x, stats.norm.pdf(x, mu, std), label="$\mu=0, \sigma^2=1$", linewidth=4)
mu = 0
variance = 4
std = np.sqrt(variance)
# x = np.linspace(mu - 3*std, mu + 3*std, 100)
plt.plot(x, stats.norm.pdf(x, mu, std), label="$\mu=0, \sigma^2=4$", linewidth=4)
mu = 0
variance = 0.25
std = np.sqrt(variance)
# x = np.linspace(mu - 3*std, mu + 3*std, 100)
plt.plot(x, stats.norm.pdf(x, mu, std), label="$\mu=0, \sigma^2=0.25$", linewidth=4)
plt.legend(fontsize=16, bbox_to_anchor=(1.05, 1))
plt.yticks(fontsize=14)
plt.xticks(fontsize=14)
plt.xlabel("Value $x$", fontsize=16)
plt.ylabel("Probability density", fontsize=16)
plt.show()

Now we include an example of what the Gaussian distribution looks like - or actually three examples. The blue curve shows a Gaussian with mean 0 and variance 1. The orange curve has mean 0 and variance 4. The variance is larger than 1, so it makes it flatter than the blue curve. The green curve is more peaked than the blue one because its variance is 0.25 - less than 1. The mean controls where the curve is centred - in our case all are centred at 0.

## Method to Calculate Gaussian Probability

In [None]:
import math

def gaussian_probability(x, mean, variance):
    part_1 = math.exp(-(x - mean) ** 2 / (2 * variance))
    part_2 = (2 * math.pi * variance) ** 0.5
    return part_1 / part_2

We include a code snippet to show how we can calculate Gaussian probability for a point $x$ given specific values of the mean and variance.

## Method to Calculate Mean

In [None]:
def calculate_mean(values):
    total_sum = 0
    num_examples = len(values)
    
    for value in values:
        total_sum += value
        
    return total_sum / num_examples

We also include a method to calculate the mean. It follows closely the formula mentioned earlier and expects a list as the input.

## Method to Calculate Variance

In [None]:
def calculate_variance(values, mean):
    total_sum_squares = 0
    num_examples = len(values)
    
    for value in values:
        total_sum_squares += (value - mean) ** 2
        
    return total_sum_squares / num_examples

Similarly we include a formula to calculate the variance for the data. We pass the mean to it. The mean can be calculated using the previous method.

## Example with Continuous Features
Problem: find the probability a pet is a dog or cat based on height and weight measurements
* Classes: dog ($d$) and cat ($c$)
* Features: height ($h$) and weight ($w$)

Now we work through a specific example. The problem will be to find the probability a pet is a dog or cat based on height and weight measurements. We will have two classes: dog and cat and the features will be height and weight.

## Example with Continuous Features
Our data:
* 6 dogs and 4 cats

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
np.random.seed(0)
# [weight in kg, height in cm]
dogs = np.random.normal([10, 50], [5, 20], (6,2))
cats = np.random.normal([4, 25], [1, 5], (4,2))
wd = [x[0] for x in dogs]
hd = [x[1] for x in dogs]
wc = [x[0] for x in cats]
hc = [x[1] for x in cats]

plt.scatter(wd, hd, s=130, label='Dogs')
plt.scatter(wc, hc, s=130, label='Cats')
plt.yticks(fontsize=14)
plt.xticks(fontsize=14)
plt.ylabel('Height (cm)', fontsize=16)
plt.xlabel('Weight (kg)', fontsize=16)
plt.legend(fontsize=16, loc=2)
plt.show()

Our dataset consists of 6 dogs and 4 cats. Their weights and heights are shown on the plot. As you can see, the two clusters are quite far away from each other, so it should be fairly simple to distinguish dogs and cats in this case.

## Example with Continuous Features
What is the probability a pet with height $h=60$ cm and weight $w=12$ kg is a dog?

Define the question using equations:
$$
\begin{aligned}
 p(d|h,w)&=\frac{p(h,w|d)p(d)}{p(h,w)} \\
  &= \frac{p(h|d)p(w|d)p(d)}{p(h,w)}
\end{aligned}
$$

Let's calculate the individual parts!

In our specific case we want to find the probability that a pet with height $h=60$ cm and weight $w=12$ kg is a dog. We write down the question using equations, so that we know what to calculate. The first part is a simple application of the Bayes' rule, while we then expand it using the Naive Bayes assumption. The next step is to calculate the individual parts - we will start by calculating the prior probabilities.

## Example with Continuous Features
Prior probabilities (there are 6 dogs and 4 cats):
$$p(d)=\frac{6}{6+4}=0.6$$
$$p(c)=\frac{4}{6+4}=0.4$$

In [None]:
p_d = len(dogs) / (len(dogs) + len(cats))
print(p_d)
p_c = len(cats) / (len(dogs) + len(cats))
print(p_c)

The prior probabilities can be easily calculated from the data. The prior probability for dogs is the number of dogs over the number of all pets - 6 over 10. We calculate the prior for cats in a similar way - getting a prior of 0.4.

## Example with Continuous Features
* Heights and weights are stored in lists ``hd``, ``wd`` and ``hc``, ``wc`` for dogs and cats respectively.

* We want to calculate means $\mu_{h,d}, \mu_{w,d}, \mu_{h,c}, \mu_{w,c}$ and variances $\sigma^2_{h,d}, \sigma^2_{w,d}, \sigma^2_{h,c}, \sigma^2_{w,c}$:

In [None]:
mu_hd = calculate_mean(hd)
mu_wd = calculate_mean(wd)
mu_hc = calculate_mean(hc)
mu_wc = calculate_mean(wc)

var_hd = calculate_variance(hd, mu_hd)
var_wd = calculate_variance(wd, mu_wd)
var_hc = calculate_variance(hc, mu_hc)
var_wc = calculate_variance(wc, mu_wc)

One thing to note is that the heights and weights are stored in ``hd``, ``wd`` and ``hc``, ``wc`` for dogs and cats respectively.

We use these measurements and the methods we have defined earlier to calculate the means and variances.

## Example with Continuous Features
Calculate probabilities $p(h|d),p(w|d),p(h|c),p(w|c)$ for a pet with height 60 cm and weight 12 kg:

In [None]:
h = 60
w = 12

p_h_cond_d = gaussian_probability(h, mu_hd, var_hd)
p_w_cond_d = gaussian_probability(w, mu_wd, var_wd)
p_h_cond_c = gaussian_probability(h, mu_hc, var_hc)
p_w_cond_c = gaussian_probability(w, mu_wc, var_wc)

Now we calculate probabilities $p(h|d),p(w|d),p(h|c),p(w|c)$ for a pet with height 60 cm and weight 12 kg. We will use Gaussian distribution because heights and weights are continuous. Each case will have its own Gaussian with the mean and variance as calculated earlier. We will use the method we have defined earlier.

## Example with Continuous Features
Normalization $p(h,w)$ can be calculated as:
$$
\begin{aligned}
 p(h,w)&=p(h,w|d)p(d)+p(h,w|c)p(c) \\
  &= p(h|d)p(w|d)p(d)+p(h|c)p(w|c)p(c)
\end{aligned}
$$

Note we have used Naive Bayes assumption here for $p(h,w|d)$ and $p(h,w|c)$.

In [None]:
p_h_w = p_h_cond_d * p_w_cond_d * p_d + p_h_cond_c * p_w_cond_c * p_c

The next step is to calculate the normalization so that we can get the actual probability - rather than just say which class is the most likely. We use the sum rule to find the value of the joint probability $p(h, w)$. During the calculation, you can see that we have used Naive Bayes assumption for expanding $p(h,w|d)$ and $p(h,w|c)$.

## Example with Continuous Features
Final step - calculate probability $p(d|h,w)$:

$$p(d|h,w)=\frac{p(h|d)p(w|d)p(d)}{p(h,w)}$$

In [None]:
p_d_cond_h_w = p_h_cond_d * p_w_cond_d * p_d / p_h_w
p_d_cond_h_w

The pet is definitely a dog! A cat is not so large!

The final step is to combine all of the calculations together and find the probability of the pet being a dog. As you can see, the pet is definitely a dog - which agrees with our expectations because a cat would not be so large.

## Example with Continuous Features
We can also check the probability of the pet being a cat:

In [None]:
p_c_cond_h_w = p_h_cond_c * p_w_cond_c * p_c / p_h_w
p_c_cond_h_w

The probability of the pet being a cat is essentially zero.

We can also look at the probability of the pet being a cat. The calculation says it is essentially zero.

## Example with Discrete Features
* Problem: spam classification based on words present in the email

* Dataset:

| Text                | Class |
|:--------------------|:-----:|
| your lottery ticket | spam  |
| your winning ticket | spam  |
| your bus ticket     |  ham  |
| your train ticket   |  ham  |

* New email: "your airplane ticket"

We also include an example with discrete features. This example is shorter than the previous one and discusses spam classification based on words present in the email. Our dataset consists of four emails, two being spam and two ham (not spam). We want to classify a new email: "your airplane ticket".

## Example with Discrete Features
How to classify the new email?
* Calculate prior probabilities of spam and ham emails
* Calculate conditional probabilities of the words occuring in spam and ham emails
* Calculate $p(\text{spam}|\text{your ticket})$ using Naive Bayes assumption
    * Word airplane is omitted because it is not in the dataset

So how do we classify the new email? There are a few main steps:
* Calculate prior probabilities of spam and ham emails
* Calculate conditional probabilities of the words occuring in spam and ham emails
* Calculate $p(\text{spam}|\text{your ticket})$ using Naive Bayes assumption

We omit the word ``airplane`` from the new email because this word was not seen in the training data and so cannot be used for the classification.

## Example with Discrete Features
Prior probabilities:
* Calculated based on the number of spam and ham emails out of all emails

$$p(\text{spam})=\frac{2}{2+2}=0.5$$

$$p(\text{ham})=\frac{2}{2+2}=0.5$$

Similarly as before, we calculate the prior probabilities. As we have two spam and two ham emails, both of the prior probabilities are equal to 0.5.

## Example with Discrete Features
Conditional probabilities of the words occuring in spam and ham emails:
* Calculated based on how many spam or ham emails include the word

| Word    | Spam | Ham |
|:--------|:----:|:---:|
| your    | 2/2  | 2/2 |
| lottery | 1/2  | 0/2 |
| ticket  | 2/2  | 2/2 |
| winning | 1/2  | 0/2 |
| bus     | 0/2  | 1/2 |
| train   | 0/2  | 1/2 |


The next step is to calculate the conditional probabilities of the words occuring in spam and ham emails. These are calculated based on how many spam or ham emails include the word.

## Example with Discrete Features
Conditional probability of combination of words for spam or ham emails:
* Based on presence or absence of the words
    * Binary values - Bernoulli distribution
* We take the following order: your, lottery, ticket, winning, bus, train

$$
\begin{aligned}
&p(\text{your ticket}|\text{spam})\\
&=p(1,0,1,0,0,0|\text{spam})\\
&=\left(\frac{2}{2}\right)\left(1-\frac{1}{2}\right)\left(\frac{2}{2}\right)\left(1-\frac{1}{2}\right)\left(1-\frac{0}{2}\right)\left(1-\frac{0}{2}\right)\\
&=0.25
\end{aligned}
$$

$$
\begin{aligned}
&p(\text{your ticket}|\text{ham})\\
&=p(1,0,1,0,0,0|\text{ham})\\
&=\left(\frac{2}{2}\right)\left(1-\frac{0}{2}\right)\left(\frac{2}{2}\right)\left(1-\frac{0}{2}\right)\left(1-\frac{1}{2}\right)\left(1-\frac{1}{2}\right)\\
&=0.25
\end{aligned}
$$

Now we calculate the conditional probability of combination of words for spam or ham emails - rather than just the single words. This calculation is again based on the presence or absence of the words, so the values are binary and can be modelled as the Bernoulli probability distribution. In the calculation, we take the following order: your, lottery, ticket, winning, bus, train - this order describes to which words the zeros and ones belong. The calculation gives 0.25 probability in both cases. The Naive Bayes assumption comes here in the assumption that the words are conditionally independent of each other given the class.

## Example with Discrete Features
Probability of the email being spam:
$$
\begin{aligned}
 &p(\text{spam}|\text{your ticket})\\&=\frac{p(\text{your ticket}|\text{spam})p(\text{spam})}{p(\text{your ticket}|\text{spam})p(\text{spam}) + p(\text{your ticket}|\text{ham})p(\text{ham})} \\
 &= \frac{0.25 \times 0.5}{0.25 \times 0.5 + 0.25 \times 0.5}=0.5
\end{aligned}
$$

For this email, we cannot decide if it is spam or not!

The final step is to calculate the probability of the email being spam. We use the Bayes' rule with the previously calculated probabilities. For this specific email, we cannot say if it is spam or not - it is too similar to both spam and ham emails.

## Limitations of Naive Bayes
* Problem with zero counts:
    * If probability for value of one feature is 0 (e.g. $p(x_3|y)=0$), posterior probability will be 0
    * Solution: add a small value to each count (**smoothing**), so we get no 0 probabilities

* Conditional independence assumption:
    * In practice, some features can be conditionally dependent
    * Allows us to fool the method, e.g. in spam classification

As you would expect, such a simple method as Naive Bayes has some limitations. One of the limitations is that its simple version has a problem with zero counts. If probability for value of one feature is 0 (e.g. $p(x_3|y)=0$), posterior probability will be 0. This is not desirable because e.g. in spam classification many words (atrributes) will have zero probability, yet we want to do the classification. The solution is to add a small value to each count (smoothing), so we get no 0 probabilities.

The second limitation is the conditional independence assumption. In practice, some features can be conditionally dependent and this limitation also allows us to fool the method, e.g. in spam classification.

## Benefits of Naive Bayes
* Able to handle **missing values** for an feature:
    * Can simply ignore the feature for an example that has the value missing
    * If $x_j$ has the value missing, then $p(x_1,x_2,..., x_N|y) = \prod_{i\neq j}^N p(x_i|y)$

* Naive Bayes allows us to look at the probabilities and **estimate uncertainty** in the prediction

* Naive Bayes is **fast**

Naive Bayes also has some very useful benefits. First, it is able to easily handle missing values for an feature: we can simply ignore the feature for an example that has the value missing. Second, Naive Bayes allows us to look at the probabilities and estimate uncertainty in the prediction. Third, Naive Bayes is very fast as it only needs to do a few calculations to find the probabilities.

## Jupyter Exercise

Naive Bayes in Practice
Open `naive_bayes_practical.ipynb`

Now it is your turn to get some practice with Naive Bayes!