# Bayesian Learning

## Contents
1. **Introduction**  
2. **The Bayes theorem**  
    2.1. Maximum a posteriori (MAP)  
    2.2. Relation with concept learning  
    2.3. Bayes optimal classifier   
3. **Gibbs classifier**  
4. **Naive Bayes classifiers**  
    4.1. The naive Bayes algorithm  
    4.2. Observations

# Introduction

Bayesian learning is a set of approaches to machine learning that, differently from the learning approaches that we have encountered till now, which aim to find a single hypothesis, associates instead a probability to each hypothesis to being correct. It is based on the concept of conditional probability. With a Bayesian approach, it is possible to combine prior knowledge with observed data. Bayesian learning treats model parameters as random variables. Consequently, parameter estimation amounts to computing posterior distributions for these random variables based on the observed data. Even if they are not applicable in certain contexts, because of the large demand of probabilistic estimations, they are adopted from a theoretical perspective because they provide the so-called *gold standard*, a theoretical landmark for evaluating other learning algorithms.

We need few concepts from probability theory:
- **Chain (or product) rule**: given two random events $A$ and $B$, their joint probability is

$$P(A\cap B) = P(B|A)\cdot P(A).$$

In the general case, given the events $A_1,A_2,...,A_n$, the chain rule extends to the formula

$$P(A_n\cap ... \cap A_1) = P(A_n | A_{n-1} \cap ... \cap A_1) \cdot P(A_{n-1} \cap ... \cap A_1),$$

which, by induction may be turned into

$$P(A_n \cap ... \cap A_1) = \prod_{k=1}^n{P\left(A_k \big\vert \bigcap_{j=1}^{k-1} A_j \right)} .$$

- **Sum rule (or addition law)**: given two random events $A$ and $B$, the probability that $A$ **or** $B$ will happen is:

$$P(A\cup B) = P(A) + P(B) - P(A\cap B).$$

- **Law of total probability:** if events $A_1,...,A_n$ are mutually exclusive with $\sum_{i=1}^nP(A_1)=1$, then for every event $B$ of the same probability space

$$P(B) = \sum_{i=1}^n {P(B\cap A_i)} = \sum_{i=1}^n P(B|A_i)P(A_i).$$

# The Bayes theorem
The Bayes theorem, formulated by Thomas Bayes (1701-1761), allows to compute a posterior probability for an event, from a prior probability, according to given observations. The theorem states:

$$ P(h | D) = \frac{P(D | h) P(h)}{P(D)}$$

where
- $P(h)$ = prior probability of hypothesis $h$;
- $P(D)$ = prior probability of training data $D$, also called *evidence*;
- $P(h|D)$ = probability of $h$ given $D$ (posterior probability);
- $P(D|h)$ = probability of $D$ given $h$.

Generally, we want the most probable hypothesis given $D$. In the context of a classification problem, the posterior probability can be interpreted as the probability that a particular object belongs to class $i$ given its observed feature values. Note that, in constrast to a frequentist's approach, here we introduced a *prior probability*, that can be interpreted as the *prior belief* or as the *a priori knowledge*.

### Example
Let's suppose that a patient gets tested positive ($+$) for a certain disease that affects $1$ every $10000$ people, with a test having $99%$ accuracy. Therefore,
- Sick: $P(S)=0.0001$
- Healthy: $P(H)=0.9999$
- Tested Positive: $P(+) = P(S)P(+|S) + P(H)P(+|H)$
- $P(+|S) = 0.01$
- $P(+|H) = 0.99$

$$P(S|+) = \frac{P(S)\cdot P(+|S)}{P(S)P(+|S) + P(H)P(+|H)}=0.0098 < 1\%$$


## Maximum a posteriori
A simple estimation method, is the *Maximum a posteriori (MAP)*, which is closely related to the method of *maximum likelihood (ML)* estimation (MAP can be seen as a regularization of ML estimation, because it incorporates a prior distribution into its objective function).

$$h_{MAP} = \underset{h\in H}{\mathrm{argmax}}P(h|D) = \underset{h\in H}{\mathrm{argmax}} \frac{P(D|h)\cdot P(h)}{P(D)} = \underset{h\in H}{\mathrm{argmax}} P(D|h)P(h),$$

because $P(D)$ does not depend on $h$.

If we **assume a uniform probability distribution** over $H$, that is $P(h_i)=P(h_j)\; \forall h_i,h_j \in H$, we can further simplify, and choose the *maximum likelihood* hypothesis

$$h_{ML} = \underset{h\in H}{\mathrm{argmax}} P(D|h)$$

### Applying MAP to the hypothesis learning problem
We can adopt a brute force approach, by calculating the posterior probability for each hypothesis $h\in H$:

$$P(h|D) = \frac{P(D|h)\cdot P(h)}{P(D)},$$

then choosing the hypothesis with the highest posterior probability:

$$h_{MAP} = \underset{h\in H}{\mathrm{argmax}} P(h|D).$$


### Relation with concept learning

Consider an instance space $X$, an hypothesis space $H$ and a training set $D$. Consider also the `FindS` algorithm (which has been introduced in the chapter *\"Concept Learning\"*), which outputs the most specific hypothesis from the version space $VS_{H,D}$ (the subset of hypotheses that are consistent with the training data in $D$). What is the difference between the hypotheies found by `FindS` and the MAP hypothesis produced by the Bayes rule?

Assume that no noise affects training data and in the target function in $H$. Consider a fixed set of instances $\langle x_1,...,x_m\rangle$ and let $D$ be the set of binary classifications $D=\langle c(x_1),...,c(x_m)\rangle$.

$$P(D|h) =  
\begin{cases}
1 & \text{if $h$ is consistent with $D$ (because no noisy data)} \\
0 & \text{otherwise}
\end{cases}$$

Assume $P(h)$ to be the uniform distribution: $P(h) = \frac{1}{|H|} \forall h\in H$ (note that `FindS` does not make this assumption).

Then,

$$P(D) = \sum_{h_i\in H} P(D|h_i)P(h_i) = \sum_{h_i\in VS_{H,D}}1\cdot \frac{1}{|H|} + \sum_{h_i \not\in VS_{H,D}}0\cdot \frac{1}{|H|} = \sum_{h_i\in VS_{H,D}}1\cdot \frac{1}{|H|} = \frac{|VS_{H,D}|}{|H|},$$

therefore,

$$P(h|D) = \begin{cases} 
\frac{1}{|VS_{H,D}|} & \text{if $h$ is consistent with $D$ (i.e. $h\in VS_{H,D}$)} \\
0 & \text{otherwise}
\end{cases}$$

**Definition (Consistent learner):** a learning algorithm that outputs $h$ with zero errors on the training examples.

- Every consistent hypothesis has posterior probability $\frac{1}{|VS_{H,D}|}$.
- Every consistent hypothesis is a MAP hypothesis.
- If we assume uniform prior probabilities and deterministic, noise free training data, then consistent learners output a MAP hypothesis.
- Consistent learners can output MAP hypotheses also with different prior probability distributions.

For example, `FindS` doesn't assume a uniform distribution for $H$, but it outputs a MAP hypothesis if we assume a probability distribution $P(h)$ that assigns $P(h_1)\ge P(h_2)$ if $h_1$ is **more specific** than $h_2$.

## Example:

With the following example, we want to highlight how the posterior probabilities evolve with observations incoming.

Consider $5$ bags of candies, each containing a high, indefinite number of candies (the flavours are: Cherry = Red, Lime = Green). Each bag represents a single hypothesis. Assume the probabilities of the candies picks don't changes, because every time we pick a candy, we do not remove it from the bag.

|       | Total Candies | Red candies | Green candies |
|:-----:|:-------------:|:-----------:|:-------------:|
| $h_1$ |      10%      |     100%    |       0%      |
| $h_1$ |      20%      |     75%     |      25%      |
| $h_1$ |      40%      |     50%     |      50%      |
| $h_1$ |      20%      |     25%     |      75%      |
| $h_1$ |      10%      |      0%     |      100%     |

Let $d=\{G,G,G,G,G,G,G,G,G,G\}$ represent a drawn from a bag (i.e. 10 green candies).

- What kind of bag is it? (what is the MAP hypothesis?)
- What colour will the next candy be?

The next graph plots the evolution of the posterior probabilities, as we pick candies from a certain bag. Note that when $d=0$, the posterior is equal to the prior probability. Observe that if we keep picking green candies, the hypothesis $h_5$ becomes more and more likely to be the correct one.


Posteriors evolution            |  Probability of extracting a green candy
:-------------------------:|:-------------------------:
<img src="images/bayesian_learning/posteriors.png" alt="Posterior probabilities" style="width: 30em;"/>  |  <img src="images/bayesian_learning/next_candies.png" alt="Posterior probabilities" style="width: 30em;"/>

## Example

Consider now a new manufacturer producing bags with an arbitrary choiche of red/green candies. Each bag can be characterized by the parameter $\theta \equiv \frac{\text{n. of red candies}}{N} \in [0,1]$. Consequently we have a continous space for hypotheses $h_\theta$. Consider now an experiment consisting in $N=r+g$ picks of $r$ red candies and $g$ green candies.

If we assume that the hypotheses are uniformly distributed (all proportions of red candies are equally likely a priori, then we can find the maximum likelihood hypothesis:

$$h_{ML} = \underset{h_\theta}{\mathrm{argmax}}P(d|h_\theta) = \underset{h_\theta}{\mathrm{argmax}}L(d|h_\theta) $$

where $L(d|h_\theta) = \log P(d|h_\theta)$

$$P(d|h_\theta) = \prod_{j=1}^N P(d_j | h_\theta) = \theta^r\cdot(1-\theta)^g$$

$$L(d|h_\theta) = r\cdot \log\theta + g\cdot \log(1-\theta)$$

In order to find the maximum, we compute the derivative and set it equal to zero:

$$\frac{\text{d}L(d|h_\theta)}{\text{d}\theta} = \frac{r}{\theta} - \frac{g}{1-\theta}$$

$$\quad \Rightarrow \quad \frac{r(1-\theta)-g\theta}{\theta(1-\theta)} = 0 \quad \Leftrightarrow \quad r(1-\theta) - g\theta = 0 \quad \Leftrightarrow \quad \theta_{ML} = \frac{r}{r+g} = \frac{r}{N}$$

### Example: Most probable classification of new instances

So far we have sought the most probable hypothesis given the data $D$ (i.e. $h_{MAP}$). Given a new instance $x$, is $h_{MAP}(x)$ the most probable classification? Consider the case in which we have three hypotheses:

- $P(h_1 | D) = 0.4$
- $P(h_2 | D) = 0.3$
- $P(h_3 | D) = 0.3$

In this case $h_{MAP}=h_1$. Consider a new instance $x$ and assume that we obtain the following classifications:

- $h_1(x) = True$
- $h_2(x) = False$
- $h_3(x) = False$

If we consider just $h_{MAP}$, then the answer would be `True`. Anyway, if we consider all the hypotheses, we would predict `True`with probability $0.4$, and `False` with probability $0.3+0.3=0.6$.

## Bayes optimal classifier
Consider the classification of a new instance $x$ assuming target values $v_j\in V$ (in the previous example we had $V=\{True, False\}$).

$$P(v_j|D) = \sum_{h_i \in H}P(v_j|h_i)P(h_i|D)$$

where $P(v_j|h_i)$ is the probability that $h_i$ predicts class $v_j$.

The Bayes optimal classification is:


$$v^* = \underset{v_j\in V}{\mathrm{argmax}} \sum_{h_i\in H} P(v_j | h_i)P(h_i | D)$$

Sticking with the previous example we now have:

- $P(h_1 | D) = 0.4 \quad P(False | h_1) = 0 \quad P(True | h_1) = 1 $
- $P(h_2 | D) = 0.3 \quad P(False | h_2) = 1 \quad P(True | h_2) = 0 $
- $P(h_3 | D) = 0.3 \quad P(False | h_3) = 1 \quad P(True | h_3) = 0 $

therefore,

$$P(True | D) = \sum_{h_i\in H} P(True | h_i)P(h_i | D) = 0.4$$

$$P(False | D) = \sum_{h_i\in H} P(False | h_i)P(h_i | D) = 0.6$$

and

$$v^* = \underset{v_j\in V}{\mathrm{argmax}} \sum_{h_i\in H} P(v_j | h_i)P(h_i | D) = False$$

### Observations

- The Bayes optimal classifier is the **optimal learner:** no other classification method using the same hypothesis space and the same prior knowledge can outperform this method on average (in terms of probability). 
- It maximizes the probability that the new instance is classified correctly. For example, in learning boolean concepts using version spaces, it takes a weighted vote among all the members of the $VS$, with each candidate hypothesis weighted by its posterior probability $\left(\frac{1}{|VS_{H,D}|}\right)$. 
- Predictions made can correspond to a hypothesis not contained in $H$. Labeling new instances with $\underset{v_j\in V}{\mathrm{argmax}}P(v_j|D)$ can correspond to none of the hypotheses in $H$, because Bayesian learning linearly combines multiple hypotheses (the hypotheses space is different).

# Gibbs classifier
Bayes optimal classifier provides best result, but can be expensive if we have a large number of hypothesis. To overcome this issue, we can use the **Gibbs algorithm**:

1. Choose one hypothesis at random from the version space $VS$, according to the posterior probability distribution $P(h|D)$.
2. Use this hypothesis to classify new instances.

Surpisingly it works quite well. Assume that the target concepts are drawn at random from $H$ according to priors on $H$, then, on average

$$\mathbb{E}[error_{Gibbs}] \le 2\cdot\mathbb{E}[error_{BayesOptimal}]$$

# Naive Bayes Classifiers
Naive Bayes classifiers, along with decision trees, neural networks, nearest neighborhood, are one of the most practical learning methods. They are linear classifiers based on the Bayes' theorem, and they are known for being simple yet very efficient. It is typically used when the training set is moderately large or when attributes that descrive instances are conditionally independent given classifications. Examples of its application are diagnosis, classification
of RNA sequences in taxonomic studies and text documents classification such as spam email filtering. The adjective *naive* comes from the assumption that the variables (the features) in the dataset are mutually independent. This is a quite unrealistic assumption because, in practice, the independence assumption is often violated. Anyway, they still perform very well in many cases even under this assumption. Note that strong violations of the independence assumptions and
non-linear classification problems can lead to very poor performances of this type of classifiers.

**Definition (Conditional independence):** let $X,Y,Z$ be random events. $X$ is conditionally independent of $Y$ given $Z$ if the probability distribution governing $X$ is independent of the value of $Y$ given the value of $Z$. Mathematically speaking:

$$P(X=x_i | Y=y_j, Z=z_k) = P(X=x_i | Z=z_k) \quad \forall x_i,y_i,z_k,$$

that can be written in a more compact notation as:

$$P(X|Y,Z) = P(X|Z)$$

Naive Bayes uses contitional independency to justify

$$P(X,Y|Z) = P(X|Y,Z)P(Y|Z) = P(X|Z)P(Y|Z)$$

Check [this explanation](https://math.stackexchange.com/questions/23093/could-someone-explain-conditional-independence) out!


Now let's go back to the *naive Bayes classifier*. Assume $f:X\rightarrow V$ is the target function, where each instance $x$ is described by attributes $\langle a_1,...,a_n\rangle$, $V$ is a finite set and $d$ is the set of instances (the dataset).

The most probable value $f(x)$ of a new instance $x$ given the training dataset $d$ is:

$$\begin{align}
v_{MAP} &= \underset{v_j\in V}{\mathrm{argmax}} P(v_j | a_1, \dots, a_n, d) \notag \\
&= \underset{v_j\in V}{\mathrm{argmax}} \frac{P(a_1,\dots,a_n | v_j,d)P(v_j|d)}{P(a_1,\dots,a_n | d)} \qquad \text{(Bayes th.)} \notag \\
&= \underset{v_j\in V}{\mathrm{argmax}} P(a_1,\dots,a_n | v_j,d) P(v_j|d). \qquad \text{(Denominator independent from $v_j$)}\notag
\end{align}$$

We need a large number of traning instances in order to estimate these probabilities accurately. We simplify everything by making the **Naive Bayes assumption**:

$$P(a_1,\dots,a_n | v_j, d) = \prod_{i=1}^n P(a_i | v_j, d)$$

which gives an approximation of $v_{MAP}$. Consequently:

$$v_{NB} = \underset{v_j\in V}{\mathrm{argmax}} P(v_j | d) \prod_{i=1}^n P(a_i | v_j, d).$$

**Question:** how do we estimate $P(v_j | d)$ and $P(a_i | v_j, d)$?  In case of categorical data we count $v_j$ and $(v_j,a_j)$ frequencies in the training dataset $d$.

$$\hat{P}(a_i | v_j) = \frac{N_{a_i,v_j}}{N_{v_j}}$$

where:
- $N_{a_i,v_j}$ is the number of times that feature value $a_i$ appears in samples from class $v_j$.
- $N_{v_j}$ is the number of entries of class $v_j$.

## The naive Bayes algorithm

<img src="images/bayesian_learning/naive_bayes_alg.png" alt="Posterior probabilities" style="width: 30em;"/>

then for a new instance $x=\langle \alpha_1,\dots, \alpha_n\rangle$ we estimate its value using:

$$v_{NB} = \underset{v_j\in V}{\mathrm{argmax}} \hat{P}(v_j | d) \prod_{i=1}^n \hat{P}(\alpha_i | v_j, d).$$

Note that:
- The set of $\hat{P}$'s and the $v_{NB}$ rule corresponds to the learned $h$.
- No search in the hypothesis space.

## Observations

#### Conditional independence assumption is often violated but it works well anyway
We don't need estimated posteriors $\hat{P}(v_j | x)$ to be correct, but we only need that:

$$\underset{v_j\in V}{\mathrm{argmax}} \hat{P}(v_j | d) \prod_{i=1}^n \hat{P}(a_i | v_j, d) = \underset{v_j\in V}{\mathrm{argmax}} {P}(v_j | d) \prod_{i=1}^n {P}(a_i | v_j, d) $$


#### Problems with attribute values missing in the traning examples
If none of the training instances with target value $v_j$ have attribute value $a_i$ (before we referred with $a_i$ to the *i*-th attribute and with $\alpha_i$ to its *i*-th value, now, since we are considering a single attribute $a$, we refer to its *i*-th value using $a_i$), then

$$\hat{P}(a_i | v_j, d)=0$$

and so

$$\hat{P}(v_j | d)= \prod_{i} \hat{P}(a_i | v_j, d)=0.$$

In order to overcome this problem, instead of estimate the probability of a certain attribute, given the target value, according to $\frac{n_c}{n}$, we do it in this way:

$$\hat{P}(a_i | v_j) = \frac{n_c+mp}{n+m}$$

where:
- $n$ is the number of training examples for which $v=v_j$.
- $n_c$ is the number of examples for which $v=v_j$ and $a=a_i$.
- $p$ is the prior estimate for $\hat{P}(a_i | v_j)$ (e.g. uniform distribution: $1/|a_i|$).
- $m$ is the weight given to prior, and it is called *equivalent sample size*, which is interpretable as augmenting the $n$ examples by $m$ virtual examples distributed as $p$.

## Example
Consider the problem of detemining if someone is going to play tennis according to the value of some attributes. Our instance is:

$$\langle \text{Outlook} = sun, \text{Temp} = cool, \text{Humid} = high, \text{Wind} = strong \rangle$$

and we want to compute

$$v_{NB} = \underset{v_j\in V}{\mathrm{argmax}} {P}(v_j | d) \prod_{i=1}^n {P}(\alpha_i | v_j, d).$$

The first thing to do is to estimate conditional probabilities:

$$\hat{P}(v_j|d) = \frac{|\{V=v_j\}|}{|d|}$$

- $\hat{P}(\text{PlayTennis}=yes | d) = P(y|d) = 9/14 = 0.64$
- $\hat{P}(\text{PlayTennis}= no | d) = P(n|d) = 5/14 = 0.36$


$$\hat{P}(\alpha_i|v_j, d) = \frac{|\{A=\alpha_i, V=v_j\}|}{|\{V=v_j\}|}$$

- $\hat{P}(\text{Wind}= strong | \text{PlayTennis}=yes, d) = 3/9 = 0.33$
- $\hat{P}(\text{Wind}= strong | \text{PlayTennis}=no, d) = 3/5 = 0.60$

Then we compute the naive Bayes solution:

$$v_{NB} = \underset{v_j\in V}{\mathrm{argmax}} \hat{P}(v_j | d) \prod_{i=1}^n \hat{P}(\alpha_i | v_j, d)$$

where:
- $j\in\{1,2\}$,
- $v_1 = yes(y)$
- $v_2 = no(y)$

We compute both:

- $\hat{P}(y | d) \hat{P}(sun | y,d) \hat{P}(cool | y,d) \hat{P}(high | y,d) \hat{P}(strong |y,d) = 0.005$
- $\hat{P}(n | d) \hat{P}(sun | n,d) \hat{P}(cool | n,d) \hat{P}(high | n,d) \hat{P}(strong |n,d) = 0.021$

allowing us to conclude that $v_{NB} = no$

### Example
Consider a dataset of $500$ documents, where $100$ of them are classified as spam. This is a binary classification problem with $V=\{ham,spam\}$, We now want to compute the class-conditional probability for the message `Hello world`, which consists in two words "hello" and "world". Assume that in our dataset, of the $100$ spam message, the word "hello" appears into $20$ of them, while the word "world" into $2$ of them. According to the naive Bayes assumption, we assume that, if we know that a message is spam, then the probability for the message to contain "hello" is independent from the probability for it to contain "world" (this is not a realistic hypothesis, however, naive Bayes classifiers are known to perform still well in those cases). Consequently:

$$P(a=\begin{bmatrix} \text{hello,} & \text{world}\end{bmatrix} | v = \text{spam}) = P(\text{hello } | \text{ spam}) \cdot P(\text{world } | \text{ spam}).$$

We now estimate the maximum likelihood according to frequencies:

$$\hat{P}(a=\begin{bmatrix} \text{hello,} & \text{world}\end{bmatrix} | v = \text{spam}) = \frac{20}{100}\cdot \frac{2}{100} = 0.004$$

Recall that, for the Bayes' theorem:

$$\text{posterior probability} = \frac{\text{conditional probability} \cdot \text{prior probability}}{\text{evidence}}$$

But if we assume that the prior probability follows a uniform distribution (not realistic, we should instead consult a domain expert or estimate it by the training data, assuming that the training data is i.i.d. and a representative sample of the entire population, in that case it would be $\hat{P}(\text{spam})=100/500=0.2$), then the posterior probability is completely determined by the class conditional probability and by the evidence. But since the evidence term is a constant, then the decision rule will entirely depend on the class conditional probability (similar to a frequentist's approach and maximum-likelihood estimate).

The decision rule for this particular example will be:

```python
if P(a | spam) >= P(a | ham):
    return spam
return ham
```

The posterior probability is the product between the class conditional probability and the prior probability:

$$P(a | v = spam) = P(spam | a) \cdot P(spam)$$
$$P(a | v = ham) = P(ham | a) \cdot P(ham)$$

where the priors are estimated from the training set:
- $\hat{P}(v=spam) = \frac{\text{# spam messages}}{\text{# messages}} = \frac{100}{500} = 0.2$
- $\hat{P}(v=ham) = \frac{\text{# ham messages}}{\text{# messages}} = \frac{400}{500} = 0.8$