# The Bayesian reasoning

Suppose you have fever and look on internet about the possible causes. You then find that 20% of times fever can come with a cold, while 75% of times it comes with appendicitis. What would be your reaction? 
In practical terms, we are saying that the probabilities of our outcomes are

$$ p(fever | cold) = 0.2, $$

and that 

$$ p(fever | appendicitis) = 0.75. $$


I guess that this would sound however a bit odd according to your common sense. This common sense is actually your knowledge about the world, that tells you that appendicits is not such a common event, at least as compared to a cold. You can see that there is a great difference between the probability of symtpoms given a disease (what you read on internet), and the probability of the disease given the symptoms (what you want to know).

This difference in probabilities can be formalized within the framework of Bayesian reasoning. 

In particular, what we called common sense, or knowledge about the world, is what we call _prior_. The process we are making with our reasoning, is to weight the possible outcomes by the respective prior information. For example, supposing that we are in winter, a cold would not be such a rare event, and it may affect 80% of the population. On the conntrary, appendicitis is much rare, and affects 2% of the population.
In other words, we are saying that:

$$p(cold) = 0.8,$$

and 

$$p(appendicitis) = 0.02.$$


If we use these probabilities to weight our outcomes, we would obtain that the probability of having a cold would be realated to the product $p(fever | cold) * p(cold)$, while the probability of having appendicitis would be related to $p(fever | appendicitis) p(appendicitis)$. In numbers we will get

$$ p(fever | cold)  p(cold) = 0.2*0.8 = 0.16,$$

and 

$$ p(fever | appendicitis)  p(appendicitis) = 0.75*0.02 = 0.015.$$




In [62]:
import numpy as np


p_fever_given_cold = 0.2
p_fever_given_appendicits = 0.75

p_cold = 0.8
p_appendicitis = 0.02

print('p(fever | cold) * p(cold) = {}'.format(p_fever_given_cold*p_cold))
print('p(fever | appendicitis) * p(appendicitis) = {}'.format(p_fever_given_appendicits*p_appendicitis))


p(fever | cold) * p(cold) = 0.16
p(fever | appendicitis) * p(appendicitis) = 0.015


In practice, we would like the events to sum up to one, in order to express probabilities. The natural way for obtaining this is by dividing the products by their sum. We then obtain:

In [63]:
denominator = p_fever_given_cold*p_cold + p_fever_given_appendicits*p_appendicitis

print('Normalized probabilities:')
print('p(fever | cold) * p(cold) = {}'.format(p_fever_given_cold*p_cold/denominator))
print('p(fever | appendicitis) * p(appendicitis) = {}'.format(p_fever_given_appendicits*p_appendicitis/denominator))

Normalized probabilities:
p(fever | cold) * p(cold) = 0.914285714286
p(fever | appendicitis) * p(appendicitis) = 0.0857142857143


### A bit of notation

We now start rephrasing the problem with some formalism. Let's call the event _having fever_ as the data $X$. Let's also denote the possible causes, cold and appendicitis, as respectively models $M_1$ and $M_2$. So, by rewriting what we have seen so far, we get:

$$ p(X|M_1) = 0.2, $$

$$ p(X|M_2) = 0.75, $$

$$ p(M_1) = 0.8, $$

$$ p(M_2) = 0.02. $$

The term $p(X|M)$ is called _data likelihood_. It expresses the probability of observing the data $X$ (fever) when knowing that the model $M$ (cold) is true. It is therefore the link between the observations and the underlying model. The term $p(M)$ is instead the _prior_. It does not depend on the data, and quantifies our belief that the model $M$ is true. When making inference (which disease we have) we are interested in computing the quantity $p(M|X)$. This quantity is called _posterior_, as it expresses the probability of the model _a posteriori_, i.e. after that the data is observed. The Bayes' theorem links these elements together through the famous formula:


$$ p(M|X) = \frac{p(X|M)p(M)}{p(X)}. $$

To compute the posterior via the Bayes formula we miss the denominator $p(X)$. This quantity acts as a normalizing term since that, as we have seen before, we need to obtain a probability that integrates to one. To practically compute this term, we can use the law of total probability:

$$ p(X) = \int p(X|M)p(M) dM, $$

which, states that the probability of an event is the sum (potentially infinite) of the probability of the event conditioned on any possible model, multiplied by the probability of the model. In our case, if we assume that all the possibilities reduce to the events _cold_ and _appendicitis_, we can simply write:

$$ p(X) = p(X|M_1)p(M1) +  p(X|M_2)p(M2). $$

Back to our simple example, we note that this term corresponds with the normalizing factor that we used before. 


## Preliminaries for Bayesian model selection

We suppose that, for a given data $X$, we would like to compare 2 competing models $M_1$ and $M_2$.
In a Bayesian setting, we would like to quantify the probability of a given model given the observations, which is expressed as :

$$ p(M|X).$$

This quantity can be reformulated thanks to the Bayes rule:

$$ p(M|X) = \frac{p(X|M)p(M)}{p(X)}. $$

In this formula, the term linking the data to the model is the probability $p(X|M)$. Therefore, two different models can be compared on the basis of the discrepancy between their associated probabilities. This discrepancy is called __Bayes factor__, and is quantified as the ratio:


$$ B = \frac{p(X|M_1)}{p(X|M_2)},$$

which is equivalent to 

$$ B = \frac{p(M_1|X)}{p(M_2|X)} \frac{p(M_2)}{p(M_1)}. $$

If we assume that the models $M_1$ and $M_2$ have the same probability a priori, $p(M_2) = p(M_1)$, the Bayes factor is the ratio between the models' posteriors: 

$$ B = \frac{p(M_1|X)}{p(M_2|X)}$$.


This ratio is also called _posterior odds_, and is defined more generally as:

$$ R = \frac{p(M_1|X)}{p(M_2|X)}  = \frac{p(X|M_1)p(M_1)}{p(X|M_2)p(M_2)}.$$

The posterior odds are therefore the product between the Bayes factor and the ratio between the prior, called _prior odds_.


In [68]:
likelihood_cold = p_fever_given_cold
prior_cold = p_cold

likelihood_appendicitis = p_fever_given_appendicits
prior_appendicitis = p_appendicitis

p_fever = p_fever_given_cold*p_cold + p_fever_given_appendicits*p_appendicitis

posterior_cold = likelihood_cold * prior_cold / p_fever
posterior_appendicitis = likelihood_appendicitis * prior_appendicitis / p_fever

bayes_factor = posterior_cold * prior_appendicitis/ (posterior_appendicitis * prior_cold)
prior_odds = prior_cold/prior_appendicitis
posterior_odds = posterior_cold/posterior_appendicitis


print('The prior and posterior for cold are: {} and {}'.format(prior_cold,posterior_cold))
print('The prior and posterior for appendicitis are: {} and {}'.format(prior_appendicitis, posterior_appendicitis))
print('The Bayes factor is: ' + str(bayes_factor))
print('The prior odds are: ' + str(prior_odds))
print('The posterior odds are: ' + str(posterior_odds))


The prior and posterior for cold are: 0.8 and 0.914285714286
The prior and posterior for appendicitis are: 0.02 and 0.0857142857143
The Bayes factor is: 0.266666666667
The prior odds are: 40.0
The posterior odds are: 10.6666666667


## Formalizing the model evidence

Beyond the toy problem of before, the Bayes factor completely depends on the _model evidence_ $p(X|M)$. We note that a model is identified by its own parameters $\boldsymbol{\theta} = \{ \theta_1, \theta_2, \ldots, \theta_k \}$. Therefore, the probability associated to the model depends on the probability of the relative parameters. 

According to the law of total probabilities we have:

$$ p(X|M) = \int_\boldsymbol{\theta} p(X|\boldsymbol{\theta}, M) p(\boldsymbol{\theta}|M) d\boldsymbol{\theta} $$

The model evidence is sometimes also called the marginal likelihood because it can be viewed as a likelihood function over the space of models, in which the parameters have been integrated (marginalized) out. Correctly evaluating this integral allows us to  marginalize (or sum or integrate) over the model parameters instead of making point estimates of their values. This naturally allows us to compensate for over-fitting associated with maximum likelihood.

From a sampling perspective, the marginal likelihood can be viewed as the probability of generating the data set X from a model whose parameters $\boldsymbol{\theta}$ are sampled at random from the prior $p(\boldsymbol{\theta}|M)$. 




Suppose we have a random variable that produces either a success or a failure. We want to compare a model M1 where the probability of success is q = 0.5, and another model M2 where q is unknown and we take a prior distribution for q that is uniform on [0,1]. We take a sample of 200, and find 115 successes and 85 failures. 

The evidence for model $M_1$ will be:

$$ p(X|M_1) = p(X|q=0.5) $$

In [69]:
import numpy as np
from scipy.stats import binom

N_trials  = 200
N_success = 115

## Model 1:  p(q=0.5) = 1
q = 0.5
evidence_M1 = binom.pmf(N_success,N_trials,q)

print('Model 1:  p( q={}) = 1'.format(q))
print('p(X = {} | M1) = {}'.format(N_success, evidence_M1))


Model 1:  p( q=0.5) = 1
p(X = 115 | M1) = 0.00595589219024


For model $M_2$ we need to compute the integral over all the possible sets of parameters' values $q$.

The evidence for model $M_2$ will be:

\begin{align*} 
p(X|M_2) &= \int p(X|q,M_2) p(q|M_2)dq = \int p(X|q,M_2) dq \\
 &= {200 \choose 115} \int q^{115}(1-q)^{85}dq\\
 &= {200 \choose 115} B(116,86) = {200 \choose 115} \frac{115! 85!}{(115+85-1)!} = \frac{1}{201}
\end{align*}

In [104]:
print('Model 2:  p( q) = U(0,1)')

evidence_M2 = 1./201

print('p(X = 115  | M2) = {}'.format(evidence_M2))


## Alternative approach based on sampling
from scipy.special import binom
q = np.random.rand(1000)
integral = np.mean(np.exp(115*np.log(q) + 85*np.log(1-q))) 

evidence_M2 = integral * binom(200,115)
print('Alternative computation based on Monte Carlo sampling')
print('p(X = 115  | M2) = {}'.format(evidence_M2))


Model 2:  p( q) = U(0,1)
p(X = 115  | M2) = 0.00497512437811
Alternative computation based on Monte Carlo sampling
p(X = 115  | M2) = 0.0046455523538


In [71]:
print('The Bayes factor is: ' + str(evidence_M1/evidence_M2))

The Bayes factor is: 1.19713433024


The Bayes factor therefore is slightly indicative of a larger evidence for $M_1$. Indeed, $M_2$ is a more complex model than M1 because it has a free parameter which allows it to model the data more closely. The ability of Bayes factors to take this into account is a reason why Bayesian inference has been put forward as a theoretical justification for and generalisation of Occam's razor.

The maximum likelihood estimate for the parameter $q$ under the model $M_2$ would be $q = 115/200 = 0.575$.
Using this parameter estimate we obtain a likelihood of $p(X|q=0.575) = {200 \choose 115} q^{115}(1-q)^{85} = 0.056991$. 

We can then use the AIC for comparing the two models. Under $M_1$ we have 0 parameters as we fixed $q$ from the beginning, and the AIC will be:

$$ AIC_{M_1} = -2 D_{M_1} + 2 N_{parameters} = − 2·ln(0.005956) + 2·0 = 10.2467 $$

Under model $M_2$ we have 1 free parameter $q\in [0,1]$, whose maximum likelihood estimation is $q = 0.575$.
The AIC will be:

$$ AIC_{M_2} = -2 D_{M_2} + 2 N_{parameters} = − 2·ln(0.056991) + 2·1 = 7.7297. $$

We observe that, although the likeihood of $M_2$ is about 1- times larger than for $M_1$, the AIC only slightly favour $M_2$. This is because of the penalization over the complexity of $M_2$. If we compute the Akaike weights we obtain:

$$w = exp(-(10.2467-7.7297)/2) = 0.284$$


Thus $M_2$ is 0.28 more probable than $M_1$, which means that we cannot exclude $M_1$ on the basis of this experience.


## Exercises.

__1.__ Given the following data, estimate by repeated cross-validation the prediction accuracy of the classifciation algorithms: Logistic Regression, Decision Tree, and K-Nearest Neighborhoods.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

dataframe[:10]

__2.__ Compute a 95% confidence interval for the difference between the prediction accuracy of Logistic Regression and K-Nearest Neighborhoods. What can you conclude?

__3.__ Compute the probability of the average prediction of K-Nearest Neighborhoods being greater than 0.75.