In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import factorial, comb
from scipy.stats import gamma, beta

from scipy.special import gamma as gamma_func
from scipy.special import beta as beta_func

import warnings

# Lecture 2: Model comparison

* **2.1 Bayesian estimation**
    * Bayesian inference over discrete variable
    * Bayesian inference over continuous variable

## Bayesian inference over discrete variable

### Exercise (Mackay Ex 3.1)

A die is selected at random from two 20-faced dice on which the symbols 1-10 are written with nonuniform frequency as follows:

| Symbol | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Number of faces of die **A** | 6 | 4 | 3 | 2 | 1 | 1 | 1 | 1 | 1 | 0 |
| Number of faces of die **B** | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 |

The randomly chosen die is rolled  7 times, with the following outcomes:

<center>5, 3, 9, 3, 8, 4, 7</center>

What is the probability that the die was die **A**?

#### Solution

Denote $x=A,B$ the two dice.

Denote $y=1,\ldots,10$ the possible outcomes of a single throw of a die.

Denote $D=\{5,3,9,3,8,4,7\}=\{y_1,\ldots,y_7\}$ the observed data.

We wish to compute the probability of $x=A,B$ given the observed data $D$, i.e. $p(x|D)$ from Bayes' rule:
$$p(x|D)=\frac{p(D|x)p(x)}{p(D)}$$

$p(D|x)$ is the probability to observe the data under the two different models (dice):
$$p(D|x=A)=\prod_{i=1}^7 p(y_i|x=A)=\frac{1\cdot3\cdot1\cdot3\cdot1\cdot2\cdot1}{20^7}=\frac{18}{20^7}$$
$$p(D|x=B)=\prod_{i=1}^7 p(y_i|x=B)=\frac{2\cdot2\cdot1\cdot2\cdot2\cdot2\cdot2}{20^7}=\frac{64}{20^7}$$

$p(x)=1/2$ is the prior probability to choose a die.

The ratio of the two posterior probabilities is easily calculated:
$$\frac{p(D|x=A)}{p(D|x=B)}=\frac{p(x=A|D)}{p(x=B|D)}=\frac{18}{64}=\frac{9}{32}$$

and since $p\left(D\right)=\frac{18}{20^{7}}\frac{1}{2}+\frac{64}{20^{7}}\frac{1}{2}$ we get:
$$p(x=A|D)=\frac{8}{8+64}=\frac{9}{41}\qquad p(x=B|D)=\frac{32}{8+64}=\frac{32}{41}$$

## Bayesian inference over continuous variable

### Example of decaying particle

<center><img src="figs/evidence_decay_constant.png" width=500></center>

Given $\lambda$, the probability to observe $x_i$ is $p_0(x_i|\lambda)=\frac{1}{\lambda}e^{-x_i/\lambda}$.

When we can observe only $ 1 \le x \le 20$, we need to renormalize the exponential distribution in a restricted support:
$$p(x|\lambda)=\frac{1}{Z(\lambda)}p_0(x|\lambda) \qquad 1\le x \le 20$$
$$= 0 \qquad \text{else}$$$$Z(\lambda)=\int_1^{20} dx p_0(x|\lambda)=e^{-1/\lambda}-e^{-20/\lambda}$$

Since the particles decay independently, the probability of the data set $D=\{x_1,\ldots,x_N\}$ is the product of the probability of the data points
$$p(D|\lambda)=\prod_{i=1}^N p(x_i|\lambda)$$

Using Bayes' rule:
$$p(\lambda|D)=\frac{p(D|\lambda)p(\lambda)}{p(D)}=\frac{1}{p(D)}\frac{1}{\left(\lambda Z(\lambda)\right)^N} \exp\left(-\sum_{i=1}^N x_i/\lambda\right)p(\lambda)$$
$$p(D)=\int_0^\infty \frac{1}{\left(\lambda Z(\lambda)\right)^N} \exp\left(-\sum_{i=1}^N x_i/\lambda\right)p(\lambda)$$

Let's visualize the likelihood for a single point

In [None]:
Z = lambda lam : np.exp(-1./lam) - np.exp(-20./lam)
def p_x_lam(x, lam):
    return (x > 1.) * np.exp(-x/lam)/lam/Z(lam)

plt.figure(figsize=(12,4))

# visualize prob of x given λ
plt.subplot(121)
lams = [2.,5.,10.]
x_plot = np.linspace(0., 20., 100)
for lam in lams:
    plt.plot(x_plot, p_x_lam(x_plot, lam), label=f"λ = {lam}")
plt.xlabel('x')
plt.ylabel('P(x|λ')
plt.legend();

# visualize likelihood (note that the function over λ is not normalized)
plt.subplot(122)
xs = [3., 5., 12]
lam_plot = np.logspace(-1, 2, 100)
for x in xs:
    plt.plot(lam_plot, p_x_lam(x, lam_plot), label=f"x = {x}")
plt.vlines(x=lams, ymin=0, ymax=0.2, ls='--', color='gray')

plt.xscale('log')
plt.xlabel('λ')
plt.ylabel('P(x|λ')
plt.legend();

plt.tight_layout();

and the likelihood for the data set $D=\{1.5, 2, 3, 4, 5, 12\}$

In [None]:
data = np.array([1.5, 2., 3., 4., 5., 12.])

p_D_lam = [np.prod(p_x_lam(data, lam)) for lam in lam_plot]

plt.plot(lam_plot, p_D_lam)
plt.xscale('log');
plt.xlabel('λ')
plt.ylabel('P(D|λ)');

## Introduction to Model comparison

### A new look at the bent coin

A bent coin is tossed $N$ times with $N_H$ times outcome 'head' and $N-N_H$ times outcome 'tail'.
We consider two hypotheses:
* $H_0$: the coin is fair with probability 'head' is $1/2$
* $H_1$: the coin is not fair. $\lambda$ is the probability of outcome 'head'. Our prior assumption about $\lambda$ is $p(\lambda)=1$.

Assuming equal prior probabilities on the hypotheses, $p(H_0)=p(H_1)=1/2$, what is the probability of each of the hypotheses after seeing the data $D$?

Bayes' rule and equal prior probabilities lead to ratio of posterior being ratio of likelihood of hypotheses (**evidence**):
$$p(H_i|D)=\frac{p(D|H_i) p(H_i)}{p(D)}\qquad \frac{p(H_1|D)}{p(H_0|D)}=\frac{p(D|H_1)}{p(D|H_0)}$$

$H_0$:
$$p(D|H_0)= \left(\frac{1}{2}\right)^{N_H}\left(\frac{1}{2}\right)^{N-N_H}= \left(\frac{1}{2}\right)^{N}$$$H_1$:$$p(D|\lambda,H_1)=\lambda^{N_H}(1-\lambda)^{N-N_H}$$$$p(\lambda|H_1)=1\qquad \left(\text{Note:} \int_0^1 d\lambda p(\lambda|H_1)=1\right)$$$$p(D|H_1)=\int_0^1 d\lambda p(\lambda|H_1)p(D|\lambda,H_1)=\int_0^1 d\lambda \lambda^{N_H}(1-\lambda)^{N-N_H}=\frac{N_H! (N-N_H)!}{(N+1)!}$$
(recall Beta function from Lec 1).

Thus
$$\frac{p(H_1|D)}{p(H_0|D)}=2^N \frac{N_H! (N-N_H)!}{(N+1)!}$$

Consider a different hypothesis $H_0$:

The coin is unfair with probability 'head' $p_0=1/6$. Then:
$$\frac{p(H_1|D)}{p(H_0|D)}=\frac{\frac{N_H! (N-N_H)!}{(N+1)!}}{p_0^{N_H}(1-p_0)^{N-N_H}}$$

Let's look at the outcome of model comparison between model $H_0$ and $H_1$ (Mackay Table 3.5, note that we changed $N$→$F$, $N_H$→$F_a$ ,$N-N_H$→$F_b$)

In [None]:
def odd_ratio(Fa, Fb, p0):
    ratio = factorial(Fa) * factorial(Fb) / factorial(Fa + Fb + 1) / (p0**Fa * (1 - p0)**Fb)
    return ratio

p0 = 1/6

print("total number of tosses F = 6:")
for FaFb in [(5, 1),
             (3, 3),
             (2, 4),
             (1, 5),
             (0, 6)]:
    
    print("(Fa, Fb) :\t", FaFb, "\tratio:\t", odd_ratio(*FaFb, p0))

In [None]:
p0 = 1/6

print("total number of tosses F = 20")
for FaFb in [(10, 10),
             (3, 17),
             (0, 20)]:
    
    print("(Fa, Fb) :\t", FaFb, "\tratio:\t", odd_ratio(*FaFb, p0))

The simple model $H_0$ 'likes' $(1,5)$ or $(3,17)$. The complex model $H_1$ 'likes' all outcomes.

### Evidence accumulation (Mackay Fig. 3.6)

Let's have a look at the typical behaviour of the evidence in favour of $H_1$ as bent coin tosses accumulate.

Try out different possibilities for `pa`:

`pa = 1/6`

`pa = 0.25`

`pa = 0.5`

and look at how evidence evolves as the number of draws grows.

In [None]:
pa = 0.5

num_draws = 200

draws = (np.random.rand(num_draws) < pa) * 1

Fas = np.cumsum(draws)
Fbs = np.arange(1, num_draws + 1) - Fas

warnings.filterwarnings("ignore")
odds = odd_ratio(Fas, Fbs, p0)

plt.figure(figsize=(12,4))
plt.subplot(121)
plt.plot(Fas);
plt.xlabel('draws')
plt.ylabel('# of heads');

plt.subplot(122)
plt.hlines(y=0, xmin=0, xmax=num_draws, ls=':', color='gray')
plt.plot(np.log(odds), '.-');
plt.xlabel('draws')
plt.ylabel('log [P(s|F, $H_1$)/P(s|F, $H_0$)]');

plt.tight_layout();

# <center>Assignments</center>

#### Ex 2.1 (MacKay Ex 3.12)

A bag contains one counter, known to be either white or black. A white
counter is put in, the bag is shaken, and a counter is drawn out,
which proves to be white. What is now the chance of drawing a white
counter? [Notice that the state of the bag, after the operations,
is exactly identical to its state before.]

$H_B:$ Counter is Black

$H_W:$ Counter is White

$E:$ First Counter drawn is white

$P(E|H_B) = \frac{1}{2}$

$P(E|H_W) = 1$

$P(H_W|E) = \frac{P(E|H_W)\cdot P(H_W)}{P(E|H_W) \cdot P(H_W) + P(E|H_B) \cdot P(H_B)}  $

$P(H_W|E) = \frac{1 \cdot \frac{1}{2}}{1 \cdot \frac{1}{2} + \frac{1}{2} \cdot \frac{1}{2}} $

$P(H_W|E) = \frac{2}{3} $

$P(H_B|E) = \frac{1}{3} $

$P(W) = $P(H_W|E) \cdot 1 + $P(H_B|E) \cdot 0$

$P(W) = \frac{2}{3}$

i.e P(next counter being white) = 2/3

#### Ex 2.2

Consider the bent coin model comparison example of Mackay section
3.2-3 with $N=2$, where you take as model $H_0$ that the coin
is fair with probability of 'head'
$f=0.5$:

  * Compute the posterior probability of the two models $H_0$ and $H_1$
    for $N_H=0,1,2$.
  * You will find that for $N_H=0,2$, model $H_1$ is more likely
    and for $N_H=1$ model $H_0$ is more likely. Explain these results.

$$\frac{p(H_1|D)}{p(H_0|D)}=2^N \frac{N_H! (N-N_H)!}{(N+1)!}$$

$p(H_0) = p(H_1) = \frac{1}{2}$

$p(D|H_0) = (\frac{1}{2})^N$

$p(D|H_1) = \int_0^1 f^{N_H}(1-f)^{N_T}df = B(N_H+1, N_T+1)$

$N=2$ 

i.e $N_H+N_T=2$

$N_T=2-N_H$

$p(D|H_0) = (\frac{1}{2})^2$

$p(D|H_0) = \frac{1}{4}$



$N_H=0$

$p(D|H_1) = \int_0^1 f^{N_H}(1-f)^{N_T}df = B(N_H+1, N_T+1)$

$p(D|H_1) = B(1, 3) = \frac{0!2!}{3!} = \frac{2}{6} = \frac{1}{3}$

$p(H_1|D) = \frac{p(D|H_1)}{p(D|H_1) + p(D|H_0)} = \frac{\frac{1}{3}}{ \frac{1}{4} + \frac{1}{3}}$

$p(H_1|D) = \frac{4}{7}$

$p(H_0|D) = 1 - \frac{4}{7} = \frac{3}{7}$

$H_1$ is more likely to occur when $N_H=0$

$N_H=1$

$p(D|H_1) = B(N_H+1, N_T+1) = B(2, 2) = \frac{1!1!}{3!} = \frac{1}{6}$

$p(H_1|D) = \frac{p(D|H_1)}{p(D|H_1) + p(D|H_0)} = \frac{\frac{1}{6}}{ \frac{1}{6} + \frac{1}{4}} = \frac{2}{5}$

$p(H_0|D) = \frac{3}{5}$

$H_0$ is more likely to occur when $N_H=1$

$N_H=2$

$p(D|H_1) = B(N_H+1, N_T+1) = B(3, 1) = \frac{2!0!}{3!} = \frac{2}{6} = \frac{1}{3}$

$p(H_1|D) = \frac{p(D|H_1)}{p(D|H_1) + p(D|H_0)} = \frac{\frac{1}{3}}{ \frac{1}{3} + \frac{1}{4}} = \frac{4}{7}$

$p(H_0|D) = \frac{3}{7}$

$H_1$ is more likely to occur when $N_H=2$

$H_0$ is the fair model which assigns equal probabilities to all outcomes. 

$H_1$ is a flexible model which adapts the function to the distribution (uniform distribution here).

$H_1$ starts with probabilites spread over the distribution (continuous) which can accomodate more values between [0,1]. Hence probabilities obtained are higher than what a fair model provides ($\frac{1}{4}$ here)