In [1]:
%autosave 120
import numpy as np
import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt
from IPython.display import display, Markdown, clear_output
import ipywidgets as widgets

Autosaving every 120 seconds


# Bayesian reasoning

Through these exercices, we will see how we can make probabilistic statements using the Bayes' rule. These examples will allow us to work on the concepts of prior, likelihood and posterior probabilities.

## Exercise 1 (Bayesian Data Analysis, Gelman et al, Chapter 1)

Humans male have one X-chromosome and one Y-chromosome, wheras females have two X-chromosomes, each chromosome being inherited from one parent. Hemophilia is a disease that exhibits X-chromosome-linked recessive inheritance, meaning that a male who inherits the gene that causes the disease on the X-chromosome is affected, while a female carrying the gene on only one of her two X-chromosomes is not affected.

Let's consider a woman who has an affected brother and an unaffected father. This implies that her mother carries the hemophilias gene with one "good" and one "bad" gene. Let's consider the random variable $\theta$ describing the state of the woman (carrier or not carrier). 

**1) Give the prior distribution of $\theta$.**

$$\mathbb{P}(\theta = 1) = 0.5$$


**2)** We are told that the woman has two sons, neither of whom is affected. We consider the random variable $y_i = 1, 0$ which denotes if the son number i is affected or not. The outcomes of the two sons are exchangeable, and conditional on the unknown $\theta$ are indepedent. We'll denote the data $(y_1, y_2)$ as $y$.

**Relying on this information, derive the posterior probability of the woman being affected.**

\begin{align*}
\mathbb{P}(\theta = 1 | y)
&= \frac{\mathbb{P}(y | \theta = 1) \mathbb{P}(\theta = 1)}{\mathbb{P}(y)} \\
&= \frac{\mathbb{P}(y | \theta = 1) * 0.5}{\sum_{i=0, 1} \mathbb{P}(y | \theta = i) \mathbb{P}(\theta = i)} \\
&= \frac{\mathbb{P}(y_1 | \theta = 1) * \mathbb{P}(y_2 | \theta = 1) * 0.5}{0.25 * 0.5 + 1 * 0.5} \\
&= \frac{0.5 * 0.5 * 0.5}{0.25 * 0.5 + 1 * 0.5} \\
&= \frac{0.125}{0.625} \\
&= 0.2 \\
\end{align*}

**3)** Let's suppose that the woman has a third son who is also unaffected. 

**What is the new posterior probability $Pr(\theta=1|y_1, y_2, y_3)$ ?**
\begin{align*}
\mathbb{P}(\theta = 1 | y)
&= \frac{\mathbb{P}(y | \theta = 1) \mathbb{P}(\theta = 1)}{\mathbb{P}(y)} \\
&= \frac{\mathbb{P}(y | \theta = 1) * 0.5}{\sum_{i=0, 1} \mathbb{P}(y | \theta = i) \mathbb{P}(\theta = i)} \\
&= \frac{\mathbb{P}(y_1 | \theta = 1) * \mathbb{P}(y_2 | \theta = 1) * \mathbb{P}(y_3 | \theta = 1) * 0.5}{0.25 * 0.5 + 1 * 0.5} \\
&= \frac{0.5 * 0.5 * 0.5 * 0.5}{0.125 * 0.5 + 1 * 0.5} \\
&= \frac{0.0625}{0.5625} \\
&= 0.111\ldots \\
\end{align*}

## Exercise 2 (Bayesian Data Analysis, Gelman et al, Chapter 1)

Approximately 1/125 of all births are fraternal twins and 1/300 of all births are identical twins. Elvis Presley had a twin brother (who died at birth). 

**What is the probability that Elvis was an identical twin ? We will approximate the probability of a boy or a girl birth as 1/2.**

Let $F$ be the event of having fraternal twins, with $\mathbb{P}(F) = \frac{1}{125}$. \
Let $I$ be the event of having identical twins, with $\mathbb{P}(I) = \frac{1}{300}$. \
Let $T$ be the event of having twin brothers.

\begin{align*}
\mathbb{P}(I | T) 
&= \frac{\mathbb{P}(I, T)}{\mathbb{P}(T)} \\
&= \frac{\frac{1}{2} \frac{1}{300}}{\mathbb{P}(T, F) + \mathbb{P}(T, I)} \\
&= \frac{\frac{1}{600}}{\mathbb{P}(T|F) \mathbb{P}(F) + \mathbb{P}(T|I) \mathbb{P}(I)} \\
&= \frac{\frac{1}{600}}{\frac{1}{2} \frac{1}{2} \frac{1}{125} + \frac{1}{2} \frac{1}{300}} \\
&= \frac{\frac{1}{600}}{\frac{1}{500} + \frac{1}{600}} \\
&\approx 0.45 \\
\end{align*}

## Exercise 3 (Bayesian Data Analysis, Gelman et al, Chapter 2)

An early study on *placenta previa*, a condition of pregnancy, found that on a sample of 980 births, 437 were females. We also know that the proportion of female births in the general population is of 0.485. We will denote by $\theta$ the probability of a female birth when the mother is suffering from *placenta previa*. We will assume a prior distribution p($\theta$) = Beta($\alpha$, $\beta$). 

**1) Write the data likelihood.**

We have $n = 980$ births among which $k = 437$ females. $\theta$ is the female birth, given placenta previa. \
Let $y$ be the  number of female births in the observed population. We are interested in determining the following quantity:
\begin{align*}
\mathbb{P}(y = k | \theta) 
&= {n \choose k} \theta^k (1 - \theta)^{n - k} \\
&= {980 \choose 437} \theta^{437} (1 - \theta)^{980 - 437} \\
&= {980 \choose 437} \theta^{437} (1 - \theta)^{543} \\
\end{align*}

**2) Give the posterior probability of the number of births $\theta$ (up to a constant).**

\begin{align*}
\mathbb{P}(\theta | y)
&\propto \mathbb{P}(y | \theta) \mathbb{P}(\theta) \\
&\propto \theta^k (1 - \theta)^{n + \beta - k - 1}  \\
\end{align*}

**3) How much evidence this data provide for the claim that the proportion of female births is below 0.485, the proportion of females in the general population ? You'll be summarizinfg information about the posterior distribution using statistics such as the median or posterior intervals.**

In [2]:
from scipy.stats import beta
from itertools import product


n = 980
k = 437
N = 500

alpha_prior = [1, 5, 10, 50]
beta_prior = [1, 5, 10, 50]
pairs = [*product(alpha_prior, beta_prior)]

results = []

for a, b in pairs:
    alpha_post = a + k
    beta_post = n + b - k
    sample_post = beta.rvs(a=alpha_post, b=beta_post, size=N)

    sample_stats = np.quantile(sample_post, [0.025, 0.5, 0.975])
    results.append([a, b, beta.mean(a=a, b=b), beta.var(a=a,b=b), *sample_stats])
pd.DataFrame(results, columns=["alpha", "beta", "mean", "var", "q0.025", "q0.05", "q0.975"])

Unnamed: 0,alpha,beta,mean,var,q0.025,q0.05,q0.975
0,1,1,0.5,0.083333,0.41449,0.445646,0.482586
1,1,5,0.166667,0.019841,0.41428,0.443124,0.476789
2,1,10,0.090909,0.006887,0.408879,0.440626,0.471417
3,1,50,0.019608,0.00037,0.394878,0.424239,0.454075
4,5,1,0.833333,0.019841,0.415073,0.448716,0.477171
5,5,5,0.5,0.022727,0.412144,0.446218,0.479782
6,5,10,0.333333,0.013889,0.413986,0.445285,0.474335
7,5,50,0.090909,0.001476,0.396693,0.426597,0.456019
8,10,1,0.909091,0.006887,0.419837,0.450484,0.480414
9,10,5,0.666667,0.013889,0.416087,0.448398,0.47881


# Probability assignment (Bayesian Data Analysis, Gelman et al, Chapter 1)

This exercise aims at showing how probabilities can be assigned starting from a set of subjective assessments.
We will see how this can be done by first relying only on observed data. Then we will see how we can build a simple parametric model based on this empirical evidence.

In [3]:
football_dataset_path = "../assignments/assignment1/football-dataset.txt"
data = pd.read_csv(football_dataset_path, index_col=False, header=0, sep=",")
data

Unnamed: 0,home,favorite,underdog,spread,favorite.name,underdog.name,week
0,1,21,13,2.0,TB,MIN,1
1,1,27,0,9.5,ATL,NO,1
2,1,31,0,4.0,BUF,NYJ,1
3,1,9,16,4.0,CHI,GB,1
4,1,27,21,4.5,CIN,SEA,1
...,...,...,...,...,...,...,...
2235,1,23,13,6.5,PIT,CLE,17
2236,1,27,7,2.0,MIN,GB,17
2237,0,31,14,9.5,SD,SEA,17
2238,0,3,27,4.0,BUF,HOU,17


Football experts provide a *point spread* for every football game as a measure of the difference in ability between two teams. For instance, team A might be a 4-point favorite to defeat team B. This means that $p(team \ A \ wins \ by \ more \ than \ 4 \ points) = \frac{1}{2}$. The football dataset provides the point spread and actual game outcome for professional football games played between 1981 and 1984.

In [4]:
outcome = np.array(data['favorite'] - data['underdog']) 
point_spread = np.array(data['spread'])

plt.figure(figsize=(6,4))
plt.scatter(point_spread +  0.2*np.random.rand((point_spread.shape[0])) - 0.1,
            outcome + 0.4*np.random.rand((outcome.shape[0])) - 0.2, s=3)
plt.xlabel('Point spread')
plt.ylabel('Outcome')
plt.title('Outcome VS Point spread')
plt.show()
print('Number of games in dataset = ' + str(len(outcome)))

<IPython.core.display.Javascript object>

Number of games in dataset = 2240


## Assigning probabilities based on observed frequencies

It is of interest to assign probabilities to particular events. A first and natural approach can be to rely on the data that's been gathered to obtain empirical estimates.

**1) Compute:**

- **P1 = Pr(Favorite wins)**
- **P2 = Pr(Favorite wins | point spread = 3.5)**
- **P3 = Pr(Favorite wins by more than the point spread)**
- **P4 = Pr(Favorite wins by more than the point spread | point spread = 3.5)**

We will consider a tied game as one-half win and one-half loss. We will also ignore games without any favorite (point spread = 0)

**2) Compute the following probabilities and comment the results: **

- **P5 = Pr(Favorite wins | point spread = 8.5)**
- **P6 = Pr(Favorite wins | point spread = 9)**

## A parametric model for the difference between outcome and point spread

The graph belows shows the difference between a game outcome and the point spread, plotted against the point spread.
Let's denote by y the outcome of a game and x its point spread.

In [5]:
y = np.array(data['favorite'] - data['underdog'])
x = np.array(data['spread'])
z = y - x
plt.figure(figsize=(6,4))
plt.scatter(x +  0.2*np.random.rand((x.shape[0])) - 0.1, 
            z + 0.4*np.random.rand((z.shape[0])) - 0.2, s=3)

plt.xlabel('x')
plt.ylabel('z = y - x')
plt.title('z vs x')
plt.show()

<IPython.core.display.Javascript object>

**3) Plot the histogram of z, and the approximated Gaussian distribution of z|x.**

**4) Making use of the approximated distribution of z|x, compute the following probabilities:**

- **P7 = Pr(Favorite wins | point spread = 3.5)**
- **P8 = Pr(Favorite wins | point spread = 8.5)**
- **P9 = Pr(Favorite wins | point spread = 9)**

# Posterior inference (Bayesian Data Analysis, Gelman et al, Chapter 2)

This exercise illustrates how to do posterior inference using standard probability distributions introduced in the class.

Suppose you have a Beta(4,4) prior distribution on the probability $\theta$ that a coin will yield a "head" when spun in a specified manner. The coin is independently spun ten times, and "heads" appear fewer than 3 times. You don't know how many heads were seen, but only that their number is less than 3. We will denote by y the random variable giving the number of heads obtained after the 10 throws.

**1) Write the prior probability distribution of $\theta$ and the conditional y|$\theta$.**

**2) Write the data likelihood.**

**3) Calculate the posterior density of $\theta$ (up to a constant).**

**4) Plot the posterior distribution of $\theta$ (up to a constant).**

# Predictive prior distribution (Bayesian Data Analysis, Gelman et al, Chapter 2)

In this exercise, we show how we can incorporate all the information we have about the parameters of an experiment, in order to derive a predictive prior over the results of this experiement.

Let y be the number of 6's in 1000 independent rolls of a particular real die, which may be unfair. Let $\theta$ be the probability that the die lands on 6. We assume the following prior distribution for $\theta$:

$$
\begin{align}
Pr(\theta = \frac{1}{12}) & = 0.25 \\
Pr(\theta = \frac{1}{6}) & = 0.5 \\
Pr(\theta = \frac{1}{4}) & = 0.25.
\end{align}
$$

**1) Using the normal approximation, give the conditional distribution p(y|$\theta$).**

**2) Give an approximate prior distribution for p(y) and plot it.**

**3) Give approximate 5%, 25%, 50%, 75%, 95% points for the distribution of y.**