# Coin Flip

![Coin flip](coin_flip.jpg)

This notebook demonstrates the obligatory coin flip example for a Bayesian model setup and inference. It is structured as follows:

1. Description of the problem setting
2. Initial model setup with code
3. Investigation of different prior distributions

In [None]:
# Necessary libraries, in particular the binomial and Beta distributions from Scipy
import numpy as np
from scipy.stats import binom, beta
import ipywidgets as widgets
import matplotlib.pyplot as plt

# Interactive plotting and style
%matplotlib widget
plt.close('all')
plt.style.use('bmh')

# Layout of our slider and text box widgets
slider_layout = {'style': {'description_width': 'initial'},
                 'layout': widgets.Layout(width='35%'),
                 'continuous_update': True} 

textbox_layout = {'style': {'description_width': 'initial'},
                  'layout': widgets.Layout(width='20%'),
                  'continuous_update': False}

## Problem Setting

We encounter a weird-looking coin, for which we wonder if it is "fair", meaning that heads and tails occur equally often. The law of large numbers tells us that we can find this *bias-weighting* by flipping the coin infinitely often and analyzing the ratio of heads/tails. Let us see what we can infer in a Bayesian setting from flipping the coin a finite number of times.

More mathematically, this is a **Bernoulli** trial experiment. For $n$ independent flips of a coin with bias-weighting $\theta$, the probability of observing $y$ heads follows a **binomial** distribution,

$$ p(y | n, \theta) = Bin(y | n,\theta) \propto \theta^y (1-\theta)^{n-y}, $$

where the constant of proportionality does not depend on $y$.

Let us visualize such a distribution:

In [None]:
# Define a function to plot binomial pmf on given axis object
def plot_binomial_pmf(num_samples, success_probability, mpl_axis):
    sample_space = np.arange(0, num_samples+1)
    # Compute the probability mass function
    pmf_values = binom.pmf(sample_space, num_samples, success_probability)

    mpl_axis.clear()
    mpl_axis.set_title(rf'Binomial Distribution: $n = {num_samples}$, $\theta = {success_probability}$')
    mpl_axis.set_xlabel(r'$y$')
    mpl_axis.set_ylabel(r'$P_{bin}(y | n,\theta)$')
    mpl_axis.set_xlim(-0.5, num_samples + 0.5)
    mpl_axis.set_ylim(0, 1.1*np.max(pmf_values))
    mpl_axis.plot(sample_space, pmf_values, marker='o', linestyle='', color='royalblue')

# Construct sliders for interactive visualization
slider_num_samples = widgets.IntSlider(value=10, min=0, max=100, step=1,
                                       description='Number of samples n:',
                                       **slider_layout)
slider_success_probability = widgets.FloatSlider(value=0.5, min=0, max=1, step=0.05,
                                                 description='Success probability \u03B8:',
                                                 **slider_layout)

# Generate interactive plot
_, ax = plt.subplots()
interactive_plot = widgets.interact(plot_binomial_pmf,
                                    num_samples=slider_num_samples,
                                    success_probability=slider_success_probability,
                                    mpl_axis=widgets.fixed(ax))

## Bayesian Model formulation

Initially, we assume that we do not know anything about the suspicious coin. This means we assign a **uniform prior distribution**,

$$ p(\theta) = Uni(0,1) $$

We have further seen that the **likelihood** of observing $y$ heads (our data) out of $n$ flips follows a binomial distribution. Hence, we may write

$$ p(y | \theta) \propto \theta^y (1-\theta)^{n-y}. $$

Note that we have omitted $n$ in the explicit dependency formulation of the likelihood. This is because we assume $n$ to be a fixed parameter or *covariate* of the Bayesian model.

Invoking Bayes theorem, we can construct a **posterior** distribution

$$ p(\theta | y) \propto p(y | \theta) p(\theta) \propto \theta^y (1-\theta)^{n-y},\quad \theta\in [0,1], $$

where the constant of proportionality does not depend on $\theta$. Note that in contrast to the discrete binomial distribution, the posterior is defined in terms of the continuous variable $\theta$. More specifically, it represents a $\beta$-distribution, which generally reads

$$ Beta(\theta | a, b) \propto \theta^{a-1} (1-\theta)^{b-1} .$$

Let us again visualize this distribution:

In [None]:
# Define a function to plot beta pdf on given axis object
def plot_beta_pdf(param_a, param_b, mpl_axis):
    sample_space = np.linspace(0.01, 0.99, num=1001, endpoint=True)
    # Compute probability density function
    pdf_values = beta.pdf(sample_space, param_a, param_b)

    mpl_axis.clear()
    mpl_axis.set_title(rf'Beta Distribution: $a = {param_a:.1f}$, $b = {param_b:.1f}$')
    mpl_axis.set_xlabel(r'$\theta$')
    mpl_axis.set_ylabel(r'$p_{beta}(\theta | a, b)$')
    mpl_axis.set_xlim(0, 1)
    mpl_axis.set_ylim(0, 1.1 * np.max(pdf_values))
    mpl_axis.plot(sample_space, pdf_values, color='royalblue')
    
# Construct sliders for interactive visualization
slider_param_a = widgets.FloatSlider(value=1, min=0.1, max=10, step=0.1,
                                     description="Parameter a:",
                                     **slider_layout)

slider_param_b = widgets.FloatSlider(value=1, min=0.1, max=10, step=0.1,
                                     description="Parameter b:",
                                     **slider_layout)

# Generate interactive plot
_, ax = plt.subplots()
interactive_plot = widgets.interact(plot_beta_pdf,
                                    param_a=slider_param_a,
                                    param_b=slider_param_b,
                                    mpl_axis=widgets.fixed(ax))

Clearly, we can identify the posterior as $Beta(\theta | a=y+1, b=n-y+1)$.
With this result, we can have a look at what we can learn from repeated coin flips. Let us therefore assume we have data for a coin with $\theta=0.4$. We can explore how our knowledge of theta develops with increasing sample size:

In [None]:
# Generate data representing 1000 coin flips with bias-weighting 0.4
NUM_SAMPLES_MAX = 1000
BIAS_WEIGHTING = 0.4

trial_data_raw = binom.rvs(n=1, p=BIAS_WEIGHTING, size=NUM_SAMPLES_MAX)
trial_data_accumulated = np.cumsum(trial_data_raw)
trial_data_accumulated = np.insert(trial_data_accumulated, 0, 0)

# Plot function for the posterior with a given sample size
def plot_posterior_pdf_uniform_prior(num_samples, trial_data_accumulated, mpl_axis):
    sample_space = np.linspace(0.01, 0.99, num=1000, endpoint=True)
    trial_successes = trial_data_accumulated[num_samples]
    pdf_values = beta.pdf(sample_space, trial_successes+1, num_samples-trial_successes+1)

    mpl_axis.clear()
    mpl_axis.set_title(rf'Posterior Distribution: $n = {num_samples}$')
    mpl_axis.set_xlabel(r'$\theta$')
    mpl_axis.set_ylabel(r'$p_{post}(\theta | y)$')
    mpl_axis.set_xlim(0, 1)
    mpl_axis.set_ylim(0, 1.1 * np.max(pdf_values))
    mpl_axis.plot(sample_space, pdf_values, color='royalblue')
    mpl_axis.axvline(BIAS_WEIGHTING, color='black', linestyle='--', alpha=0.5)
    mpl_axis.text(BIAS_WEIGHTING+0.01, 1.035, 'true value')

# Create an interactive textbox for sample size adjustment
textbox_num_samples = widgets.BoundedIntText(value=0, min=0, max=NUM_SAMPLES_MAX, step=1,
                                             description='Number of samples n:',
                                             **textbox_layout)

# Generate interactive plot
_, ax = plt.subplots()
interactive_plot = widgets.interact(plot_posterior_pdf_uniform_prior,
                                    num_samples=textbox_num_samples,
                                    trial_data_accumulated = widgets.fixed(trial_data_accumulated),
                                    mpl_axis=widgets.fixed(ax))


Initially, meaning without the inclusion of any data, $\theta$ is uniformly distributed as prescribed through the prior. With the inclusion of an increasing number of samples, we obtain a curve with a bell-like curve. The more data we include, the closer (ideally) the center of the distribution gets to the true $\theta$. Moreover, the variance is reduced through the introduction of additional data samples.