In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
!pip install lets-plot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lets-plot
  Downloading lets_plot-2.5.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 5.1 MB/s 
[?25hCollecting pypng
  Downloading pypng-0.20220715.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 4.1 MB/s 
Installing collected packages: pypng, lets-plot
Successfully installed lets-plot-2.5.1 pypng-0.20220715.0


In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import lets_plot as lp
import seaborn as sns

sns.set()
lp.LetsPlot.setup_html()

## Experiment 1 - How to determine P(H)?

Maybe we can try flipping the coin many times. Then:

\begin{equation}
P(H) = \frac{\text{num(Heads)}}{\text{num(coin tosses)}}
\end{equation}

Let's try an experiment with this.

#### First, we 'choose' a coin. For this, we pick a random probability value

In [20]:
p_heads = np.random.beta(2, 4)

#### Without knowing what this value is, let's try to guess it with experiments

In [23]:
num_tosses = 100
num_heads = 0

for i in range(num_tosses):
    toss_result = np.random.choice(["H", "T"], p=[p_heads, 1 - p_heads])

    if toss_result == "H":
        num_heads = num_heads + 1

p_guess = num_heads / num_tosses

In [24]:
lp.ggplot(
    data=pd.DataFrame(
        dict(value=[p_heads, p_guess], prob=["True P(H)", "Guessed P(H)"])
    )
) + lp.geom_bar(lp.aes(x="prob", y="value", fill="prob"), stat="identity")

#### So, 100 tosses gets us close! But not quite there

#### Will we always guess the same value for P(H)? 

- With the same number of tosses?
- What if we change the number of tosses? Very large number of tosses? What about very few?

## Experiment 2: Repeated Guessing

Let's retry the experiment above, 500 times.

In [25]:
num_retries = 500
num_tosses = 100

tosses = pd.DataFrame(
    np.random.choice(
        ["H", "T"], size=(num_retries, num_tosses), p=[p_heads, 1 - p_heads]
    ),
    columns=[f"Toss {i}" for i in range(1, num_tosses + 1)],
    index=[f"Try #{i}" for i in range(1, num_retries + 1)],
)

In [26]:
tosses

Unnamed: 0,Toss 1,Toss 2,Toss 3,Toss 4,Toss 5,Toss 6,Toss 7,Toss 8,Toss 9,Toss 10,...,Toss 91,Toss 92,Toss 93,Toss 94,Toss 95,Toss 96,Toss 97,Toss 98,Toss 99,Toss 100
Try #1,T,T,H,T,T,T,T,T,T,T,...,T,T,T,T,T,T,T,T,T,T
Try #2,T,T,T,T,T,H,T,T,T,T,...,T,T,T,T,T,T,T,T,T,T
Try #3,T,T,T,T,T,T,H,T,T,T,...,T,T,T,T,T,T,T,H,T,T
Try #4,T,T,T,T,T,H,T,T,T,T,...,T,T,T,T,T,T,T,T,T,H
Try #5,T,T,T,T,T,T,T,T,T,T,...,T,T,T,T,H,T,H,T,T,T
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Try #496,T,T,T,T,T,T,T,T,T,T,...,T,T,T,T,T,T,T,T,T,T
Try #497,H,T,T,T,T,T,T,T,T,T,...,T,T,T,T,T,T,T,H,T,H
Try #498,T,T,T,T,T,T,T,T,T,H,...,T,T,T,T,H,T,T,T,T,T
Try #499,H,T,T,H,T,T,T,T,T,H,...,T,H,T,T,T,T,T,T,T,T


In [27]:
guesses = (tosses == "H").sum(axis=1).to_frame("Number of Heads")
guesses

Unnamed: 0,Number of Heads
Try #1,7
Try #2,6
Try #3,9
Try #4,9
Try #5,7
...,...
Try #496,11
Try #497,10
Try #498,9
Try #499,14


In [28]:
guesses["Guessed P(H)"] = guesses["Number of Heads"] / num_tosses

In [29]:
guesses

Unnamed: 0,Number of Heads,Guessed P(H)
Try #1,7,0.07
Try #2,6,0.06
Try #3,9,0.09
Try #4,9,0.09
Try #5,7,0.07
...,...,...
Try #496,11,0.11
Try #497,10,0.10
Try #498,9,0.09
Try #499,14,0.14


In [30]:
lp.ggplot(guesses) + lp.geom_bar(lp.aes(x="Guessed P(H)"))

#### What happens if we do a lot more tosses per trial?

Let's do 10,000 tosses per trial!

In [17]:
num_retries = 500
num_tosses = 10000

tosses = pd.DataFrame(
    np.random.choice(
        ["H", "T"], size=(num_retries, num_tosses), p=[p_heads, 1 - p_heads]
    ),
    columns=[f"Toss {i}" for i in range(1, num_tosses + 1)],
    index=[f"Try #{i}" for i in range(1, num_retries + 1)],
)

guesses = (tosses == "H").sum(axis=1).to_frame("Number of Heads")
guesses["Guessed P(H)"] = guesses["Number of Heads"] / num_tosses

lp.ggplot(guesses) + lp.geom_bar(lp.aes(x="Guessed P(H)")) + lp.xlim(0.01, 0.4)

#### What happens if we do many more trials?

Let's keep doing 100 experiments per trial but we now do 50,000 trials!

In [18]:
num_retries = 50000
num_tosses = 100

tosses = pd.DataFrame(
    np.random.choice(
        ["H", "T"], size=(num_retries, num_tosses), p=[p_heads, 1 - p_heads]
    ),
    columns=[f"Toss {i}" for i in range(1, num_tosses + 1)],
    index=[f"Try #{i}" for i in range(1, num_retries + 1)],
)

guesses = (tosses == "H").sum(axis=1).to_frame("Number of Heads")
guesses["Guessed P(H)"] = guesses["Number of Heads"] / num_tosses

lp.ggplot(guesses) + lp.geom_bar(lp.aes(x="Guessed P(H)")) + lp.xlim(0.1, 0.6)

#### We've ended up with not one guess, but many 'possible' guesses

- Some guesses happen more often in our repeated experiment - they are more likely to happen
- The P(H) guesses which happen more often are more likely - by definition, since they happen more often.

#### For this coin flipping experiment, n(H)/n(H) + n(T) is the 'best' guess - it is the most likely

This way of picking a guess is called the *Maximum Likelihood Estimation* and there are ways to derive the value of it!