In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy import mean, sqrt
from scipy.stats import norm as ndist

# stats60 specific
from code.probability import BoxModel, Binomial, RandomVariable, SumIntegerRV
from code import roulette
from code.week1 import standardize_right, standardize_left, normal_curve
from code.utils import probability_histogram
figsize = (8,8)

# Probability histogram & normal approximation

- This chapter considers a histogram
approximation what we have been calling the `mass_function`
of the `sum of draws`.

- It turns out, that with enough draws, the sample histogram begins to follow
the normal curve.

## Probability histogram for tossing a fair coin

- When tossing a fair coin, there is 1/2 probability of getting 1 head, 1/2 of getting 0 heads.
- We can make a histogram with an rectangle of width 1, area 1/2 around 0, and an identical rectangle around 1.

In [None]:
coin_trial = BoxModel(['H','T'])
coin_trial.mass_function

### Probability histogram of successes

In [None]:
%%capture
one_toss = plt.figure(figsize=figsize)
one_toss_ax = probability_histogram(Binomial(1, coin_trial, ['H']),
                                    draw_bins=np.arange(2)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of heads',
                                    ylabel='% per head')[0]
one_toss_ax.set_xlim([-0.6,1.6])

In [None]:
one_toss

## Difference between probability histogram and mass function

- The `mass_function` tells us the exact chances of either 0 or 1 success
in our trial.

- The probability histogram is based on breaking the numbers into bins (like
we did with sample data earlier). 

- It then finds all the `mass` in those bins.

- In our example, we have two bins: [-0.5,0.5) and [0.5,1.5).

- There is chances 1/2 for the successes to be in the first bin, and 
chances 1/2 in the second.

- Just as in a histogram for data, areas of bars represents percentages (chances).


## Tossing a fair coin twice

* When tossing a fair coin twice , there is
*     * 1/4 probability of getting 2 heads
      * 1/2 probability of getting 1 head
      * 1/4 probability of getting 0 heads
* We can make similarly make a histogram for this experiment.
* This histogram is called a *probability histogram*.


### Probability histogram of successes

In [None]:
%%capture
two_toss = plt.figure(figsize=figsize)
two_toss_ax = probability_histogram(Binomial(2, coin_trial, ['H']),
                                    draw_bins=np.arange(3)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of heads',
                                    ylabel='% per head')[0]
two_toss_ax.set_xlim([-0.6,2.6])
two_toss_ax.legend()


In [None]:
two_toss

Let's compare this with
the actual number of successes if we toss a fair coin many times.

In [None]:
two_draws = Binomial(2, BoxModel(['H','T']), ['H'])
two_draws.mass_function

In [None]:
two_draws.sample(5)

In [None]:
%%capture
two_toss = plt.figure(figsize=figsize)
two_toss_ax = probability_histogram(Binomial(2, coin_trial, ['H']),
                                    bins=np.arange(4)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of heads',
                                    ylabel='% per head',
                                    ndraws=500)[0]
two_toss_ax.set_xlim([-0.6,2.6])
two_toss_ax.legend()


In [None]:
two_toss

- The sample histogram looks close to the probability histogram.

- As we take a larger sample, the closer it gets.

## Probability histogram: law of large numbers

* Choose an experiment (e.g. tossing a fair coin twice and counting the number of heads, $H$).
* Repeat the experiment 500 times creating a list $[H_1, H_2, \dots, H_{500}].$
* The frequentist view of probability tells us that the histogram of the list $[H_1, H_2, \dots, H_{500}]$ should look like the *probability histogram*.
* Or, the empirical histogram *converges* to the probability histogram.
* We call this the *Law of Large Numbers*
  

In [None]:
%%capture
tosses = {}
ntoss = 5
tosses[ntoss] = plt.figure(figsize=figsize)
probability_histogram(Binomial(ntoss, coin_trial, ['H']),
                                    bins=np.arange(7)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of heads',
                                    ylabel='% per heads',
                                    ndraws=500)
tosses[ntoss].gca().set_xlim([-0.6,5.6])
tosses[ntoss].gca().legend()


In [None]:
tosses[5]

### Probability histogram of successes

In [None]:
%%capture
ntoss = 30
tosses[ntoss] = plt.figure(figsize=figsize)
probability_histogram(Binomial(ntoss, coin_trial, ['H']),
                                    bins=np.arange(32)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of heads',
                                    ylabel='% per heads',
                                    ndraws=500)
tosses[ntoss].gca().set_xlim([5.6,25.6])
tosses[ntoss].gca().legend()


In [None]:
tosses[30]

The probability histogram looks a lot like a normal curve!

**This is not an accident!**

## Roulette

Let's look at another of our usual games: roulette.

The numbers [2,24,29] are my lucky numbers. I think I will try betting on them!

In [None]:
roulette.examples['lucky numbers']

### Probability histogram of successes

In [None]:
%%capture
lucky_numbers = {}
nbet = 10
lucky_trial = roulette.examples['lucky numbers'].model
lucky_numbers[nbet] = plt.figure(figsize=figsize)
probability_histogram(Binomial(nbet, lucky_trial, None),
                                    bins=np.arange(12)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of successes',
                                    ylabel='% per success')
lucky_numbers[nbet].gca().set_xlim([-0.6,5.6])
lucky_numbers[nbet].gca().set_title('After %d bets' % nbet, fontsize=15)
lucky_numbers[nbet].gca().legend()


In [None]:
lucky_numbers[10]


In [None]:
%%capture
nbet = 100
lucky_numbers[nbet] = plt.figure(figsize=figsize)
probability_histogram(Binomial(nbet, lucky_trial, None),
                                    bins=np.arange(12)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of successes',
                                    ylabel='% per success')
lucky_numbers[nbet].gca().set_xlim([-0.6,20.6])
lucky_numbers[nbet].gca().set_title('After %d bets' % nbet, fontsize=15)
lucky_numbers[nbet].gca().legend()



In [None]:
lucky_numbers[100]

### Probability histogram of successes

In [None]:
%%capture
nbet = 1000
lucky_numbers[nbet] = plt.figure(figsize=figsize)
probability_histogram(Binomial(nbet, lucky_trial, None),
                                    bins=np.arange(12)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of successes',
                                    ylabel='% per success')
lucky_numbers[nbet].gca().set_xlim([50.6,110.6])
lucky_numbers[nbet].gca().set_title('After %d bets' % nbet, fontsize=15)
lucky_numbers[nbet].gca().legend()




In [None]:
lucky_numbers[1000]

- After 10 bets, the histogram doesn't look like the normal curve.

- Eventually, it looks very close to the normal curve.

# Betting money in roulette

- Suppose we start with 100\$ and use my lucky numbers bet.

- In this case, the outcome is now \$ (and we'll eventually lose to the casino...)

In [None]:
%%capture
winnings = {}
nbet = 10
winnings[nbet] = plt.figure(figsize=figsize)

successes = Binomial(nbet, roulette.examples['lucky numbers'])
total = RandomVariable(successes, lambda wins : 100 + 110 * wins - 10 * (nbet - wins) )
probability_histogram(total, width=120, facecolor='gray',
                      xlabel='Total (\$)', ylabel='% per \$')
winnings[nbet].gca().set_xlim([-60,610])
winnings[nbet].gca().set_title('After %d bets of 10\$' % nbet, fontsize=15)


In [None]:
winnings[10]

In [None]:
%%capture
winnings = {}
nbet = 100
winnings[nbet] = plt.figure(figsize=figsize)

successes = Binomial(nbet, roulette.examples['lucky numbers'])
total = RandomVariable(successes, lambda wins : 100 + 110 * wins - 10 * (nbet - wins) )
probability_histogram(total, width=120, facecolor='gray',
                      xlabel='Total (\$)', ylabel='% per \$')

winnings[nbet].gca().set_title('After %d bets of 10\$' % nbet, fontsize=15)


In [None]:
winnings[100]


In [None]:
%%capture
winnings = {}
nbet = 1000
winnings[nbet] = plt.figure(figsize=figsize)

successes = Binomial(nbet, roulette.examples['lucky numbers'])
total = RandomVariable(successes, lambda wins : 100 + 110 * wins - 10 * (nbet - wins) )
probability_histogram(total, width=120, facecolor='gray',
                      xlabel='Total (\$)', ylabel='% per \$')

winnings[nbet].gca().set_title('After %d bets of 10\$' % nbet, fontsize=15)


In [None]:
winnings[1000]

## Normal approximation

### Central limit theorem

* When making many independent draws from a box, the central limit theorem says that we can use the normal curve to approximate probabilities of things for the **sum of draws**
  .
* Specifically, the normal curve applies to 
$$\frac{\text{ sum of draws} - \text{expected( sum of draws)}}{\text{SE( sum of draws)}}$$

### Example

In roulette, betting on  5
   100 times, 10\$ each bet starting with 100\$. 
   
What are the chances we will finish with more than 200 \$?

Here is the box:

In [None]:
places = {}
for i in range(1,37) + ['0','00']:
    if i in [5]:
        places[i] = roulette.roulette_position(350,
                                               facecolor='green',
                                               bg_alpha=None,
                                               fontsize=90)
    else:
        places[i] = roulette.roulette_position(-10,
                                               facecolor='red',
                                               bg_alpha=None,
                                               fontsize=90)
winnings = roulette.roulette_table(places)
from IPython.core.display import HTML

In [None]:
HTML(winnings)

We can now compute:
- $\text{average( sum of 100 draws)} = 100 \times (-0.52)\$ = -52\$ $
- $$\text{SE( sum of  100 draws)} = \sqrt{100} \times 360 \times \sqrt{\frac{1}{38} \times \frac{37}{38}} \approx 576\$ $$
- Finishing with more than 200\$ means the **sum of draws** was greater than 100\$ .
- In standardized units, this is $$\frac{100-(-52)}{576} \approx 0.27$$

In [None]:
%%capture
with plt.xkcd():
    winnings_stand = plt.figure(figsize=(10,5))
    standardize_right(100, -52, 576, units="Total amount", standardized=True,
                      data=False)

In [None]:
winnings_stand

In [None]:
%%capture
normal_fig = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(0.26, 4, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.1f%%' % (100 * ndist.sf(0.27)), fontsize=20, color='green')


In [None]:
normal_fig

### Probability histogram of successes

In [None]:
%%capture
winnings = {}
nbet = 10
winnings[nbet] = plt.figure(figsize=figsize)

successes = Binomial(nbet, roulette.examples['lucky numbers'])
total = RandomVariable(successes, lambda wins : 100 + 110 * wins - 10 * (nbet - wins) )
ax, avg, sd = probability_histogram(total, width=120, facecolor='gray',
                      xlabel='Total (\$)', ylabel='% per \$')

winnings[nbet].gca().set_title('After %d bets of 10\$' % nbet, fontsize=15)
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)


In [None]:
winnings[10]

In [None]:
%%capture
winnings = {}
nbet = 100
winnings[nbet] = plt.figure(figsize=figsize)

successes = Binomial(nbet, roulette.examples['lucky numbers'])
total = RandomVariable(successes, lambda wins : 100 + 110 * wins - 10 * (nbet - wins) )
ax, avg, sd = probability_histogram(total, width=120, facecolor='gray',
                      xlabel='Total (\$)', ylabel='% per \$')

winnings[nbet].gca().set_title('After %d bets of 10\$' % nbet, fontsize=15)
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)


In [None]:
winnings[100]

In [None]:
%%capture
winnings = {}
nbet = 1000
winnings[nbet] = plt.figure(figsize=figsize)

successes = Binomial(nbet, roulette.examples['lucky numbers'])
total = RandomVariable(successes, lambda wins : 100 + 110 * wins - 10 * (nbet - wins) )
ax, avg, sd = probability_histogram(total, width=120, facecolor='gray',
                      xlabel='Total (\$)', ylabel='% per \$')

winnings[nbet].gca().set_title('After %d bets of 10\$' % nbet, fontsize=15)
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)


In [None]:
winnings[1000]

In [None]:
%%capture
tosses = {}
ntoss = 100
tosses[ntoss] = plt.figure(figsize=figsize)
ax, avg, sd = probability_histogram(Binomial(ntoss, coin_trial, ['H']),
                                    bins=np.arange(7)-0.5,
                                    alpha=0.5, facecolor='gray',
                                    xlabel='Number of heads',
                                    ylabel='% per head',
                                    ndraws=500)
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)


In [None]:
tosses[100]

## Continuity correction

- When using the normal approximation to the sum of draws, we sometimes use the *continuity correction*.
- This means we might add or subtract 1/2 at the endpoints.

- For example 
   - {observing more than 40 heads in 100 flips} = {observing more than 40.5 heads in 100 flips}
   - {observing less than 45 heads in 100 flips} = {observing less than 44.5 heads in 100 flips}
   - {observing exactly 40 heads in 100 flips} = {observing between 39.5 and 40.5 heads in 100 flips}
   - {observing greater than or equal to 41 heads but less than 52 heads in 100 flips } = {observing between 40.5 and 51.5 heads in 100 flips}
   
   - {observing greater than 41 heads but less than 52 heads in 100 flips } = {observing between 41.5 and 51.5 heads in 100 flips}

### Normal approximation

In [None]:
interval = np.linspace(0, 44.5, 101)
ax.fill_between(interval, 0*interval, ndist.pdf((interval - avg) / sd) / sd,
                hatch='+', color='red', alpha=0.5)
ax.set_title('Using continuity correction', fontsize=20, color='red')
ax.set_xlim([ax.get_xlim()[0],50])

## Less than 45 heads with continuity correction

In [None]:
tosses[100]

In [None]:
%%capture
with plt.xkcd():
    heads_stand = plt.figure(figsize=figsize)
    standardize_left(44.5, avg, sd, units="Heads", standardized=True,
                     data=False)

In [None]:
from numpy import sqrt
SE = sqrt(100) * (1-0) * sqrt(1/2. * 1/2.)
print SE
heads_stand

In [None]:
%%capture
normal_fig = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(-4,-1.10, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.1f%%' % (100 * ndist.cdf(-1.10)), fontsize=20, color='green')


In [None]:
normal_fig

## Observing exactly 40 heads using continuity correction

The standardized units are

- (39.5 - 50) / 5 = -2.1
- (40.5 - 50) / 5 = -1.9

In [None]:
%%capture
normal_fig = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(-2.1,-1.9, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.1f%%' % (100 * (ndist.cdf(-1.9) - ndist.cdf(-2.10))), fontsize=20, color='green')



In [None]:
normal_fig

Compare this to the true value:

$$
\binom{100}{40} \left(\frac{1}{2} \right)^{40} \left(\frac{1}{2}\right)^{60}
$$

In [None]:
Binomial(100, coin_trial, ['H']).mass_function[40]

## Example

Use the normal approximation to estimate the probability of observing greater than or equal to 45 heads in 80 flips of a fair coin.

We know

- $\text{expected(sum of draws)} = 80 \times 0.5 = 40 $
- $\text{SE( sum of draws)} = \sqrt{80} \times (1 - 0) \times \sqrt{\frac{1}{2}  \times \frac{1}{2}} \approx 4.5 $
- Observing more than or equal to 45 heads is the same as observing more than 44.5 heads
- In standardized units, this is $\frac{44.5-40}{4.5} \approx 1$

In [None]:
%%capture
normal_fig = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(4.5 / np.sqrt(20), 4, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.1f%%' % (100 * ndist.sf(4.5 / np.sqrt(20))), fontsize=20, color='green')


In [None]:
normal_fig

## Example (continued)


Use the normal approximation to estimate the probability of observing exactly 45 heads in 80 flips of a fair coin.

- Observing 45 heads is the same as observing between 44.5 and 45.5 heads.
- In standardized units, this is endpoints are $$\begin{aligned}
               \text{lower}&= \frac{44.5-40}{4.5}    \approx 1 \\
               \text{upper}&= \frac{45.5-40}{4.5}    \approx 1.2 \\
               \end{aligned}$$

In [None]:
%%capture
normal_fig = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(4.5 / np.sqrt(20), 5.5 / np.sqrt(20), 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.1f%%' % (100 * (ndist.sf(4.5 / np.sqrt(20)) - ndist.sf(5.5 / np.sqrt(20)))), fontsize=20, color='green')




In [None]:
normal_fig

Compare this to the true value:

$$
\binom{80}{45} \left(\frac{1}{2} \right)^{45}  \left(\frac{1}{2} \right)^{35}
$$

In [None]:
Binomial(80, coin_trial, ['H']).mass_function[45]

## Central limit theorem

* The central limit theorem applies to **sum of draws**.
* The number of draws should be reasonably large.
* The more lopsided the values are, the more draws needed for reasonable approximation (compare the approximations of rolling  5
   in roulette to flipping a fair coin).
* It is another type of *convergence*
  : as the number of draws grows, the normal approximation gets better.

In [None]:
%%capture
lopsided = {}
ndraw = 3 
lopsided[ndraw] = plt.figure(figsize=figsize)
mass = np.array([0,1,1,0,0,0,0,1])/3.
rv = SumIntegerRV(mass, 3)
ax, avg, sd = probability_histogram(rv,
                                    facecolor='gray')
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)
ax.set_ylim([0,1.1*max(rv.mass_function.values())])
ax.set_title('%d draws from [1,2,7]' % ndraw, fontsize=17)

In [None]:
lopsided[3]

In [None]:
%%capture
ndraw = 10

lopsided[ndraw] = plt.figure(figsize=figsize)
mass = np.array([0,1,1,0,0,0,0,1])/3.
rv = SumIntegerRV(mass, ndraw)
ax, avg, sd = probability_histogram(rv,
                                    facecolor='gray')
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)
ax.set_ylim([0,1.1*max(rv.mass_function.values())])
ax.set_title('%d draws from [1,2,7]' % ndraw, fontsize=17)

In [None]:
lopsided[10]

In [None]:
%%capture
ndraw = 30

lopsided[ndraw] = plt.figure(figsize=figsize)
mass = np.array([0,1,1,0,0,0,0,1])/3.
rv = SumIntegerRV(mass, ndraw)
ax, avg, sd = probability_histogram(rv,
                                    facecolor='gray')
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)
ax.set_ylim([0,1.1*max(rv.mass_function.values())])
ax.set_title('%d draws from [1,2,7]' % ndraw, fontsize=17)

In [None]:
lopsided[30]

In [None]:
%%capture
ndraw = 50

lopsided[ndraw] = plt.figure(figsize=figsize)
mass = np.array([0,1,1,0,0,0,0,1])/3.
rv = SumIntegerRV(mass, ndraw)
ax, avg, sd = probability_histogram(rv,
                                    facecolor='gray')
normal_curve(mean=avg, SD=sd, ax=ax, alpha=0.3, facecolor='green', color='green',
             xlabel=None, ylabel=None)
ax.set_ylim([0,1.1*max(rv.mass_function.values())])
ax.set_title('%d draws from [1,2,7]' % ndraw, fontsize=17)

In [None]:
lopsided[50]


## Take away 

- If the box is lopsided, convergence to normal curve may be slower.

- But it still happens (and can be used)!

## How many samples should we take?

- The normal approximation works when we take enough
samples.
- But how many should we take?
- There have been various rules proposed...
- For counts, a [rule of thumb](http://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation) says the Normal approximation to the Binomial is OK when $np \geq k$ and $n(1-p) \geq k$
where $k$ is of the order of 5 or 10.
- **For concreteness, we take $k=10$.**
