In [1]:
# Run this cell to set up your notebook

import numpy as np
from scipy import stats
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Week 8 Part 6 #

Throughout, we are going to use the notation $S_n = X_1 + X_2 + \cdots + X_n$ for i.i.d. $X_1, X_2, \ldots, X_n$.

The CLT says that "if $n$ is large" then the distribution of $S_n$ is roughly normal, regardless of the distribution of each $X_i$.

How large is "large"? 

The answer to this does depend on the distribution of $X_i$. The more skew or weird that underlying distribution is, typically the larger $n$ has to be before the CLT kicks in.

Some elementary classes tell students that 30 is some kind of magic sample size beyond which you can use the CLT. I regret to inform you that such is not the case. 

- Start by going back to the total number of spots on [$n$ rolls of a die](http://prob140.org/textbook/Chapter_14/02_PGFs_in_NumPy.html#The-Sum-of-the-Numbers-on-$n$-Rolls-of-a-Die). The distribution is pretty close to normal when $n$ is only 10.
- Now recall what happened when we started with something just [a tad weird](http://prob140.org/textbook/Chapter_14/02_PGFs_in_NumPy.html#Making-Waves). The distribution of $S_{30}$ clearly isn't normal. It's trying to be, but it's got a way to go.

So how do you know whether $n$ is large enough for the normal approximation to the sample sum, especially if you don't know the distribution of $X_i$?

There's no clear answer. But one way to get a sense of the shape would be to construct a bootstrap approximation to the distribution of the sample sum. That is, bootstrap the sample and compute the sum, and then do that over and over again. If the resulting distribution is normal then go ahead and use the CLT.

One way to know that you *shouldn't* use a normal approximation: if your normal curve covers impossible values. For example, if the interval "mean $\pm$ three SDs" goes outside the possible range of your variable, the normal approximation might not be valid.

## Reading: Approximations to the Binomial ##
We know two of these:

- The Poisson approximation when $n$ is large and $p$ is small
- The normal approximation when $n$ is large

That seems contradictory, as the normal and Poisson distributions can look quite different. But there's actually no problem:

The CLT says that whatever the value of $p$ (could be small; it just has to be fixed), eventually $n$ is going to be large enough that the binomial $(n, p)$ distribution will look normal.

If $n$ is "large" in some absolute sense but the binomial $(n, p)$ histogram has a Poisson shape, that just means that $n$ isn't yet large enough for the normal approximation. Keep increasing $n$ (while keeping $p$ fixed), and eventually the shape will become normal.

Now work through [an example](http://prob140.org/textbook/Chapter_14/03_Central_Limit_Theorem.html#Approximating-the-Binomial-$(n,-p)$-Distribution) and related discussion. 

The issue is whether the binomial histogram is all squished near 0 (or $n$), or whether it has some room on either side of the mean. That is, the issue is the SD $\sqrt{npq}$.

## Vitamins ##

**1.** Is the binomial $(100, 0.1)$ distribution approximately normal?

**2.** Is the binomial $(10000, 0.1)$ distribution approximately normal?

## Break. Coming up: the sample mean. ##