# Sampling Distributions and Central Limit Theorem #

In this notebook, we will learn about:

- Sampling Distributions for both mean and median
- The Central Limit Theorem

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from datascience import *

import scipy.stats

from hide_code3 import hide_code

hide_code()

## Variation in Samples ##

If you have a large population, it's very likely that everytime you draw a sample, it will be a different sample.


Then if you compute a statistics from that sample (mean, median, standard deviation, etc), it could be slightly different from sample to sample.  

Run and re-run the next few cells to see what I mean.


In [None]:
united = Table.read_table('united.csv')
#united
# mybins = np.arange(-15, 175, 25)
# united.hist("Delay", bins = mybins)
np.mean(united.column('Delay'))
np.std(united.column("Delay"))

In [None]:
x = united.sample(100).column('Delay')
x

In [None]:
np.mean(x)

In [None]:
np.mean(united.sample(100).column('Delay'))

## Sampling Distribution ##

For a fixed population and fixed sample size, and for a particular statistic (most often the mean), the collection of all possible values of that statistic over all possible samples of that size, forms what we call the sampling distribution.

The graphs below show a fixed population and (partial) sampling distributions for various sample sizes. 

### Uniform Population ###

In [None]:
from hidden_demos import sampling_distribution_demo1
sampling_distribution_demo1()



### Symmetric but non-uniform population ###

In [None]:
from hidden_demos import sampling_distribution_demo2

sampling_distribution_demo2()

### Non-symmetric population ###

In [None]:
from hidden_demos import sampling_distribution_demo3

sampling_distribution_demo3()

### Normal population ###

In [None]:
from hidden_demos import sampling_distribution_demo4

sampling_distribution_demo4()

## What were we supposed to observe in these graphs above? ##

1. As the sample size increased, all the sampling distributions began to approach normal distributions.  Did you notice the normal curve superimposed on most graphs?

2. For the example where the population was normal to start with, the sampling distribution *always* looked normal.

3. Did you notice that the scaling on the x-axis kept changing?  Go back and look again.

    As the sample size *increased* the spread along the x-axis *decreased*.  That's not just a coincidence or a trick.  The standard deviations are shrinking as the sample size increases, and there's even a formula that predicts what the sampling distributions standard deviation should be.  



### Law of Large Numbers ###

If $\overline{X}_n$ is the average of $n$ many $x_i$ all drawn from the same population/distribution with mean $\mu$ then as $n$ increases, $\overline{X}_n$ will approach $\mu$.  

That is, when your sample size is very large, you can have some confidence that the sample average is "pretty close" to the population average.  Later, when we study confidence intervals, we'll be able to quantify what counts as "pretty close".


### Central Limit Theorem ###

Assume $\overline{X}_n$ is the average of $n$ many $x_i$ all drawn from the same population/distribution with mean $\mu$ and population standard deviation $\sigma$.  Then $\overline{X}_n$ is a member of a sampling distribution.  For large values of $n$, this sampling distribution can be assumed approximately normal.  Specifically, the sampling distribution can be assumed to be $\displaystyle N\left(\mu, \frac{\sigma}{\sqrt{n}} \right)$.  

*Recall N(m, s) refers to a normal distribution with a mean of m and a standard deviation of s.*

What counts as a large sample size?  Generally, above 32 is considered large enough that most sampling distributions are at least approximately normal.  We saw with our Non-Symmetric Original Population example that sometimes it can take much larger samples before the graph appears normal; fortunately, such strongly skewed original populations don't come up often, and when they do, it's easy to detect.  


### Normal populations don't need CLT ###

If the original population is already normal, the sampling distribution for literally any size will automatically be normal and will follow $\displaystyle N\left(\mu, \frac{\sigma}{\sqrt{n}} \right)$.  




## Large Random Samples ##

In [None]:
#Download the united table again

united = Table.read_table('united.csv')
united

In [None]:
united_bins = np.arange(-20, 201, 5)
united.hist('Delay', bins = united_bins)

#Think about what this code is producing. It is creating a histogram of all the delay times (recall that negative times
#mean they arrived early). Pay attention to the bins.

In [None]:
#This is the actual average of delay times, just like we computed in class.

np.average(united.column('Delay'))

In [None]:
#Now let's find the average of a sample of size 10 of delay times. Run this a few times to see the different
#averages you get. Notice the variation in the results. What's the greatest average you get? The least? How close to the 
#real average of 16.658155515370705 are they?

np.average(united.sample(10).column('Delay'))

In [None]:
#Let's create a function that will allow us to do the above with whatever sample size we choose.

def sample_mean(size):
    return np.average(united.sample(size).column('Delay'))

In [None]:
sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(10)
    sample_means = np.append(sample_means, new_mean)
    
print(sample_means)

#Decipher this code before running. You are first making an empty array. Then, you are finding the mean of a sample of 10 
#delay times. Notice you are using the function you defined, "sample_mean", to accomplish this. You repeat this 1000 times.
#That is what the 'for loop' accomplishes. Each time, you are entering the result into the array called sample_means. 
#Look at the first number in your array. Python took a random sample of 10 delay times and found the mean. 
#That first number is the mean it found. There are 1000 results in the array!

In [None]:
#Let's make a histogram of these means. Note that the histogram will change each time you run this cell because your array
#will change based on the samples. Run it a few times to see.

sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(10)
    sample_means = np.append(sample_means, new_mean)

Table().with_column('Sample means', sample_means).hist()


In [None]:
#Let's find the average of all these means. We're going to take the 1000 numbers from the array and calculate their mean.
#How close is the average of the array to the actual average of flight delays? Run it a few times to see the different results.

sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(10)
    sample_means = np.append(sample_means, new_mean)

np.average(sample_means)


In [None]:
#This code does all of the above in one step.

sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(10)
    sample_means = np.append(sample_means, new_mean)

print(np.average(sample_means))

Table().with_column('Sample means', sample_means).hist()


In [None]:
#Let's do the same thing, but this time the sample size will be 25. Study the code below and see how it makes sense. 

sam_size = 25

sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(sam_size)
    sample_means = np.append(sample_means, new_mean)
    
print(np.average(sample_means))
    
Table().with_column('Sample means', sample_means).hist()

Compare the variation, or spread, of the histograms from samples of size 10 and samples of size 25. Which has the larger spread? Is one a better predictor of the actual mean of united flight delays? Recall the actual mean is 16.658155515370705.

In [None]:
#Predict the histogram you will get if you do the same thing with samples of size 50. Predict the how close the mean will 
#be to the true average of flight delays and also how spread out the histogram will be. Look at the scale at the bottom.

sam_size = 50

sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(sam_size)
    sample_means = np.append(sample_means, new_mean)
    
print(np.average(sample_means))
    
Table().with_column('Sample means', sample_means).hist()

In [None]:
#What will a histogram of means from samples of size 4000 look like? Make a prediction before running. It may take awhile
#for Python to perform this. Try it at least a couple of times to see how close to the true average it gets. Also notice
#how spread out the data is.

sam_size = 4000

sample_means = make_array()

for i in np.arange(1000):
    new_mean = sample_mean(sam_size)
    sample_means = np.append(sample_means, new_mean)
    
print(np.average(sample_means))
    
Table().with_column('Sample means', sample_means).hist()

## Sampling Distributions with Medians ##

A sampling distribution exists (in theory) for any statistic you can compute from a sample.  Another common statistic to work with is the median.  

It's not a common to work with the median, and it doesn't work out as nicely, because the Central Limit Theorem does not necessarily apply to the median, like it does with the mean.


In [None]:
np.median(united.column('Delay'))

In [None]:
np.median(united.sample(10).column('Delay'))

In [None]:
def sample_median(size):
    return np.median(united.sample(size).column('Delay'))

In [None]:
sample_median(10)

In [None]:
sample_medians = make_array()

for i in np.arange(1000):
    new_median = sample_median(10)
    sample_medians = np.append(sample_medians, new_median)

In [None]:
Table().with_column('Sample medians', sample_medians).hist(bins = np.arange(-10,31))

In [None]:
sample_medians = make_array()

for i in np.arange(1000):
    new_median = sample_median(1000)
    sample_medians = np.append(sample_medians, new_median)


Table().with_column('Sample medians', sample_medians).hist(bins = np.arange(-10,31))

In [None]:
Table().with_column('Sample medians', sample_medians).hist(bins = np.arange(-10,31))