<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Remember-Sampling?" data-toc-modified-id="Remember-Sampling?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Remember Sampling?</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check:-How-could-we-see-the-chances-of-this?" data-toc-modified-id="🧠-Knowledge-Check:-How-could-we-see-the-chances-of-this?-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>🧠 Knowledge Check: How could we see the chances of this?</a></span><ul class="toc-item"><li><span><a href="#And-what-is-the-mean-of-these-sample-means?-Was-it-close-to-the-true-population-mean?" data-toc-modified-id="And-what-is-the-mean-of-these-sample-means?-Was-it-close-to-the-true-population-mean?-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>And what is the mean of these sample means? Was it close to the true population mean?</a></span></li></ul></li></ul></li><li><span><a href="#Central-Limit-Theorem" data-toc-modified-id="Central-Limit-Theorem-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Central Limit Theorem</a></span></li></ul></div>

# Remember Sampling?

Let's go back to our discussion on [sampling](sampling.ipynb).

We looked at the ages of passengers and took a sample

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('data/titanic.csv')
all_ages = df['Age'].dropna()
age_mean = all_ages.mean()

print(f'There are {all_ages.size} people, average age is {age_mean :.1f}')

In [None]:
all_ages.hist();

We took a random sample and saw how far that sample mean was from the population mean

In [None]:
# Take a random sample
number_of_ppl_in_sample = 100
sample = all_ages.sample(n=number_of_ppl_in_sample, random_state=27) #Take a sample of n people
mean_s = sample.mean()

calc_percent_error = lambda pop_mean, sample_mean: np.abs(sample_mean - pop_mean) / pop_mean
    
percent_err = calc_percent_error(age_mean, mean_s)

print(f'The sample mean was {mean_s:.1f} with a percent error of {percent_err*100:.2f}%')

But what are the chances we randomly sample the extreme values? That could cause issues...

## 🧠 Knowledge Check: How could we see the chances of this?

One way, is literally repeat the experiment! 

Sample a bunch of times from the experiment and see how many times we get extreme values!

In [None]:
# Repeatedly take samples and plot out sample means
sample_means = []
for i in range(10**3):
    sample = all_ages.sample(n=number_of_ppl_in_sample) 
    sample_means.append(sample.mean()) # Calculate the sample mean


plt.hist(sample_means, bins=250);

Huh, that shape looks familiar...

### And what is the mean of these sample means? Was it close to the true population mean?

In [None]:
mean_of_means = np.mean(sample_means)

print(f'The mean of sample means was {mean_of_means:.3f} vs the actual population mean {age_mean:.3f}')
print(f'Percent error: {100*calc_percent_error(age_mean, mean_of_means):.2f}%')

![](images/neat.gif)

Soo pretty good estimate!

# Central Limit Theorem

The Central Limit Theorem (CLT) basically says that the mean of sample means will approach a normal curve. But not only that, but (nearly) every population distribution will produce this bell curve from the sample means!

And since we know the math of normal curves pretty dang well, we can use our math skills to learn something about the population from the sample! So that means we can estimate the population mean $\mu$ and standard deviation $\sigma$ (note that the **standard error** is actually smaller than the standard deviation of the population). And because all normal curves can be expressed with just $\mu$ and $\sigma$, we have a pretty good distribution shape! We can actually use this to figure out confidence intervals too!!

> **tl;dr** We like 'em Gaussians! Thanks CLT!!

![](images/central-limit-theorem.png)

Essentially tells us that as we take more samples from the population, we would approach the true population mean.

We can compare our one sample and compare it to our predicted population mean (from the distribution of sample means usually).

If our sample mean is very different, we are either very lucky/unlucky or there is something fundamentally different about our sample!

![](images/something-is-different.png)