# The Central Limit Theorem

The central limit theorem (CLT) theorem is very useful in probability theory. The idea is that if you take a reasonably large sample from **any** distribution, you know that that random sample's mean is normally distributed. Let me clarify.

Suppose you:

-	Take a random sample of at least 30 elements from a population of **any** distribution.

-	Then find the mean of that sample.

If you drew thousands of samples, then you would find that the distribution of all of those averages forms a normal (bell-shaped) distribution.

Cool! So what's the CLT good for?

Well, because of the CLT, we know that samples of the mean will be normally distributed regardless of what shape the underlying distribution is. We can use this information to create confidence intervals (i.e., construct a range of values where the population’s mean is likely to lie.) We can also do something called z-tests, where we estimate if a given value is likely to have come from the population-based on what we know about the sample. 

If you'd like to know more about sampling and the CLT, here are two useful videos to watch:

https://www.youtube.com/watch?v=YAlJCEDH2uY

https://www.youtube.com/watch?v=XLCWeSVzHUU


In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo( "XLCWeSVzHUU", width=600, height=400) 

In [None]:
YouTubeVideo( "YAlJCEDH2uY", width=600, height=400)

# Let's see how the CLT works!

First, let's load some python libraries we need.

In [None]:
# Load libraries and utility functions
import random
import statistics
import matplotlib.pyplot as plt  # plotting library
import seaborn as sns # KDE viz library

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # suppress warnings
from IPython.display import display, Markdown # do pretty printing

# Print out text and add markdown formatting marks to be interpretted
def pprint(txt, marks=''):
  display(Markdown( f"{marks}{txt}{marks}") )


In [None]:

# Simulate data drawn from different distributions  
from scipy.stats import uniform
# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)

from scipy.stats import norm
data_normal = norm.rvs(size=10000,loc=0,scale=1)

from scipy.stats import gamma
data_gamma = gamma.rvs(a=5, size=10000)

from scipy.stats import expon
data_expon = expon.rvs(scale=1,loc=0,size=10000)

from scipy.stats import poisson
data_poisson = poisson.rvs(mu=3, size=10000)

from scipy.stats import binom
data_binom = binom.rvs(n=10,p=0.8,size=10000)

from scipy.stats import bernoulli
data_bern = bernoulli.rvs(size=10000,p=0.6)

# Central Limit Theorem (CLT)

When we take a sample from a population, we can calculate a sample mean. The basic idea of the CLT is that when we take multiple samples, the distributions of the means will be **normally distributed**. 

# What is the normal distribution?

The normal distribution is a probability distribution that looks like a bell. The normal distribution is symmetric around the mean. In other words, data closer to the mean occur more frequently than data far from the mean. 

<img src="https://busan302.mycourses.work/images/NormalDist.png" width=400px>

# Why is the CLT a big deal?

In practice, we are usually constrained to just taking a single sample from a population, not thousands of samples. The CLT tells us that, for a given sample greater than roughly 30 elements, we can assume the sample mean is drawn from a normal distribution.

In this section, we will define the sample's size (SAMPLE_SIZE) and the number of samples (NUM_SAMPLES) we'd like to draw from the data distributions we created earlier.

In [None]:
# Function to take a random sample from a given distribution.
# Uses the global variables NUM_SAMPLES and SAMPLE_SIZE set above.
def get_samples( some_dist, num_samples=0, sample_size=0 ):
  lst = []
  for i in range(num_samples):
    spl = [] # create a list to hold elements from our sample
    for j in range(sample_size):
      idx = random.randint( 0, len(some_dist)-1 ) # randomly draw an observation from the data distribution
      spl.append( float(some_dist[idx]) ) # add item to our sample
    lst.append( statistics.mean(spl) ) # take the mean of the sample and save that 
  return lst # return the list of sample means

# Simulation
This next section runs a program to simulate how the CLT works!

The list **example_distributions** contains different population distributions. The simulation will draw a series of random samples from each population. The distribution of these sample means appears in red on the right side. Overlaid on top is a smoothed line called a **kernel density estimation (KDE).** 

Look at the population distribution of the left side. Compare the distribution of the sample means to the population.

Is the resulting KDE bell-shaped like a normal distribution or some other shape? What happens if you change the number of samples (NUM_SAMPLES) drawn from the population?


In [None]:
example_distributions = [ ("Uniform",data_uniform), ("Normal",data_normal), ("Gamma",data_gamma), ("Exponential",data_expon), ("Poisson",data_poisson), ("Binomial",data_binom) , ("Bernoulli",data_bern) ]
# Global variables
NUM_SAMPLES = 50  # The number of samples to take from the population
SAMPLE_SIZE = 30  # The size of each sample to take (i.e., how many elements to draw from the population.)


# Loop over distributions and find the results
for the_name, this_dist in example_distributions:
  
  samples = get_samples( this_dist, NUM_SAMPLES, SAMPLE_SIZE )
  plt.subplot(1, 2, 1) # row 1, col 2 index 1
  plt.title(f"Population:\n{the_name} distribution")
  
  pprint(f"Source population distribution: {the_name}",'##')
  pprint(f"Take repeated samples from {the_name} shaped population.")
  plt.hist( this_dist, bins=100) 
  
  #pprint( f"Distribution of the means from repeated samples taken from {the_name} distribution." )
  plt.subplot(1, 2, 2) # index 2
  sns.distplot( samples, hist=True, kde=True, bins=100, color = 'red', hist_kws={'edgecolor':'white'})

  plt.title(f"Distribution of means\nfrom repeated samples\nof {the_name} distribution") 
  plt.show()

# Questions - Central Limit Theorem

- Look at the global variables NUM_SAMPLES. See what happens to the distribution of the sample means when you change this variable and run the code blocks again.

- What happens to the **distribution of the means** if you set NUM_SAMPLES equal to **50**?

- What happens to the **distribution of the means** if you set NUM_SAMPLES equal to **5000**?

- What happens to the **distribution of the means** if you set NUM_SAMPLES equal to **50000**?



# Some notes on the importance of the CLT

These simulations show that the mean of a sample randomly drawn from a population is distributed normally (i.e., it is drawn from a bell-shaped distribution). The caveat is that the sample size must be sufficiently large so that the mean is accurate. How large is large enough? A rule of thumb is that a sample must have at least 30 elements before we can assume normality. 

It is often helpful to know a value has been drawn from normal distributions because many statistical tests assume that a value is drawn from a normal distribution.

When we take a random sample, we usually want to estimate the population's mean. If we know that a given value is drawn from a normal distribution, we can infer the likely value of the population mean. If our sample is large enough (n > 30), we can take the sample mean and standard deviation and calculate confidence intervals around our estimates.

There’s a lot to unpack in the previous sentences, but suffice it to say that the CLT is a big deal, making much of statistics possible!
