# The Central Limit Theorem


**IMPORTANT INSTRUCTIONS:** This activity is designed for you to experiment with Python code about samplig, variance, and mean. Feel free to change any numerical value throughout the code in the activity to visualize different outcomes and results.

## The Central Limit Theorem

The central limit theorem (CLT) is an fundamental theorem of statistics that lays the foundations for undarstanding the results of *sampling*.


### Formal Definition

The central limit theorem (CLT) states that the distribution of a sample variable approximates a normal distribution  as the *sample* size becomes larger, assuming that all *samples* are identical in size and regardless of the population's actual distribution shape.

In other words, according to the CLT, the mean of a *sample* of data will be closer to the mean of the overall population in question, as the *sample* size increases.

### Key Terms 

Imagine performing a *trial*  and getting an *observation*. Next, imagine repeating the *trial* again and get a new independent *observation*. Collected together, multiple *observations* represents a *sample* of *observations*.

As you know, a sample is a group of *observations* from a broader *population*.

Let's first understand the differences between the words *observation*, *sample*, and *population*:

- *Observation*: is the result from one *trial* of an experiment.
- *Sample*: is the roup of results gathered from separate independent *trials*.
- *Population*: Is the space of all possible observations that could be seen from a trial.

The mean of a *sample* won't be exactly the same mean of the *population* distribution: like any estimate, it will be wrong and will contain some error. However, if you draw multiple independent *samples* and calculate their means, the distribution of those means will form a Gaussian distribution.

### Key Takeaways

Without going too depp into the mathematical details of the CLT, here are some fundamental takeways from it:


- The CLT states that the distribution of *sample* means approximates a normal distribution as the sample size gets larger, regardless of the *population*'s distribution.
- Sample sizes equal to or greater than **30** are often considered sufficient for the CLT to hold.
- The average of the means and standard deviations of the *sample* will equal the mean and standard deviation of the *population*.


## Uniform Distribution

For this example, assume that you are rolling a fair six-faced die. 

As you already know, the distribution of the numbers that turn up from a dice roll is uniform.

Assume that we roll the die  50 times.

Theoretically, the mean of the experiment is given by:

$$\mu_x = \frac{\sum X}{N} =\frac{1+2+3+4+5+6}{6} = 3.5$$

 In the code cell below, we use Python to generate the sample of 50 die rolls and the mean value of the sample.

In [None]:
from numpy.random import seed
from numpy.random import randint
from numpy import mean

# seed the random number generator for reproducibility
seed(1)

# generate a sample of die rolls
rolls = randint(1, 7, 50)
print(rolls)
print(mean(rolls))


We can see, that our *sample* mean is not exactly the same as the theoretical one. 

Can you guess why? What happens to the *sample* mean if you increase the number of trials?

**DOUBLE CLICK HERE TO TYPE YOUR SOLUTION**




Next, let's try to repeat the processs multiple times, such as 100. This will give us a result of 100 sample means.

Run the code cell below:

In [None]:
# calculate the mean of 50 dice rolls 100 times
means = [mean(randint(1, 7, 50)) for i in range(100)]

Finally, we can use Matplotlib to the distribution of these sample means.

In [None]:
import matplotlib.pyplot as plt
# plot the distribution of sample means
plt.hist(means)
plt.show()

You notice that the distributions doesn't really resemble a Gaussian.

How can you improve the result? What happens if you increase the number of *trials*?

**DOUBLE CLICK HERE TO TYPE YOUR SOLUTION**




The image below, displays the sample mean distribution when we repeat the experiment 10,000 times.

<img src="clt.png" alt="Drawing" style="width: 400px;"/>

You notice that, as we increase the number of experiments, the underlying distribution  of the sample means resemblea ans Guassian with mean equal to 3.5 in accordance with the CLT.

## Exponential Distribution

As you know, the exponential distribution is a continuous distribution that is often used to model the expected time one needs to wait before the occurrence of an event.

As a remainder, the exponential distribution is defined as:

$$f(x) = \begin{cases} 
      \lambda e^{-\lambda x} \text{  if  } x >0 \\
      0 \text{  otherwise.}
   \end{cases}
$$

Additionally, the mean, $\mu_x$, and the standard deviation, $\sigma^2$, of an exponential disribution are given by:

$$\mu_x = \sigma^2 = \frac{1}{\lambda}.$$


Suppose we are dealing with a population which is exponentially distributed defined by the parameter $\lambda = 0.25$. The mean and the stardard deviation for this case are given by:

In [None]:
# rate parameter for the exponentially distributed population
rate = 0.25

# population mean
mu = 1/rate

# population standard deviation
sd = np.sqrt(1/(rate**2))

print('Population mean:', mu)
print('Population standard deviation:', sd)

Now we want to see how the sampling distribution looks for this population. We will consider two cases, i.e. with a small sample size (n= 2), and a large sample size (n=500).

First, we will draw 50 random samples from our population of size 2 each. The code to do the same in Python is given below:

In [None]:
import pandas as pd
import numpy as np

# drawing 50 random samples of size 2 from the exponentially distributed population
sample_size = 2
#generating an empty dataframe
df2 = pd.DataFrame(index= ['x1', 'x2'] )

#filling the dataframe with exponentially distributed points
for i in range(1, 51):
    exponential_sample = np.random.exponential((1/rate), sample_size)
    col = f'sample {i}'
    df2[col] = exponential_sample

df2

For each of the 50 *samples*, we can calculate the *sample* mean and plot its distribution using Seaborn as follows:

In [None]:
import seaborn as sns
# calculating sample means and plotting their distribution
df2_sample_means = df2.mean()
sns.distplot(df2_sample_means);

You can observe that even for a small *sample* size such as two, the distribution of *sample* means looks like a poor approximation of a normal distribution, with some left skew. This is due to the fact that the *sample* size is too small (2) and this size is not large enough for the CLT to hold.


What happens if try to repeat the above process, but with a much larger *sample* size?

In the code cell below, fill in the ellipsis to set the *sample* size equal to 500.

In [None]:
# drawing 50 random samples of size 500
sample_size=...

df500 = pd.DataFrame()

for i in range(1, 51):
    exponential_sample = np.random.exponential((1/rate), sample_size)
    col = f'sample {i}'
    df500[col] = exponential_sample


df500_sample_means = pd.DataFrame(df500.mean(),columns=['Sample means'])
sns.distplot(df500_sample_means);

After running the experiment with a *sample* size equal to 500, you should see a distribution similar to the one below:

<img src="exp.png" alt="Drawing" style="width: 400px;"/>


You notice that the *sampling* distribution looks much more like a normal distribution now as we have sampled with a much larger sample size (n=500). You can also observe that the mean of the *sampling* distribution is close to the theoretical one that we computed above.