# Central Limit Theorem

Let's use the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) in a dataset.

Take a population and measure for each individual a value (size, weight, etc.)

The important thing to know is that **whatever** the form of the distribution over the population, the **sampling** distribution tends to a Gaussian, and its dispersion is given by the Central Limit Theorem.

Let's verify this experimentally

---

## Let's start

Run the following cell to import modules for the livecode

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

❓ Load the `"tips"` dataset from seaborn into a `df` variable. Display the head

<details>
    <summary>💡 View hint</summary>
    You can use <a href="https://seaborn.pydata.org/generated/seaborn.load_dataset.html"><code>seaborn.load_dataset</code></a>
</details>

❓ How many rows are available in that dataset?

❓ Plot the distribution of the `total_bill` column in that restaurant

❓ What is the [**skewness**](https://whatis.techtarget.com/definition/skewness) value of this distribution?

❓ Create variables `mu` and `sigma` storing the mean and standard deviation of the distribution of tips

## Sampling

❓ Pick randomly and with replacement, 10 rows of the dataset, and compute the mean $\bar{x}$ of that sample.

Run this cell a few times, do you get the same result each time? Is this expected?

❓ Create a `means` list storing a list of means of $N$ samples of size $n$.

Start with $n = 5$ and $N = 10$

In the same cell, **plot** the distribution of `means`. At $n$ constant, increase $N$ and observe. Then increase $n$ and test another range of $N$. What do you observe?

Try and plot a grid of 6 distributions for $ n \in \{ 1, 5, 50, 100, 500, 1000 \}$

## Checking the CLT

![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7b/IllustrationCentralTheorem.png/400px-IllustrationCentralTheorem.png)

❓ Let's verify the Central Limit Theorem computationally:

For each value of `n`:
- Compare `mu` with the mean of means
- Compare `sigma` with the standard deviation of the means, (don't forget the $\sqrt n$ adjustment)
- Compute the `skewness` of the sampling distribution

## Probability

Imagine I take a sample of 100 rows from the dataset. What is the probability that the average bill of that sample is **lower than 18€** (`target`)? 

❓ Plot `pdf` from [`scipy.stats.norm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) using `mu`, `sigma` and `n` for the distribution of samples of total bills.

❓ What is the probability we are looking for? Use the `cdf` method to find it.

❓ Compute the z-score for the value `18€`

❓ Plot the normal distribution (0, 1) and a red dot for the target (use the `pdf`)

---

## More Resources

- [StatQuest - Probability vs Likelihood](https://www.youtube.com/watch?v=pYxNSUDSFH4)
- [StatQuest - Central Limit Theorem](https://www.youtube.com/watch?v=YAlJCEDH2uY)
- [Le théorème central limite 🇫🇷](https://www.youtube.com/watch?v=4dhm2QAA2x4?hl=en&cc_lang_pref=en&cc=1)
- [3blue1brown - Bayes Theorem](https://www.youtube.com/watch?v=HZGCoVF3YvM)