# MODULES


In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy.stats as stats
from sklearn import datasets

sns.set()


___

# FROM SAMPLE TO POPULATION

## Sampling Distribution of Sample Statistics

The true value of a population's parameter is usually unknown; we try to estimate it based on the available sample data. But how does the sample statistics relate to the actual population parameter ?

As covered ealier:
+ the true value of a population's parameter is fixed.
+ a sample is only part of the population; the numerical value of its statistic will not be the exact value of the parameter.
+ the observed value of the statistic depends on the selected sample.
+ some variability in the values of a statistic, over different samples, is unavoidable.

As it depends on the sample, the sample statistic is random and has a **sampling distribution** we can study. 


## Central Limit Theorem

Let ${X_{1},\ldots ,X_{n}}$ be a sequence of independent and identically distributed (i.i.d.) random variables drawn from a distribution of expected value $\mu$ and finite variance $\sigma^2$. Let ${\bar {X}}_{n}$ be the sample average: ${\bar {X}}_{n} = ({X_{1} + \ldots + X_{n}}) / n$.

### Law of Large Numbers (LLN)

The [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) states that the sample mean converges to $\mu$ as the sample size increases.


### Central Limit Theorem (CLT)

The [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) states that during this convergence, the difference between the sample mean and its limit $\mu$ approximates the normal distribution with mean 0 and variance $\sigma ^{2}/n$. A very important property of the CLT is that it holds regardless of the distribution of $X_i$.

This means that for large samples (typically $n$ greater than 30), the sampling distribution of the sample mean is approximately normal and has the following paramaters:

+ mean: $\mu$
+ standard error: $\sigma / \sqrt{n}$

The **standard error** is the **standard deviation** of the **sampling distribution** of the sample mean.

### Example for the Exponential distribution

![missing](../../img/exp-clt.png)

## Confidence Interval

As shown in the example, the sample means can take a large range of values, some being quite far from the actual population mean. We usually have only one sample to study the population, with no way of knowing where our sample mean sits in the sampling distribution. What we can do is leverage the CLT to quantify this uncertainty and build a [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) for the population mean. 

Given a confidence percentage $1 - \alpha$, we can calculate the interval of values inside which $1 - \alpha$ percent of all samples means will fall. A small percentage $\alpha$ of all samples, the ones least representative of the population, will have a sample mean so far from the actual mean that it falls outside of this interval. 

Leveraging the fact that the sampling distribution of the sample mean is roughly normal, our sample mean has a 95% probability of being between -2 and +2 standard errors of the population mean $\mu$:

<img class="center-block" src="https://sebastienplat.s3.amazonaws.com/9ec352c1ff3263bdd17c8407d30c1f0b1490007929308"/>


Mathematically, this translates to:

$P(\bar{y} \in [\space\mu \pm 2 se\space] ) = P(\bar{y} \in [\space\mu \pm 2 \sigma/\sqrt{n}\space] ) \simeq 0.95$

We can deduce that:

$P(\mu \in [\space\bar{y} \pm 2 \sigma/\sqrt{n}\space] ) \simeq 0.95$

In other words: for 95% of all the samples, the population mean will fall between two standard errors from the sample mean. For 5% of all the samples, the population mean will not be inside that confidence interval and our inference will be incorrect.

_Note: we have no way of knowing if our sample is part of these 5%. This is why $\alpha$ is called the Type I Error._


___

# T-DISTRIBUTION
## Limits of the CLT

The CLT Confidence intervals **do not works** when either:

+ $\sigma$ is unknown.
+ the sample size $n$ is small.

The [Student’s t distribution](https://en.wikipedia.org/wiki/Student's_t-distribution) is used instead. 


## Assumptions of the t-distribution

The sampling distribution of the sample mean has to be roughly normal for the t-distribution to work well. It means that either:

+ the population is normally distributed, even for small samples.
+ the sample is large, regardless of the underlying distribution of data, thanks to the CLT.

_Note: If the data distribution is far from normal, using a t-test that focuses on the mean [might not be the most relevant test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6676026/)._

If the sample size is very small, we can use normal probability plots to check whether the sample may come from a normal distribution. It the t-distribution cannot be used, it is possible to use a more robust procedure such as the one-sample [**Wilcoxon procedure**](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test).


## Properties of the t-distribution

Small samples are more likely to underestimate $\sigma$ and have a mean that differs from $\mu$. The t-distribution accounts for this uncertainty with heavier tails compared to a Gaussian: the probability of extreme values becomes comparatively higher. This means its confidence intervals are wider than CLT ones for the same confidence level.

We have seen that under the assumptions of the CLT:

$$\frac { \bar {X_n} - \mu }{\sigma /\sqrt {n}} \sim N(0, 1)$$

Under the assumptions of the t-distribution, we can substitute the unbiased sample variance $\widehat {\sigma}^2$ to the sampling distribution of the sample mean  ([mathematical proof](https://www.math.arizona.edu/~jwatkins/ttest.pdf)):

$$\frac { \bar {X_n} - \mu }{\widehat {\sigma} /\sqrt {n}} \sim t_{n-1}$$

The distribution $t_{n-1}$ is the t-distribution with $n-1$ degrees of freedom.

_Note: the unbiased variance calculated from a sample of size $n$ uses $n-1$ to average the distances from the mean, in what is called the [Bessel's correction](https://en.wikipedia.org/wiki/Bessel%27s_correction), to [reduce the bias](https://dawenl.github.io/files/mle_biased.pdf)._

_Note: When the sample size is large (30+ observations), the Student Distribution becomes extremely close to the normal distribution._


## Confidence intervals

The $1 - \alpha$ T Confidence Interval is:

$$\bar{y} \pm T_{\alpha/2, n-1} \times \widehat {\sigma}^2 / \sqrt{n}$$

Where $T_{\alpha/2, n-1}$ is the distance from the mean of the t-distribution with n-1 degrees of freedom above which lay $\alpha/2$ percent of all observations



## Example for th Exponential distribution

Back to the Exponential Distribution, the figure below shows the experimental sample mean distribution vs T vs Normal for different sample sizes: 2, 5, 10 and 20:
+ the t-distribution gets close to normal even for relatively small sample sizes. 
+ it does not approximate the empirical distribution very well for smaller sample sizes because its assumptions are not met: the exponential distribution is far from normal.

<img class="center-block" src="https://sebastienplat.s3.amazonaws.com/dc954f3e9562d53b7829a2adcd2854ff1490011103173"/>


#### Sample size

In a normal distribution, 95% of the data is between --2 and +2 standard deviations from the mean. Even for skewed data, going two standard deviations away from the mean often captures nearly all of the data.

If we know the minimum and maximum values that the population is likely to take (excluding outliers), we can suppose they represent this interval of four standard deviations.

It means the standard deviation of a population $\sigma$ can be approximated by:

$\sigma \simeq 1/4 \times \Delta_{range}$

If we know the margin of error $E$ we are ready to accept at $1 - \alpha$ confidence, the sample size we need can be approximated by:

$n \simeq [Z_{\alpha/2} \times \sigma / E]^2 \simeq [Z_{\alpha/2} \times  \Delta_{range} / 4 E]^2 $

A more accurate method to estimate the sample size: iteratively evaluate the following formula, until the $n$ value chosen to calculate the t-value matches the resulting $n$.

$n \simeq [t_{\alpha/2, n-1} \times  \Delta_{range} / 4 E]^2 $