# Central limit theorem

Distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30.

If we have a general population with mean MU and standard deviation SIGMA, and we draw a series of samples of size n from that population, then the distibution of the sample means will be normal.  
The mean of sample means will be approximately equal to MU with standard deviation (standard error) equal to SIGMA/SQRT(n).  
If n > 30, then we can use sample standard deviation to compute standard error - STD/SQRT(n)

## Confidence interval

CI is a range of estimates for an unknown parameter with lower and upper bounds.

For example, we can give a 95% confidence interval for the population mean using sample mean, sample standard deviation and sample size. 

First we need to compute standard error (STD/sqrt(sample size). Then using a property of a normal distribution and central limit theorem (to justify using normal distribution) we can estimate that with 95% chance the population means lies in the interval sample_mean +- (SE*1.96)

In [9]:
from math import sqrt
SAMPLE_MEAN = 100
SAMPLE_STD = 10
SAMPLE_SIZE = 64

SE = SAMPLE_STD / sqrt(SAMPLE_SIZE)

print(SAMPLE_MEAN-(1.96*SE), SAMPLE_MEAN+(1.96*SE)) #   interval at 95 % confidence level for population mean
print(SAMPLE_MEAN-(2.58*SE), SAMPLE_MEAN+(2.58*SE)) #  interval at 99 % confidence level for population mean

97.55 102.45
96.775 103.225


## Margin of error

MoE is the value above and below sample proportion in a confidence interval. MoE is usually set in advance and a required sample size to get such a MoE is calculated instead.

In [55]:
Z_at_confidence_level = 1.96
MoE = 0.1
n = (SAMPLE_STD**2 * Z_at_confidence_level**2)/MoE**2
n
# Z_at_confidence_level is z-value for a chosen confidence level (1.96 for 95 and 2.56 for 99 for example) 

38415.99999999999

So with a given sample std we need almost 40k examples to estimate population mean with 95 % confidence level and 0.1 margin of error.

## P-value

Probability of getting a given result by chance assuming the null hypothesis is true. Or probability of getting test results at least as extreme assuming the null hypothesis is true.

We can use properties of normal distribution to compute this probability. For example, if we know population mean and draw a sample from that population with mean MU and standard deviation SIGMA, we can compute the difference in terms of standard deviation (using z score - substracting sample mean from population mean and dividing by sample standard deviation) and infer the probability using z score table. 



In [12]:
import scipy.stats as st

In [27]:
P_MEAN = 100
SAMPLE_MEAN = 120
SAMPLE_STD = 10

z_score = (P_MEAN - SAMPLE_MEAN)/SAMPLE_STD
z_score

-2.0

In [28]:
# one sided
st.norm.cdf(z_score)

0.022750131948179195

## T-distribution

When sample size is small, distribution of the sample means stops behaving like a normal distribution. So a T (student) distribution is used instead. It has an additional parameter - degrees of freedom - which equals to sample size - 1. When sample size is large enough t distribution is almost identical to the normal, so we can always use t distribution instead of normal.

In [31]:
P_MEAN = 100
SAMPLE_MEAN = 120
SAMPLE_STD = 10
N = 10

z_score = (P_MEAN - SAMPLE_MEAN)/SAMPLE_STD
z_score

-2.0

In [32]:
# one sided
st.t.cdf(z_score, df=N-1)

0.03827641188535046

## Two-sample t-test

We can test if the difference between two sample means is statistically significant using t distribution. 
We need to calculate t-score = (MEAN_1 - MEAN_2)/SQRT(VARIANCE_1/n1 + VARIANCE_2/n2) and degrees of freedom = n1 + n2 - 2 and infer the probability of getting such a result.

In [46]:
MEAN_1 = 89.9
MEAN_2 = 80.7

sigma_1 = 11.3
sigma_2 = 11.7

n1 = 20
n2 = 20

In [47]:
t_score = (MEAN_1 - MEAN_2)/sqrt(((sigma_1**2)/n1) + ((sigma_2**2)/n2))
t_score

2.529439633102561

In [48]:
# one sided
st.t.cdf(-abs(t_score), df=n1+n2-2)*2

0.0156935300771004

In [None]:
T-statistics

z = (X - sample_mean)/std/sqrt(n) - why divide by sqrt? it computes how far x is from the mean in terms of std?