# Confidence Interval (CI)
-----------------
Imagine, we have some population we are interested in, let's say 10_000 people, and we want to measure proportion of smokers in it. Only 377 from 10_000 agreed to answer this question, and no way we can get answers from the whole population.  
So, even with sample of 377 we can calculate a confidence interval, and be 95% (or other amount) confident that all other sample proportions from the whole population will lay inside our interval. 

But we need to make some assumptions before calculating confidence interval:
1. Sample can be considered a simple random sample
2. Large enough sample size

By calculating the confidence intervals around any data we collect, we have additional information about the likely values we are trying to estimate and find out how significant the difference is. 

In [18]:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

### Let's start with a One Proportion example

In [19]:
np.random.seed(1)
true_proportion = 0.4321

# Create population
population = np.random.binomial(1, true_proportion, 10_000)
population  # array of zeros and ones, simulates smokers proportion in the population

array([0, 1, 0, ..., 0, 0, 1])

In [20]:
n = 377  # sample size
sample = np.random.choice(population, size=n, replace=False)

best_estimate = np.mean(sample)  # we have zeros and ones, and can calculate proportion through mean()

z = 1.96  # for 95% confidence
margin_of_error = z * np.sqrt((best_estimate * (1 - best_estimate)) / n)
lcb = best_estimate - margin_of_error  # lower confidence bound
ucb = best_estimate + margin_of_error  # upper confidence bound

lcb, ucb 
# You can run this cell multiple times, to see how confidence intervals works

(0.3666827770648099, 0.4662084696195402)

We can easily get same results by just using proportion_confint()

In [21]:
yes = np.sum(sample)
sm.stats.proportion_confint(yes, n)

(0.3666836914687879, 0.46620755521556223)

95% of intervals formed this way expected to cover the true population proportion!  

You can find awesome visualization of this process [here](https://seeing-theory.brown.edu/frequentist-inference/index.html#section2), or google for 'seeing-theory.brown.edu'

### General CI Equation  
**BestEstimate** +- **MarginOfError**  
We already understand what best estimate is, but let's talk more about margin of error.  
**MarginOfError** = 'a few' Estimated Standard Errors, 'a few' (**z** in the first example) = multiplier from appropriate distribution based on desired confidence level and sample design 95% Confidence Level <-> 0.05 Significance  

LowerConfidenceBound (**lcb**) = BestEstimate - MarginOfError  
UpperConfidenceBound (**ucb**) = BestEstimate + MarginOfError    

MarginOfError will be different for different tasks, you'll see later.
### z and t multipliers
We will use **z** for proportions, and **t** for means.  

How to get these multipliers:

In [22]:
# Find z: 
# Google for 'z multiplier for 95 confidence interval' and find tables

# Find t:
# Similarly to z you can find table, or
from scipy.stats import t

degree_of_freedom = n - 1  # there is a good explanation of df on YT by Vallia 
# If you have two independent means, df = n1 + n2 - 2
probability = 1 - (1 - 0.95) / 2  
t.ppf(probability, degree_of_freedom)

1.9662932291779265

## Proportions
$\hat{p}$ - best estimate (proportion $\frac{part}{all}$)  
$n$ - sample size  
$y$ - positive part  


### One Proportion
**Example**: How much smokers in population.  
**Assumptions**:  
1. Sample can be considered a simple random sample
2. Large enough sample size  

**Equation**: $\hat{p} \pm z\sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}$  
**Python**: statsmodels.api.stats.proportion_confint($y, n$)

### Two Proportions Difference
**Example**: Difference between proportion of women (who smokes) and men (who smokes).   
**Assumptions**:  
1. Samples can be considered two simple random samples
2. Samples can be considered independent of one another
3. Large enough sample sizes to assume that the distribution of our estimate is normal

**Equation**: $(\hat{p}_1-\hat{p}_2) \pm z\sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}$

## Means
$a$ - list of all sample means  
$\mu$ - best estimate (mean)   
$n$ - sample size   
$s$ - standard deviation   

### One Mean
**Example**: Blood pressure of the population.  
**Assumptions**: 
1. Sample can be considered a simple random sample
2. Population of differences is normal or sample is large enough

**Equation**: $\mu \pm t\frac{s}{\sqrt{n}}$  
**Python**: statsmodels.api.stats.DescrStatsW($a$).zconfint_mean()

### Two Independant Means Difference
**Example**: Blood pressure of the women and men.   
**Assumptions**: 
1. Samples can be considered a simple random samples
2. Samples can be considered independent of one another
3. Population of differences is normal or sample is large enough
**Approaches**: *Pooled* if equal population variances else *Unpooled*.  

**Equation (Unpooled)**: $(\mu_1 - \mu_2) \pm t\sqrt{(\frac{s_1}{\sqrt{n_1}})^2 + (\frac{s_2}{\sqrt{n_2}})^2}$   
**Equation (Pooled)**: $(\mu_1 - \mu_2) \pm t\sqrt{\frac{(n_1 - 1) s_1^2 + (n_2 - 1) s_2^2} {n_1 + n_2 - 2}} \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$


What if normality doesn't hold? **Mann-Whitney test**.

### Two Paired Means Difference
**Example**: There should be connection between first and second dataset, for example, measurements taken from the same people before and after an event, or difference between fingers length on dominant hand and not dominant hand of the piano player.  
**Assumptions**:  
1. Each data point in one data set is related to one, and only one, data point in the other data set
2. Random sampling
3. Population of differences is normal or sample is large enough

**Equation**: $\mu_{dif} \pm t\frac{s_{dif}}{\sqrt{n}}$   

What if normality doesn't hold? **Wilcoxon Signed Rank test**. 

# Hypothesis Testing
-------------
Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution. First, a tentative assumption is made about the parameter or distribution. This assumption is called the null hypothesis and is denoted by $H_0$. An alternative hypothesis (denoted $H_a$), which is the opposite of what is stated in the null hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine whether or not $H_0$ can be rejected. If $H_0$ is rejected, the statistical conclusion is that the alternative hypothesis $H_a$ is true.

### One Proportion Example
In previous years 52% of parents believed that electronics and social media was the cause of their teenager's lack of sleep. Do more parents today believe that their teenager's lack of sleep is caused due to electronics and social media?  

Hypotheses:  
$H_0: \hat{p} = 0.52$  
$H_a: \hat{p} > 0.52$  
Why alternative hypothesis is ">"? Because "Do **more** parents today believe ..."

Significance level:  
$\alpha = 0.05$  (standard is 5%, but you can change it)

In [23]:
# Null hypothesis proportion
p_null = 0.52

# Let's assume we calculated actual proportion and get 0.56
p_actual = 0.56
n = 1018  # sample size

Is difference between null and actual proportions significant or not? Can we reject the Null Hypothesis ($H_0$)?  
We can answer this questions after calculating **Z-test** and **P-value**.

In [24]:
standard_error = np.sqrt(p_null * (1 - p_null) / n)
z_test = (p_actual - p_null) / se
z_test

2.5545334262132955

That means that our observed sample proportion is 2.555 null standard errors above our hypothesized population proportion.  
Now we need to calculate **P-value**, there are two approaches:
1. Google for table, or for oline p-value calculator
2. Use scipy library

In [25]:
import scipy.stats.distributions as dist

# With standard sinificance level of 0.05
p_value = dist.norm.cdf(-np.abs(z_test))  # !!! example for 1 tailed p-value !!!
p_value

0.005316510991822442

Easiest approach to find P-value with python statmodels library.

In [26]:
z_test, p_value = sm.stats.proportions_ztest(int(n * p_actual),  # positive part
                                             n,  # sample size
                                             p_null,  # null hypothesis
                                             alternative='larger',  # alternative hypothesis
                                             prop_var=0.52)
z_test, p_value

(2.549514696495784, 0.0053936482172321255)

In [27]:
if p_value < 0.05:
    print('Reject the null hypothesis!')
else:
    print("Can't reject the null hypothesis!")

Reject the null hypothesis!


There is sufficient evidence to conclude that the population proportion of parents with a teenager who believe that electronics and social medial is the cause for lack of sleep is greater than 52%.

### Two Proportions Difference Example
Is there a significant difference between the population proportions of parents of black children and parents of hispanic children who report that their child has had some swimming lessons?

Hypotheses:  
$H_0: p_1 - p_2 = 0$  
$H_a: p_1 - p_2 \neq 0$  
Alternative allows proportion to be either greater or less than 0 -> **two-tailed test** need more evidence against null hypothesis to reject it!

Significance level:  
$\alpha = 0.10$  
Since we have two-tailed test, let's increase our significance level. (it's optional)

In [43]:
n1 = 247
y1 = 91
p1 = y1 / n1  # 0.37

n2 = 308
y2 = 120
p2 = y2 / n2  # 0.39

# Calculate combined population proportion
p = (y1 + y2) / (n1 + n2)  # 0.38

# Estimate of the variance of the combined population proportion
va = p * (1 - p)  # 0.24

# Standard error of the combined population proportion
se = np.sqrt(va * (1 / n1 + 1 / n2))  # 0.041

z_test = (p1 - p2) / se
z_test

-0.5110545335044571

That means that our observed difference in sample proportions is 0.51 estimated standard errors below our hypothesized mean of equal population proportions.

In [44]:
p_value = 2 * dist.norm.cdf(-np.abs(z_test))
p_value

0.6093128715165157

In [45]:
if p_value < 0.05:
    print('Reject the null hypothesis!')
else:
    print("Can't reject the null hypothesis!")

Can't reject the null hypothesis!


Formally, based on our sample and our p-value, we fail to reject the null hypothesis. We conclude that there is no significant difference between the population proportion.

*Note: All data assumptions and examples stay the same (look Confidence Intervals part).*
## Proportions Test
$\hat{p}$ - best estimate (proportion $\frac{part}{all}$)   
$p_{null}$ - null hypothesis proportion  
$n$ - sample size  
$y$ - positive part

### One Proportion 
**Equation**: $test = \frac{\hat{p} - p_{null}} {\sqrt{\frac{p_{null}(1 - p_{null})}{n}}}$  
**Python**: statsmodels.api.stats.proportions_ztest($y, n, p_{null},$ altervative=(depends on your $H_a$ sign))

### Two Proportions 
**Equation**: 

$p_{comb} = \frac{y_1 + y_2}{n_1 + n_2}$  
$test = \frac{p_1 - p_2}{\sqrt{ (p_{comb}(1 - p_{comb})) (\frac{1}{n_1} + \frac{1}{n_2})}}$  


## Means Test
$a$ - list of all sample means    
$\mu$ - best estimate (mean)  
$\mu_{null}$ - null hypothesis mean  
$n$ - sample size    
$s$ - standard deviation   

### One Mean
**Equation**: $test = \frac{\mu - \mu_{null}}{\frac{s}{\sqrt{n}}}$  
**Python**: statsmodels.api.stats.ztest($a,$ value=$\mu_{null},$ alternative=(depends on your $H_a$ sign))

### Two Independant Means Difference
**Equation**: $test = \frac{(\mu_1 - \mu_2) - 0}{\sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2)}{n_1 + n_2 - 2}} \sqrt{\frac{1}{n1} + \frac{1}{n2}}}$   
**Python Z-test**: statsmodels.api.stats.ztest($a_1, a_2$)  
**Python T-test**: statsmodels.api.stats.ttest_ind($a_1, a_2$)

### Two Paired Means Difference
**Equation**: $test = \frac{\mu_{diff} - \mu_{null}}{\frac{s_diff}{\sqrt{n}}}$     
**Python Z-test**: statsmodels.api.stats.ztest($a$) 