Hypothesis Test

1) Hypothesis Testing Framework
* criteria for conducting hypothesis testing:
    * interested in comparison of some population parameters that you don't have access to (but you can collect sample data)
    * have competing hypothesis to test using the sample data
        * example: $H_0$ (Null): men and women make the same amount of money
        * $H_A$ (Alternative): men make more money than women
* setup: **innocent ($H_0$) until proven guilty ($H_A$)**
    * assume null hypothesis is true and look for evidence to the contrary
* hypothesis testing doesn't allow for causation, only inference based on evidence or lack thereof
    * conclusions to draw:
        * there is sufficient evidence that John Smith is guilty of murder
        * there is sufficient evidence to reject the null hypothesis that men and women make the same amount of money
    * conclusion you **can't** draw:
        * John Smith is innocent of murder
        * men and women make the same amount of money
* steps:
* tests: two-sample comparison of means, one-sample proportion test, two-sample comparison of proportion, etc.
    1. state the null hypothesis ($H_0$) and the alternative hypothesis ($H_A$)
    2. choose a significance level, $\alpha$ (typically $\alpha=0.05$)
    3. select statistical test, and compute appropriate **test statistic**
        * often test statistic is a sample mean (or difference of two sample means), so we compare it to the t-distribution via CLT
    4. compute **p-value** based on test statistic:
        * if p-value < $\alpha \Rightarrow$ reject $H_0$ in favor of $H_A$
        * if p-value > $\alpha \Rightarrow$ fail to reject $H_0$
* hypothesis test example:
* $n=30$, sample mean: $\bar{x}=102$, sample standard deviation: $s=7$
    1. $H_0: \mu = 100$ and $H_A: \mu \neq 100$
    2. $\alpha = 0.05$
    3. $t = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$ where $s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}$
        * $t = \frac{102-100}{\frac{7}{\sqrt{30}}} = 1.565$
    4. since we are doing a two-sided test, we take area to the left of $t=-1.565$ and right of $t=+1.565$
        * computed p-value based on $t$: p-value $\approx$ 0.1284 > 0.05 $\Rightarrow$ failed to reject null $H_0$
* $n=\textbf{100}$, sample mean: $\bar{x}=102$, sample standard deviation: $s=7$
    1. $H_0: \mu = 100$ and $H_A: \mu \neq 100$
    2. $\alpha = 0.05$
    3. $t = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$ where $s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}$
        * $t = \frac{102-100}{\frac{7}{\sqrt{\textbf{100}}}} = \textbf{2.857}$
    4. since we are doing a two-sided test, we take area to the left of $t=-2.857$ and right of $t=+2.857$
        * computed p-value based on $t$: p-value $\approx$ 0.0052 < 0.05 $\Rightarrow$ reject null $H_0$ in favor of $H_A$
* what is a t-distribution? it is essentially a fat-tailed Normal distribution that approaches normal as degrees of freedom $\rightarrow \infty$
![t_dist](http://ci.columbia.edu/ci/premba_test/c0331/images/s7/6317178747.gif)
    * why does the $t = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$ test statistic follow a t-distribution? **Central Limit Theorem (CLT)**
* constructing confidence interval 
    * example: $n=30$, sample mean: $\bar{x}=102$, sample standard deviation: $s=7$
        * CI for $\mu$ (population mean): $\begin{align} (\bar{x}-t_{\frac{\alpha}{2}}*\frac{s}{\sqrt{n}}, \bar{x}+t_{\frac{\alpha}{2}}*\frac{s}{\sqrt{n}}) 
            & = 102 \pm 2*\frac{7}{\sqrt{30}} \\ 
            & = (99.44, 104.55) \end{align}$ at $\alpha = 0.025$
        * $qt(0.0975, df=10) = 2.23$
        * $qt(0.0975, df=30) = 2.04$
        * appropriate conclusion: with confidence level of 95%, $\mu$ lies in the interval $(88.28, 115.72)$
        * **not** appropriate conclusion: the probability that true population mean $\mu$ is in the range $(88.28, 115.72)$ is 95% (this makes a bayesian inference which is not possible in this case)
    * all equations necessary to get CI (example)
        * get sample data, find t-statistic 
            * $\bar{x}=\frac{x_1+\dots+x_n}{n}$
            * $s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}$
                * $\downarrow$
            * $t=\frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$
        * setup t-statistic that captures the population mean, $\mu$, 95% of the time ($c \approx 2$)
            * $P(-c\leq t \leq c)=0.95$
            * $P(\bar{x}-\frac{cs}{\sqrt{n}}\leq \mu \leq \bar{x}+\frac{cs}{\sqrt{n}})=0.95$
            * $(\bar{x}-\frac{cs}{\sqrt{n}},\bar{x}+\frac{cs}{\sqrt{n}})$

2) Type I and Type II Errors
* **p-value** - the probability of observing the data we observed, or more extreme, given the null hypothesis is true
* **Type I error** - $P($reject $H_0$ $\big|$ $H_0$ is true$)$ - knowing that the $H_0$ is true, we reject the null
* **Type II error** - $P($accept $H_0$ $\big|$ $H_0$ is false$)$ - knowing that the $H_0$ is false, we accept the null
* look at the tail end(s) of the distribution for the sample mean under the null hypothesis
* we can always be wrong due to random variation
* at this time, we are accepting the $\alpha$ at this time
* questions to ask:
    1. What happens when we increase the sample size?
    
| Ground truth $\rightarrow$<br/>$\downarrow$ Hypothesis Test             | $H_0$ is true                      | $H_0$ is false                    |
|--------------:|:----------------------------------:|:---------------------------------:|
| Accept $H_0$ | Correct Decision<br/> ($1-\alpha$) | Type II Error<br/> ($\beta$)      |
| Reject $H_0$ | Type I Error<br/> ($\alpha$)       | Correct Decision<br/> ($1-\beta$) |
![typei_ii_errors](http://www.avance.ch/newsletter/Avance_on_statistics/errors.png)

3) **Two-sample t-test for Comparison of Means**
* assumptions:
    * population distribution is normal (often not true)
        * if population distributions are close to normal and sample $n$-size is large, then CLT will allow for this comparison
    * standard deviations are equal (often not true)
        * can apply **Welch's t-test**, which works with equal or unequal sample sizes, and unequal variances
        * possible variations:
            * equal sample sizes, equal variance
            * equal or unequal sample sizes, equal variance
            * equal or unequal sample sizes, unequal variance
* hypothesis test example:
* $n_1=20, n_2=30$, sample mean: $\bar{x_1}=101,\bar{x_2}=95$, sample standard deviation: $s_1=7,s_2=5$
    1. $H_0: \mu_1 = \mu_2$ and $H_A: \mu_1 \gt \mu_2$
    2. $\alpha = 0.05$
    3. $t = \frac{\bar{x}_1-\bar{x}_2}{s_{\bar{x}_1-\bar{x}_2}}$ where $s_{\bar{x}_1-\bar{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$
        * $t = \frac{101-95}{\sqrt{\frac{49}{20}+\frac{25}{30}}} = 3.311$
    4. since we are doing a one-sided test, we take area to the right of $t=+3.311$
        * computed p-value based on $t$: p-value $\approx$ 0.00116 < 0.05 $\Rightarrow$ reject null $H_0$ in favor of $H_A$

4) **Two-sample z-test for Comparison of Proportions**
* this hypothesis testing for proportions isn't much different from comparisons of means
    * we are still averaging $x_1,x_2,\dots,x_n$ except instead of taking on any values, they only take 0 or 1
    * means: $X$~$?(\mu,\sigma^2)$ $\rightarrow$(by CLT) $\rightarrow$ $\bar{X}$~$Normal(\mu,\frac{\sigma^2}{n})$
    * proportions: $X$~$Bernoulli(p)$ $\rightarrow$(by CLT) $\rightarrow$ $\bar{X}$~$Normal(p,\frac{p(1-p)}{n})$
* why are we estimating $z$?
    * t-test was used due to estimating $\sigma$ with sample standard deviation, $s$
    * in this case, we just have a single parameter, $p$, not $(\mu,\sigma)$
* hypothesis test example:
* $n_1=300, n_2=1000$, sample proportion: $\hat{p}_1=0.05,\hat{p}_2=0.03$
    1. $H_0: \hat{p}_1 = \hat{p}_2$ and $H_A: \hat{p}_1 \gt \hat{p}_2$
    2. $\alpha = 0.05$
    3. $z = \frac{\hat{p}_1-\hat{p}_2-0}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}} = \frac{0.05-0.03}{\sqrt{\frac{0.05*0.95}{300}+\frac{0.03*0.97}{1000}}} = 1.46$
    4. since we are doing a one-sided test, we take area to the right of $z=+1.46$
        * computed p-value based on $t$: p-value $\approx$ 0.072 > 0.05 $\Rightarrow$ fail to reject null $H_0$

5) Multiple Comparisons Problem
* Running a hypothesis test and setting $\alpha=0.05$
    * 1st time we run a test, 5% chance of getting Type I error (with 95% chance of not getting a false positive)
    * 2nd time we run a test, additional 5% chance of Type I error (probability of no Type I error for both tests at $0.95^2 = 0.9025$)
    * after $n$ tests, probability of no Type I error for any of $n$ tests is $0.95^n$
* example: we want to try 100 variations of original layout of website with small tweaks such as magenta button color, panda icon, etc.
    * even if all changes made no difference, expect ~5 variation to be "successful"
    * there are various methods to counteract multiple comparisons problem:
        1. **Bonferroni adjustment**
        2. Fisher's least-significant-difference
        3. Duncan's test
        4. Scheffe's test
        5. Tukey's test
        6. Dunnett's test
* **Bonferroni correction**
    * hypotheses $H_1,\dots,H_m$ with corresponding p-values $p_1,\dots,p_m$
        * $m$ is the total # of null hypotheses
        * $m_0$ is the number of true null hypotheses
    * **Boole's inequality (aka union bound)** applies here saying that for any finite set of events, the probability that at least one of the events happens is no greater than the sum of probabilities of the individual events
        * for set of events: $A_1,A_2,A_3,\dots,A_n$
        * $P(\bigcup_i A_i) \leq \sum_i P(A_i)$
    * **Familywise error rate (FWER)** - the probability of rejecting at least one true $H_i$ of making at least one Type I error
        * $FWER = P\Big\{\bigcup_{i=1}^{m_0}\big(p_i \leq \frac{\alpha}{m}\big) \Big\} \leq \sum_{i=1}^{m_0} \big\{ P\big(p_i \leq \frac{\alpha}{m}\big)\big\} = m_0 \frac{\alpha}{m} \leq m \frac{\alpha}{m} = \alpha$
        * this doesn't require any assumptions about dependence among the p-values or about how many of the null hypotheses are true
    * use $\frac{\alpha}{m}$ instead of $\alpha$ when we examine the resulting p-values, $p_1,\dots,p_m$
    * example: testing 10 hypotheses, $m=10$ where $A_1,A_2,A_3,\dots,A_{10}$
        * if $\alpha=0.05$, we want the overall Type I error to be bounded by 5%
        * in worst case scenario, our tests are independent (having nothing to do with each other)
        * it's conservative to measure each hypothesis against an **adjusted significance level**, $\frac{0.05}{10}=0.005$

6) Chi-square Test (related to Comparison of Proportions)
* this test is a general method for comparing fact with theory
* this approach assumes **sampled units fall randomly into cells**, and that the chance of a unit falling into particular cell can be estimated from the theory we're testing (not unlike $H_0$)
    * assume a hypothesis ($H_0$), collect some data, and see if test statistic leads one to want to reject that assumption
    * example: is there a relationship between age and investment preference?
        * $\chi^2 = \sum \frac{(observed-expected)^2}{expected}$

|           | Stocks | Bonds | Cash |     |
|:---------:|:------:|:-----:|:----:|:---:|
| Age 25-34 |   30   |   10  |   1  |  41 |
| Age 35-44 |   35   |   25  |   2  |  62 |
| Age 45-54 |   38   |   35  |   4  |  77 |
| Age 55-70 |   22   |   30  |   4  |  56 |
|           |   125  |  100  |  11  | **236** |

* if $Z_1,\dots,Z_k$ are independent, standard normal random variables, then the sum of their squares:
    * $Q = \sum_{i=1}^k Z_i^2$
    * the sum of their squares is distributed according to the chi-squared distribution with $k$ degrees of freedom:
    * $Q$ ~ $\chi^2(k)$ or $Q$ ~ $\chi_k^2$
![chi_square_dist](https://upload.wikimedia.org/wikipedia/commons/thumb/3/35/Chi-square_pdf.svg/321px-Chi-square_pdf.svg.png)
* **Chi-Square Test of Independence** - hypothesis tests where the assumption is that there is no relationship between the variables
    1. expected table under assumption of *no relationship between Race of Victim and Death Penalty*
    2. compute $\chi^2$ test statistic
        * Yes/White: $\frac{(130)(59)}{362} = 21.19$
        * No/Black: $\frac{(232)(303)}{362} = 194.19$
        * $\chi^2 = \frac{(45-21.19)^2}{21.19} + \cdots + \frac{(218-194.19)^2}{194.19} = 49.89$
    3. p-value = $P(\chi^2>49.89) = 1.626e-12 < 0.0001 \rightarrow$ fail to reject null $H_0$

| Death Penalty$\rightarrow$<br/>$\downarrow$Race of Victim | Yes |  No | Totals |
|--------------------------------:|:---:|:---:|:------:|
|               White              |  45 |  85 |   **130**  |
|               Black              |  14 | 218 |   232  |
|              Totals              |  **59** | 303 |   **362**  |

| Expected Table      |  Yes  |   No   |
|:-----:|:-----:|:------:|
| White | 21.19 | 108.81 |
| Black | 37.81 | 194.19 |

* **Chi-Square Goodness of Fit Test** - hypothesis tests where the assumption is that the data is consistent with the specified distribution
    * uses similar methodology as two-sample comparison of proportions tests (specify the proportion of observations at each level of categorical variable)
    * example: race and death penalty dataset
    * $n_{white}=130, n_{black}=232, \hat{p}_w=\frac{45}{130}, \hat{p}_b=\frac{14}{232}$
        1. hypothesis: $H_0: p_{white} = p_{black}$ and $H_A: p_{white} \neq p_{black}$
            * let $\alpha=0.05$
        2. $Z = \frac{(\hat{p}_1 - \hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}} = 7.063152$
        3. since we are doing a two-sided test, we take area to the left of $z = -7.063$ and right of $z = +7.063$
        4. compute p-value $\approx 1.628e-12 \Rightarrow$ reject null $H_0$ in favor of $H_A$ 
    * example: testing hypothesis of expected distribution (# of customers) over 6 days from owner is consistent with observed data
        * (observed): some actual data on customer flow
        * $H_0$: owner's distribution is correct
        * $H_A$: owner's distribution is not correct
        * $\chi^2 = \frac{(30-20)^2}{20} + \cdots + \frac{(20-30)^2}{30} = 11.44$
        * resulting chi-squared test statistic: $\chi^2 = 11.44$
        * compare test statistic to $\chi^2$ distribution with $df = 5$
        * p-value = $P(\chi^2>11.44)=0.0433 < 0.05 \Rightarrow$ reject null, $H_0$, distribution
        
|     Day    |  M | Tu | W  | Th | F  | S  | Total |
|:----------:|:--:|:--:|----|----|----|----|-------|
| Expected % | 10 | 10 | 15 | 20 | 30 | 15 | 100   |
|  Observed  | 30 | 14 | 34 | 45 | 57 | 20 | 200   |
| Expected   | 20 | 20 | 30 | 40 | 60 | 30 | 200   |



7) Experimental Design For A/B Testing
* Experimental vs Observational
    * **Experimental** - *apply treatments* to experimental units (e.g. people, animals, land, etc.) and observe effect of treatment
        * conclusion: **establishes causality**
        * example: randomly assigning homework to students and measuring the performance of the two groups
    * **Observational** - observe subjects and measure variables of interest *without assigning treatments* to subjects
        * conclusion: **can't establish causality**
        * example: students who did and didn't do their homework and their grades
* **Experimental Design**
    * randomization into groups of equal sizes
        * randomly generate number from 0 to 1
        * if $\leq 0.5$, then assigned to do homework group, otherwise don't do homework group
    * assume independent observations
        * assume the students don't know if the other students have to do homework or not
        * otherwise, that knowledge might affect their performance
* **Confounding factor** - an extraneous attribute that correlates with the dependent variable (e.g. performance) and the independent variable (e.g. did homework or didn't do homework)
    * what are possible confounding factors?
        * how hard-working the student is
        * more hard-working students might perform better and are more likely to do their homework