# Hypothesis testing 

**What is statistical hypothesis testing?**

When we perform experiments, we typically do not have access to all the members of a population, and need to take **samples** of measurements. 

A statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample. 

We test a hypothesis by determining the chance of obtaining a sample statistic if the hypothesis regarding the population parameter is true. 

> The goal of hypothesis testing is to make a decision about the value of a population parameter based on sample data.

**Why do we care about hypothesis testing?**

Scenarios: 
* Chemistry - do inputs from two different barley fields produce different yields?
* Astrophysics - do star systems with near-orbiting gas giants have hotter stars?
* Economics - demography, surveys, etc.
* Medicine - BMI vs. Hypertension, etc.
* Business - which ad is more effective given engagement?



**Intuition** 

Suppose you have a large dataset for a population. The data is normally distributed with mean 0 and standard deviation 1.

Along comes a new sample with a sample mean of 2.9.

> The idea behind hypothesis testing is a desire to quantify our belief as to whether our sample of observations came from the same population as the original dataset. 

According to the empirical rule for normal distributions there is only roughly a 0.003 chance that the sample came from the same population, because it is roughly 3 standard deviations above the mean. 


<img src="images/normal_sd_new.png" width="500">
 
To formalize this intuition, we define an threshold value for deciding whether we believe that the sample is from the same underlying population or not. This threshold is $\alpha$, the **significance threshold**.  

This serves as the foundation for hypothesis testing where we will reject or fail to reject the null hypothesis.


In [1]:
# imports 
import numpy as np
import scipy.stats as stats 
import pandas as pd 

## Let's kick-off our discussion off with an example! 

Suppose that African elephants have weights distributed normally around a mean of 9000 lbs with a standard deviation of 900 lbs. _Pachyderm Adventures_ has recently measured the weights of **35** Gabonese elephants and has calculated their average weight at 8637 lbs. 

Is the weight of Gabonese elephants different that the weight of African elephants? Use significance level $\alpha = 0.05$. 

**What are the null and alternative hypotheses? What is the significance level?**

* Null hypothesis
    * The average weight of Gabonese elephants is the same as the average weight of African elephants.

* Alternative hypothesis
    * The average weight of Gabonese elephants is different than the average weight of African elephants.

Significance level: $\alpha = 0.05$

**What should be our test statistic? Are we running an upper, lower, or two-tailed test? Why?**

* The test statistic for this test is the z-statistic. We are running a two-tailed one-sample z-test. 
    * The sample size $n \geq 30$, and we know the population mean and standard deviation. We assume that the data is normally distributed. 
    * It's a two-tailed test because we want to know if the weights are _different_. 
    


**What's the critical test statistic value we should use?**

**Note to instructors**:

At this point, you may want to clarify what decision rules look like for lower, upper, and two-tailed tests.

**Decision Rule**: 

The decision rule tells us when we can reject the null hypothesis. 

It depends on 3 factors: 
1. The alternative hypothesis 
    * Is this an upper-tailed, lower-tailed, or two-tailed test?
2. The test statistic 
3. The level of significance 

Upper-tailed test (right-tailed test): 
* The null hypothesis is rejected if the test statistic is greater than the critical value. 

Lower-tailed test (left-tailed test): 
* The null hypothesis is rejected if the test statistic is smaller than the critical value.

Two-tailed test:
* The null hypothesis is rejected if the test statistic is either larger than an upper critical value or smaller than a lower critical value.

<img src="images/hypothesis_test.png" width="500">

In [2]:
# critical z-statistic
alpha = 0.05
stats.norm.ppf(alpha/2), stats.norm.ppf(1-alpha/2)

(-1.9599639845400545, 1.959963984540054)

> Since we are performing a two-tailed one-sample z-test and $\alpha = 0.05$, if the z-score we compute is greater than 1.96 or smaller than -1.96, then we can reject the null hypothesis at significance level 0.05 in favor of the alternative hypothesis. 

**Perform the test.**

Compute the z-statistic. 

In [3]:
n = 35
sigma = 900

x_bar = 8637
mu = 9000

se = sigma/np.sqrt(n)
z = (x_bar - mu)/se
print(z)

-2.386152179183512


**Make a decision: do we reject the null hypothesis or not?**

> z = -2.39 is smaller than -1.96, thus we can reject the null hypothesis in favor of the alternative hypothesis at significance level $\alpha = 0.05$. 
- - -

Another way of getting to same answer: 

In [4]:
stats.norm.cdf(z)

0.008512852080791552

> The area of the tail corresponding to this z-score is 0.0085. This is below 0.025. Thus we reject the null hypothesis in favor of the alternative at significance level $\alpha = 0.05$. 

**Would we be able to reject the null hypothesis if our significance threshold was $\alpha = 0.01$?**

In [5]:
alpha = 0.01
print(alpha/2)

0.005


The area of the tail corresponding to z = -2.386 is `stats.norm.cdf(z) = 0.0085`. 

> Since the area of the tail corresponding to the z-score we obtained is 0.0085, which is greater than 0.005, we cannot reject the null hypothesis in favor of the alternative at a significance level of $\alpha = 0.01$. 

What if we wanted to test if the average weight of Gabonese elephants was _less_ than the average weight of African elephants at a significance level of 0.05?

**What are the null and alternative hypothesis in this case?**

**What kind of test do we need to run?**

**What's the critical test statistic value we should use?**

**Perform the test and make a decision regarding the null hypothesis.**

Would we be able to reject the null hypothesis in this case if the significance level was $\alpha = 0.01$?

* Null hypothesis
    * The weight of Gabonese elephants is not different than the weight of African elephants.

* Alternative hypothesis
    * The weight of Gabonese elephants is less than the weight of African elephants.
- - - 

We need to run a lower-tailed one-sample z-test. 

In [6]:
z_critical = stats.norm.ppf(0.05)
z_critical

-1.6448536269514729

In [7]:
# compute the z-statistic
n = 35
sigma = 900

x_bar = 8637
mu = 9000

se = sigma/np.sqrt(n)
z = (x_bar - mu)/se
print(z)

-2.386152179183512


> Our z-statistic is smaller than the critical z-statistic, thus we can reject the null hypothesis in favor of the alternative, at a significance level of 0.05. 

Alternatively:

In [8]:
stats.norm.cdf(z)

0.008512852080791552

>The area under the tail corresponding to this z-score is below 0.05. Thus we can reject the null in favor of the alternative hypothesis at $\alpha =0.05$. 

> We would be able to reject the null hypothesis at $\alpha = 0.01$, since the area under the tail corresponding to our z-statistic is smaller than 0.01 (our critical value lies in the rejection region). 

**Note to instructors:**

Thus far, students have performed a lower-tail one-sample z-test and a two-tailed one-sample z-test, for significance levels $\alpha = 0.05$ and $\alpha = 0.01$. 

The only one of these tests where they should have failed to reject the null hypothesis is for the two-tailed one-sample z-test at $\alpha=0.01$. 

> We can have the same sample data, but if the hypothesis test we're performing is different and/or if the significance level we want to test is different, we may make different decisions. 

> Lower values for the significance threshold $\alpha$ place more stringent requirements on the evidence we need to reject the null hypothesis.

# Hypothesis testing 

Regardless of the type of statistical hypothesis test you're performing, there are five main steps to executing them:

1. Set up a null and alternative hypothesis 

2. Choose a significance level $\alpha$ (or use the one assigned). 

3. Determine the critical test statistic value or p-value. Find the rejection region.

4. Calculate the value of the test statistic. 

5. Compare the test statistic value to the critical test statistic value to reject the null hypothesis or not.

# Language of Hypothesis testing 

**Significance Level $\alpha$**

The significance level $\alpha$ is the threshold at which you're okay with rejecting the null hypothesis. It is the probability of rejecting the null hypothesis when it is true. 

The most commonly used $\alpha$ in science is $\alpha = 0.05$. When you set $\alpha = 0.05$, you're saying "I'm okay with rejecting the null hypothesis if there is less than a 5% chance that the results I am seeing are actually due to randomness". 

**p-values**

The p-value is the probability of observing a test statistic at least as large as the one observed, by random chance, assuming that the null hypothesis is true. 

If $p \lt \alpha$, we reject the null hypothesis. 

If $p \geq \alpha$, we fail to reject the null hypothesis.

> **We do not accept the alternative hypothesis, we only reject or fail to reject the null hypothesis in favor of the alternative.**


**What if the experiment we perform fails to reject the null hypothesis?**

* We do not throw out failed experiments! 
* We say "this methodology, with this data, does not produce significant results" 
    * Maybe we need more data!

# z-tests vs t-tests

According to the **Central Limit Theorem**, the sampling distribution of a statistic, like the sample mean, will follow a normal distribution _as long as the sample size is sufficiently large_. 

When we know the standard deviation of the population, and the size of our sample is large enough ($n > 30$, where n is sample size) we can compute a z-statistic for our sample, and we can use the normal distribution to evaluate probabilities with the sample mean.  

$$\text{z-statistic} = \frac{num}{denom}$$


However, sometimes sample sizes are small (size < 30), and many times we do not know the standard deviation of the population. 

When sample sizes are small, the sampling distribution of a statistic like the sample mean will follow a _Student's t-distribution_. 

In the image below, we show a t-distribution with one degree of freedom (in blue) plotted alongside a standard normal distribution (dashed black line).  Notice how the t-distribution has heavier tails than the standard normal distribution.

<img src="images/df_01.png" width="500">

When sample sizes are small or we do not know the standard deviation of the population, we compute the t-statistic of the sample, and use the t-distribution to evaluate probabilities with the sample mean.

$$\text{t-statistic} = \frac{num}{denom}$$


When sample sizes are small, the sampling distribution of a statistic like the sample mean will not follow a normal distribution; it will follow a _Student's t-distribution_. When you take smaller samples, you can expect greater sample variance. The t-distribution has fatter/heavier tails than the normal distribution, and as such is appropriate for this situation. 

## Should I run a z-test or a t-test? 

<img src="images/z_or_t_test.png" width="500">


## Compare and contrast z-tests and t-tests. 
In both cases, it is assumed that the samples are normally distributed. 

A t-test is like a modified z-test:
1. Penalize for small sample size; use "degrees of freedom" 
2. Use the _sample_ standard deviation $s$ to estimate the population standard deviation $\sigma$. 

T-distributions have more probability in the tails. As the sample size increases, this decreases and the t distribution more closely resembles the z, or standard normal, distribution. By sample size n = 1000 they are virtually indistinguishable from each other. 

# More information about t-tests

* One sample and two sample t-tests: what's the difference? 

# Examples

Given the following data, we want to know if the sample is different from the population at $\alpha=0.05$ 

```python
population_mean = 85
sample = [90, 100, 110]
```

In [9]:
population_mean = 85
sample = [90, 100, 110]

**State the null and alternative hypothesis**

> Null hypothesis: $H_0$: The sample mean is the same as the population mean. 

> Alternative hypothesis: $H_1$: The sample mean is not the same as the population mean. 

**What type of hypothesis test do we want to perform and why?**

> We want to perform a two-tailed one-sample t-test. We do not know the population standard deviation, so we need to estimate it, and our sample size is small (n = 3). 

**What's the critical test statistic?** 

In [18]:
t_critical = stats.t.ppf(q=0.975, df=df)
print(t_critical)

4.302652729911275


**Perform the test.**

In [19]:
# Using scipy
stats.ttest_1samp(a=sample, popmean=population_mean)

Ttest_1sampResult(statistic=2.5980762113533156, pvalue=0.12168993434632014)

In [20]:
# By "hand"
mu = 85 
x_bar = np.mean(sample)
n = len(sample)
s = np.std(sample, ddof=1)
df = n-1

t = (x_bar - mu)/(s/n**0.5)
print(t)
print(df)

2.5980762113533156
2


**Can we reject the null hypothesis?**

> No. We fail to reject the null hypothesis at $\alpha=0.05$ because the value of our t-statistic does not lie in the rejection region for the hypothesis test. 

# Another example

I'm buying jeans from store A and store B. I know nothing about their inventory other than prices. 

``` python
store1 = [20,30,30,50,75,25,30,30,40,80]
store2 = [60,30,70,90,60,40,70,40]
```

Should I go just to one store for a less expensive pair of jeans? I'm pretty apprehensive about my decision, so $\alpha = 0.1$. It's okay to assume the samples have equal variances.

**State the null and alternative hypotheses**

> Null: Store A and B have the same jean prices. 

> Alternative: Store A and B do not have the same jean prices. 

**What kind of test should we run? Why?** 

> Run a two-tailed two independent sample t-test. Sample sizes are small. 

**Perform the test.**

In [21]:
store1 = [20,30,30,50,75,25,30,30,40,80]
store2 = [60,30,70,90,60,40,70,40]

stats.ttest_ind(store1, store2)

Ttest_indResult(statistic=-1.70113828065953, pvalue=0.10826653002468378)

**Make decision.**

> We fail to reject the null hypothesis at a significance level of $\alpha = 0.1$. We do not have evidence to support that jean prices are different in store A and store B. 

# More practice!

A rental car company claims the mean time to rent a car on their website is 60 seconds with a standard deviation of 30 seconds. A random sample of 36 customers attempted to rent a car on the website. The mean time to rent was 75 seconds. Is this enough evidence to contradict the company's claim at a significance level of $\alpha = 0.05$? 

We know the population standard deviation and our sample size n $\geq$ 30. 
* We are going to perform a two-tailed one-sample z-test at a significance level of $\alpha=0.05$. 

Null hypothesis: There is no difference in the mean time to rent of the sample and the claim by the rental company. 

Alternative hypothesis: There is a difference in the mean time to rent of the sample and the claim by the rental company. 

For a two-sided one-sample test and $\alpha = 0.05$, the critical z-scores are -1.96 and 1.96. That is, if our computed z-statistic is below -1.96 or above 1.96, we have enough evidence to reject the null hypothesis in favor of the alternative.



In [22]:
# one-sample z-test 
z = (75 - 60)/(30/np.sqrt(36))
print(z)

3.0


The z-statistic is greater than 1.96, thus we can reject the null hypothesis in favor of the alternative hypothesis, at $\alpha = 0.05$.  

# More practice!

A coffee shop relocates from Manhattan to Brooklyn and wants to make sure that all lattes are consistent. They believe each latte has 4 oz of espresso. A random sample of 25 lattes shows a mean of 4.6 oz and standard deviation of 0.22 oz. Are their lattes different now that they've relocated to Brooklyn? Use alpha = 0.01. 

State null and alternative hypothesis
1. Null: the amount of espresso in the lattes is the same as before 
2. Alternative: the amount of espresso in the lattes is different 

What kind of test? 
* two-tailed one-sample t-test
    * small sample size
    * unknown population standard deviation 
    * two-tailed because we want to know if amounts are same or different 

In [23]:
x_bar = 4.6 
mu = 4 
s = 0.22 
n = 25 

df = n-1

t = (x_bar - mu)/(s/n**0.5)
t

13.63636363636363

In [24]:
# critical t-statistic values
stats.t.ppf(0.005, df), stats.t.ppf(1-0.005, df)

(-2.796939504772805, 2.796939504772804)

Can we reject the null hypothesis? 

> Yes. t > |t_critical|. we can reject the null hypothesis in favor of the alternative at $\alpha = 0.01$. 

# Another example...

You measure the delivery times of ten different restaurants in two different neighborhoods. You want to know if restaurants in the different neighborhoods have similar delivery times. It's okay to assume both samples have equal variances. Set your significance threshold to 0.05. 

``` python
delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]
delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]
```

State null and alternative hypothesis. What type of test should we perform? 

> Null hypothesis: The delivery times for restaurants in neighborhood A are equal to delivery times for restaurants in neighborhood B. 

> Alternative hypothesis: Delivery times for restaurants in neighborhood A are not equal to delivery times for restaurants in neighborhood B. 

> Two-sided unpaired two-sample t-test

In [25]:
delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]
delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]

In [26]:
stats.ttest_ind(delivery_times_A, delivery_times_B)

Ttest_indResult(statistic=1.7223240113288751, pvalue=0.10214880648482656)

We cannot reject the null hypothesis that restaurant A and B have equal delivery times. p-value > $\alpha$. 

# More practice!

Consider the gain in weight (in grams) of 19 female rats between 28 and 84 days after birth. 

Twelve rats were fed on a high protein diet and seven rats were fed on a low protein diet.

``` python
high_protein = [134, 146, 104, 119, 124, 161, 107, 83, 113, 129, 97, 123]
low_protein = [70, 118, 101, 85, 107, 132, 94]
```

Is there any difference in the weight gain of rats fed on high protein diet vs low protein diet? It's OK to assume equal sample variances. 

Null and alternative hypotheses? 

> null: there is no difference in the weight gain of rats who were fed a high protein diet vs a low protein diet 

> alternative: weight gains differ by kind of diet 

Kind of test and why?

> Two-sided unpaired two-sample t-test. Low sample size.

In [28]:
stats.ttest_ind(high_protein, low_protein) #two-tailed test

Ttest_indResult(statistic=1.89143639744233, pvalue=0.07573012895667763)

We fail to reject the null hypothesis at a significance level of $\alpha = 0.05$. 

**What if we wanted to test if the rats who ate a high protein diet gained more weight than those who ate a low-protein diet?**

Null: weight gain by rats who ate high protein diet same as weight gain of low protein diet rats 

alternative: weight gain by rats who ate high protein diet greater than weight gain of low protein diet rats 

Kind of test? One-sided unpaired two-sample test 

Critical test statistic value? 

In [30]:
stats.t.ppf(q=0.95, df = len(high_protein)+len(low_protein)-2) #critical t-statistic 

1.7396067260750672

We can reject the null hypothesis in favor of the alternative at alpha = 0.05 (one-sided test). 
The value of t-statistic lies in rejection region. 