# Hypothesis testing 

## Intuition for Hypothesis Testing Example

Cristian has recently claimed that his lucky quarter is actually distinctly different than every other kind of quarter. Due to the unique weight distribution from the quarter's design there is actually a greater chance for the quarter to land tails than other fair coins.

Do we believe him?

I sure don't. But lets be good data scientists and put this claim to the test.

Let's flip the coin once and if it comes up tails then I'll change my mind.

Would you change your mind?

How many tails would I have to flip in order to convince you that this coin actual isn't fair? How many to know for sure that it isn't fair?

What is a reasonable threshold to set?

In [11]:
# imports
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

### High Level Hypothesis Testing
1. Start with a Scientific Question (yes/no)
2. Take the skeptical stance (Null hypothesis) 
3. State the complement (Alternative)
4. Create a model of the situation Assuming the Null Hypothesis is True!
5. Decide how surprised you would need to be in order to change your mind

## Definitions

**What is statistical hypothesis testing?**

When we perform experiments, we typically do not have access to all the members of a population, and need to take **samples** of measurements to make inferences about the population. 

A statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample. 

We test a hypothesis by determining the chance of obtaining a sample statistic if the null hypothesis regarding the population parameter is true. 

> The goal of hypothesis testing is to make a decision about the value of a population parameter based on sample data.



**Why do we care about hypothesis testing?**

Scenarios: 
* Chemistry - do inputs from two different barley fields produce different yields?
* Astrophysics - do star systems with near-orbiting gas giants have hotter stars?
* Economics - demography, surveys, etc.
* Medicine - BMI vs. Hypertension, etc.
* Business - which ad is more effective given engagement?

**Intuition** 

Suppose you have a large dataset for a population. The data is normally distributed with mean 0 and standard deviation 1.

Along comes a new sample with a sample mean of 2.9.

> The idea behind hypothesis testing is a desire to quantify our belief as to whether our sample of observations came from the same population as the original dataset. 

According to the empirical (68–95–99.7) rule for normal distributions there is only roughly a 0.003 chance that the sample came from the same population, because it is roughly 3 standard deviations above the mean. 

<img src="images/normal_sd_new.png" width="500">
 
To formalize this intuition, we define an threshold value for deciding whether we believe that the sample is from the same underlying population or not. This threshold is $\alpha$, the **significance threshold**.  

This serves as the foundation for hypothesis testing where we will reject or fail to reject the null hypothesis.


# Hypothesis testing 

Regardless of the type of statistical hypothesis test you're performing, there are five main steps to executing them:

1. Set up a null and alternative hypothesis 

2. Choose a significance level $\alpha$ (or use the one assigned). 

3. Determine the critical test statistic value or p-value. **(Find the rejection region for the null hypothesis.)**

4. Calculate the value of the test statistic. 

5. Compare the test statistic value to the critical test statistic value to reject the null hypothesis or not.

<img src="images/hypothesis_test.png" width="500">

**Decision Rule**: 

The decision rule tells us when we can reject the null hypothesis. 

It depends on 3 factors: 
1. The alternative hypothesis 
    * Is this an upper-tailed, lower-tailed, or two-tailed test?
2. The test statistic 
3. The level of significance $\alpha$. 


Upper-tailed test (right-tailed test): 
* The null hypothesis is rejected if the test statistic is greater than the critical value. 

Lower-tailed test (left-tailed test): 
* The null hypothesis is rejected if the test statistic is smaller than the critical value.

Two-tailed test:
* The null hypothesis is rejected if the test statistic is either larger than an upper critical value or smaller than a lower critical value.

# Language of Hypothesis testing 

**Significance Level $\alpha$**

The significance level $\alpha$ is the threshold at which you're okay with rejecting the null hypothesis. It is the probability of rejecting the null hypothesis when it is true. 

The most commonly used $\alpha$ in science is $\alpha = 0.05$. When you set $\alpha = 0.05$, you're saying "I'm okay with rejecting the null hypothesis if there is less than a 5% chance that the results I am seeing are actually due to randomness". 

**p-values**

The p-value is the probability of observing a test statistic at least as large as the one observed, by random chance, assuming that the null hypothesis is true. 

If $p \lt \alpha$, we reject the null hypothesis. 

If $p \geq \alpha$, we fail to reject the null hypothesis.

> **We do not accept the alternative hypothesis, we only reject or fail to reject the null hypothesis in favor of the alternative.**


**What if the experiment we perform fails to reject the null hypothesis?**

* We do not throw out failed experiments! 
* We say "this methodology, with this data, does not produce significant results" 
    * Maybe we need more data!

## Type 1 Errors (False Positives) and Type 2 Errors (False Negatives)
Most tests for the presence of some factor are imperfect. And in fact most tests are imperfect in two ways: They will sometimes fail to predict the presence of that factor when it is after all present, and they will sometimes predict the presence of that factor when in fact it is not. Clearly, the lower these error rates are, the better, but it is not uncommon for these rates to be between 1% and 5%, and sometimes they are even higher than that. (Of course, if they're higher than 50%, then we're better off just flipping a coin to run our test!)

Predicting the presence of some factor (i.e. counter to the null hypothesis) when in fact it is not there (i.e. the null hypothesis is true) is called a "false positive". Failing to predict the presence of some factor (i.e. in accord with the null hypothesis) when in fact it is there (i.e. the null hypothesis is false) is called a "false negative".


How does changing our alpha value change the rate of type 1 and type 2 errors?

# Let's continue our discussion of hypothesis tests with an example.

Suppose that African elephants have weights distributed normally around a mean of 9000 lbs with a standard deviation of 900 lbs. _Pachyderm Adventures_ has recently measured the weights of **35** Gabonese elephants and has calculated their average weight at 8637 lbs. 

Is the average weight of Gabonese elephants different that the average weight of African elephants? Use significance level $\alpha = 0.05$. 

**What are the null and alternative hypotheses? What is the significance level of the test?**

* Null hypothesis
    * The average weight of Gabonese elephants is the same as the average weight of African elephants.

* Alternative hypothesis
    * The average weight of Gabonese elephants is different than the average weight of African elephants.

The significance level of our test is $\alpha = 0.05$. 

**What should be our test statistic? Are we running an upper, lower, or two-tailed test? Why?**

Since we know the population standard deviation, the size of our sample is greater than 30, and we are comparing the sample mean to the population mean, we are going to run a one-sample z-test. 

Since we want to know if the sample mean is **different** from the population mean, we are running a two-tailed test. 

**What's the value of the critical test statistic that we should use for our test?**

In [3]:
# critical z-statistic
alpha = 0.05

# point percent function is the inverse of the cumulative density function which can be understood as the quantile
stats.norm.ppf(alpha/2), stats.norm.ppf(1-alpha/2)

(-1.9599639845400545, 1.959963984540054)

> Since we are performing a two-tailed one-sample z-test and $\alpha = 0.05$, if the z-score we compute is greater than 1.96 or smaller than -1.96, then we can reject the null hypothesis at significance level 0.05 in favor of the alternative hypothesis. 

**Perform the test.**

Compute the relevant test statistic for the sample.

Compute the z-statistic for the sample.  

$$\text{z-statistic} = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}, $$ where $\bar x$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the population standard deviation, and $n$ is the sample size. 

In [5]:
n = 35
sigma = 900

x_bar = 8637
mu = 9000

se = sigma/np.sqrt(n)
z = (x_bar - mu)/se
print(z)

-2.386152179183512


**Make a decision: do we reject the null hypothesis or not?**

> z = -2.39 is smaller than -1.96, thus we can reject the null hypothesis in favor of the alternative hypothesis at significance level $\alpha = 0.05$. 
- - -

Another way of getting to same answer: 

In [8]:
stats.norm.cdf(z)

0.008512852080791552

> The area of the tail corresponding to this z-score is 0.0085. This is below 0.025. Thus we reject the null hypothesis in favor of the alternative at significance level $\alpha = 0.05$. 

**Would we be able to reject the null hypothesis if our significance threshold was $\alpha = 0.01$?**

The area of the tail corresponding to the calculated z-statistic z = -2.386 is `stats.norm.cdf(z) = 0.0085`. 

> Since the area of the tail corresponding to the z-score we obtained is 0.0085, which is greater than 0.005, we fail to reject the null hypothesis in favor of the alternative at a significance level of $\alpha = 0.01$. 

In [5]:
# critical z-statistic
alpha = 0.01
stats.norm.ppf(alpha/2), stats.norm.ppf(1-alpha/2)

(-2.575829303548901, 2.5758293035489004)

> Alternatively, since we are performing a two-tailed one-sample z-test and $\alpha = 0.01$, if the z-score we compute is greater than 2.58 or smaller than -2.58, then we can reject the null hypothesis at significance level 0.05 in favor of the alternative hypothesis. 

>Since the calculated z-statistic is -2.386, we fail to reject the null hypothesis in favor of the alternative at a significance level of $\alpha = 0.01$. 

# z-tests vs t-tests

According to the **Central Limit Theorem**, the sampling distribution of a statistic, like the sample mean, will follow a normal distribution _as long as the sample size is sufficiently large_. 

__What if we don't have large sample sizes?__

When we do not know the population standard deviation or we have a small sample size, the sampling distribution of the sample statistic will follow a t-distribution.  
* Smaller sample sizes have larger variance, and t-distributions account for that by having heavier tails than the normal distribution.
* t-distributions are parameterized by degrees of freedom, fewer degrees of freedom fatter tails. Also converges to a normal distribution as dof >> 0

# One-sample z-tests and one-sample t-tests

One-sample z-tests and one-sample t-tests are hypothesis tests for the population mean $\mu$. 

How do we know whether we need to use a z-test or a t-test? 

<img src="images/z_or_t_test.png" width="500">


**When we perform a hypothesis test for the population mean, we want to know how likely it is to obtain the test statistic for the sample mean given the null hypothesis that the sample mean and population mean are not different.** 

The test statistic for the sample mean summarizes our sample observations. How do test statistics differ for one-sample z-tests and t-tests? 

A t-test is like a modified z-test. 

* Penalize for small sample size: "degrees of freedom"

* Use sample standard deviation $s$ to estimate the population standard deviation $\sigma$.

<img src="images/img5.png" width="500">



A one-sample t-test estimates the population mean (one parameter). A sample with size $n$ provides $n$ pieces of information, or degrees of freedom, for estimating the population mean and its variability. 

One degree of freedom is used to estimate the mean, the remaining $n-1$ degrees of freedom are used to estimate variability. 

>The one-sample t-test for samples of size $n$ has $n-1$ degrees of freedom.

<img src="images/img4.png" width="500">


## One-sample z-test

* For large enough sample sizes $n$ with known population standard deviation $\sigma$, the test statistic of the sample mean $\bar x$ is given by the **z-statistic**, 
$$Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}$$ where $\mu$ is the population mean.  

* Our hypothesis test tries to answer the question of how likely we are to observe a z-statistic as extreme as our sample's given the null hypothesis that the sample and the population have the same mean, given a significance threshold of $\alpha$. This is a one-sample z-test.  

## One-sample t-test

* For small sample sizes or samples with unknown population standard deviation, the test statistic of the sample mean is given by the **t-statistic**, 
$$ t = \frac{\bar{x} - \mu}{s/\sqrt{n}} $$ Here, $s$ is the sample standard deviation, which is used to estimate the population standard deviation, and $\mu$ is the population mean.  

* Our hypothesis test tries to answer the question of how likely we are to observe a t-statistic as extreme as our sample's given the null hypothesis that the sample and population have the same mean, given a significance threshold of $\alpha$. This is a one-sample t-test.

## Compare and contrast z-tests and t-tests. 
In both cases, it is assumed that the samples are normally distributed. 

A t-test is like a modified z-test:
1. Penalize for small sample size; use "degrees of freedom" 
2. Use the _sample_ standard deviation $s$ to estimate the population standard deviation $\sigma$. 

T-distributions have more probability in the tails. As the sample size increases, this decreases and the t distribution more closely resembles the z, or standard normal, distribution. By sample size n = 1000 they are virtually indistinguishable from each other. 

## Here's an example: 

A coffee shop relocates from Manhattan to Brooklyn and wants to make sure that all lattes are consistent before and after their move. They buy a new machine and hire a new barista. In Manhattan, lattes are made with 4 oz of espresso. A random sample of 25 lattes made in their new store in Brooklyn shows a mean of 4.6 oz and standard deviation of 0.22 oz. Are their lattes different now that they've relocated to Brooklyn?

**What's the null and alternative hypothesis to test in this case? What kind of test should we run? Why?** 

> $H_0$: Lattes are the same. 

> $H_1$: Lattes are different. 

>> Should run a one-sample t-test. Unknown population standard deviation. Small sample size. 

## Two-sample t-tests 

Sometimes, we are interested in determining whether two population means are equal. In this case, we use two-sample t-tests.

There are two types of two-sample t-tests: **paired** and **independent** (unpaired) tests. 

What's the difference?  

**Paired tests**: How is a sample affected by a certain treatment? The individuals in the sample remain the same and you compare how they change after treatment. 

**Independent tests**: When we compare two different, unrelated samples to each other, we use an independent (or unpaired) two-sample t-test.

The test statistic for an unpaired two-sample t-test is slightly different than the test statistic for the one-sample t-test. 

Assuming equal variances, the test statistic for a two-sample t-test is given by: 

$$ t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{s^2 \left( \frac{1}{n_1} + \frac{1}{n_2} \right)}}$$

where $s^2$ is the pooled sample variance, 

$$ s^2 = \frac{\sum_{i=1}^{n_1} \left(x_i - \bar{x_1}\right)^2 + \sum_{j=1}^{n_2} \left(x_j - \bar{x_2}\right)^2 }{n_1 + n_2 - 2} $$

Here, $n_1$ is the sample size of sample 1 and $n_2$ is the sample size of sample 2. 

An independent two-sample t-test for samples of size $n_1$ and $n_2$ has $(n_1 + n_2 - 2)$ degrees of freedom. 

## Sample problem: Unpaired two-sample t-test 

You measure the delivery times of ten different restaurants in two different neighborhoods, A and B. You want to know if restaurants in the different neighborhoods have the same delivery times. It's okay to assume both samples have equal variances. 

``` python
delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]
delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]
```

# Let's practice solving hypothesis test problems!

## Example 1
Let's revisit our Gabonese elephant weight example. 

Suppose that African elephants have weights distributed normally around a mean of 9000 lbs with a standard deviation of 900 lbs. _Pachyderm Adventures_ has recently measured the weights of **35** Gabonese elephants and has calculated their average weight at 8637 lbs. 

Is the average weight of Gabonese elephants _less_ than the average weight of African elephants? Use significance level $\alpha = 0.05$. 

**What are the null and alternative hypothesis in this case?**

**What kind of test do we need to run?**

**What's the critical test statistic value we should use?**

**Perform the test and make a decision regarding the null hypothesis.**

* Null hypothesis
    * The average weight of Gabonese elephants is the same as the average weight of African elephants.

* Alternative hypothesis
    * The average weight of Gabonese elephants is less than the average weight of African elephants.
- - - 

We need to run a lower-tailed one-sample z-test. 

## Example 2
Next, let's finish working through our coffee shop example...  

A coffee shop relocates from Manhattan to Brooklyn and wants to make sure that all lattes are consistent before and after their move. They buy a new machine and hire a new barista. In Manhattan, lattes are made with 4 oz of espresso. A random sample of 25 lattes made in their new store in Brooklyn shows a mean of 4.6 oz and standard deviation of 0.22 oz. Are their lattes different now that they've relocated to Brooklyn? Use a significance level of $\alpha = 0.01$. 

State null and alternative hypothesis
1. Null: the amount of espresso in the lattes is the same as before the move.
2. Alternative: the amount of espresso in the lattes is different before and after the move. 

What kind of test? 
* two-tailed one-sample t-test
    * small sample size
    * unknown population standard deviation 
    * two-tailed because we want to know if amounts are same or different 

In [10]:
x_bar = 4.6 
mu = 4 
s = 0.22 
n = 25 

df = n-1

t = (x_bar - mu)/(s/n**0.5)
print("The t-statistic for our sample is {}.".format(round(t, 2)))

The t-statistic for our sample is 13.64.


In [11]:
# critical t-statistic values
stats.t.ppf(0.005, df), stats.t.ppf(1-0.005, df)

(-2.796939504772805, 2.796939504772804)

Can we reject the null hypothesis? 

> Yes. t > |t_critical|. we can reject the null hypothesis in favor of the alternative at $\alpha = 0.01$. 

## Example 3

I'm buying jeans from store A and store B. I know nothing about their inventory other than prices. 

``` python
store1 = [20,30,30,50,75,25,30,30,40,80]
store2 = [60,30,70,90,60,40,70,40]
```

Should I go just to one store for a less expensive pair of jeans? I'm pretty apprehensive about my decision, so $\alpha = 0.1$. It's okay to assume the samples have equal variances.

**State the null and alternative hypotheses**

> Null: Store A and B have the same jean prices. 

> Alternative: Store A and B do not have the same jean prices. 

**What kind of test should we run? Why?** 

> Run a two-tailed two independent sample t-test. Sample sizes are small. 

**Perform the test.**

In [12]:
store1 = [20,30,30,50,75,25,30,30,40,80]
store2 = [60,30,70,90,60,40,70,40]

stats.ttest_ind(store1, store2)

Ttest_indResult(statistic=-1.70113828065953, pvalue=0.10826653002468378)

**Make decision.**

> We fail to reject the null hypothesis at a significance level of $\alpha = 0.1$. We do not have evidence to support that jean prices are different in store A and store B. 

## Example 4 

Next, let's finish working through the restaurant delivery times problem. 

You measure the delivery times of ten different restaurants in two different neighborhoods. You want to know if restaurants in the different neighborhoods have the same delivery times. It's okay to assume both samples have equal variances. Set your significance threshold to 0.05. 

``` python
delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]
delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]
```

State null and alternative hypothesis. What type of test should we perform? 

> Null hypothesis: The delivery times for restaurants in neighborhood A are equal to delivery times for restaurants in neighborhood B. 

> Alternative hypothesis: Delivery times for restaurants in neighborhood A are not equal to delivery times for restaurants in neighborhood B. 

> Two-sided unpaired two-sample t-test

In [13]:
delivery_times_A = [28.4, 23.3, 30.4, 28.1, 29.4, 30.6, 27.8, 30.9, 27.0, 32.8]
delivery_times_B = [26.4, 26.3, 27.4, 30.4, 25.1, 28.4, 23.3, 24.7, 31.8, 24.3]

In [14]:
stats.ttest_ind(delivery_times_A, delivery_times_B)

Ttest_indResult(statistic=1.7223240113288751, pvalue=0.10214880648482656)

> We cannot reject the null hypothesis that restaurant A and B have equal delivery times. p-value > $\alpha$. 

# Level Up: More practice problems!

A rental car company claims the mean time to rent a car on their website is 60 seconds with a standard deviation of 30 seconds. A random sample of 36 customers attempted to rent a car on the website. The mean time to rent was 75 seconds. Is this enough evidence to contradict the company's claim at a significance level of $\alpha = 0.05$? 

Null hypothesis:

Alternative hypothesis:


In [None]:
# one-sample z-test 


Reject?:

Consider the gain in weight (in grams) of 19 female rats between 28 and 84 days after birth. 

Twelve rats were fed on a high protein diet and seven rats were fed on a low protein diet.

``` python
high_protein = [134, 146, 104, 119, 124, 161, 107, 83, 113, 129, 97, 123]
low_protein = [70, 118, 101, 85, 107, 132, 94]
```

Is there any difference in the weight gain of rats fed on high protein diet vs low protein diet? It's OK to assume equal sample variances. 

Null and alternative hypotheses? 

> null: 

> alternative: 

What kind of test should we perform and why? 

> Test:

We fail to reject the null hypothesis at a significance level of $\alpha = 0.05$. 

**What if we wanted to test if the rats who ate a high protein diet gained more weight than those who ate a low-protein diet?**

Null:

alternative:

Kind of test? 

Critical test statistic value? 

Can we reject?

# Summary 

Key Takeaways:

* A statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample. 
* Hypothesis tests consist of a null hypothesis and an alternative hypothesis.
* We test a hypothesis by determining the chance of obtaining a sample statistic if the null hypothesis regarding the population parameter is true. 
* One-sample z-tests and one-sample t-tests are hypothesis tests for the population mean $\mu$. 
* We use a one-sample z-test for the population mean when the population standard deviation is known and the sample size is sufficiently large. We use a one-sample t-test for the population mean when the population standard deviation is unknown or when the sample size is small. 
* Two-sample t-tests are hypothesis tests for differences in two population means. 