# Lecture 5.3: Hypothesis Testing II - Inference for Categorical Data

## Outline

* Hypothesis testing - one sample proportion
* Hypothesis testing for difference in means
    * Large sample
    * Small sample
* Hypothesis testing for difference in proportions


## Objectives

Be able to:

* Construct confidence interval for difference in means
* Carry out hypothesis testing for difference in means
* Construct confidnece interval for difference in proportions
* Carry out hypothesis testing for difference in proportions

In [None]:
%pylab inline
import pandas as pd
import yaml

from scipy import stats
from sqlalchemy import create_engine
from statsmodels.stats.weightstats import ttest_ind

pg_creds = yaml.load(open('../../pg_creds.yaml'))['student']

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{dbname}'.format(**pg_creds))

## Hypothesis Testing for Proportions

So far, we have been talk about hypothesis testing for the mean. We can also conduct hypothesis tests for proportions.

**Example**

We want to see if there is any evidence that the percentage of female patrons of the HairCare shop is 70%.  

In a simple random sample of 200 patrons, 133 are female.

We want to test:

$H_0: p = 0.7$    
$H_a: p \neq 0.7$

We know:

$x = 133$  
$n = 200$  
$\hat{p} = \frac{x}{n} = \frac{133}{200} = 0.665$  

By the CLT,  

$$ \hat{p} \sim N(p, \frac{p(1-p)}{n}) $$

or

$$ \frac{\hat{p} - p}{\sqrt{p(1-p)/n}} \sim N(0, 1) $$  


The 95% confidence interval for $p$ is given by  

$$ (\hat{p} - 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \text{, } \hat{p} + 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}) $$


**Confidence interval method**  

The 95% confidence interval for $p$ is  

$$ (0.665 - 1.96 \sqrt{\frac{0.665(1 - 0.665)}{200}} \text{, } 0.665 + 1.96 \sqrt{\frac{0.665(1 - 0.665)}{200}}) $$

$$ \Rightarrow (0.60, 0.73) $$  

Since 0.7 is within the 95% confidence interval, we fail to reject the null at 0.05 significance level.

**The p-value approach**  

The test statistic is

$$z_{stat} = \frac{\hat{p} - p_0}{\sqrt{p_0(1 - p_0)/n}} = \frac{0.665 - 0.7}{\sqrt{0.7(1 - 0.7)/200}} = -1.08 $$  

Note: $ z_{stat} \sim N(0, 1)$ if the null hypothesis is true

<B>Pop Quiz:</B> <details><summary>Why is this a $z_{stat}$ and not a $t_{stat}$?</summary>Because the denominator is the hypothesized population $\sigma$, not the sample $s$</details>

The p-value is

In [None]:
stats.norm.cdf?

$$ \text{p-value } = P(Z < -1.08 \text{ or } Z > 1.08) = 0.28 > 0.05 $$  

We fail to reject the null. We don't have sufficient evidence to conclude that the true proportion of female patrons is different from 70%.  


This is often called the **one sample z-test** for proportions.

#### One Sided Intervals for proportions 

* 95% Upper one-side CI   

$$ (- \infty, \hat{p} + 1.64 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}) $$  

* 95% Lower one-side CI  

$$ (\hat{p} - 1.64 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \infty) $$  

* For 95% intervals, $z_{\alpha} = 1.64$; we don’t use 1.96 since these are one-sided intervals

## Comparing Two Means (Large Sample)

**Example** We want to compare the average age among two populations:

<img src="images/age_groups.png" width="700">


### Confidence Interval for $\mu_1 - \mu_2$

We have two **independent** samples:  


$X_1, X_2, \dots, X_{n_1}$ from population 1  

$Y_1, Y_2, \dots, Y_{n_2}$ from population 2  


Assume both $n_1$ and $n_2$ are **bigger** than 30,  
the 95% confidence interval for $\mu_1 - \mu_2$ is given by  

$$ \bar{X} - \bar{Y} \pm 1.96 \sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}} $$  

**Note**: here we are using 1.96 since the t-distribution is very close to the standard Normal distribution when the sample size is large. In general, the computer uses the t-distribution.

**Example**  

* A survey found that the average hotel room rate in New Orleans is \$88.42 and the average room rate in Phoenix is \$80.61.
* Assume that the data were obtained from two samples of 50 hotels each and that the standard deviations are \$5.62 and \$4.83, respectively.
* <details><summary>Construct a confidence interval for the difference in the rates.
</summary>$$ \begin{align*}
     \bar{X} - \bar{Y} \pm 1.96 \sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}} 
     &= (88.42 - 80.61) \pm 1.96 \sqrt{\frac{5.62^2}{50} + \frac{4.83^2}{50}} \\
     &= 7.81 \pm 1.96(1.047) \\
     &= (5.75, 9.86)
   \end{align*} $$</details>

### Hypothesis Testing for Difference in Means (Large Sample)

We can use two methods to test whether there is a difference in two population means.  

* Confidence interval
* p-value

#### Confidence Interval Method  

First, we need to state out null and alternative hypotheses:

$H_0: \mu_1 = \mu_2 \text{ or } \mu_1 - \mu_2 = 0$  

$H_a: \mu_1 \neq \mu_2 \text{ or } \mu_1 - \mu_2 \neq 0$


Then, we construct the confidence interval  

$$ \bar{X} - \bar{Y} \pm 1.96 \sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}} $$  

We make our conclusion by checking if 0 falls within the interval - if it does, we fail to reject the null, otherwise we reject it.

**Example**

* A survey found that the average hotel room rate in New Orleans is \$88.42 and the average room rate in Phoenix is \$80.61.
* Assume that the data were obtained from two samples of 50 hotels each and that the standard deviations are \$5.62 and \$4.83, respectively.
* <details><summary>Test if the true average hotel room rates are different between the two cities.
</summary>$H_0: \mu_1 = \mu_2$  
$H_a: \mu_1 \neq \mu_2$  
We have calculated the confidence interval: (5.75, 9.86).  
The interval does not span 0, so we reject the null hypothesis. There is sufficient evidence to conclude that the average room rates in Orleans and Phoenix are different.
</details>

### Using the P-value  

The test statistic is given by  

$$ t_{stat} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}}} $$  

<!--
Since we are assuming large sample sizes, we will find the p-value from the standard Normal distribution (as an approximation).

$$ \text{p-value} = P(Z < -|t_{stat}| \text{ or } Z > |t_{stat}|) = 2 \times P(Z < -|t_{stat}|) $$  
--
**Note**: 
* the computer always uses the t-distribution regardless of the sample size.
* It is a **t-based** test, but we are using the standard Normal, $Z$, as an approximation
-->

<!--
In the case of one sided test, we obtain the p-value as follows:

* For $H_a: \mu_1 > \mu_2 \text{ or } \mu_1 - \mu_2 > 0$

$$ \text{p-value} = P(Z > t_{stat}) $$  

* For $H_a: \mu_1 < \mu_2 \text{ or } \mu_1 - \mu_2 < 0$

$$ \text{p-value} = P(Z < t_{stat}) $$
-->

Going back to our hotel example,

$$ t_{stat} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}}} = \frac{88.42 - 80.61}{\sqrt{\frac{5.62^2}{50} + \frac{4.83^2}{50}}} = 7.45 $$  

In [None]:
stats.ttest_ind_from_stats?

$$ \text{p-value} = P(t < -7.45 \text{ or } t > 7.45) = 2 \times P(Z < -7.45) = 3.67 \times 10^{-11} < 0.05 $$

We have extremely strong evidence to conclude that the average hotel room rates in Orleans and Phoenix are different.

## Comparing Two Means (Small Sample)

If our sample sizes are small, we would not have the nice convenience to use the standard Normal distribution as an approximation, as the t-distribution would look quite different from the standard Normal when the degrees of freedom is small.

### Confidence Intervals for Difference in Means  

When sample sizes are small, the 95% confidence interval for $\mu_1 - \mu_2$ is given by  

$$ \bar{X} - \bar{Y} \pm t \sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}} $$

where $t$ is the 97.5th percentile of the corresponding t-distribution.  

What is the $df$ of the $t$-distribution?

Find the **degrees of freedom** of the t-distribution gets a bit tricky:  

* One way is to take $df = min(n_1 - 1, n_2 - 1)$
    * We will use this if we have to calculate the CI by hand   
    
    
* Another way is to use the data to approximate the degrees of freedom (the computer usually uses this) 

$$ df = \frac{\left(\frac{s_x^2}{n_1} + \frac{s_Y^2}{n_2} \right)^2}{\frac{\left(\frac{s_X^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_Y^2}{n_2}\right)^2}{n_2 - 1}} \text{ Yikes!}$$

Thankfully, `statsmodels` will do this for us.

### Using the P-value

$H_0: \mu_1 = \mu_2 \text{ or } \mu_1 - \mu_2 = 0$  

$H_a: \mu_1 \neq \mu_2 \text{ or } \mu_1 - \mu_2 \neq 0$

The test statistic is the same:

$$ t_{stat} = \frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s_X^2}{n_1} + \frac{s_Y^2}{n_2}}} $$  

When we calculate the p-value, instead finding the approximate probability in a standard Normal, we need to find the exact probability from the t-distribution with the right degrees of freedom. (**t-based**) 


$$ \text{p-value} = P(t < -|t_{stat}| \text{ or } t > |t_{stat}|) = 2 \times P(t < -|t_{stat}|) $$  

where the t-distribution has degrees of freedom defined above.

For one sided test, we obtain the p-value as follows:

* For $H_a: \mu_1 > \mu_2 \text{ or } \mu_1 - \mu_2 > 0$

$$ \text{p-value} = P(t > t_{stat}) $$  

* For $H_a: \mu_1 < \mu_2 \text{ or } \mu_1 - \mu_2 < 0$

$$ \text{p-value} = P(t < t_{stat}) $$

**Example**

* The residents of Cambridge complain that traffic speeding fines given in their city are higher than the traffic speeding fines that are given in nearby Sommerville.
* The assistant to the county manager agreed to study the problem and to indicate if the complaints were reasonable. Independent random samples of the amounts paid by residents for speeding tickets in each of the two cities over the last three months were obtained.

In [None]:
cambridge = pd.read_sql("SELECT fine FROM fines WHERE city = 'Cambridge'", engine)['fine']
sommerville = pd.read_sql("SELECT fine FROM fines WHERE city = 'Sommerville'", engine)['fine']

First, we need to state our hypotheses:

$H_0: \mu_1 = \mu_2$  

$H_a: \mu_1 > \mu_2$

In [None]:
ttest_ind?

So our p-value $= 4.60 \times 10^{-6} < 0.05$, and we conclude that there is sufficient evidence to support the Cambirdge residents' complaint about the difference in speeding fines between the two cities.

### Match Pairs Test

* Another two sample problem is matched pairs.
* The easiest example is a group of people who decided to try WeightWatchers.
* You have their before and after weights, say, after two months of dieting.
* In this case we again have two different samples, but they are not independent - they are matched.

First, take the difference $d = X_1 - X_2$ for each pair.  

Then carry out one sample hypothesis test on the difference $d$. (**t-based**)  

$H_0: d = 0$  

$H_a: d \neq 0$

## Testing Difference in Proportions

Just like the one sample case we saw yesterday, instead of testing the difference in means, we might want to test the difference in population proportions:

* The difference in unemployment rates between Republicans and Democrats  

* The difference in percentage if households that plan to buy a car during the next year in northeastern states versus southeastern states.

**Example**  

* We have two **independent** samples from populations of interest.

* Ask them the same question;
    * Do you like pandas? (for example)
    
* From each sample we calculate the proportion of yes’s:

<img src="images/groups_prop.png" width="700">

### The Confidence Interval for Difference in Proportions

By the CLT, the 95% confidence interval for $p_1 - p_2$ is give by  

$$ (\hat{p}_1 - \hat{p}_2) \pm 1.96 \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}} $$

### Testing Two Proportions

First, state the hypotheses:

$H_0: p_1 = p_2$ or $p_1 - p_2 = 0$  

$H_a: p_1 \neq p_2$ or $p_1 - p_2 \neq 0$

The test statistic for testing two proportions is given by

$$ t_{stat} = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1 - \hat{p}) \left(\frac{1}{n_1} + \frac{1}{n_2} \right)}} $$

where 

$$ \hat{p} = \frac{n_1 \hat{p}_1 + n_2 \hat{p}_2}{n_1 + n_2} $$

We calculate the p-value by comparing the test statistic to a standard Normal distribution (**z-based**),  

$$ \text{p_value} = P(Z < -|t_{stat}| \text{ or } Z > |t_{stat}|) = 2 \times P(Z < -|t_{stat}|) $$ 

For one sided test, we obtain the p-value as follows:

* For $H_a: p_1 > p_2 \text{ or } p_1 - p_2 > 0$

$$ \text{p-value} = P(Z > t_{stat}) $$  

* For $H_a: p_1 < p_2 \text{ or } p_1 - p_2 < 0$

$$ \text{p-value} = P(Z < t_{stat}) $$

**Example**

In July 1987 the Canadian parliament debated the reinstatement of the death penalty. One of the factors in this debate was the amount of public support for the death penalty. In 1982, a sample of 1500 Canadians reveled that 70% favored the death penalty. In 1987, 61% in a sample of 1500 supported the death penalty. Do these data provide sufficient evidence at the 5% significance level to indicate that support has fallen between 1982 and 1987 ?

We want to test

$$ H_0: p_{82} = p_{87} \text{ vs } H_a: p_{82} > p_{87} $$

The test statistic is 

$$ t_{stat} = \frac{(0.70 - 0.61)}{\sqrt{0.655 (1 - 0.655) \left( \frac{1}{1500} + \frac{1}{1500} \right)}} = 5.18 $$

where

$$ \hat{p} = \frac{1500 (0.70) + 1500 (0.61)}{1500 + 1500} = 0.655 $$

$$ \text{p-value} = P(Z > 5.18) = 1.1 \times 10^{-7} < 0.05 $$

In [None]:
1 - stats.norm.cdf(5.18)

Hence we may conclude with 95% confidence that the level of support for the death penalty has declined from 1982 to 1987.

## A/B Testing

A/B testing (sometimes called split testing) is comparing two versions of a web page to see which one performs better.   

* You compare two web pages by showing the two variants (let's call them A and B) to similar visitors at the same time.  

* The one that gives a better conversion rate, wins.

<img src="images/ab_test.png" width="600">

A/B testing is essentially a form of two sample hypothesis testing.