In [10]:
from datascience import *
import numpy as np
from math import *
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

## Lesson 24: Hypothesis Testing Errors & Power

Throughout this block, we have been studying hypothesis tests. We have covered the four basic steps of any hypothesis test, and we have practiced various methods for obtaining the distribution of our test statistic under the null hypothesis. 

After we have reached a conclusion (reject or fail to reject), we must consider possible errors. 

### Type I error 

Type I error is the event that we rejected the null hypothesis when the null hypothesis was actually true. Type I error is also known as a false positive. The probability of a Type I error is usually defined by the threshold used for rejection. A common threshold is 0.05. Those of you who have taken statistics before may recognize this value as $\alpha$. 

### Type II error

Type II error is the event that we failed to reject the null hypothesis when the null hypothesis was actually false. This is otherwise known as a false negative. The probability of a Type II error is harder to find and requires a more in-depth analysis of a hypothesis test. The probability of a Type II error is often given as $\beta$, and $1-\beta$ is referred to as **Power**. The power of a test is probability that we will reject the null hypothesis when we are supposed to. 

Which one of these errors is more serious? It depends on the context of the problem. 

### Example: Golf Balls

Joe has a summer job at a golf course and one of his jobs is to fish out golf balls from the water traps. He has a theory that certain types of golf ball are more likely to end up in the water than others. Let's assume there are four brands of golf ball, let's and assume that all four are used equally at this golf course. He fishes out 100 golf balls and counts each brand. He finds 30 of brand A, 30 of brand B, 20 of brand C and 20 of brand D. Conduct a hypothesis test to determine whether certain types of golf ball are more likely than others to end up in the water.

Step 1: Hypotheses

H0: P(a) = P(b) = P(c) = P(d)  
Ha: at least one P is different

Step 2: Test statistic

There are many correct answers, but let's go with sum of absolute difference between observed and expected counts under $H_0$. To do this, we need to find the expected counts. If each ball was equally likely, how many should we expected to find of each if we select 100 golf balls? 

In [11]:
# expected value = 25

Step 3: $p$-value

We need the distribution of the test statistic under $H_0$. 

In [12]:
x = abs(25-30)+abs(25-30)+abs(25-20)+abs(25-20)
x

20

In [13]:
# multinomial
my_multi=stats.multinomial(n=100,p=[.25,.25,.25,.25])

In [24]:
# want probability we got the results we got assuming hypothesis is true (p-value)

In [25]:
results = list(map(lambda x:test_stat(x,e),temp))

TypeError: 'int' object is not iterable

In [26]:
# note obs=20
np.count_nonzero(results>=obs)/num_sim

NameError: name 'results' is not defined

Step 4: Conclude

0.18383
the p-value is high as compared to an alpha of .05 so there insufficient evidence against our null hypothesis therefore we fail to reject the null.

What kind of error could we have made in this case? 

We could have made a Type II(false negative) error.

#### Power 
Suppose that, in truth, 30% of the balls found in the water were brand A, 30% were brand B, 20% were brand C and 20% were brand D. In this case, our collected sample reflected this truth perfectly. However, our hypothesis test failed to recognize this deviation from equal proportions. We made a type II error. This is because this test has fairly low power. Use simulation to determine the power of this test. 

I am looking for the probability that I reject the null hypothesis given the true proportions laid out above. Well, first I need to figure out for what values of my test statistic I would reject $H_0$. 

In [None]:
# figure out for what values reject null


Next, I need to simulate from the true population and determine how often my test statistic would have met this threshold. 

In [None]:
my_multinomial = stats.multinomial(100, p=[.3,.3,.2,.2])

In [None]:
np.random.seed(29)
num_sim = 10000
results_alt = list(map(lambda x:test_stat(x,e),my___.rvs(num_sim)))

In [None]:
np.count_nonzero(np.array(results_alt)>=26)/num_size

What do you think about this power? 

...

Repeat this power calculation, but assume Joe collects 500 balls instead of 100. Note that you will have to obtain a new critical value. What does this tell you about power and sample size?

In [None]:
# Power = 1 - Type II error

In [None]:
my_multinomial_2 = stats.multinomial(500, p=[.3,.3,.2,.2])

In [27]:
np.random.seed(29)
num_sim = 10000
results_alt = list(map(lambda x:test_stat(x,e),my_multinomial_2.rvs(num_sim)))

NameError: name 'my_multinomial_2' is not defined

In [28]:
np.count_nonzero(np.array(results_alt)>=26)/num_size

NameError: name 'results_alt' is not defined

The power is a lot higher now. Power is directly related to sample size.