In [1]:
from datascience import *
import numpy as np
from math import *
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

## Lesson 24: Hypothesis Testing Errors & Power

Dr. Warner Notes:

Court: Not guilty = fail to reject

type 1: null is true, but you reject it. Send an innocent person to the big house.
probability of type 1 is alpha (5%), or beyond a reasonable doubt

type2: null is false, but you fail to reject it. OJ Simpson.


power is 1 minus the probability of type 2. you want to make power high, because it is the probability that we reject the nulle when the alternative is true

Making a type 1 error less likely makes a type 2 error more likely. It is a zero sum game.

Shift tab brings up documentation for commands

Throughout this block, we have been studying hypothesis tests. We have covered the four basic steps of any hypothesis test, and we have practiced various methods for obtaining the distribution of our test statistic under the null hypothesis. 

After we have reached a conclusion (reject or fail to reject), we must consider possible errors. 

### Type I error 

Type I error is the event that we rejected the null hypothesis when the null hypothesis was actually true. Type I error is also known as a false positive. The probability of a Type I error is usually defined by the threshold used for rejection. A common threshold is 0.05. Those of you who have taken statistics before may recognize this value as $\alpha$. 

### Type II error

Type II error is the event that we failed to reject the null hypothesis when the null hypothesis was actually false. This is otherwise known as a false negative. The probability of a Type II error is harder to find and requires a more in-depth analysis of a hypothesis test. The probability of a Type II error is often given as $\beta$, and $1-\beta$ is referred to as **Power**. The power of a test is probability that we will reject the null hypothesis when we are supposed to. 

Which one of these errors is more serious? It depends on the context of the problem. 

### Example: Golf Balls

Joe has a summer job at a golf course and one of his jobs is to fish out golf balls from the water traps. He has a theory that certain types of golf ball are more likely to end up in the water than others. Let's assume there are four brands of golf ball, let's and assume that all four are used equally at this golf course. He fishes out 100 golf balls and counts each brand. He finds 30 of brand A, 30 of brand B, 20 of brand C and 20 of brand D. Conduct a hypothesis test to determine whether certain types of golf ball are more likely than others to end up in the water.

Step 1: Hypotheses

Null: All brands are equally likely to end up in the pond.(We find 25 of each)

Alternative: Some of the brands are found more than others. At least one is different from the rest.


Step 2: Test statistic

There are many correct answers, but let's go with sum of absolute difference between observed and expected counts under $H_0$. To do this, we need to find the expected counts. If each ball was equally likely, how many should we expected to find of each if we select 100 golf balls? 

In [2]:
25

25

Step 3: $p$-value

We need the distribution of the test statistic under $H_0$. 

The sum of the difference is 2(30-25)+2(25-20) = 20 

In [3]:
#This is how you set up a simulation. This is not a binomial, so I went with the simulation method.
Brand = ['A','B','C','D']
T = []

for n in np.arange(10000):
    X = np.random.choice(Brand,100,replace=True)
    T = np.append(T,abs(sum(X == 'A')-25) + abs(sum(X == 'B')-25) + abs(sum(X == 'C')-25) + abs(sum(X == 'D')-25))
    
T

array([ 8.,  8.,  8., ..., 14., 16., 12.])

You can clean up the coding at by looking at his functions. Note: stats.multinomial is like the binomial, but with 4.

In [4]:
# Now the problem is to determine what the probability is. It probably should not be that it is exactly 20, but
#that it is 20 or greater, because it is my data or more extreme in all cases.
#The abs difference has turned a two-sided problem into a one-sided one

p = sum(T>19)/10000
p

0.1879

Step 4: Conclude

We fail to reject because it is too likely that we get the data we have if we expect all balls equally.

What kind of error could we have made in this case? 

We failed to reject, so we could have made a Type 2 error.

#### Power 
Suppose that, in truth, 30% of the balls found in the water were brand A, 30% were brand B, 20% were brand C and 20% were brand D. In this case, our collected sample reflected this truth perfectly. However, our hypothesis test failed to recognize this deviation from equal proportions. We made a type II error. This is because this test has fairly low power. Use simulation to determine the power of this test. 

Dr. Warner in class notes:

He gave us the actual distribution given the alternative is true. stats.multinomial(n=100,[.3,.3,.2,.2])
When power comes out too low and we fail to reject, we simply need more data

I am looking for the probability that I reject the null hypothesis given the true proportions laid out above. Well, first I need to figure out for what values of my test statistic I would reject $H_0$. 

In [19]:
# Just pull down the T data from above.
def P_val(X):
    return sum(T>=X)/len(T)

print (P_val(25))
#play around with the value until you get the first output less than 0.1
print('All values greater than or equal to 25 would cause us to reject the null')

0.0421
All values greater than or equal to 25 would cause us to reject the null


Next, I need to simulate from the true population and determine how often my test statistic would have met this threshold. 

In [14]:
#true_pop = stats.multinomial(100,[.3, .3, .2, .2]) I am not sure how this works

#Here I will try to simulate again, this time drawing from a pool of 30 A, 30 B, 20 C 20 D

Weighted_Brand = np.repeat('A',30)
Weighted_Brand = np.append(Weighted_Brand,np.repeat('B',30))
Weighted_Brand = np.append(Weighted_Brand,np.repeat('C',20))
Weighted_Brand = np.append(Weighted_Brand,np.repeat('D',20))
stat = []
for n in np.arange(10000):
    dist = np.random.choice(Weighted_Brand,100)
    stat = np.append(stat,abs(sum(dist == 'A')-25) + abs(sum(dist == 'B')-25) + abs(sum(dist == 'C')-25) + abs(sum(dist == 'D')-25))

sum(stat>=25)/10000

0.3423

$\beta$ is the probability of a type 2 error, that we would fail to reject even though the null is not true. $\beta$ is the conjugate of power. Above, we found the power, the probability that we would reject given that null was false (here described by the given distribution). Our power=34.2%

What do you think about this power? 

This power is low. Given a random sample from this distribution, we would fail to reject the null more than half the time, even though the null is false.

Repeat this power calculation, but assume Joe collects 500 balls instead of 100. Note that you will have to obtain a new critical value. What does this tell you about power and sample size?

In [7]:
#This will give us the new value assuming the null is true with a sample size of 500, find the p value
#Note, Brand is the same as last time, equal probability, simulation:
statistic = []
for n in np.arange(10000):
    X = np.random.choice(Brand,500)
    statistic = np.append(statistic,abs(sum(X == 'A')-125) + abs(sum(X == 'B')-125) + abs(sum(X == 'C')-125) + abs(sum(X == 'D')-125))
    
# Just for curiosity's sake, what is the probability that we would have the same distribution given that the null was true:
#Note, the differences would be 150-125 and 125-120, resulting in an expected test statistic of 100.
sum(statistic>=100)/10000
#This came out at practically 0 percent

0.0

In [22]:
#Find the new cutoff
print(sum(statistic>=55)/10000)
print('55 is the new cutoff for our test statistic where we would reject the null')

0.0441
55 is the new cutoff for our test statistic where we would reject the null


In [None]:
#Assuming a 30-30-20-20, what is the probability we would reject?
Weighted_Brand = np.repeat('A',150)
Weighted_Brand = np.append(Weighted_Brand,np.repeat('B',150))
Weighted_Brand = np.append(Weighted_Brand,np.repeat('C',100))
Weighted_Brand = np.append(Weighted_Brand,np.repeat('D',100))
stat = []
for n in np.arange(10000):
    dist = np.random.choice(Weighted_Brand,500)
    stat = np.append(stat,abs(sum(dist == 'A')-125) + abs(sum(dist == 'B')-125) + abs(sum(dist == 'C')-125) + abs(sum(dist == 'D')-125))

Power = sum(stat>=55)/10000
Power

Note that our power has increased to near 100%. A type 2 error would be extremely unlikely given a sample size of 500 and a 30-30-20-20 distribution. The more data that is in a sample, the less likely we are to make mistakes.