In [1]:
import random
import numpy as np

In [2]:
from scipy.stats import ttest_ind_from_stats
from scipy.stats import ttest_ind
import statsmodels.stats.api as sms
import scipy.stats as stats

In [3]:
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.io import output_notebook
from bokeh.layouts import gridplot
output_notebook()

In [4]:
def hist(l, density = True, bins = 10):
    hist, edges = np.histogram(l, density=density, bins=bins)
    p = figure()
    p.quad(top = hist/bins, bottom=0, left=edges[:-1], right=edges[1:], alpha = .4)
    p.xaxis.axis_label = "P value"
    p.yaxis.axis_label = "Frequency"
    return p

## Quantifying a p value, and type 1 errors

We know that the p value is the probability that the null hypothesis is true. Let's think about this another way.

A scientist conducts the following experiment. She has two groups, each with 30 people. She gives each each the same exact pill, a placebo. The placebo contains no medicinal properties (it is just a sugar pill). 

Because the pill does nothing, we should not see a difference in the two groups. In other words, the results should be random. 

After one week, the scientist measures the heart beat of each group. She records the results of each group, and then does resampling. She uses a confidence interval of 95%, which means a p values of .05. If the scientist did this experiment 100 times, how often would she find that there is a significance? She would find that happened 5 times out of a hundred. In other words, the scientist would be fooled into thinking there was a real difference 5 times. 

If the scientist used a confidence interval of 99%, she would be fooled 1 times. These types of errors are known as type 1 errors, the error that you think there is significance when there is not.

Let's see this principal in action:

In [5]:
"""Skip to below"""
def make_samples(samp_size = 100):
    samp1 = [random.gauss(0, 1) for x in range(samp_size)]
    samp2 = [random.gauss(0, 1) for x in range(samp_size)]
    cm = sms.CompareMeans(sms.DescrStatsW(samp1), sms.DescrStatsW(samp2))
    info = stats.ttest_ind(samp1, samp2)
    return info.pvalue

The above function creates two competely random sets of numbers. It then "resamples" them to find if they are significant. (Actually we are using math here because it is quicker and easier; but if you resampled, you would get the same results.) 

We should see no difference, since the samples are random. But we will. Run the code below until you get a p value <= .05

In [6]:
make_samples()


0.5089350161784283

On average, you should have to run the function about 20 times. If you ran it until you get a value of .01, you would have to run it about 99 times. 

Think about this. Even though the sets are random, you will still get significance. If a drug company tests a drug, it should use a p value of .01. Otherwise, it could sell a drug that may have no use and harm public health. 

Think back to our first example of heights between freckles and non freckles. The p value was greater than .2. That means one our of every five times we sould get a type 1 error. It is for this reason that we generally only accept results that have a p value <= .05.

The following code helps you visualize the p value. The function creates two random sampels and finds the p values. It does this 100 times. We then make a histograpm to see how often each p value occurs. 

In [7]:
def is_sig(num_iter = 100):
    final = []
    for i in range(num_iter):
        final.append(make_samples())
    return final


l = is_sig(1000)
show(hist(l, True, bins = 5))

You can see that the a p value of 0 to .05 occurs 20% of the time, between .05 and .1 20%, and so on. Just as expected. This is not magic. It is how the p value is defined. 