# Bootstrapped p values for the two-sample t test

This is a Jupyter notebook having two purposes.

1. Explain how to get bootstrap p values 

2. Show how blazingly fast Python is for this kind of tasks

## The bootstrap

The bootstrap is a resampling technique introduced by Bradley Efron in 1979 as a tool to find standard errors and confidence intervals in settings where we 

- do not have a standard error at hand for the given estimator (because it is too complex) or

- think that assumptions behind the usual standard error are grossly violated.

If you are not yet familiar with this technique, check <a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)" >Wikipedia</a>. 

How do you e.g. calculate the Bootstrap standard error of the sample mean of $n$ observations $X_1, \dots, X_n$?
1. Calculate the sample mean $\bar X$
2. Draw with replacement $n$ observations from the sample $X_1, \dots, X_n$. This is your first Bootstrap sample. Repeat this over and over to end up with $B$ Bootstrap samples. Choose $B$ as large as possible, e.g. 10'000. The whole point of the Bootstrap is that such Bootstrap sample is to the original sample as the sample is to the population.
3. For each bootstrap sample, calculate the sample mean $\bar X_i^b$. These values form your bootstrap sampling distribution.

The sample standard deviation $S$ of $\bar X_1^b, \dots, \bar X_B^b$ is often a reasonable guess for the true standard error of the mean. The interval $[\bar X \pm 1.96 S]$ is an approximate 95%-confidence interval for the true population mean $\mu$. Alternatively, the percentage Bootstrap confidence interval can be calculated, which is just the range from the 2.5% quantile up to the 97.5% quantile of the values $\bar X_i^b$. These are just simple Bootstrap confidence intervals. There are clearly better versions around, no question. But our focus today is on something more delicate: p values.

### p values via bootstrap

Due to its simplicity, researchers are tempted also in finding p values by the bootstrap. Often a bad idea:
- Almost always, permutation tests are the way to go if we do not trust classic tests. An exception is when the observations cannot be assumed to be exchangable under the null, e.g. in a two-sample t-test situation with unequal variances.

- Often, classic tests (e.g. Welch's two-sample t test) are quite robust even to clear violations of assumptions.

- They do it wrongly. 

What do I mean by the last bullet point? Look at the following algorithm:

1. Calculate original value of the test statistic $T_0$

2. Draw $B$ boostrap samples

3. For each bootstrap sample, calculate the test statistic $T_i$. The values $T_1, ..., T_B$ form the (boostrap) sampling distribution used in subsequent steps.

4. Calculate the proportion $\hat p$ of values in $T_1, ..., T_B$ at least as large as $T_0$. In a two-sided setting, the p value is $2 \cdot \min\{\hat p, 1-\hat p\}$.

My completely unqualified guess is that 80% of all bootstrap p values are calculated like this. The problem is hidden in Step 3 and we need to modify just this step: The boostrap sampling distribution is centered approximately around $T_0$ instead of 0 (or whatever value is associated with the null hypothesis). We are not interested in the sampling distribution around our specific value of the test statistic. But instead, we need to find the sampling distribution under the null hypothesis. In a two-sample t test setting, you would just subtract the group means from each value to end up with two samples with equal mean (0). Similar in a k-sample comparison setting: just subtract the location estimate of interest. An excellent reference is [Boos & Brownie (1988)](https://pdfs.semanticscholar.org/ba4e/96f388ee8fc03e78779ba1d1a303174e420c.pdf). 

- It explains in detail how to find p values by the Bootstrap
- It uses awesome tricks in Monte-Carlo simulation
- It compares different two- and k-sample tests among others.

We will focus now on the comparison of two means. We can bootstrap one of the following test statistics: 

1. The mean difference
2. The classic t-test statistic, i.e. the mean difference normalized
3. <a href = "https://en.wikipedia.org/wiki/Welch%27s_t-test"> Welch's t test statistic</a>, which uses a different normalization than the classic t-test.

In the paper cited above, the authors recommend to use the third option. It had best performance regarding type I and II errors among the other procedures in consideration.  

So let's implement the Bootstrap p value for Welch's t-test statistic. 

## Python for calculatin Bootstrap p values

Usually I do statistics in R. But it is difficult to beat Python in this setting. No loop, no comprehension... increadible. I will pay you a coffee if you manage to provide a faster version in R ;).

Let's have a look at the functions.

1. `boot_matrix`: It creates $B$ bootstrap samples of a vector and returns the result in a matrix

2. `bootstrap_t_pvalue`: It takes two samples and returns the bootstrap p value along with the original two-sample t test statistic and its parametric p value.

In [13]:
import numpy as np
import scipy.stats as stats

def boot_matrix(z, B):
    """Bootstrap sample
    
    Vector z is bootstrapped B times and organized in matrix"""
    
    n = len(z)  # sample size
    idz = np.random.randint(0, n, (B, n))  # indices to pick for all boostrap samples
    return z[idz]

def bootstrap_t_pvalue(x, y, B=1000, equal_var=False):
    """ Bootstrap p values for two-sample t test
    
    Returns tuple with boostrapped p value, test statistics and parametric p value"""
    
    # Original Welch's t test statistic
    orig = stats.ttest_ind(x, y, equal_var=equal_var)
    
    # Generate boostrap distribution of Welch's t statistic
    xboot = boot_matrix(x - x.mean(), B=B) # important centring step to get sampling distribution under the null
    yboot = boot_matrix(y - y.mean(), B=B)
    t = stats.ttest_ind(xboot, yboot, axis=1, equal_var=equal_var)[0]

    # Calculate proportion of bootstrap samples with at least as strong evidence against null    
    p = np.mean(t >= orig[0])
    
    return (2*min(p, 1-p), *orig)

### Example 1

Let us first have a look at a standard situation: We sample from two shifted normal distributions. In such setting, the classic t-test is definitively the test to apply, especially because we have equal variances. 

In [14]:
np.random.seed(984564) # for reproducability
x = np.random.normal(loc=11, scale=20, size=30)
y = np.random.normal(loc=15, scale=20, size=20)
%time bootstrap_t_pvalue(x, y, B=10000)

Wall time: 15.6 ms


(0.21300000000000008, -1.2299392326284944, 0.22753807122611611)

Okay. Our boostrap was super fast (15 milliseconds for 10'000 boostrap runs? C'mon) and returned almost the same p value as Welch's t test for unequal variances (default t test in R software). Compare first value with last value in the tuple. 

What would we get in such situation with the wrong approach done so frequently? We have to modify one of our two functions:

In [15]:
def the_wrong_way(x, y, B=1000, equal_var=False):
    """ Bootstrap p values for two-sample t test
    
    Returns tuple with boostrapped p value, test statistics and parametric p value"""
    
    # Original Welch's t test statistic
    orig = stats.ttest_ind(x, y, equal_var=equal_var)
    
    # Generate boostrap distribution of Welch's t statistic
    xboot = boot_matrix(x, B=B) # error
    yboot = boot_matrix(y, B=B) # error
    t = stats.ttest_ind(xboot, yboot, axis=1, equal_var=equal_var)[0]

    # Calculate proportion of bootstrap samples with at least as strong evidence against null    
    p = np.mean(t >= orig[0])
    
    return (2*min(p, 1-p), *orig)

In [16]:
the_wrong_way(x, y, B=10000)

(0.98040000000000005, -1.2299392326284944, 0.22753807122611611)

Just look at this p value. So wrong.

### Example 2

Now, just to show how well Welch's t-test for unequal variances works

In [17]:
np.random.seed(345244) # for reproducability
x = np.random.normal(loc=11, scale=20, size=30)
y = np.random.normal(loc=15, scale=10, size=20)
bootstrap_t_pvalue(x, y, B=10000)

(0.215, 1.2929322360150164, 0.20279745745876884)

### Example 3

Let's move on to the nasty situation: Unequal variances, non-normality

In [18]:
np.random.seed(399888) # for reproducability
x = np.random.exponential(scale=20, size=30)
y = np.random.exponential(scale=10, size=20)
bootstrap_t_pvalue(x, y, B=10000)

(0.028799999999999999, 2.0977748141544903, 0.041263608229049308)

Here, we can see quite some difference. We don't know the "correct" p value. To do so, we would need to do a full fledged Monte-Carlo-Study. Maybe I will do that in the next notebook?