# Bootstrapped p values for the two-sample t test

This is a short Python notebook having two purposes.

1. Explain how to get bootstrap p values 

2. Show how blazingly fast Python is for this kind of tasks

## The bootstrap

The bootstrap is a resampling technique introduced by Bradley Efron 1979 mainly as a tool to find standard errors and confidence intervals in complex settings where we 

- do not have a standard error at hand for the given estimator (because it is too complex) or

- know that assumptions behind the usual standard error are grossly violated.

If you are not yet familiar with this technique, check [Wikipedia](https://en.wikipedia.org/wiki/Bootstrapping_(statistics). 

### p values via bootstrap

Researchers are often interested also in finding p values by the bootstrap. Often a bad idea:

- Almost always, permutatation tests are the way to go if we do not trust classic tests. An exception is when the observations cannot be assumed to be exchangable under the null, e.g. in a two-sample t-test situation with unequal variances.

- Often, classic tests (e.g. Welch's two-sample t test) are quite robust even to clear violations of assumptions.

- They do it wrongly. 

What do I mean by the last bullet point? Look at the following algorithm:

1. Calculate original test statistic $T$

2. Take $B$ boostrap samples

3. For each boostrap sample, calculate the test statistic. The $B$ values form the (boostrap) sampling distribution.

4. Calculate the proportion $\hat p$ of values in the sampling distribution at least as large as $T$. In a two-sided setting (we are focussing on that one), the p value is $2 \cdot \{\min{\hat p, 1-\hat p}\}$.

My completely random guess is that 80% of all bootstrap p values are calculated like this. The problem is hidden in Step 3: The boostrap sampling distribution is centered approximately around $T$ instead of 0 (or whatever value is associated with the null hypothesis). We are not interested in the sampling distribution around our specific value of the test statistic. But instead, we need to find the sampling distribution under the null hypothesis. This is not always easy to do. But in a two-sample t test setting, you would just subtract the group means from each value to end up with two samples with equal mean (0). Similar in a k-sample comparison setting, just subtract the location estimate of interest. An excellent reference is [Boos & Brownie (1988)](https://pdfs.semanticscholar.org/ba4e/96f388ee8fc03e78779ba1d1a303174e420c.pdf). 

- It explains in detail how to find p values by Boostrap
- It contains some awesome tricks for Monte-Carlo-Comparison of statistical hypothesis tests
- It compares different two- and k-sample tests.

We will focus now on the comparison of two means. There are different possible test statistics to boostrap: 

1. The mean difference
2. The studentized mean difference
3. Welch's t test statistic

In the paper cited above, based on simulations, the authors recommend to use the third option. It had best performance regarding level I and II errors among the other procedures.

Now we are going to implement the Boostrap p value for Welch's t-test statistic. 

## Python for calculatin Bootstrap p values

Usually I do statistics in R. But it is difficult to beat Python in this setting. No loop, no comprehension... increadible. It will be difficult to beat the following code in R regarding execution speed. I will pay you a coffee if you manage to do so ;).

Let's have a look at the functions.

1. `boot_matrix`: It creates $B$ bootstrap samples of a vector and returns the result in a matrix

2. `bootstrap_t_pvalue`: It takes two samples and returns the bootstrap p value along with the original two-sample t test statistic and its parametric p value.

In [21]:
import numpy as np
import scipy.stats as stats

def boot_matrix(z, B):
    """Bootstrap sample
    
    Vector z is bootstrapped B times and organized in matrix"""
    
    n = len(z)  # sample size
    idz = np.random.randint(0, n, (B, n))  # indices to pick for all boostrap samples
    return z[idz]

def bootstrap_t_pvalue(x, y, B=1000, equal_var=False):
    """ Bootstrap p values for two-sample t test
    
    Returns tuple with boostrapped p value, test statistics and parametric p value"""
    
    # Original Welch's t test statistic
    orig = stats.ttest_ind(x, y, equal_var=equal_var)
    
    # Generate boostrap distribution of Welch's t statistic
    xboot = boot_matrix(x - x.mean(), B=B)
    yboot = boot_matrix(y - y.mean(), B=B)
    t = stats.ttest_ind(xboot, yboot, axis=1, equal_var=equal_var)[0]

    # Calculate proportion of bootstrap samples with at least as strong evidence against null    
    p = np.mean(t >= orig[0])
    
    return (2*min(p, 1-p), *orig)

### Example 1

Let us first have a look at a standard situation: We sample from two shifted normal distributions. In such setting, the classic t-test is definitively the test to apply, especially because we have equal variances. 

In [22]:
x = np.random.normal(loc=11, scale=20, size=30)
y = np.random.normal(loc=15, scale=20, size=20)
%time bootstrap_t_pvalue(x, y, B=10000)

Wall time: 15.6 ms


(0.6956, -0.38840021321462054, 0.69944437801476789)

Okay. Our boostrap was super fast (15 micro seconds for 10'000 boostrap runs? Cmon) and returned almost the same p value as Welch's t test for unequal variances (default t test in R software). Compare first value with last value in the tuple.

### Example 2

Now, just to show how well Welch's t-test for unequal variances works

In [19]:
x = np.random.normal(loc=11, scale=20, size=30)
y = np.random.normal(loc=15, scale=10, size=20)
bootstrap_t_pvalue(x, y, B=10000)

(0.2914000000000001, -1.0449868988384667, 0.30210339327696162)

### Example 3

Let's move on to the nasty situation: Unequal variances, non-normality

In [20]:
np.random.seed(399888)
x = np.random.exponential(scale=20, size=30)
y = np.random.exponential(scale=10, size=20)
bootstrap_t_pvalue(x, y, B=10000)

(0.028799999999999999, 2.0977748141544903, 0.041263608229049308)

Here, we can see quite some difference. We don't know the "correct" p value. To do so, we would need to do a full fledged Monte-Carlo-Study. Maybe I will do that in the next notebook.