# Bootstrap Tests and Confidence Intervals



Let's begin by testing our two statistical questions using NHST with bootstrap resampling. The two questions are:
1. Does graduate education correspond to an increase in the median net family  wealth in comparison to undergraduate education?
2. Is the relative frequency of millionaires higher among those with a graduate education in comparison to those with an undergraduate education?

We will use the NLSY79 data set to answer these questions, and we present two common approaches for determining reporting results on the statistical significance of the observed differences in summary statistics. We start with the most common approach in NHST:

## Statistical Significance in NHST

The most common approach to determining statistical significance in NHST is to compare the $p$-value to a fixed threshold $\alpha$, where $\alpha>0$ and $\alpha<< 1$ (that is shorthand for $\alpha$ is much less than 1).  Recall that $p$ is the probability of observing such an extreme value of the test statistic **under the null hypothesis, $H_0$**. The value of $\alpha$ should be specified before the experiment or post-hoc analysis is conducted. A value of $\alpha=0.05$ is commonly used.  If $p < \alpha$, then we say that we **reject the null hypothesis**; otherwise, we only say that we **fail to reject the null hypothesis**. Except where otherwise noted, we will use a $p$-value threshold of $\alpha=0.05$ in this book.

Some important points about testing statistical significance using $p$-values:
1. **The $p$-value is not the probability that the null hypothesis is true, given the data.** The $p$-value is calculated by *assuming that $H_0$ is true.*  Thus, the $p$-value can only tell us about the probability of seeing such an extreme value of the test statistic under the assumption that $H_0$ is true; it **cannot** tell us about the probability that $H_0$ is true.
1. **A value $p > \alpha$ does not indicate the null hypothesis is true.** From the previous point, the $p$-value already assumes that $H_0$ is true. 
1. **A value $p > \alpha$ does not mean that the null hypothesis is accepted.** Such a value only indicates that the data is not sufficient to determine that the observed difference is unusual. When the samples are small, there is more variation in the value of the test statistic. This is why we say we **fail to reject the null hypothesis**.  This result may just indicate that more data is needed to determine whether the observed value occurs with low probability.
1. **A value  $p< \alpha$ does not indicate that the alternative hypothesis is true.** Since the $p$-value is calculated under the assumption that $H_0$ is true, it cannot actually tell us anything about the alternative hypothesis.
1. **A small value of $p$ does not indicate that the effect under the alternative hypothesis is strong.** For instance, consider an experiment comparing the performance of a medication to a placebo. The null hypothesis is that there is no difference in outcome between the placebo and the medication, and the alternative hypothesis is probably something related to an improvement in outcomes from the medication. Then a low $p$-value does **not** indicate that the medication has a particularly strong effect on the outcome. Because $p$-values are calculated under the assumption that the null hypothesis is true, they **cannot tell us anything about the alternative hypothesis**.

For a more detailed and extended discussion of misconceptions, see:
* Goodman, Steven, "A dirty dozen: twelve p-value misconceptions," *Seminars in H
Hematology*, vol. 45, no. 3, 2008.

Let's carry out some statistical test using $p$-values. We start by importing the necessary libraries and loading the data:

In [120]:
import numpy as np
import numpy.random as npr
import pandas as pd

import matplotlib.pyplot as plt



In [121]:
#df = pd.read_csv('https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/05-binary-hypothesis-testing/nls/nls.csv')
df = pd.read_csv('nls/nls.csv')


remap = {'R0000100':'CASE_ID',
         'T5597600': 'GENDER',
         'T5684500': 'NET_WEALTH',
         'T9900000': 'HIGHEST_GRADE_EVER'
        }
df.rename(columns=remap, inplace=True)
df2=df.query('HIGHEST_GRADE_EVER > 0 & NET_WEALTH>0') 
undergrad = df2.query('HIGHEST_GRADE_EVER >= 16 & HIGHEST_GRADE_EVER <=17')['NET_WEALTH']                
grad = df2.query('HIGHEST_GRADE_EVER >= 18')['NET_WEALTH']

pooled = df2.query('HIGHEST_GRADE_EVER >= 16')['NET_WEALTH']

In [122]:
print('The number of data points in each group is:')
print(f'\tUndergrad: {len(undergrad)}')
print(f'\tGrad: {len(grad)}')
print(f'\tPooled: {len(pooled)}')

The number of data points in each group is:
	Undergrad: 821
	Grad: 473
	Pooled: 1294


## Testing Whether Graduate Eduction Increases Median Net Family Wealth

The median values of net family wealth for the undergraduate and post-baccalaureate groups are

In [123]:
undergrad.median()

427000.0

In [124]:
grad.median()

484400.0

Then our test statistic is the difference  between these, and the observed value of the test statistic is  

In [125]:
diff1 = grad.median() - undergrad.median()
diff1

57400.0

Let's start with a standard NHST. We conduct a simulation for which in each iteration, we create two new sample groups by bootstrap sampling from the pooled data. We then compute the medians for each group and calculate the sample test statistic by subtracting the median for the `postbac` group from the `undergrad` group. We are evaluating whether post-baccalaureate education increases net family wealth, so we will use a one-sided test. So, we increment and counter if the test statistic exceeds the observed value. At the end of the iterations, we calculate the relative frequency of the test statistic exceeding the observed difference in median wealth. 

In [55]:
num_sims = 100_000
count1 = 0

for sim in range(num_sims):
  # Bootstrap sampling
  undergrad_sample = npr.choice(pooled, len(undergrad))
  grad_sample = npr.choice(pooled, len(grad))
  
  # Compute value of test statistic
  diff_sample = np.median(grad_sample) - np.median(undergrad_sample) 
  
  # Compare test statistic to observed value and count
  if diff_sample >= diff1:
    count1+=1
  
print(f'The relative frequency of observing a difference in medians as large as')
print(f'the difference in the original data (i.e., the p-value) is {count1/num_sims}')
  
  

The relative frequency of observing a difference in medians as large as
the difference in the original data (i.e., the p-value) is 0.08963


Recall what this value means. If the net wealth data for the two groups come from the same distribution (i.e., there is no real difference between the undergraduate and graduate groups in terms of  the probability of having a certain family wealth), then we will still see a difference as large as the one we observed approximately 8.8% of the time when we have samples of these sizes. Since 8.8% is not insignificant,  we cannot be confident that the observed difference in net family wealth is significant. We say that "we fail to reject the null hypothesis" because $p\approx 0.08>0.05$.

Failing to reject the null hypothesis does not mean that the null hypothesis is true. In fact, it is likely that the observed difference in the median net family wealth is a real effect. However, the data is not sufficient to be reject the null hypothesis because we are using a $p$-threshold of 0.05. A researcher who is interested in pursuing this particular statistical difference could consider collecting additional data to further aid in testing the hypothesis that graduate education increases median net family wealth..

## Testing Whether Graduate Education Increases Probability of Becoming a Millionaire

Now consider whether post-baccalaureate education increases the probability of a family obtaining a net worth of over \\$1 million. We first determine the proportions of families with net wealth over \\$1 million in each group. If we compare each of `undergrad` and `grad` to a threshold of `1_000_000`, the output will be a Pandas Series with True and False values. Because NumPy treats True as 1 and False as 0, if we pass these to `np.sum()`, we will get a count of the number of values over 2 million:

In [126]:
np.sum(undergrad > 1_000_000)

201

In [127]:
np.sum(grad > 1_000_000)

141

Since these underlying groups have different cardinalities, we determine the relative frequencies of millionaires in each group by dividing by the number of data points in the group:

In [128]:
rf_undergrad = np.sum(undergrad > 1_000_000) / len(undergrad)
rf_undergrad

0.24482338611449453

In [129]:
rf_grad = np.sum(grad > 1_000_000) / len (grad)
rf_grad

0.29809725158562367

We can see that the relative frequency of multi-millionaires in the graduate group is greater by

In [130]:
diff2 = rf_grad - rf_undergrad
diff2

0.05327386547112914

Let's conduct a NHST using bootstrap resampling to determine if the observed difference of approximately 5.3% is statistically significant at the $\alpha=0.01$ level. The simulation is very similar to the one for median wealth; we just need to replace the use  of median as the summary statistic with the use of relative frequency of exceeding 1 million:

In [131]:
num_sims = 100_000
wealth_threshold = 1_000_000
count2 = 0

for sim in range(num_sims):
  # Bootstrap sampling
  undergrad_sample = npr.choice(pooled, len(undergrad))
  grad_sample = npr.choice(pooled, len(grad))

  
  # Compute value of test statistic
  diff_sample = np.sum(grad_sample > wealth_threshold)/len(grad) - np.sum(undergrad_sample > wealth_threshold)/len(undergrad)
  # Compare test statistic to observed value and count
  if diff_sample >= diff2:
    count2+=1
  
print(f'The relative frequency of observing a difference in medians as large as')
print(f'the difference in the original data (i.e., the p-value) is {count2/num_sims}')
  
  

The relative frequency of observing a difference in medians as large as
the difference in the original data (i.e., the p-value) is 0.01883


For this result $p \approx 0.019 < 0.05$, so we reject the null hypothesis at the $p < 0.05$ level.  Our conclusion is that there is a statistically significant association between having a graduate education and the probability of having a net family worth over \$1 million.

Recall again what $p \approx 0.019$ means: that even if  there is no statistical difference between the two groups in terms of the proportion of millionaires, we will see a difference in relative frequencies as large as the one observed in the data with probability approximately equal to 0.019. If there null hypothesis were true and we conducted a similar survey many times, we would expect to see this large a difference in approximately 1.9% of the surveys.

Note that it is best practice to report the value of $p$ found because smaller values of $p$ provide stronger evidence that the observed difference did not come from the null hypothesis.  Note again that $p \approx 0.019$ **does not** mean that the probability that the null hypothesis is true is approximately 0.019. That is because the $p$ value is determined **under the assumption that the null hypothesis is true.**

## An Alternative to $p$-values: Confidence Intervals

The use of $p$-values and fixed thresholds for determining statistical significance has fallen under attack over the past decade.  