## Bootstrap estimation and sampling: In class exercise

### *We will use data on Hodgkin's disease to look at a lowering of lung health due to tretment based on Inferential Thinking 13.4. We will test the hypothesis that the treatment causes a drop in lung health.*

## Comparing Baseline and Post-Treatment Scores: Hodgkins
From [*Inferential Thinking 13.4* ](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html?highlight=hodgkins#comparing-baseline-and-post-treatment-scores)

We will study this in the context of data that are a subset of the information gathered in a randomized controlled trial about treatments for Hodgkin's disease. Hodgkin's disease is a cancer that typically affects young people. The disease is curable but the treatment can be very harsh. The purpose of the trial was to come up with dosage that would cure the cancer but minimize the adverse effects on the patients. 

This table ``hodgkins`` contains data on the effect that the treatment had on the lungs of 22 patients. The columns are:

- Height in cm
- A measure of radiation to the mantle (neck, chest, under arms)
- A measure of chemotherapy
- A score of the health of the lungs at baseline, that is, at the start of the treatment; higher scores correspond to more healthy lungs
- The same score of the health of the lungs, 15 months after treatment

In [None]:
from datascience import *
%matplotlib inline
path_data = 'data/'
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np

In [None]:
hodgkins = Table.read_table(path_data + 'hodgkins.csv')

In [None]:
hodgkins.show(3)

We will compare the baseline and 15-month scores. As each row corresponds to one patient, we say that the sample of baseline scores and the sample of 15-month scores are *paired* - they are not just two sets of 22 values each, but 22 pairs of values, one for each patient.

At a glance, you can see that the 15-month scores tend to be lower than the baseline scores – the sampled patients' lungs seem to be doing worse 15 months after the treatment. This is confirmed by the mostly positive values in the column `drop`, the amount by which the score dropped from baseline to 15 months.

## *<font color='green'>Create a new column labelled 'drop' which is the drop between the 'base' column and the 'month15'*

In [None]:
hodgkins = hodgkins.with_columns(
    'drop', hodgkins.column('base') - ... )

In [None]:
hodgkins

## *<font color='green'>Create a histogram of the 'drop' column*

In [None]:
...

## *<font color='green'>What is the mean drop?*

In [None]:
print("Average: %4.2f" % np.mean(...))

### Average Drop Hypothesis
In the sample, the average drop is about 28.6. But could this be the result of chance variation? <br>The data are from a random sample. Could it be that in the entire population of patients, the average drop is just 0?

To answer this, we can set up two hypotheses:

**Hypothesis:** <font color='blue'>In the population, cumulative treatment over 15 months leads to a drop in lung function.

**Null hypothesis:** <font color='blue'>In the population, the average drop is random about  0.



To test this hypothesis with a 5% cutoff for the p-value, let's construct an approximate 95% confidence interval for the average drop in the population.

## *<font color='green'>Now we sample Table to get at random variations in the population revealed in sample*

In [None]:
hodgkins.sample()

In [None]:
def one_bootstrap_mean():
    resample = hodgkins.sample()
    return np.mean(resample.column('drop'))

In [None]:
# Generate 10,000 bootstrap means
num_repetitions = ...
bstrap_means = make_array()
for i in np.arange(num_repetitions):
    bstrap_means = np.append(bstrap_means, one_bootstrap_mean())

In [None]:
plt.hist(bstrap_means)

### Get the endpoints of the 95% confidence interval
Use Datascience percentile function

In [None]:
percentile(2.5, bstrap_means)

In [None]:
left = percentile(2.5, bstrap_means)
right = percentile(97.5, bstrap_means)

make_array(left, right)

In [None]:
resampled_means = Table().with_columns(
    'Bootstrap Sample Mean', bstrap_means
)
resampled_means.hist(bins=np.arange(0,50,2.5))
plt.plot([left, right], [0, 0], color='yellow', lw=8)
plt.xlim(0,50)
plt.tight_layout()
plt.savefig('bootstrap_CI.png')

## *<font color='green'>Does the 95% confidence interval include 0 drop? What does this mean about support of hypothesis and possibility to reject Null?*

## *<font color='green'>Use Google or Generative AI to come up with a description of Hodgkin's disease*