# STA130 Tutorial 3 (with \<Your Favorite TA\>):<br> Bootstrapping and Bootstrapped Confidence Intervals

Today we're interested in estimating parameters based on a sample<br>in a way that automatically suggests how accurate an estimate is for that sample!

|![](im/4/garfield_confint.png)|
|-|


### Language of ~Love~ 💔 Statistical Confidence (5 Minutes) [Click "down" next] 
The chance that a ***RANDOMLY*** drawn sample is sufficiently representative of the population...<br>$\quad$so that it produces a "[...80/90/95/99...]% confidence interval"... <br>$\quad\quad$which will have captured the true parameter value it estiamtes is (approximately) [...80/90/95/99...]% 

- *The sample **MUST** be drawn **RANDOMLY** for this to be true... ["(approximately)" because: 'advanced reasons']*

Saying, **"We have X% confidence that the true parameter value being estimated lies within the [a,b] interval"** is the "statistcal language" to convey the above meaning. 

> Of course, either the true parameter value is or is not contained in the given confidence interval... **we never know for sure...** so...

| | |
|-|:--|
|<img src='https://s3-screenshots.int-cdn.lcpdfrusercontent.com/monthly_2022_01/gp_cop.png.ee8424dbd8a00715888e2aba49e29a78.png' style="width:600px" text="Grammar Police">|**NEVER SAY<br>"The 'probability' or 'chance' the true parameter value lies in the [a,b] confidence interval is X%"**<ul><li>or statistical language knit-pickers will accuse you of meaning:<br>"Sometimes the true parameter value is randomly in this [a,b] interval, but other times it won't be..."</li><li>**when (of course) it can always only be *in* or *not in* in a specific interval...**</li></ul>|

### Mothers Average Age Giving Birth: Coding Review (<15 Minutes) [Click "down" next] 

- Let's look at data from the Child Health and Development Studies. 

> Birthweight, date, and gestational age were collected as part of the Child Health and Development Studies in 1961 and 1962 for a sample of 400 mothers who had babies in these two years. Information about the baby’s parents -- age, education, height, weight,
and whether the mother smoked -- was also recorded.

Let's estimate the average age of mothers giving birth in "the population" using this sample of 400 mothers.


- What is "the population"? *["double click" for answer...]*
<!-- Ages of mothers having births in 1961 and 1962--> 

> There is an actual true average age of mothers giving birth in "the population"; so, if any created confidence interval, will or will not contain the actual true average age.

- Discuss, Fix, and Run ***bootstrap confidence interval*** estimation code in the next slide...
<!-- seed needs to be set outside of the for loop; bootrapping uses `replace=True`! -->
<!-- increasing `reps` refines the bootstrap sampling distribution representation -->

## Mothers Age Giving Birth in 1961-62: Coding Review (Continued...) 


In [5]:
import pandas as pd; import numpy as np; import plotly.express as px
gestation = pd.read_csv('data/4/gestation.csv')
confidence = 0.99 # Once code below is fixed and is working... What happens to the interval width when this changes?

n,reps = len(gestation.age),2000 # What is `n`? And once the code is working... What is `reps`? What does changing `reps` do? 
boot_means = np.array([0.0]*reps, dtype='float') # `dtype='int'` won't work: `boot_means.dtype` is 'float64' to store decimals
for i in range(reps): 
    np.random.seed(130) # FIX this placement, then: What does changing the seed do to the pseudorandom number generation?
    boot_sample = np.random.choice(gestation['age'], n, replace=False) # FIX to `True`, then: Does `n` change? Why or why not?
    boot_means[i] = boot_sample.mean()

bootstrap_CI = np.quantile(boot_means, ((1-confidence)/2, 1-(1-confidence)/2)) # These will be marked on the figure below
print(str(100*confidence)+"% confidence interval is", bootstrap_CI, "(based on using", np.round((1-confidence)/2,3),"and", np.round(1-(1-confidence)/2,3), 'quantiles which have a', confidence, 'difference)') 
fig = px.histogram(pd.DataFrame({'Bootstrapped Means': boot_means}), x='Bootstrapped Means', title="If the sample was representative of the population, this demonstrates the variability means of (bootstrapped) samples from the (sample approximated) 'population':<br>This gives us an idea of how much variability there is in estimating a mean from a sample of size n=400, which is how we determine the confidence interval width!<br>The confidence interval then centers on the original sample mean which might be higher or lower than the true population mean with a "+str(100*confidence)+"% chance of 'covering' it.<br>Why is it INCORRECT TO say 'There is "+str(100*confidence)+"% chance that the true average age of mothers when giving birth (1961-62) is between 26.65... and 28.23... years old.'?"); fig.add_vline(bootstrap_CI[0]); fig.add_vline(bootstrap_CI[-1]) 

99.0% confidence interval is [27.445 27.445] (based on using 0.005 and 0.995 quantiles which have a 0.99 difference)


# Hedging: Video (<10 mins) [Click "down" next] 

> - So what was the average age that woman had children in 1961-62?

Hedging is helpful whenever you can’t say something is 100% one way or another, as is often the case. In statistics, hedging should be used based on the limitations of data and the strength and generalizability of the conclusions.

Play this video for students: [https://play.library.utoronto.ca/watch/3b3e6a4c6b633e04a19da8a261be2434](https://play.library.utoronto.ca/watch/3b3e6a4c6b633e04a19da8a261be2434)

> We hope a **sample** is representative of a **population**; but, small sample sizes mean generalizations -- such as the accuracy of **sample statistics** estimating **population parameters** -- should be viewed cautiously and not be used overconfidently...

The chance that a ***RANDOMLY*** drawn sample is sufficiently representative of the population...<br>$\quad$such that it produces a "[...80/90/95/99...]% confidence interval"... <br>$\quad\quad$which will have captured the true parameter value it estiamtes is (approximately) [...80/90/95/99...]%  


<span style="color:gray">

# Hedging: Video (<10 mins) [Click "down" next] 

> - So what was the average age that woman had children in 1961-62?

Hedging is helpful whenever you can’t say something is 100% one way or another, as is often the case. In statistics, hedging should be used based on the limitations of data and the strength and generalizability of the conclusions.

Play this video for students: [https://play.library.utoronto.ca/watch/3b3e6a4c6b633e04a19da8a261be2434](https://play.library.utoronto.ca/watch/3b3e6a4c6b633e04a19da8a261be2434)

</span> 
    
### Class Discussion (10 minutes) ["Double Click"/"Exit Slide Mode" if example answers are needed]

- How do **confidence intervals** quantiatively facilite **hedging** regarding a statistic estimating a parameter? 
<!-- A statistic is a point estimate of a parameter; but, depending on the confidence level, and
the sample size (as will be demonstrated a little later), the confidence interval will be smaller or wider -->
<!-- which allows us to naturally "hedge" on how relatively accurate and precise we think our estimate might be in a quantitative way based on statistical "confidence". -->

- Is **hedging** relevant when discussing a **confidence interval** based on a sample that was not quite **random**?
<!-- Yes, insofar as the sample is representative of the population the estimate based on the sample might be fairly good; but, we won't have the confidence interval guarantees, and we won't know if there's some systematic bias in the sample that might confound the estimate and make it bad. -->
<!-- So, we need to be cautious in how strongly we present our conclusions if the sample is not totally random -->

- From the perspective of **bootstrapping** <br>why is (or isn't) it important that a (random) **sample** is representative of a **population**? 
<!-- We bootrap samples from the original sample in a manner that acts as if the original sample was the actual population; so, the more accurately this happens to be, the better the the bootstrap samples are at acting like samples from the population! -->
<!-- Now of course, the more similar to the population the sample is, the more accurate the point estimate will be (meaning that we will indeed have an interval that will capture the true parameter value it's estimating. -->
<!-- We can nonetheless rely on the confidence level regardless of this (assuming we have a random sample), though, because the confidence level has to do with the chance that the sample will indeed be representative of the population! --> 

#### Tutorial Activity: Quiz (15 minutes) / Reivew (10 mintues) ["Double Click"/"Exit Slide Mode" if example answers are needed]

- *Turn in your answers for your Tutorial Activity mark (along with the groups answers from the next group activity)*

> - Question credit for attempting to provide an answer:<br>answers will not be reviewed in detail during marking   

| | |
|:-|:-|
|0. What's your name? | 5. Write code to return the 80% confidence interval for the list of bootstrapped medians in `bootstrapped_medians`. <!-- `np.quantile(bootstrapped_medians, (0.1, 0.9))` or `np.percentile(bootstrapped_medians, (10, 90))` --> |
|1. What is the difference between a statistic and a parameter?  <!-- a parameter is a statistic if it was calculated for every possible data point in a population... a statistic is just calculated for a sample! --> | 6. Interpret what it means if the 80% confidence interval for the code for the question above gives `(0.62, 0.93)`. <!-- There is an 80% chance that the random sample drawn resulted in a confidence interval construction procedure that captured the true median in the population --> |
|2. What is the purpose of bootstrapping? <!-- ACCEPTIBLE: to create confidence intervals! BETTER: to simulate the sample distribution of a statistic to see how much variability there is in statistics calcuations at a given sample size! --> | 7. If we want to be more confident in capturing the true median, should we use a larger or smaller confidence interval? <!-- Larger --> |
| 3. When bootstrapping a sample, what sample size do we use and should we sample with replacement? <!-- the same sample size as the sample (so that we understand the variability of the statistic under the sample size of the original samples); and, yes(!) we sample with replacement (because of the answer to the next question; and, so that we don't just get the same sample every time)! --> | 8. If we made 100 bootstrapped confidence intervals based on 100 random samples, approximately how many of the 100 bootstrapped confidence intervals would we expect to capture the true median? <!-- Around 80 --> | 
| 4. True or False: "bootstrapping acts as if a sample is a population" <!-- True! --> | 9. Describe the difference between the following two lines of code below. Assume `x` is a numpy array of numbers. |

```python
df_a_data_frame.sample(n=500)
np.random.choice(x_a_list_of_numbers, 500)
```
<!-- df.sample(n=500)` makes a new data frame out of 500 unique randomly choosen rows (with default `replace=False`); whereas, `np.random.choice` uses `replace=True` by default to create a bootstrap sample from an list (or `np.array`, etc.) of numbers -->



## Confidence Interval Widths (10 mins) 
- *Break into groups of three or four or so: **turn in your groups answers with your Quiz answers**.*

|![](im/4/garfield_confint.png)|
|-|

Compare the following two statements, and discuss and answer the following questions:
1. "We have 99.99% confidence that tomorrow's high temperature will be between -40 and 200 degrees!"
2. "We have 80% confidence that tomorrow's high temperature will be between 10 and 20 degrees!"


- What does it mean when we say we have "99.99% confidence" or "80% confidence"?
- Is it always desirable to have higher confidence levels in predictions? 
    - Which one of the intervals above is more helpful deciding what clothing to wear tomorrow?
- What's the risk with making confidence interval statements and what tradeoff can be used to address this?



### Confidence Interval Widths: Code Demonstration (<5 Minutes) 
## How Do We Get Narrower Confidence Intervals?
> At a fixed confidence level, narrower intervals are more interpretable, meaningful, and understandable; **therefore, more likely to be helpful, usable, and actionable!**

In [30]:
import pandas as pd; import numpy as np; import plotly.express as px; data = pd.read_csv('data/4/gestation.csv'); np.random.seed(130); reps = 2000 
confidence, intervals = 0.9, pd.DataFrame({'Confidence Interval': [], 'n': []})
# trying some different sizes of bootstrap samples!
for n in [5, 10, 15, 30, 100, 250, 400]: 
    boot_means = np.array([0.0]*reps, dtype='float') 
    for i in range(reps): 
        boot_means[i] = np.random.choice(data['age'], n).mean() # `replace=True` is the default
    intervals = pd.concat([intervals,pd.DataFrame({'Confidence Interval': np.quantile(boot_means, ((1-confidence)/2, 1-(1-confidence)/2)), 'n': 2*[n]})])    
fig = px.line(intervals, x='n', y='Confidence Interval', color='n', title='Diminishing Returns: notice how these '+str(100*confidence)+'% confidence intervals get narrow at larger sample sizes, but they do this more and more slowly the larger n is...<br>Why should we only use n=400 in this context? Why not n=500? Or n=1000? Or...? [The original sample size was n=400 and these intervals show the variability<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;in bootstrap means at different bootstrap sample sizes (where we "bootstrap population simulate" from the original sample to examine sample size behavior)]<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[What we care about is the variability of the means when the sample size is n=400 since that\'s the sample size that we actually have...]', height=400); fig.update_traces(line={'width': 10}); fig.update_xaxes(range=[0, 425])

### Confidence Interval Widths: Code Demonstration (<5 Minutes)
## How Do We Get Narrower Confidence Intervals?
> At a fixed confidence level, narrower intervals are more interpretable, meaningful, and understandable; **therefore, more likely to be helpful, usable, and actionable!**

In [6]:
import pandas as pd; import numpy as np; import plotly.express as px; data = pd.read_csv('data/4/gestation.csv'); np.random.seed(130); bootstrap_reps,confidence_interval_reps = 100,100
confidence, intervals = 0.9, pd.DataFrame({'Confidence Interval Length': [], 'n': []})
for n in np.arange(5,101,5, dtype='int'):
    for i in range(confidence_interval_reps): 
        boot_means = np.array([0.0]*bootstrap_reps, dtype='float')
        for j in range(bootstrap_reps): 
            boot_means[j] = np.random.choice(data['age'], n, replace=False).mean() # random (sub)samples
        intervals = pd.concat([intervals,pd.DataFrame({'Confidence Interval Length': np.diff(np.quantile(boot_means, ((1-confidence)/2, 1-(1-confidence)/2))), 'n': [n]})])    
px.box(intervals, x='n', y='Confidence Interval Length', title='Boxplots of the distributions of lengths of '+str(confidence_interval_reps)+' '+str(100*confidence)+'% bootstrap confidence intervals each created for random (sub)samples sizes (of size n=5, 10, 15, ..., up to 100)', height=400)

## Tutorial Assignment (get started...) *[Click "down" next]*

- Submit your work for the assignment through Quercus

**Marking is based on clarity and correctness of your written submission as judged from the perspective of a nonstatistical audience:** *don't be afraid to use technical statistical terms, but just be sure to explain their meaning in simple and understandable ways that would help nonstatistical audience made sense of what you're taking about when you do!*

- Don't spend more than 60 minutes on this assignment (unless really needed...)

    - Aim for something close to 200 to 500 words
    - Grammar is not the main focus of the assessment, but it is important that you communicate in a clear and professional manner; so,
        - use full sentences (without slang or emojis)




## Tutorial Assignment (complete at home if needed) *[Click "down" next]*

You are once again chatting on the phone with your friend who has not taken STA130. Your friend enjoyed your previous conversation about data visualization, so they've followed up by asking you if you've learned anything new in your STA130 course. You decide you'll tell them about the new technique you just learned called bootstrapping(!); but, since Halloween is coming up you tell them you'll explain bootstrapping by using it to estimate what proportion of all the candy bought for Halloween is their favorite candy based on the contents of their trick-or-treat bag after trick-or-treating. 

On Halloween you, your friend, and a whole bunch of your classmates go trick-or-treating and each acquire a bag of candy from various houses. Explain to your friend what bootstrapping is and how you can use it to estimate the proportion of their favorite candy based on their own bag of Halloween candy. In doing so, you should carefully discuss the following.
- What is the population and parameter of interest?
- What is the role of random sampling (and how could eating candy during trick-or-treating affect this)?
- How is the parameter estimated and what are the steps in the process for doing so?

Make sure your friend understands how to interpret their bootstrap confidence interval by considering all the additional hypothetical confidence intervals that could be made from you and your classmates trick-or-treat bags. 

## Tutorial Assignment (complete at home if needed)

Now that you've told your friend a little bit about the considerations surrounding bootstrap confidence intervals, as well as the process for making them, test your own knowlege by answering the following question:
- Suppose you, your friend, and your classmates all have exactly the same number of pieces of candy in your trick-or-treat bags; and, that this number is the same as the number of people who went trick-or-treating together (you, your friend, and your classmates); and, that each of you has recorded the proportion of the candy in your own trick-or-treat bags that is your friends favorite candy. What is the difference between  
    - your friends bootstrap sampling distribution of the proportion their favorite candy in their trick-or-treat bag
    
    and 
    
    - the sampling distribiution of the proportion their favorite candy observed across all trick-or-treat bags?

> Hint: the proportion in each bag is an estimate of the population parameter; so, the average of the proportions of all trick-or-treat bags would be a better estimate than the proportion of your friends bag, right? Is the location that the bootstrap sampling distribution is centered on perhaps different from this? Where is the bootstrap sampling distribution centered around compared to this "average of proportions"?
