# Lab 9: Statistical Inference (part a: sampling distributions)

### Acknowledgment

This lab has been adapted from the worksheets associated with the online textbook [Data Science: A First Introduction (Python Edition)](https://python.datasciencebook.ca/index.html) by [Campbell, Timbers, Lee, Heagy, and Ostblom](https://python.datasciencebook.ca/authors.html) and shared under an [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:
- Describe real world examples of questions that can be answered with the statistical inference methods.
- Name common population parameters (e.g., mean, proportion, median, variance, standard deviation) that are often estimated using sample data, and use computation to estimate these.
- Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution).
- Explain the difference between a population parameter and sample point estimate.
- Use computation to draw random samples from a finite population.
- Use computation to create a sampling distribution from a finite population.
- Describe how sample size influences the sampling distribution.

This worksheet covers parts of [Chapter 6](https://hyosubkim.github.io/datasci-for-kin/6-Inference/inference.html) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. 

In [None]:
### Run this cell before continuing.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Make our figures classy
sns.set_theme(style='darkgrid')

**Question 1.1** Matching:

Read the mixed up table below and assign the variables in the code cell below a number to match the the term to its correct definition. Do not put quotations around the number or include words in the answer, we are expecting the assigned values to be numbers.

| Terms |  Definitions |
|----------------|------------|
| <p align="left">point estimate | <p align="left">1. the entire set of entities/objects of interest |
| <p align="left">population | <p align="left">2. selecting a subset of observations from a population where each observation is equally likely to be selected at any point during the selection process|
| <p align="left">random sampling | <p align="left">3. a numerical summary value about the population |
| <p align="left">population parameter | <p align="left"> 4. a distribution of point estimates, where each point estimate was calculated from a different random sample from the same population |
| <p align="left">sample | <p align="left">5. a collection of observations from a population |
| <p align="left">observation | <p align="left">6. a single number calculated from a random sample that estimates an unknown population parameter of interest |
| <p align="left">sampling distribution | <p align="left">7. a quantity or a quality (or set of these) we collect from a given entity/object |


In [None]:
# ENTER BELOW
point_estimate = ...
population = ...
random_sampling = ...
population_parameter = ...
sample = ...
observation = ...
sampling_distribution = ...


###  Virtual sampling simulation

In real life, we rarely, if ever, have measurements for our entire population. Here, however, we will pretend that we somehow were able to ask every single Candian senior what their age is. We will do this so that we can experiment to learn about sampling and how this relates to estimation.

Here we make a simulated dataset of ages for our population (all Canadian seniors) bounded by realistic values ($\geq$ 65 and $\leq$ 118):

In [None]:
# Run this cell to simulate a large finite population
# Don't change the seed!
np.random.seed(4321)

can_seniors = pd.DataFrame({
    'age': np.random.exponential(1 / 0.1, 2000000) ** 2 + 65,
}).query(
    "65 <= age <= 118"
)

can_seniors

**Question 1.2** 

A distribution defines all the possible values (or intervals) of the data and how often they occur. Visualize the distribution of the population (`can_seniors`) that was just created by plotting a histogram using `seaborn` with the `bins` arguement set to 30. Give the x-axis a descriptive label.

In [None]:
# YOU ANSWER HERE
fig, ax = plt.subplots()
sns.histplot()
ax.set_xlabel()
fig.set_title()
plt.show()


**Question 1.3** 

We often want to represent distributions by a single value or small number of values. Common values used for this include the mean, median, standard deviation, etc). 

Use the `agg` method to calculate the following population parameters from the `can_seniors` population:

- mean (`mean`)
- median (`median`)
- standard deviation (`std`)

*Name the resulting data frame `pop_parameters` (it should have one only column, called `age`, and one population parameter per row)*

In [None]:
# ENTER YOUR ANSWER HERE
pop_parameters = ...

**Question 1.4** 

In real life, we usually are able to only collect a single sample from the population. We use that sample to try to infer what the population looks like.

Take a single random sample of 40 observations using `sample` from the Canadian seniors population (`can_seniors`). Name it `sample_1`. Use 4321 as your `random_state`. `random_state` is another parameter you can assign when using `sample` (i.e., you don't have to set the random seed outside of `sample` as we've done in the textbook and lecture).

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.5** 

Visualize the distribution of the random sample you just took (`sample_1`) by plotting a histogram using `bins=30`. Just as in the population histogram we created above, give the plot a title; a suitable choice could be `"Sample 1 distribution"`.


In [1]:
# ENTER YOUR ANSWER HERE



**Question 1.6** 

Use `agg` to calculate the following point estimates from the random sample you just took (`sample_1`):

- mean 
- median 
- standard deviation 

*Name this data frame `sample_1_estimates`.*

In [2]:
# ENTER YOUR ANSWER HERE


Let's now compare our random sample to the population from which it was drawn with `seaborn` histograms.

In [None]:
# run this code cell
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(6, 8), sharex=True)
sns.histplot(data=can_seniors, bins=30, ax=ax[0], legend=False)
ax[0].set_title("Population")

sns.histplot(data=sample_1, bins=30, ax=ax[1], legend=False)
ax[1].set_xlabel("Age (years)")
ax[1].set_title("Sample 1 distribution")

plt.show()

And now let's compare the point estimates (mean, median and standard deviation) with the true population parameters we were trying to estimate:

In [None]:
# run this cell
pop_parameters

In [None]:
# run this cell
sample_1_estimates

**Question 1.7** Multiple Choice

After comparing the population and sample distributions above, and the true population parameters and the sample point estimates, which statement below **is not** correct:

A. The sample point estimates are close to the values for the true population parameters we are trying to estimate

B. The sample distribution is of a similar shape to the population distribution

C. The sample point estimates are identical to the values for the true population parameters we are trying to estimate

*Assign your answer to an object called `answer1_7`. Your answer should be a single character surrounded by quotes.*

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.8.0** 

What if we took another sample? What would we expect? Let's try! Take another random sample of size 40 from the population (using the different random seed `2020` so that you get a different sample), visualize its distribution with the title `"Sample 2 distribution"`, and calculate the point estimates for the sample mean, median, and standard deviation.

*Name your random sample of data `sample_2`, and name your estimates `sample_2_estimates`.*

In [None]:
# ENTER YOUR ANSWER HERE


In [None]:
# Run this cell
sample_2_estimates

**Question 1.8.1** 

After comparing the distribution and point estimates of this second random sample from the population with that of the first random sample and the population, which of the following statements below **is not** correct:

A. The sample distributions from different random samples are of a similar shape to the population distribution, but they vary a bit depending which values are captured in the sample

B. The sample point estimates from different random samples are close to the values for the true population parameters we are trying to estimate, but they vary a bit depending which values are captured in the sample

C. Every random sample from the same population should have an identical set of values and yield identical point estimates.

*Assign your answer to an object called `answer1_8_1`. Your answer should be a single character surrounded by quotes.*

In [None]:
# ENTER YOUR ANSWER HERE


### Exploring the sampling distribution of an estimate

Just how much should we expect the point estimates of our random samples to vary? To build an intuition for this, let's experiment a little more with our population of Canadian seniors. To do this we will take 1000 random samples, and then calculate the point estimate we are interested in (let's choose the mean for this example) for each sample. Finally, we will visualize the distribution of the sample point estimates. This distribution will tell us how much we would expect the point estimates of our random samples to vary for this population for samples of size 40 (the size of our samples).

**Question 1.9** 

Draw 1000 random samples from our population of Canadian seniors (`can_seniors`). Each sample should have 40 observations. Use a list comprehension wrapped in `pd.concat` as in the textbook, and name the resulting data frame `samples`.

In [None]:
np.random.seed(4321) # DO NOT CHANGE

# ENTER YOUR ANSWER HERE




**Question 2.0** 

Group by the sample replicate number, and then for each sample, calculate the mean as the point estimate. Name the data frame `sample_estimates`. Use `reset_index` and `rename(columns=___)`, so that the final data frame has the column names `replicate` and `mean_age`.

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.1** 

Visualize the distribution of the sample estimates (`sample_estimates`) you just calculated by plotting a histogram using `bins=30`. Title the plot `"Sampling distribution of the sample means"` and give the x-axis a descriptive label.

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.2** 

Let's refresh our memories: what is the mean age of the whole population (we calculated this above)? *Assign your answer to an object called `answer2_2`. Your answer should be a single number reported to two decimal places.*


In [None]:
# ENTER YOUR ANSWER HERE


**Question 2.3** Multiple Choice

Considering the true value for the population mean, and the sampling distribution you created and visualized in **question 2.1**, which of the following statements below **is not** correct:

A. The sampling distribution is centered at the true population mean

B. All the sample means are the same value as the true population mean

C. Most sample means are at or very near the same value as the true population mean

D. A few sample means are far away from the same value as the true population mean

*Assign your answer to an object called `answer2_3`. Your answer should be a single character surrounded by quotes.*

In [None]:
# ENTER YOUR ANSWER HERE


**Question 2.4** True/False

Taking a random sample and calculating a point estimate is a good way to get a "best guess" of the population parameter you are interested in. True or False?

*Assign your answer to an object called `answer2_4`. Your answer should be a boolean. i.e. `True` or `False`*

In [None]:
# ENTER YOUR ANSWER HERE


### The influence of sample size on the sampling distribution

What happens to our point estimate when we change the sample size? Let's answer this question by experimenting! We will create 3 different sampling distributions of sample means, each using a different sample size. As we did above, we will draw samples from our Canadian seniors population. We will visualize these sampling distributions and see if we can see a pattern when we vary the sample size.

**Question 2.5** 

Using the same strategy as you did above, draw 1000 random samples from the Canadian seniors population (`can_seniors`), each of size **20**. For each sample, calculate the mean age and assign this data frame to an object called `sample_estimates_20`. As previously, make sure you use `reset_index` so that the data frame has the columns `replicate` and `mean_age`. 

Then, visualize the distribution of the sample estimates (means) you just calculated by plotting a histogram using `bins=30`. Give the x-axis a descriptive label. Give the plot the title `"Sampling distribution with sample size n=20"`.

In [None]:
np.random.seed(4321)  # DO NOT CHANGE

# ENTER YOUR ANSWER HERE



In [None]:
# ENTER YOUR ANSWER HERE


**Question 2.6** 

Using the same strategy as you did above, draw 1000 random samples from the Canadian seniors population (`can_seniors`), each of size **100**. For each sample, calculate the mean age and assign this data frame to an object called `sample_estimates_100`. As previously, make sure you use `reset_index` so that the data frame has the columns `replicate` and `mean_age`. 

Then, visualize the distribution of the sample estimates (means) you just calculated by plotting a histogram using `bins=30`. Give the x-axis a descriptive label. Give the plot the title `"Sampling distribution with sample size n=100"`.

In [None]:
np.random.seed(4321)  # DO NOT CHANGE

# ENTER YOUR ANSWER HERE



**Question 2.7** 

Next, let's compare the three sampling distributions together. To do this more effectively we need to change the histograms' x-axes to span the same range ((hint: remember the `sharex` argument?). You can copy most of the code you've already written for this. Just make sure you have 3 sets of axes to work with. 

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.8** Multiple Choice

Considering the panel figure you created above in **question 2.7**, which of the following statements below **is not** correct:

A. As the sample size increases, the sampling distribution of the point estimate becomes narrower.

B. As the sample size increases, more sample point estimates are closer to the true population mean.

C. As the sample size decreses, the sample point estimates become more variable.

D. As the sample size increases, the sample point estimates become more variable.

*Assign your answer to an object called `answer2_8`. Your answer should be a single character surrounded by quotes.*

In [None]:
# ENTER YOUR ANSWER HERE


**Question 2.9** True/False

Given what you observed above, and considering the real life scenario where you will only have one sample, answer the True/False question below:

The smaller your random sample, the better your sample point estimate reflect the true population parameter you are trying to estimate. True or False?

*Assign your answer to an object called `answer2_9`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE


---
# Lab 9: Statistical Inference (part b: bootstrapped distributions)

### Acknowledgment

This lab has been adapted from the worksheets associated with the online textbook [Data Science: A First Introduction (Python Edition)](https://python.datasciencebook.ca/index.html) by [Campbell, Timbers, Lee, Heagy, and Ostblom](https://python.datasciencebook.ca/authors.html) and shared under an [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

### Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:
- Explain why we don't have a sampling distribution in practice/real life.
- Define bootstrapping.
- Use Python to create a bootstrap distribution to approximate a sampling distribution.
- Contrast bootstrap and sampling distributions.

This worksheet covers parts of [Chapter 6](https://hyosubkim.github.io/datasci-for-kin/6-Inference/inference.html) of the online textbook. You should read this chapter before attempting this assignment. Any place you see `___`, you must fill in the function, variable, or data to complete the code. 

In [None]:
### Run this cell before continuing.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Make our figures classy
sns.set_theme(style='darkgrid')

**Question 1.1** True/False:

In real life, we typically take many samples from the population and create a sampling distribution when we perform estimation. True or false?

*Assign your answer to an object called `answer1_1`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE



**Question 1.2** Ordering

Correctly re-order the steps for creating a bootstrap sample from those listed below. 

1. record the observation's value
2. repeat the above the same number of times as there are observations in the original sample 
3. return the observation to the original sample
4. randomly draw an observation from the original sample (which was drawn from the population)

Create your answer by reordering values below in the `answer1_2` list with the correct order for the steps above for creating a bootstrap sample.

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.3** Multiple choice

From the list below, choose the correct description of a bootstrap distribution for a point estimate:

A. a list of point estimates calculated from many samples drawn with replacement from the population

B. a list of point estimates calculated from many samples drawn without replacement from the population

C. a list of point estimates calculated from bootstrap samples drawn with replacement from a single sample (that was drawn from the population)

D. a list of point estimates calculated from bootstrap samples drawn without replacement from a single sample (that was drawn from the population)

*Assign your answer to an object called `answer1_3`. Your answer should be an uppercase letter and is surrounded by quotes. (e.g. `"F"`)*

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.4** Multiple choice

From the list below, choose the correct explanation of why, when performing estimation, we want to report a plausible **range** for the true population quantity we are trying to estimate along with the point estimate:

A. The point estimate is our best guess at the true population quantity we are trying to estimate

B. The point estimate will often not be the exact value of the true population quantity we are trying to estimate

C. The value of a point estimate from one sample might very well be different than the value of a point estimate from another sample.

D. B & C

F. A & C

E. None of the above

*Assign your answer to an object called `answer1_4`. Your answer should be an uppercase letter and is surrounded by quotes (e.g. `"F"`).*

In [None]:
# ENTER YOUR ANSWER HERE


###  Continuing with our virtual population of Canadian seniors from last worksheet

Here we re-create the virtual population (ages of all Canadian seniors) we used in the last worksheet. It was bounded by realistic values ($\geq$ 65 and $\leq$ 118):

In [None]:
# Run this cell to simulate a large finite population
# Don't change the seed!
np.random.seed(4321)

can_seniors = pd.DataFrame({
    'age': np.random.exponential(1 / 0.1, 2000000) ** 2 + 65,
}).query(
    "65 <= age <= 118"
)
can_seniors

Let's remind ourselves of what this population looks like:

In [None]:
# Run this cell
fig, ax = plt.subplots()
sns.histplot(data=can_seniors, x="age", bins=30, ax=ax)
ax.set_xlabel("Age (years)")
ax.set_title("Population of Canadian seniors", fontsize=16)
plt.show()

### Estimate the mean age of Canadian Seniors

Let's say we are interested in estimating the mean age of Canadian Seniors. Given that we have the population (we created it) we could just calculate the mean age from this population data. However in real life, we usually only have one small-ish sample from the population. Also, from our experimentation with sampling distributions, we know that different random samples will give us different point estimates. We also know from these experiments that the point estimates from different random samples will mostly be close to the true population quanitity we are trying to estimate, and how close depends on the sample size.

What about in real life though, when we only have one sample? Can we say how close? Or at least give some plausible range of where we would expect the population quanitity we are trying to estimate to fall? Yes! We can do this using a method called bootstrapping! Let's explore how to create a bootstrap distribution from a single sample using Python and then we will discuss how the bootstrap distribution relates to the sampling distribution, and what it can tell us about the true population quantity we are trying to estimate.

Let's draw a single sample of size 40 from the population and visualize it:

In [None]:
# Run this cell
one_sample = can_seniors.sample(40, random_state=12345)
one_sample

In [None]:
# Run this cell
fig, ax = plt.subplots()
sns.histplot(data=one_sample, x="age", bins=30, ax=ax)
ax.set_xlabel("Age (years)")
fig.suptitle("Distribution of one sample")
plt.show()

**Question 1.5** 

Calculate the mean age (our point estimate of interest) from the random sample you just took (`one_sample`). Assign the result to a variable called `one_sample_estimates`.

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.6** 

To generate a single bootstrap sample in Python, we can use the `sample` method with `frac=1` to indicate that the bootstrap sample size is the same as the original sample. In contrast to when we created a sampling distribution from a population, we will set `replace=True` to ensure we don't end up with the exact same sample each time when performing the bootstrap.

Use `sample` to take a single bootstrap sample from the sample you drew from the population. Use 4321 as the `random_state` (again, to set the seed for our "random" number generator, and so that we can reproduce these results exactly) and name this bootstrap sample `boot1`.

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.7** Multiple choice

Why do we change `replace` to `TRUE`?

A. Taking a bootstrap sample involves drawing observations from the original population without replacement

B. Taking a bootstrap sample involves drawing observations from the original population with replacement

C. Taking a bootstrap sample involves drawing observations from the original sample without replacement

D. Taking a bootstrap sample involves drawing observations from the original sample with replacement

*Assign your answer to an object called `answer1_7`. Your answer should be an uppercase letter and is surrounded by quotes (e.g. `"F"`).*

In [None]:
# ENTER YOUR ANSWER HERE


**Question 1.8** 

Visualize the distribution of the bootstrap sample you just took (`boot1`). Set `bins=30` and give the plot and the x-axis a descriptive title.

In [None]:
# ENTER YOUR ANSWER HERE


Let's now compare our bootstrap sample to the original random sample that we drew from the population:

In [None]:
# Run this code cell
fig, ax = plt.subplots(2, 1, figsize=(6, 8), sharex=True)
sns.histplot(data=one_sample, x="age", bins=30, ax=ax[0])
ax[0].set_xlabel("Age (years)")
ax[0].set_title("Distribution of one sample", fontsize=16)

sns.histplot(data=boot1, x="age", bins=30, ax=ax[1])
ax[1].set_xlabel("Age (years)")
ax[1].set_title("Single bootstrap sample", fontsize=16)
plt.show()


Earlier we calculated the mean of our original sample to be about 79.6 years. What is the mean of our bootstrap sample?

In [None]:
# Run this cell
boot1.mean()

We see that original sample distrbution and the bootstrap sample distribution are of similar shape, but not identical. They also have different means. The difference of the frequency of the values in the bootstrap sample (and the difference of the value of the mean) comes from sampling from the original sample with replacement. Why sample with replacement? If we didn't we would end up with the original sample again. **What we are trying to do with bootstrapping is to mimic drawing another sample from the population, without actually doing that.**

Why are we doing this? As mentioned earlier, in real life we typically only have one sample and thus we cannot create a sampling distribution that we can use to tell us about how we might expect our point estimate to behave if we took another sample. What we can do instead, is to use our sample as an estimate of our population, and sample from that with replacement (i.e., bootstrapping) many times to create many bootstrap samples. We can then calculate point estimates for each bootstrap sample and create a bootstrap distribution of our point estimates and use this as a proxy for a sampling distribution. We can finally use this bootstrap distribution of our point estimates to suggest how we might expect our point estimate to behave if we took another sample.

**Question 1.9** 

What do 6 different bootstrap samples look like? Use the `sample` method to create a single data frame with 6 bootstrap samples drawn from the original sample we drew from the population, `one_sample`. Assign a new column called `replicate` to mark the sample number `(from 0 to 5)`. Name the data frame `boot6`.

Set the seed as `1234`.

In [None]:
np.random.seed(1234)  # DO NOT CHANGE!

# ENTER YOUR ANSWER HERE



**Question 2.0** 

Now visualize the six bootstrap sample distributions from `boot6` by using a `seaborn` `FacetGrid`. To facilitate comparing the distribution, lay the plots out in a single column. Give the plot and the x-axis a descriptive title. Replace the ellipses with your code. 

In [None]:
# ENTER YOUR ANSWER HERE
g = sns.FacetGrid(data=boot6, row="replicate", aspect=1.5)  # This is creating 6 subplots split by replicate number
g.map(..., ..., bins=30)  # What plotting function do you want to use in each subplot (facet)?
g.set_xlabels(...)
g.set_titles(col_template="Replicate {col_name}")

g.fig.suptitle("Six bootstrap samples")

plt.tight_layout()
plt.show()

**Question 2.1** 

Calculate the mean of these 6 bootstrap samples using `groupby` and `mean` and save result into a column called `mean_age`. Use `reset_index` so that the resulting data frame has two columns: `replicate` and `mean`. Name the data frame `boot6_means`.

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.2** 

Let's now take 1000 bootstrap samples from the original sample we drew from the population (`one_sample`). As previously, assign a new column called `replicate` to mark the sample number `(from 0 to 999)`. Name the data frame `boot1000`.

Set the seed as `1234`.

In [None]:
np.random.seed(1234)  # DO NOT CHANGE!

# ENTER YOUR ANSWER HERE



**Question 2.3** 

Calculate the mean of these 1000 bootstrap samples using `groupby` and `mean` and save result into a column called `mean_age`. Use `reset_index` so that the resulting data frame has two columns: `replicate` and `mean`. Name the data frame `boot1000_means`.

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.4** 

Visualize the distribution of the bootstrap sample point estimates (`boot1000_means`) you just calculated by plotting a histogram with `bins=30`. Give the plot and the x-axis a descriptive title.

In [None]:
# ENTER YOUR ANSWER HERE



How does the bootstrap distribution above compare to the sampling distribution? Let's visualize them side by side:

In [None]:
# Run this cell

# Create sampling distribution from the population
np.random.seed(4321)
samples = pd.concat([
    can_seniors.sample(40).assign(replicate=n)
    for n in range(1000)
])

sample_estimates = (
    samples.groupby("replicate")
    .mean()
    .reset_index()
    .rename(columns={"age": "mean_age"})
)

In [None]:
# Visualize the sampling distribution
fig, ax = plt.subplots(2, 1, figsize=(6, 9), sharex=True)
sns.histplot(data=sample_estimates, x="mean_age", ax=ax[0])
ax[0].set_xlabel("Mean age (years)")
ax[0].set_title("Sampling distribution", fontsize=16)

sns.histplot(data=boot1000_means, x="mean_age", ax=ax[1])
ax[1].set_xlabel("Mean age (years)")
ax[1].set_title("Bootstrapped distribution", fontsize=16)

plt.tight_layout()
plt.show()


Reminder: the true population quantity we are trying to estimate, the population mean, is about 79 years. We know this because we created this population and calculated this value. In real life we wouldn't know this value.

**Question 2.5** True/False

The mean of the bootstrap distribution is the same value as the mean of the sampling distribution of the sample means. True or false?

*Assign your answer to an object called `answer2_5`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE


**Question 2.6** True/False

The mean of the bootstrap distribution is not the same value as the mean of the sampling distribution because the bootstrap distribution was created from samples drawn from a single sample, whereas the sampling distribution was created from samples drawn from the population. True or false?

*Assign your answer to an object called `answer2_6`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.7** True/False

The shape and spread (i.e. width) of the distribution of the bootstrap sample means is a poor approximation of the shape and spread of the sampling distribution of the sample means. True or false?

*Assign your answer to an object called `answer2_7`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE



**Question 2.8** True/False

In real life, where we only have one sample and cannot create a sampling distribution, the distribution of the bootstrap sample estimates (here means) can suggest how we might expect our point estimate to behave if we took another sample. True or false?

*Assign your answer to an object called `answer2_8`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE



### Using the bootstrap distribution to calculate a plausible range for point estimates

Once we have created a bootstrap distribution, we can use it to suggest a plausible range where we might expect the true population quantity to lie. One formal name for a commonly used plausible range is called a confidence interval. Confidence intervals can be set at different levels, an example of a commonly used level is 95%. When we report a point estimate with a 95% confidence interval as the plausible range, formally we are saying that if we repeated this process of building confidence intervals more times with more samples, we’d expect ~ 95% of them to contain the value of the population quantity.

> How do you choose a level for a confidence interval? You have to consider the downstream application of your estimation and what the cost/consequence of an incorrect estimate would be. The higher the cost/consequence, the higher a confidence level you would want to use. 

To calculate an approximate 95% confidence interval using bootstrapping, we essentially order the values in our bootstrap distribution and then take the value at the 2.5th percentile as the lower bound of the plausible range, and the 97.5th percentile as the upper bound of the plausible range. 

In [None]:
# Run this cell
# A "quantile" is 100th of a percentile (similar to a proportion vs a percentage)
boot1000_means["mean_age"].quantile([0.025, 0.975])

Thus, to finish our estimation of the population quantity that we are trying to estimate, we would report the point estimate and the lower and upper bounds of our confidence interval. We would say something like this:

Our sample mean age for Canadian seniors was measured to be 83.7 years, and we’re 95% "confident" that the true population mean for Canadian seniors is between 78.8 and 89.2. 

Here our 95% confidence interval does contain the true population mean for Canadian seniors, 79 years - pretty neat! However, in real life we would never be able to know this because we only have observations from a single sample, not the whole population.

**Question 2.9** True/False

For any sample we take, if we use bootstrapping to calculate the 95% confidence intervals, the true population quantity we are trying to estimate would always fall within the lower and upper bounds of the confidence interval. True or false?

*Assign your answer to an object called `answer2_9`. Your answer should be a boolean. i.e. `True` or `False`.*

In [None]:
# ENTER YOUR ANSWER HERE
