In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw08.ipynb")

# Homework 8: Confidence Intervals

**Helpful Resource:**

- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods!

**Recommended Reading**: 

* [Estimation](https://www.inferentialthinking.com/chapters/13/Estimation)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Thai Restaurants in Santa Barbara

Daniela and Emily are trying to see what Thai restaurant SBCC students prefer in Santa Barbara. They survey 1500 SBCC students selected uniformly at random, and ask each student what Thai restaurant is the best The choices of Thai restaurant are Zen Yai, Galanga, Thai Bistro, and Your Place. (*Note: this data is fabricated for the purposes of this homework, all 4 restaurants are great choices*). After compiling the results, Daniela and Emily release the following percentages from their sample:

|Thai Restaurant  | Percentage|
|:------------:|:------------:|
|Zen Yai | 8% |
|Galanga | 52% |
|Thai Bistro | 25% |
|Your Place | 15% |

These percentages represent a uniform random sample of the population of SBCC students. We will attempt to estimate the corresponding *parameters*, or the percentage of the votes that each restaurant will receive from the entire population (the entire population is all SBCC students). We will use confidence intervals to compute a range of values that reflects the uncertainty of our estimates.

The table `votes` contains the results of the survey (*this data is fabricated for the purposes of this HW*).

In [None]:
# Just run this cell
votes = Table.read_table('votes.csv')
votes

**Question 1.1.** Complete the function `one_resampled_percentage` below. It should return Galanga's ***percentage*** of votes after taking the original table (`tbl`) and performing one bootstrap sample of it. Reminder that a percentage is between 0 and 100.

*Note:* `tbl` will always be in the same format as `votes`.

*Hint:* Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? **Be sure to use percentages, not proportions, for this question!**


In [None]:
def one_resampled_percentage(tbl):
    ...

one_resampled_percentage(votes)

In [None]:
grader.check("q1_1")

**Question 1.2.** Complete the `percentages_in_resamples` function such that it simulates and returns an array of 2022 elements, where each element represents a bootstrapped estimate of the percentage of voters who will vote for Galanga. You should use the `one_resampled_percentage` function you wrote above.


In [None]:
def percentages_in_resamples():
    percentage_galanga = make_array()
    ...

In [None]:
grader.check("q1_2")

In the following cell, we run the function you just defined, `percentages_in_resamples`, and create a histogram of the calculated statistic for the 2022 bootstrap estimates of the percentage of voters who voted for Galanga. Based on what the original Thai restaurant percentages were, does the graph seem reasonable? Talk to a friend, ask your instructor or ask a tutor if you are unsure!


In [None]:
resampled_percentages = percentages_in_resamples()
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")

**Question 1.3.** Using the array `resampled_percentages`, find the values at the two edges of the middle 95% of the bootstrapped percentage estimates. Compute the lower and upper ends of the interval, named `galanga_lower_bound` and `galanga_upper_bound` respectively.

*Hint:* If you are stuck on this question, try looking over [Chapter 13](https://inferentialthinking.com/chapters/13/Estimation.html) of the textbook.


In [None]:
galanga_lower_bound = ...
galanga_upper_bound = ...
print(f"Bootstrapped 95% confidence interval for the percentage of Galanga voters in the population: [{galanga_lower_bound:.2f}, {galanga_upper_bound:.2f}]")

In [None]:
grader.check("q1_3")

**Question 1.4.** The survey results seem to indicate that Galanga is beating all the other Thai restaurants among the voters. We would like to use confidence intervals to determine a range of likely values for Galanga's true lead over all the other restaurants combined. The calculation for Galanga's lead over Zen Yai, Thai Bistro, and Your Place combined is:

$$\text{Galanga's percent of the vote} - (\text{100 percent} - \text{Galanga's percent of vote})$$

Define the function `one_resampled_difference` that returns **exactly one value** of Galanga's percentage lead over Zen Yai, Thai Bistro, and Your Place combined from one bootstrap sample of `tbl`.

*Hint 1:* Galanga's lead can be negative.

*Hint 2:* Given a table of votes, how can you figure out what percentage of the votes are for a certain restaurant? **Be sure to use percentages, not proportions, for this question!**

*Note:* If the skeleton code provided within the function is not helpful for you, feel free to approach the question using your own variables.


In [None]:
def one_resampled_difference(tbl):
    bootstrap = ...
    galanga_percentage = ...
    ...

In [None]:
grader.check("q1_4")

<!-- BEGIN QUESTION -->

**Question 1.5.** Write a function called `leads_in_resamples` that returns an array of 2022 elements representing the bootstrapped estimates (the result of calling `one_resampled_difference`) of Galanga's lead over Zen Yai, Thai Bistro, and Your Place combined. Afterwards, run the cell to plot a histogram of the resulting samples. **Note:** Galanga's lead can be negative.

*Hint:* If you see an error involving `NoneType`, consider what components a function needs to have!


In [None]:
def leads_in_resamples():
    ...

sampled_leads = leads_in_resamples()
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead")

<!-- END QUESTION -->

**Question 1.6.** Use the simulated data in `sampled_leads` from Question 1.5 to compute an approximate 95% confidence interval for Galanga's true lead over Zen Yai, Thai Bistro, and Your Place combined.


In [None]:
diff_lower_bound = ...
diff_upper_bound = ...
print("Bootstrapped 95% confidence interval for Galanga's true lead over Zen Yai, Thai Bistro, and Your Place combined: [{:f}%, {:f}%]".format(diff_lower_bound, diff_upper_bound))

In [None]:
grader.check("q1_6")

## 2. Interpreting Confidence Intervals 

The staff computed the following 95% confidence interval for the percentage of Galanga voters: 

$$[50.53, 55.53]$$

(Your answer may have been a bit different due to randomness; that doesn't mean it was wrong!)

<!-- BEGIN QUESTION -->

**Question 2.1.** The staff also created 70%, 90%, and 99% confidence intervals from the same sample, but we forgot to label which confidence interval represented which percentages! **First**, match each confidence level (70%, 90%, 99%) with its corresponding interval in the cell below (e.g. __ % CI: [52.1, 54] $\rightarrow$ replace the blank with one of the three confidence levels). **Then**, explain your thought process and how you came up with your answers.

The intervals are below:

* [50.03, 55.94]
* [52.1, 54]
* [50.97, 54.99]



_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 2.2.** Suppose we produced 6,000 new samples (each one a new/distinct uniform random sample of 1,500 students) from the population and created a 95% confidence interval from each one. Roughly how many of those 6,000 intervals do you expect will actually contain the true percentage of the population? 

Assign your answer to `true_percentage_intervals`.


In [None]:
true_percentage_intervals = ...

In [None]:
grader.check("q2_2")

Recall the second bootstrap confidence interval you created, which estimated Galanga's lead over Zen Yai, Thai Bistro, and Your Place combined. Among
voters in the sample, Galanga's lead was 6%. The staff's 95% confidence interval for the true lead (in the population of all voters) was:

$$[1.2, 11.2]$$

Suppose we are interested in testing a simple yes-or-no question:

> "Is the percentage of votes for Galanga equal to the percentage of votes for Zen Yai, Thai Bistro, and Your Place combined?"

Our null hypothesis is that the percentages are equal, or equivalently, that Galanga's lead is exactly 0. Our alternative hypothesis is that Galanga's lead is not equal to 0.  In the questions below, don't compute any confidence interval yourself—use only the staff's 95% confidence interval.

**Question 2.3.** Say we use a 5% p-value cutoff. Do we reject the null, fail to reject the null, or are we unable to tell using the staff's confidence interval? 

Assign `cutoff_five_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval

*Hint:* Consider the relationship between the p-value cutoff and confidence. If you're confused, take a look at [this chapter](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html) of the textbook.


In [None]:
cutoff_five_percent = ...

In [None]:
grader.check("q2_3")

**Question 2.4.** What if, instead, we use a p-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using our staff confidence interval? 

Assign `cutoff_one_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval


In [None]:
cutoff_one_percent = ...

In [None]:
grader.check("q2_4")

**Question 2.5.** What if we use a p-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using our confidence interval?

Assign `cutoff_ten_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval


In [None]:
cutoff_ten_percent = ...

In [None]:
grader.check("q2_5")

## Congratulations! You're done with Homework 8! 

Be sure to run the tests and verify that they all pass, then choose Download as PDF from the File menu and submit the .pdf file on canvas.