In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw07.ipynb")

# Homework 07: Confidence Intervals

**Reading**: 
* [Estimation](https://inferentialthinking.com/chapters/13/Estimation.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. Additional tests will be run once your homework is submitted for grading. While you may pass all the tests you have access to before submission, you may not earn full credit if you do not pass the hidden tests as well.**. 

Many of the tests you have access to before submitting only test to ensure you have given an answer that is formatted correctly and/or you have given an answer that *could* make sense in context. For example, a test you have access to while completing the assignment may check that you selected a valid choice for a multiple choice problem (1, 2, or 3) or that your answer is an integer between 0 and 50 if asked to count a subset of states in the United States. The tests that are run after submission will evaluate your work for accuracy. **Do not assume that just because all your tests pass before submission means that your answers are correct!**

Consult with your teacher and course syllabus for information and policies regarding appropriate collaboration with other students, appropriate use of AI tools, and submission of late work.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Thai Restaurants


Max and Aesha are trying see what the best Thai restaurant in Durham, NC is. They survey 1500 Durham residents selected uniformly at random, and ask each of them what Thai restaurant is the best (*Note: this data is fabricated for the purposes of this homework, but these restaurants are real if you wanted to try them out!*). 

The choices of Thai restaurant are:
* Bua Thai,
* Thai Cafe,
* Pad Thai, and
* Thai Spoon.

After compiling the results, Max and Aesha compute the following percentages from their single sample:

|Thai Restaurant  | Percent of vote|
|:------------:|:------------:|
|Bua Thai | 8% |
|Thai Cafe | 52% |
|Pad Thai | 25% |
|Thai Spoon | 15% |

These percentages show that based on this sample, Thai Cafe has the largest share of votes for best Thai restaurant in Durham, winning 52% of the votes. We will attempt to estimate the corresponding *parameter* in the population, which in this case would be the percentage of the votes that Thai Cafe would receive if every resident of the city of Durham were able to vote in the survey.

The table `votes` contains the results of the survey. Run the cell below to load the data set.

In [None]:
# Just run this cell
votes = Table.read_table('votes.csv')
votes

### Question 1.1

To start the investigation, complete the function `one_resampled_percentage` so it creates a new bootstrap (aka resample) from the `votes` data set and returns Thai Cafe's **percentage** (not proportion) of votes from the bootstrap.

**Note:** You can assume that input to the `one_resampled_percentage` function will be a Table that is formatted the same way as `votes`.

In [None]:
def one_resampled_percentage(tbl):
    ...

# Run your function a few times and inspect the output
# to ensure it is working as you intend it to!
one_resampled_percentage(votes)

In [None]:
grader.check("q1_1")

### Question 1.2.
Complete the `percentages_from_resamples` function such that it returns an array of 2,500 bootstrapped estimates of the percentage of voters who will vote for Thai Cafe. While computing the bootstrapped estimates, `percentages_from_resamples` should assign the percentages to the array named `percentage_thai_cafe` that is created for you.

Once returned, the array of percentages will be assigned to `resampled_percentages` for use in the remainder of this assignment.

In [None]:
def percentages_from_resamples():
    percentage_thai_cafe = make_array()
    ...

resampled_percentages = percentages_from_resamples()
resampled_percentages

In [None]:
grader.check("q1_2")

Run the following cell to create a histogram to visualize the distribution of the statistic for the 2,500 bootstrap samples. Based on what the original Thai restaurant percentages were, does the graph seem reasonable? You don't need to provide a written response, but this is a good opportunity to ensure your resampling procedure makes reasonable sense based on the original sample.

In [None]:
Table().with_column('Estimated Percentage', resampled_percentages).hist("Estimated Percentage")

### Question 1.3.

Compute the values for the two endpoints of the 95% confidence interval for the percentage of votes for Thai Cafe, based on your bootstrapped estimates. Name the lower and upper ends of the interval, `thai_cafe_lower_bound` and `thai_cafe_upper_bound`, respectively.

In [None]:
thai_cafe_lower_bound = ...
thai_cafe_upper_bound = ...

# The following code wil print the results of your calculations.
# Don't change it, just run it!
print("Bootstrapped 95% confidence interval for the percentage of Thai Cafe voters in the population: [{:f}, {:f}]".format(thai_cafe_lower_bound, thai_cafe_upper_bound))

In [None]:
grader.check("q1_3")

### Question 1.4.
The survey results seem to indicate that Thai Cafe is beating all the other Thai restaurants in Durham combined. We would like to use confidence intervals to determine a range of likely values for Thai Cafe's true lead over all the other restaurants combined. To compute Thai Cafe's lead over Bua Thai, Pad Thai, and Thai Spoon combined, you could use the following formula:

$$\text{Thai Cafe's lead} = \text{Thai Cafe's \% of the vote} - ( 100 - \text{Thai Cafe's \% of the vote} )$$

Complete the function `one_resampled_difference` that computes and returns **exactly one estimate** of Thai Cafe's percentage lead over Bua Thai, Pad Thai, and Thai Spoon combined from a single bootstrapped sample.

**Note:** You can assume that input to the `one_resampled_difference` function will be a Table that is formatted the same way as `votes`.

In [None]:
def one_resampled_difference(tbl):

# Run your function a few times and inspect the output
# to ensure it is working as you intend it to!
one_resampled_difference(votes)

In [None]:
grader.check("q1_4")

<!-- BEGIN QUESTION -->

### Question 1.5.
Write a function called `leads_from_resamples` that finds 2,500 bootstrapped estimates of Thai Cafe's lead over Bua Thai, Pad Thai, and Thai Spoon combined. The provided code will plot a histogram of the distribution of the resulting estimates from the resamples.

**Note:** Thai Cafe's lead can be negative if the resample results in Thai Cafe receiving less than 50 percent of the votes. That's okay!

In [None]:
def leads_from_resamples():
    ...

sampled_leads = leads_from_resamples()
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead")

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.6.
Using only the histogram that the code below generates, create an *approximate* a 95% confidence interval for the estimated lead that Thai Cafe would have over the other restaurants. Since you should not compute exact values for the ends of the confidence interval, choose integer endpoints that are reasonable and explain how you came up with the endpoints using the histogram.

**Note**: The bins of the histogram are all 1 unit wide.

In [None]:
Table().with_column('Estimated Lead', sampled_leads).hist("Estimated Lead", bins=np.arange(-6,15,1))

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Interpreting Confidence Intervals


When generating the solutions for this assignment, the teaching team computed the following 95% confidence interval for the percentage of voters who voted Thai Cafe as the best in Durham: 

$$[49.40, 54.47]$$

Your answer is almost certainly bit different; that doesn't mean it was wrong! Don't forget that resampling involves ✨randomness✨!

Use this provided confidence interval, **not your confidence interval** from earlier, to answer the following questions.

<!-- BEGIN QUESTION -->

### Question 2.1:
Can we say there is a 95% probability that the interval [49.40, 54.47] contains the true percentage of the population that votes for Thai Cafe as the best Thai restaurant in Durham? Answer "yes" or "no" and explain your reasoning. 

*Note:* ambiguous answers using language like "sometimes" or "maybe" will not receive credit.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.2:
The teaching staff also created 70%, 90%, and 99% confidence intervals from the same sample, but we forgot to label which confidence interval represented which percentages! Match each confidence level (70%, 90%, 99%) with its corresponding interval in the cell below (e.g. __ % CI: [49.87, 54.0] $\rightarrow$ replace the blank with one of the three confidence levels). Please put them in order from 70% CI, 90% CI, then 99% CI, your teacher thanks you for helping them grade this a bit easier! **Then**, explain your thought process and how you came up with your answers.

The intervals are below:

* [49.87, 54.00]
* [50.67, 53.27]
* [48.80, 55.40]


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 2.3.
Suppose you are so passionate about the Thai food scene in Durham, so you applied for and received a grant that allowed you to pay to create 5,000 new *real* samples from the population, not bootstrapped samples  Each new sample is collected the same way the original sample was: as uniform random sample of 1,500 Durham residents. 

If you created a 95% confidence interval for each of the 5,000 new samples, roughly how many of those 5,000 intervals would you expect to actually contain the true percentage of the population that would vote for Thai Cafe? Assign your answer to `true_percentage_intervals`.

**Note**: Don't forget, there's no way for us to actually know the true value of the population parameter we're trying to estimate, so we won't know which of the confidence intervals actually contain it or not, but you should know the expected number of intervals that will and won't contain the parameter.

In [None]:
true_percentage_intervals = ...

In [None]:
grader.check("q2_3")

### Using confidence intervals to test hypotheses

Recall the second bootstrap confidence interval you estimated earlier, which estimated Thai Cafe's lead over Bua Thai, Pad Thai, and Thai Spoon combined. Among voters in the sample, Thai Cafe's lead was 4%. Suppose the teaching staff's 95% confidence interval for the true lead (in the population of all voters) was:

$$[-0.80, 8.80]$$

Suppose we are interested in testing a simple yes-or-no question:

> "Is the percentage of votes for Thai Cafe tied with the percentage of votes for Bua Thai, Pad Thai, and Thai Spoon combined?"

Our null hypothesis is that the percentages are equal, or equivalently, that Thai Cafe's lead is exactly 0. Our alternative hypothesis is that Thai Cafe's lead is not equal to 0.  

In the remaining questions below, don't compute any confidence intervals yourself - use only the provided 95% confidence interval of $[-0.80, 8.80]$.

### Question 2.4 

Say we use a 5% P-value cutoff.  Do we reject the null, fail to reject the null, or are we unable to tell using the provided confidence interval?

Assign `restaurants_tied` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval

*Hint:* If you're confused, take a look at [this chapter](https://inferentialthinking.com/chapters/13/4/Using_Confidence_Intervals.html) of the textbook.


In [None]:
restaurants_tied = ...

In [None]:
grader.check("q2_4")

### Question 2.5
What if, instead, we use a P-value cutoff of 1%? Do we reject the null, fail to reject the null, or are we unable to tell using the provided confidence interval?

Assign `cutoff_one_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval


In [None]:
cutoff_one_percent = ...

In [None]:
grader.check("q2_5")

### Question 2.6
What if we use a P-value cutoff of 10%? Do we reject, fail to reject, or are we unable to tell using the provided confidence interval?

Assign `cutoff_ten_percent` to the number corresponding to the correct answer.

1. Reject the null / Data is consistent with the alternative hypothesis
2. Fail to reject the null / Data is consistent with the null hypothesis
3. Unable to tell using our staff confidence interval


In [None]:
cutoff_ten_percent = ...

In [None]:
grader.check("q2_6")

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `hw07_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `hw07_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)