In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw06.ipynb")

# Homework 06: Testing Hypotheses

**Reading**: 
* [Testing Hypotheses](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. Additional tests will be run once your homework is submitted for grading. While you may pass all the tests you have access to before submission, you may not earn full credit if you do not pass the hidden tests as well.**. 

Many of the tests you have access to before submitting only test to ensure you have given an answer that is formatted correctly and/or you have given an answer that *could* make sense in context. For example, a test you have access to while completing the assignment may check that you selected a valid choice for a multiple choice problem (1, 2, or 3) or that your answer is an integer between 0 and 50 if asked to count a subset of states in the United States. The tests that are run after submission will evaluate your work for accuracy. **Do not assume that just because all your tests pass before submission means that your answers are correct!**

Consult with your teacher and course syllabus for information and policies regarding appropriate collaboration with other students, appropriate use of AI tools, and submission of late work.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Part 1. Spam Calls

Yanay gets a lot of spam calls. An area code is defined to be a three digit number from 200-999 inclusive. In reality, many of these area codes are not in use, but for this question we'll simplify things and assume they all are. **Throughout these questions, you should assume that Yanay's area code is 781.**

### Question 1.1

**Assuming each area code is just as likely as any other**, what is the probability that two back to back spam calls have the same area code of 781? 

Assign this probability as a *proportion* between 0 and 1 to `prob_781`, formatted as either an exact fraction or an equivalent arithmetic expression. For example your assignment statement could be formatted as either `prob_781 = 1/9` or `prob_781 = (1/3) ** 2`, but should not be assigned as `prob_781 = 0.33` since it is not exact.

In [None]:
prob_781 = ...
prob_781

In [None]:
grader.check("q1_1")

### Question 1.2.

Rohan already knows that Yanay's area code is 781. Rohan randomly guesses the last 7 digits (0-9 inclusive) of his phone number. What's the probability that Rohan correctly guesses Yanay's number, assuming he’s equally likely to choose any digit? *Note: A US phone number contains a 3-digit area code followed by 7 additional digits, i.e. xxx-xxx-xxxx*

Assign this probability as a *proportion* between 0 and 1 to `prob_781`, formatted as either an exact fraction or an equivalent arithmetic expression.

In [None]:
prob_yanay_num = ...
prob_yanay_num

In [None]:
grader.check("q1_2")

### Suspicious Spam

Yanay suspects that there's a higher chance that the spammers are using his area code (781) to trick him into thinking it's someone from his area calling him. Ashley thinks that this is not the case, and that spammers are just choosing area codes of the spam calls at random from all possible area codes (*Remember, for this question we’re assuming the possible area codes are 200-999, inclusive*). Yanay wants to test his claim using the 50 spam calls he received in the past month.

Run the cell below to load a dataset that contains the the area codes of the 50 spam calls he received last month.

In [None]:
# Just run this cell
spam = Table().read_table('spam.csv')
spam

<!-- BEGIN QUESTION -->

### Question 1.3.

Define the null hypothesis and alternative hypothesis for this investigation. Your null hypothesis should fully describe a probability model that you can use as part of a simulation later.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.4.

Yanay would like to use the number of times you see the area code 781 in 50 spam calls as the statistic to test his hypothesis. List at least one additional reasonable choice for a statistic that could be used to test the hypothesis.

*Hint*: For a refresher on choosing test statistics, check out the textbook section on [Test Statistics](https://inferentialthinking.com/chapters/11/3/Decisions_and_Uncertainty.html).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 1.5.

Suppose you decide to use **the number of times you see the area code 781 in 50 spam calls** as your test statistic. Calculate the observed value of this statistic based on 50 actual spam calls that Yanay received. Assign this value to `observed_val`.

In [None]:
observed_val = ...

In [None]:
grader.check("q1_5")

### Question 1.6

Write a function called `simulate` that generates exactly one simulated value of your test statistic. The function should take no arguments and simulate receiving 50 spam calls by creating an array of 50 area codes under the assumptions of the null hypothesis. The function should return the number of times the 781 area code was listed in the 50 random spam calls.

In [None]:
def simulate():
    possible_area_codes = np.arange(200,1000)
    ...
    
# Call your function a few times to make sure it works and
# observe what types of outputs it produces

simulate()

In [None]:
grader.check("q1_6")

### Question 1.7.

Generate 20,000 simulations of receiving 50 random spam calls under the assumptions of the null hypothesis, recording the number of times the 781 area code appears in the 50 calls for each simulation. Assign `test_statistics_under_null` to an array that contains the test statistics for these trials. 

*Hint*: Use the function you defined in the previous question to generate the statistics.

In [None]:
test_statistics_under_null = ...
repetitions = ...

...
    
test_statistics_under_null

In [None]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

### Question 1.8.

Using the results from the simulations, generate a histogram of the empirical distribution of the number of times you saw the area code 781 appear out of 50 spam calls. **NOTE: Use the provided bins when making the histogram**

In [None]:
our_bins = np.arange(0, 5, 1) # Use these provided bins
...

<!-- END QUESTION -->

### Question 1.9.

Yanay is ready to make a decision about the null and alternative hypotheses based on the results of the simulation. Compute an empirical p-value using the observed statistic value and array of statistics from the simulation to help Yanay decide between the models.

In [None]:
p_value = ...
p_value

In [None]:
grader.check("q1_9")

<!-- BEGIN QUESTION -->

### Question 1.10.

Suppose you use a relatively strict p-value cutoff of 1%. What should Yanay conclude from the hypothesis test? Why?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Part 2: Multiple area codes

Yanay now thinks that the spammers may be using even a more complex scheme than he originally thought. Yanay now believes that instead of the spammers trying to match Yanay's area code, they are tracking his phone and are using the area code of the areas that he's visited recently. Yanay decides to check if the area code from spam calls match the area code of one of the 8 places he's been to recently. He wants to test if it's more likely to receive a spam call with an area code from any of those 8 places than chance alone would suggest. 

These are the area codes of the places he's been to recently: 781, 617, 509, 510, 212, 858, 339, 626.

<!-- BEGIN QUESTION -->

### Question 2.1

Define the null hypothesis and alternative hypothesis for this investigation.

*Reminder: Don’t forget that your null hypothesis should fully describe a probability model that we can use for simulation later.*


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 2.2

Suppose Yanay selects **the count of how many of 50 spam calls came from area codes that were "recently visited"** as the test statistic. Compute the observed value of this statistic from the 50 actual spam calls that Yanay received (remember, this is already stored to the `spam` Table). Assign the test statistic to `visited_observed_value`.

**Hint:** A `Table` predicate should make it easy to create a Table from `spam` that only contains the rows with the area codes that Yanay visited. Check your Python reference sheet!

In [None]:
visited_area_codes = make_array(781, 617, 509, 510, 212, 858, 339, 626)
visited_observed_value = ...
visited_observed_value

In [None]:
grader.check("q2_2")

### Question 2.3
Write a function called `simulate_visited_area_codes` that generates exactly one simulated value of your test statistic under the null hypothesis. The function should take no arguments and simulate receiving 50 spam calls codes under the assumption that the result of each call's area code is sampled from the range 200-999 with equal probability. To match Yanay's hypothesis, your simulation should assume there are 8 area codes that were "recently visited" and the remaining area codes have not been "recently visited" when computing the number of spam calls in each category (recently visited, not recently visited).

The function should return the test statistic decided upon earlier in this question: **the count of how many of  50 spam calls came from area codes that were "recently visited"**.

*Hint*: You may find [the textbook section](https://inferentialthinking.com/chapters/11/1/Assessing_a_Model.html#the-prediction-under-the-model-of-random-selection) on the `sample_proportions` function to be useful.

In [None]:
def simulate_visited_area_codes():
    ...
    
# Call your function a few to make sure it works and to 
# get a feel for possible outputs it my generate
simulate_visited_area_codes()

In [None]:
grader.check("q2_3")

### Question 2.4.

Generate 20,000 simulated values of the number of times you see any of the area codes of the places Yanay has been to in 50 random spam calls. Assign `test_statistics_under_null` to an array that stores the result of each of these trials. 

**Hint**: Use the function you defined in Question 2.2.


In [None]:
visited_test_statistics_under_null = ...
repetitions = ...

...
visited_test_statistics_under_null

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

### Question 2.5.

Using the simulated statistics assigned to `visited_test_statistics_under_null`, generate a histogram of the empirical distribution of the number of times you saw any of the visited area codes in 50 spam phone calls under the assumptions of the null hypothesis.

**NOTE: Use the provided bins when making the histogram**

In [None]:
bins_visited = np.arange(0,6,1) # Use these provided bins
...

<!-- END QUESTION -->

### Question 2.6

Yanay is ready to make a decision about the hypotheses based on the results of the simulation. Compute an empirical p-value using the observed value of the statistic and the array of statistics from the simulation to help Yanay decide between the models.

In [None]:
p_value = ...
p_value

In [None]:
grader.check("q2_6")

<!-- BEGIN QUESTION -->

### Question 2.7.

Suppose Yanay had decided to use a p-value cutoff of 1 percent. What can he conclude, if anything, about the null hypothesis?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2.8

Describe what the empirical p-value value assigned to `p_value` represents in context of this experiment. Avoid only providing a generic definition of a p-value by using as many specifics of this problem as you can in your response.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Question 2.9.

When you reject the null hypothesis, you are claiming that the observed statistic was different enough from what you would expect to see from chance alone to the point that you believe it is from some other factor. In this particular case, you are accusing the spam callers of favoring the area codes that person has visited, which means they are tracking the person. This is a serious accusation! Sometimes, you'll make this claim incorrectly.

Suppose you run this same test for 200 different people, each with a p-value cutoff of 1 percent. If the spam callers were **not** actually favoring area codes that people have visited (the null hypothesis is true), in how many of the 200 tests would spam callers be incorrectly accused of favoring area codes that people have visited? Assign your answer as a numerical value to `incorrectly_accused`. You can assign an arithmetic expression that represents the answer (e.g. `incorrectly_accused = 0.5 * 10` instead of `incorrectly_accused = 5` as long as it evaluates to the correct value.

In [None]:
incorrectly_accused = ...

In [None]:
grader.check("q2_9")

## Part 3: Do you pick up the call?

Yanay now wants to determine how effective these spam techniques actually are. He goes through the 50 spam calls he received and records whether or not he picked up the call because he believed it was a legitimate phone call.

The Table `spam_with_labels` contains:
* a column labeled `Area Code Visited` which contains either the string `"Yes"` or the string `"No"` which represents whether the spam call used an area code from where Yanay has recently visited.
* a column labeled `Picked Up` which contains either the integer `1` if Yanay picked up and the integer `0` if he did not pick up.

In [None]:
# Just run this cell
spam_with_labels = Table().read_table("spam_picked_up.csv")
spam_with_labels

Yanay suspects that he is more likely to pick up phone calls from area codes that he's recently visited.

His **null hypothesis** is that there is no difference in the distribution of calls he picked up between area codes he has visited and area codes he has not visited, with any observed differences due to chance. 

His **alternative hypothesis** is that there is a difference between the two categories, specifically that he thinks that he is more likely to pick up if he has visited the area code. 

Conduct an [A/B Test](https://inferentialthinking.com/chapters/12/1/AB_Testing.html#permutation-test) to test his hypothesis. Use the difference in proportion of calls picked up between the area codes Yanay visited and the area codes he did not visit as the test statistic.

### Question 3.1.
Write a function named `difference_in_proportion` to calculate the test statistic that will be used to complete this test. The function should take in a Table which will be assigned to `, which can be assumed has the labels "Area Code Visited" and "Picked Up" like the Table of spam calls that Yanay has collected.

Then, use the function it to compute the observed value of the test statistic and assign it to `observed_diff_proportion`.

In [None]:
def difference_in_proportion(sample):
    ...

observed_diff_proportion = ...
observed_diff_proportion

In [None]:
grader.check("q3_1")

### Question 3.2.

The labels in this case are the `"Yes"` and `"No"` values found in the `"Area Code Visited"` column of the sample of spam calls found in `spam_with_labels`. Write a function `simulate_one_stat` that takes in a Table formatted like `spam_with_labels`, assigns it to `table`, shuffles the labels in `table`, and returns the appropriate test statistic. Your code should *overwrite* the labels found in the `"Area Code Visited"` column, since the `difference_in_proportion` expects the Table it works with to have the label `"Area Code Visited"` when computing the test statistic.

In [None]:
def simulate_one_stat(table):
    ...

one_simulated_test_stat = simulate_one_stat(spam_with_labels) 
one_simulated_test_stat

In [None]:
grader.check("q3_2")

### Question 3.3.

Generate 1,000 simulated test statistic values using the `simulate_one_stat` function and store them to the array `test_stats`.

The provided code will generate a histogram to display the distribution of the test statistics generated by the simulation, and plot the value assigned to `observed_diff_proportion` on the same axes.

In [None]:
test_stats = make_array()
repetitions = ...

...

# here's code to generate a histogram of values and the red dot is the observed value
Table().with_column("Simulated Proportion Difference", test_stats).hist("Simulated Proportion Difference");
plt.plot(observed_diff_proportion, -0.01, 'r^', markersize=10);

In [None]:
grader.check("q3_3")

### Question 3.4.

Compute the empirical p-value for this test, and assign it to `p_value_ab`.

In [None]:
p_value_ab = ...
p_value_ab

In [None]:
grader.check("q3_4")

<!-- BEGIN QUESTION -->

### Question 3.5.

Yanay decides to use a p-value cutoff of 1 percent. Is there sufficient evidence to reject the null hypothesis based on the empirical p-value? Why or why not, and what does that mean about Yanay's suspicion that you are more likely to pick up a spam call that is coming form an area code from an area that you've recently visited?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `hw06_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `hw06_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)