In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw10.ipynb")

# Homework 10: Testing Hypotheses

Please complete this notebook by filling in the cells provided. Before you begin, execute the previous cell to load the provided tests.

**Helpful Resource:**

**Helpful Resource:**
- [Python Reference](https://pages.mtu.edu/~lebrown/data1202-s24/reference/index.html): Cheat sheet of helpful array & table methods used in DATA 1202!

**Recommended Readings**: 

* [Sampling and Empirical Distributions](https://www.inferentialthinking.com/chapters/10/Sampling_and_Empirical_Distributions.html)
* [Testing Hypotheses](https://www.inferentialthinking.com/chapters/11/Testing_Hypotheses.html)
* [A/B Testing](https://inferentialthinking.com/chapters/12/1/AB_Testing.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!


**Note: This homework has hidden tests on it. That means even though the tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**


Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck.

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
import d8error

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Vaccinations Across The Nation - P-Values

Let's review the results you got for HW 09, but now use our new knowledge on how to calculate a p-value to conclude about the hypotheses. 


### Problem Setup

A vaccination clinic has two types of vaccines against a disease. Each person who comes in to be vaccinated gets either Vaccine 1 or Vaccine 2. One week, everyone who came in on Monday, Wednesday, and Friday was given Vaccine 1. Everyone who came in on Tuesday and Thursday was given Vaccine 2. The clinic is closed on weekends.

Doctor McCoy at the clinic said, "Oh wow, it's just like tossing a coin that lands heads with chance $\frac{3}{5}$. Heads you get Vaccine 1 and Tails you get Vaccine 2."

But Doctor Strange said, "No, it's not. We're not doing anything like tossing a coin."

That week, the clinic gave Vaccine 1 to 211 people and Vaccine 2 to 107 people. Conduct a test of hypotheses to see which doctor's position is better supported by the data.

#### Hypotheses

**Null Hypothesis:** The assignment of vaccines is like tossing a coin that lands heads with chance 3/5.

**Alternative Hypothesis:** The assignment to vaccines is not like tossing a coin.

#### Statistic 

With the null and alternative hypotheses defined, you can describe the test statistic. 

*Statistic* = |percent of heads - 60|

#### Simulate 

Let's now set up the problem and run the simulation again. 

Simulate 20,000 values of the test statistic under the assumption that the null hypothesis is true. 


In [None]:
sample_size = 211+107

def one_simulated_statistic():
    percent_heads = 100 * sample_proportions(
        sample_size, make_array(0.6, 0.4)).item(0)
    return abs(percent_heads - 60)

num_simulations = 20000

simulated_statistics = make_array()
for i in np.arange(num_simulations):
    simulated_statistics = np.append(simulated_statistics, 
                                     one_simulated_statistic())

**Question 1.1.** Calculate the `observed_statistic`, then using this value and `simulated_statistics` and `num_simulations`, find the empirical p-value based on the simulation. **(8 points)**


In [None]:
observed_statistic = ...

p_value = ...
p_value

In [None]:
grader.check("q1_1")

**Question 1.2.** Assign `correct_doctor` to the number corresponding to the correct statement below. Use the 5% cutoff for the p-value. **(4 points)**

1. The data support Dr. McCoy's position more than they support Dr. Stranges's.
2. The data support Dr. Stranges's position more than they support Dr. McCoy's.

As a reminder, here are the two claims made by Dr. McCoy and Dr. Strange:
> **Doctor McCoy:** "Oh wow, it's just like tossing a coin that lands heads with chance $\frac{3}{5}$. Heads you get Vaccine 1 and Tails you get Vaccine 2."

>**Doctor Strange:** "No, it's not. We're not doing anything like tossing a coin."


In [None]:
correct_doctor = ...
correct_doctor

In [None]:
grader.check("q1_2")

## 2. Using TVD as a Test Statistic - p-Value

Let's again review the question from HW9 using our knowledge of p-values.



In this part of the homework, we'll look at how we can use TVD to determine the effect that different factors have on happiness. 

We will be working with data from the [Gallup World Poll](https://www.gallup.com/analytics/349487/gallup-global-happiness-center.aspx#:~:text=World%20Happiness%20Report&text=Using%20the%20Gallup%20World%20Poll,about%20the%20World%20Happiness%20Report.) that is presented in the World Happiness Report, a survey of the state of global happiness. The survey ranked 155 countries by overall happiness and estimated the influence that economic production, social support, life expectancy, freedom, absence of corruption, and generosity had on population happiness. The study has been repeated for several years, but we'll be looking at data from the 2016 survey.

Run the cell below to load in the `happiness_scores` table.

In [None]:
happiness_scores = Table.read_table("happiness_scores.csv")
happiness_scores.show(5)

Participants in the study were asked to evaluate their life satisfaction from a scale of 0 (worst possible life) to 10 (best possible life). The responses for each country were averaged to create the `Happiness Score`.

The columns `Economy (GDP per Capita)`, `Family`, `Health (Life Expectancy)`, `Freedom`, `Trust (Government Corruption)`, and `Generosity` estimate the extent to which each factor influences happiness, both for better or for worse. The happiness score is the sum of these factors; the larger a factor is, the more it contributes to overall happiness. [In other words, if you add up all the factors (in addition to a "Difference from Dystopia" value we excluded in the dataset), you get the happiness score.]

Let's look at the different factors that affect happiness in the United States. Run the cell below to view the row in `us_happiness` that contains data for the United States.

In [None]:
us_happiness = happiness_scores.where("Country", "United States")
us_happiness

**To compare the different factors, we'll look at the proportion of the happiness score that is attributed to each variable. 
You can find these proportions in the table `us_happiness_factors` after running the cell below.**

*Note:* The factors shown in `us_happiness` don't add up exactly to the happiness score, so we adjusted the proportions to  only account for the data we have access to. The proportions were found by dividing each Happiness Factor value by the sum of all Happiness Factor values in `us_happiness`.

In [None]:
us_happiness_factors = Table().read_table("us_happiness_factors.csv")
us_happiness_factors

### Problem Setup

Suppose we want to test whether or not each factor contributes the same amount to the overall Happiness Score. Define the null hypothesis, alternative hypothesis, and test statistic. 

- *Null Hypothesis:* Each factor contributes an equal amount to the happiness score. Any deviation is due to random chance.

- *Alternative Hypothesis:* Some factors contribute more to the happiness score than other factors. 

- *Test Statistic:* the total variation distance (TVD) between the observed score proportions and the expected score proportions under the null. 

#### Calculate Test Statistic 

We wrote a function `calculate_tvd` that takes in the observed distribution (`obs_dist`) and expected distribution under the null hypothesis (`null_dist`) and calculates the total variation distance. Define `observed_tvd` to be equal to the observed test statistic. 


In [None]:
null_distribution = np.ones(6) * (1/6)

def calculate_tvd(obs_dist, null_dist):
    return sum(abs(obs_dist - null_dist))/2
    
observed_tvd = calculate_tvd(
    us_happiness_factors.column("Proportion of Happiness Score"), 
    null_distribution) 
observed_tvd

#### Simulate 

Create an array called `simulated_tvds` that contains 10,000 simulated values under the null hypothesis. Assume that the original sample consisted of 1,000 individuals. 



In [None]:
simulated_tvds = make_array() 

for i in np.arange(10000):
    simulated_proportions = sample_proportions(1000, null_distribution)
    one_tvd = calculate_tvd(simulated_proportions, null_distribution)
    simulated_tvds = np.append(simulated_tvds, one_tvd)


Run the cell below to plot a histogram of your simulated test statistics, as well as a red dot representing the observed value of the test statistic.

In [None]:
Table().with_column("Simulated TVDs", simulated_tvds).hist()
plt.scatter(observed_tvd, 0, color='red', s=70, zorder=2);
plt.show();

**Question 2.1.** Use your simulated statistics to calculate the p-value of your test. Make sure that this number is consistent with what you observed in the histogram above. **(4 points)**


In [None]:
p_value_tvd = ...
p_value_tvd

In [None]:
grader.check("q2_1")

## 3. Who is Older?

Data scientists have drawn a simple random sample of size 500 from a large population of adults. Each member of the population happened to identify as either "male" or "female". Data was collected on several attributes of the sampled people, including age. The table `sampled_ages` contains one row for each person in the sample, with columns containing the individual's gender identity.

In [None]:
sampled_ages = Table.read_table('age.csv')
sampled_ages.show(5)

**Question 3.1.** How many females were there in our sample? Please use the provided skeleton code. **(6 points)**

*Hint:* Keep in mind that `.group` sorts categories in alphabetical order!


In [None]:
num_females = sampled_ages.group(...)...
num_females

In [None]:
grader.check("q3_1")

**Question 3.2.** Complete the cell below so that `avg_male_vs_female` evaluates to `True` if the sampled males are older than the sampled females on average, and `False` otherwise. Use Python code to achieve this. **(6 points)**


In [None]:
group_mean_tbl = sampled_ages.group(...)
group_means = group_mean_tbl...       # array of mean ages
avg_male_vs_female = group_means... > group_means...
avg_male_vs_female

In [None]:
grader.check("q3_2")

**Question 3.3.** The data scientists want to use the data to test whether males are older than females—or, in other words, whether the ages of the two groups have the same distribution. One of the following statements is their null hypothesis and another is their alternative hypothesis. Assign `null_statement_number` and `alternative_statement_number` to the numbers of the correct statements in the code cell below. **(6 points)**

1. In the sample, the males and females have the same distribution of ages; the sample averages of the two groups are different due to chance.
2. In the population, the males and females have the same distribution of ages; the sample averages of the two groups are different due to chance.
3. The age distributions of males and females in the population are different due to chance.
4. The males in the sample are older than the females, on average.
5. The males in the population are older than the females, on average.
6. The average ages of the males and females in the population are different.


In [None]:
null_statement_number = ...
alternative_statement_number = ...

In [None]:
grader.check("q3_3")

**Question 3.4.** The data scientists have decided to use a permutation test. Assign `permutation_test_reason` to the number corresponding to the reason they made this choice. **(6 points)**

1. Since a person's age shouldn't be related to their gender, it doesn't matter who is labeled "male" and who is labeled "female", so you can use permutations.
2. Under the null hypothesis, permuting the labels in the `sampled_ages` table is equivalent to drawing a new random sample with the same number of males and females as in the original sample.
3. Under the null hypothesis, permuting the rows of `sampled_ages` table is equivalent to drawing a new random sample with the same number of males and females as in the original sample.


In [None]:
permutation_test_reason = ...
permutation_test_reason

In [None]:
grader.check("q3_4")

**Question 3.5.** To test their hypotheses, the data scientists have followed our textbook's advice and chosen a test statistic where the following statement is true: Large values of the test statistic favor the alternative hypothesis.

The data scientists' test statistic is one of the two options below. Which one is it? Assign the appropriate number to the variable `correct_test_stat`. **(4 points)**

1. "male age average - female age average" in a sample created by randomly shuffling the male/female labels
2. "|male age average - female age average|" in a sample created by randomly shuffling the male/female labels


In [None]:
correct_test_stat = ...
correct_test_stat

In [None]:
grader.check("q3_5")

**Question 3.6.** Complete the cell below so that `observed_statistic_ab` evaluates to the observed value of the data scientists' test statistic. Use as many lines of code as you need, and remember that you can use any quantity, table, or array that you created earlier. **(4 points)**


In [None]:
observed_statistic_ab = ...
observed_statistic_ab

In [None]:
grader.check("q3_6")

**Question 3.7.** Assign `shuffled_labels` to an array of shuffled male/female labels. The rest of the code puts the array in a table along with the data in `sampled_ages`. **(6 points)**

*Note:* Check out [12.1](https://inferentialthinking.com/chapters/12/1/AB_Testing.html#predicting-the-statistic-under-the-null-hypothesis) for a refresher on random permutations.


In [None]:
shuffled_labels = ...
original_with_shuffled_labels = sampled_ages.with_columns('Shuffled Label', shuffled_labels)
original_with_shuffled_labels

In [None]:
grader.check("q3_7")

**Question 3.8.** The comparison below uses the array `shuffled_labels` from Question 3.7 and the count `num_females` from Question 3.1.

For this comparison, assign the correct number from one of the following options to the variable `correct_q8`. **Pretend this is a midterm problem and solve it without doing the calculation in a code cell.** **(6 points)**

`comp = np.count_nonzero(shuffled_labels == 'female') == num_females`

1. `comp` is set to `True`.
2. `comp` is set to `False`.
3. `comp` is set to `True` or `False`, depending on how the shuffle came out.


In [None]:
correct_q8 = ...
correct_q8

In [None]:
grader.check("q3_8")

**Question 3.9.** Define a function `simulate_one_statistic` that takes no arguments and returns one simulated value of the test statistic. We've given you a skeleton, but feel free to approach this question in a way that makes sense to you. Use as many lines of code as you need. Refer to the code you have previously written in this problem, as you might be able to re-use some of it. **(6 points)**


In [None]:
def simulate_one_statistic():
    "Returns one value of our simulated test statistic"
    shuffled_labels = ...
    shuffled_tbl = ...
    group_means = ...
    ...

In [None]:
grader.check("q3_9")

After you have defined your function, run the following cell a few times to see how the statistic varies.

In [None]:
simulate_one_statistic()

**Question 3.10.** Complete the cell to simulate 4,000 values of the statistic. We have included the code that draws the empirical distribution of the statistic and shows the value of `observed_statistic_ab` from Question 3.6. Feel free to use as many lines of code as you need. **(6 points)**

*Note:* This cell will take around a minute to run.


In [None]:
repetitions = 4000

simulated_statistics_ab = make_array()
...
    simulated_statistics_ab = ...

# Do not change these lines
Table().with_columns('Simulated Statistic', simulated_statistics_ab).hist()
plt.scatter(observed_statistic_ab, -0.002, color='red', s=70);

In [None]:
grader.check("q3_10")

**Question 3.11.** Use the simulation to find an empirical approximation to the p-value. Assign `p_val` to the appropriate p-value from this simulation. Then, assign `conclusion` to either `null_hyp` or `alt_hyp`. **(6 points)** 

*Note:* Assume that we use the 5% cutoff for the p-value.


In [None]:
# These are variables provided for you to use.
null_hyp = 'The data are consistent with the null hypothesis.'
alt_hyp = 'The data support the alternative more than the null.'

p_val = ...
conclusion = ...

p_val, conclusion # Do not change this line

In [None]:
grader.check("q3_11")

You're done with Homework 10!  

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)