# Homework 5: Simulation, Sampling, and Hypothesis Testing

## Due Saturday, Aug 13th at 11:59PM PST

Welcome to Homework 5! This homework will cover:
- Simulations (see [Note 18](https://notes.dsc10.com/04-probability_and_simulation/probability_and_simulation.html))
- Populations and samples (see [Note 19](https://notes.dsc10.com/04-probability_and_simulation/1_populations_and_samples.html))
- Parameters and statistics (see [Note 20](https://notes.dsc10.com/04-probability_and_simulation/2_parameters_and_statistics.html))
- Models and Hypothesis Testing (see [Note 21](https://notes.dsc10.com/05-hypothesis_testing/1_hypothesis_tests.html) and [CIT 11.2](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html))

### Instructions

This assignment is due on Saturday, Aug 13th at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

In [1]:
# Please don't change this cell, but do make sure to run it.
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

%reload_ext pandas_tutor

## 1. Lucky Triton Lotto, Continued  🔱 🎱 🧜

In the last homework, we calculated the probability of winning the grand prize (free housing) on a Lucky Triton Lotto lottery ticket, and found that it was quite low 😭.

In [2]:
free_housing_chance = (1 / 62) * (1 / 61) * (1 / 60) * (1 / 59) * (1 / 58) * (1 / 16)
free_housing_chance

In this question, we'll approach the same question not using math, but using simulation. 

It's important to remember how this lottery game works:
- When you buy a Lucky Triton Lotto ticket, you first pick five different numbers, one at a time, from 1 to 62. Then you separately pick a number from 1 to 16, which may or may not be the same as one of the first five. These are **your numbers**. For example, you may select (15, 1, 13, 3, 61, 8). This is a sequence of six numbers - **order matters**!.
- The **winning numbers** are chosen by King Triton drawing five balls, one at a time, **without replacement**, from a pot of white balls numbered 1 to 62. Then, he draws a gold ball, the Tritonball, from a pot of gold balls numbered 1 to 16. Both pots are completely separate, hence the different ball colors. For example, maybe the winning numbers are (13, 15, 62, 3, 5, 8).

We’ll assume for this problem that in order to win the grand prize (free housing), all six of your numbers need to match the winning numbers and be in the **exact same positions**. In other words, your entire sequence of numbers must be exactly the same. However, if some numbers in your sequence match up with the corresponding number in the winning sequence, you will still win some Triton Cash. 

Suppose again that your numbers are (15, 1, 13, 3, 61, 8) and the winning numbers are (13, 15, 62, 3, 5, 8). In this case, two of your numbers are considered to match two of the winning numbers. Notice that although both sequences include the number 15 within the first five numbers (representing a white ball), since they are in different positions, that's not considered a match.

- Your numbers: (15, 1, 13, **3**, 61, **8**)
- Winning numbers: (13, 15, 62, **3**, 5, **8**)

**Question 1.1.** Write a function called `simulate_one_ticket`. It should take no arguments, and it should return an array with 6 random numbers, simulating how the numbers are selected for a single Lucky Triton Lotto ticket. The first five numbers should all be randomly chosen without replacement, from 1 to 62. The last number should be between 1 and 16.

In [3]:
def simulate_one_ticket():
    """Simulate one Lucky Triton Lotto ticket."""
    ...

In [None]:
grader.check("q1_1")

**Question 1.2.** It's draw day. You checked the winning numbers King Triton drew, which happened to be **(55, 12, 3, 51, 23, 5)**. You didn't win free housing, and you are quite sad.

Suppose you want to remind yourself how unlikely it is to win the grand prize. Call the function `simulate_one_ticket` 100,000 times. (It would cost a fortune if you were to buy that many! It's pretty nice to be able to simulate this experiment instead of doing it in real life). In your 100,000 tickets, **how many times did you win the grand prize (free housing)?** Assign your answer to `count_free_housing`.

**_Hint 1:_** Try it first with only buying 10 tickets. Once you are sure you have that figured out, change it to 100,000 tickets. It may take a little while (up to a minute) for Python to perform the calculations when you are buying 100,000 tickets.

**_Hint 2:_** You'll have to count how many of the numbers you chose match the numbers that were drawn. One way to do this involves [`np.count_nonzero`](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html). Remember you need **all** the numbers to match to win the grand prize.

In [8]:
count_free_housing = ...
...
count_free_housing

In [None]:
grader.check("q1_2")

Remember, the mathematical probability of winning free housing is quite low, on the order of $10^{-11}$. That's a lot lower than than 1 in 100,000, which is $10^{-5}$.

**Question 1.3.** As we've seen, you would need to be extremely lucky to win the grand prize. To encourage more students to buy Lucky Triton Lotto tickets, students can win Triton Cash if some of their numbers match the corresponding winning numbers, as described above. Again simulate buying 100,000 tickets, but this time find **the greatest number of matches achieved by any of your tickets**, and assign this to `most_matches`. 

The winning numbers are the same from the previous part: **(55, 12, 3, 51, 23, 5)**

For example, if 90,000 of your tickets matched 1 winning number and 10,000 of your tickets matched 2 winning numbers, then you would set `most_matches` to 2. If 99,999 of your tickets matched 1 winning number and one of your tickets matched 4 winning numbers, you would set `most_matches` to 4. If you happened to win the grand prize on one of your tickets, you would set `most_matches` to 6. Remember, order matters.

**_Hint:_** There are several ways to approach this; one way involves storing the number of matches per ticket in an array and finding the largest number in that array. 

In [11]:
most_matches = ...
...
most_matches

In [None]:
grader.check("q1_3")

**Question 1.4.** Suppose one Lucky Triton Lotto ticket costs $2.

The Lucky Triton Lotto advertisement on Instagram promises you will never lose money because of the following generous prizes:

- Win $10 with a 1-number match

- Win $25 with a 2-number match

- Win $100 with a 3-number match

- Win $1,000 with a 4-number match

- Win $5,000 with a 5-number match

- Win $20,000 with a 6-number match (Free Housing!)

If you had the money to buy 100,000 tickets, what would be your net winnings from buying these tickets? Since this is net winnings, this should account for the prizes you win and the cost of buying the tickets. Assign the amount to `net_winnings`. Note that a positive value means you won money overall, and a negative value means you lost money overall. Do you believe the advertisement's claims?

The winning numbers are the same from the previous part: **(55, 12, 3, 51, 23, 5)**

**_Hint:_** Again, there are a few ways you could approach this problem. One way involves generating another 100,000 random tickets and counting the amount earned per ticket, adding to a running total. Alternatively, if you created an array of the number of matches per ticket in Question 1.3, you could loop through that array.

In [14]:
net_winnings = ...
...
net_winnings

In [None]:
grader.check("q1_4")

## 2. Sampling with Netflix 🍿

In this question, we will use a dataset consisting of information about all Netflix Original movies to get some practice with sampling. Run the cell below to load the data into a DataFrame, indexed by title.

In [17]:
# Just run this cell, do not change it!
movie_data = bpd.read_csv('data/netflix_originals.csv').set_index('Title')
movie_data

We've provided a function called `compute_statistics` that takes as input a DataFrame with two columns, `'Runtime'` and `'IMDB Score'`, and then:
- draws a histogram of `'Runtime'`,
- draws a histogram of `'IMDB Score'`, and
- returns a two-element array containing the mean `'Runtime'` and mean `'IMDB Score'`.

Run the cell below to define the `compute_statistics` function, and a helper function called `histograms`. Don't worry about how this code works, and please don't change anything.

In [18]:
# Don't change this cell, just run it.
def histograms(df):
    runtimes = df.get('Runtime').values
    ratings = df.get('IMDB Score').values
    
    plt.subplots(1, 2, figsize=(15, 4), dpi=100)

    plt.subplot(1, 2, 1)
    plt.hist(runtimes, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 250, 10))
    plt.title('Distribution of Runtimes')

    plt.subplot(1, 2, 2)
    plt.hist(ratings, density=True, alpha=0.5, color='blue', ec='w', bins=np.arange(0, 10, 0.4))
    plt.title('Distribution of IMDB Scores')
    
def compute_statistics(runtimes_and_ratings_data, draw=True):
    if draw:
        histograms(runtimes_and_ratings_data)
    avg_runtime = np.average(runtimes_and_ratings_data.get('Runtime').values)
    avg_rating = np.average(runtimes_and_ratings_data.get('IMDB Score').values)
    avg_array = np.array([avg_runtime, avg_rating]) 
    return avg_array

We can use this `compute_statistics` function to show the distribution of `'Runtime'` and `'IMDB Score'` and compute their means, for any collection of movies. 

Run the next cell to show these distributions and compute the means for all Netflix Original movies. Notice that an array containing the mean `'Runtime'` and mean `'IMDB Score'` values is displayed before the histograms.

In [19]:
movie_stats = compute_statistics(movie_data)
movie_stats

Now, imagine that instead of having access to the full *population* of movies, we had only gotten data on a smaller subset of the movies, or a *sample*.  For 584 movies, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.  Instead, we often make *statistical inferences* about a large underlying population using a smaller sample.

A **statistical inference** is a statement about some characteristic of the underlying population, such as "the average IMDB rating for Netflix Original movies is 6.3". You may have heard the word "inference" used in other contexts.  It's important to keep in mind that statistical inferences can be wrong.

A common strategy for inference using samples is to estimate parameters of the population by computing the same statistics on a sample.  This strategy sometimes works well and sometimes doesn't.  The degree to which it gives us useful answers depends on several factors.

One very important factor in the utility of samples is how they were gathered. Let's look at some different sampling strategies.

### Convenience sampling
One sampling methodology, which is **generally a bad idea**, is to choose movies which are somehow convenient to sample.  For example, you might choose movies that you have personally watched, since it's easier to collect information about them.  This is called, somewhat pejoratively, *convenience sampling*.

**Question 2.1.**  Suppose you love scary movies 👻 and you decide to manually look up information on all Netflix Original movies in the following genres:
- `'Horror'`
- `'Thriller'`
- `'Horror thriller'`

Assign `convenience_sample` to a subset of `movie_data` that contains only the rows for movies that are in one of these genres.

In [20]:
convenience_sample = ...
convenience_sample

In [None]:
grader.check("q2_1")

**Question 2.2.** Assign `convenience_stats` to an array of the mean `'Runtime'` and mean `'IMDB Score'` of your convenience sample.  Since they're computed on a sample, these are called *sample means*. 

**_Hint_**: Use the function `compute_statistics`; it's okay if histograms are displayed as well.

In [23]:
convenience_stats = ...
convenience_stats

In [None]:
grader.check("q2_2")

Next, we'll compare the distribution of runtimes in our convenience sample with distribution of runtimes for all the movies in our dataset.

In [27]:
# Just run this cell, do not change it!
def compare_runtimes(first, second, first_title, second_title):
    """Compare the runtimes in two DataFrames."""
    bins = np.arange(0, 250, 10)
    
    plt.subplots(1, 2, figsize=(15, 4), dpi=85)

    plt.subplot(1, 2, 1)
    plt.hist(first.get('Runtime'), bins=bins, density=True, ec='w', color='blue', alpha=0.5)
    plt.title(f'Runtimes ({first_title})')
    
    plt.subplot(1, 2, 2)
    plt.hist(second.get('Runtime'), bins=bins, density=True, ec='w', color='blue', alpha=0.5)
    plt.title(f'Runtimes ({second_title})')

compare_runtimes(movie_data, convenience_sample, 'All Movies', 'Convenience Sample')

**Question 2.3.** 

From what you see in the histogram above, did the convenience sample give us an accurate picture of the runtimes for the full population of movies?  Why or why not?

Assign either 1, 2, 3, or 4 to the variable `sampling_q3` below. 
1. Yes. The sample is large enough, so it is an accurate representation of the population.
2. No. Normally convenience samples give us an accurate representation of the population, but only if the sample size is large enough. Our convenience sample here was too small.
3. No. Normally convenience samples give us an accurate representation of the population, but we just got unlucky.
4. No. Convenience samples generally don't give us an accurate representation of the population.

In [28]:
sampling_q3 = ...

In [None]:
grader.check("q2_3")

### Simple random sampling
A more principled approach is to sample uniformly at random from the movies.  If we ensure that each movie is selected at most once, this is a **random sample without replacement**, sometimes abbreviated to "**simple random sample**" or "**SRS**".  Imagine writing down each movie's title on a card, putting the cards in a hat, and shuffling the hat.  To sample, pull out cards one by one and set them aside, stopping when the specified *sample size* is reached.

We've produced two simple random samples of `ratings_data`: the variable `small_srs_data` contains a SRS of size 70, and the variable `large_srs_data` contains a SRS of size 180.

Now we'll run the same analyses on the small simple random sample, the large simple random sample, and the convenience sample. The subsequent code draws the histograms and computes the means for `'Runtime'` and `'IMDB Score'`.

In [31]:
# Don't change this cell, but do run it.

small_srs_data = bpd.read_csv('data/small_srs_rating.csv').set_index('Title')
large_srs_data = bpd.read_csv('data/large_srs_rating.csv').set_index('Title')

small_stats = compute_statistics(small_srs_data, draw=False);
large_stats = compute_statistics(large_srs_data, draw=False);
convenience_stats = compute_statistics(convenience_sample, draw=False);

print('Full data stats:                 ', movie_stats)
print('Small SRS stats:', small_stats)
print('Large SRS stats:', large_stats)
print('Convenience sample stats:        ', convenience_stats)

color_dict = {
    'small SRS': 'blue',
    'large SRS': 'green',
    'convenience sample': 'orange'
}

plt.subplots(3, 2, figsize=(15, 15), dpi=100)
i = 1

for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()):
    plt.subplot(3, 2, i)
    i += 2
    plt.hist(df.get('Runtime'), density=True, alpha=0.5, color=color_dict[name], ec='w', 
             bins=np.arange(0, 250, 10))
    plt.title(f'Runtimes ({name})');

i = 2
for df, name in zip([small_srs_data, large_srs_data, convenience_sample], color_dict.keys()):
    plt.subplot(3, 2, i)
    i += 2
    plt.hist(df.get('IMDB Score'), density=True, alpha=0.5, color=color_dict[name], ec='w', 
             bins=np.arange(0, 10, 0.4))
    plt.title(f'IMDB Ratings ({name})');

### Producing simple random samples
Often it's useful to take random samples even when we have a larger dataset available.  One reason is that doing so can help us understand how inaccurate other samples are.

DataFrames provide the method `.sample` for producing simple random samples.  Note that its default is to sample **without** replacement. 

**Question 2.4.** Produce a simple random sample *without replacement* of size 70 from `movie_data`. Store an array containing the mean `'Runtime'` and mean `'IMDB Score'` of your SRS in `my_small_stats`. Again, it's fine if histograms are displayed.

In [32]:
my_small_stats = ...
my_small_stats

Run the cell in which `my_small_srs_data` is defined many times, to collect new samples and compute their sample means.

<br>

Now, recall, `small_stats` is an array containing the mean `'Runtime'` and mean `'IMDB Score'` for the one small SRS that we provided you with:

In [33]:
small_stats

Answer the following two-fold question:
- Are the values in `my_small_stats` (the mean `'Runtime'` and `'IMDB Score'` for **your** small SRS) similar to the values in `small_stats` (the mean `'Runtime'` and `'IMDB Score'` for the small SRS **we provided you with**)? 
- Each time you collect a new sample – i.e. each time you re-run the cell where `my_small_stats` is defined – do the values in `my_small_stats` change a lot?

Assign either 1, 2, 3, or 4 to the variable `sampling_q4` below.
1. The values in `my_small_stats` are very different from the values in `small_stats`, and don't change at all each time a new sample is collected.
2. The values in `my_small_stats` are identical to the values in `small_stats`, and change a bit each time a new sample is collected.
3. The values in `my_small_stats` are slightly different from the values in `small_stats`, and change a bit each time a new sample is collected.
4. The values in `my_small_stats` are identical to the values in `small_stats`, and don't change at all each time a new sample is collected.

<!--
BEGIN QUESTION
name: q2_4
-->

In [34]:
sampling_q4 = ...

In [None]:
grader.check("q2_4")

**Question 2.5.** Similarly, create a simple random sample of size 180 from `movie_data` and store an array of the sample's mean `'Runtime'` and mean `'IMDB Score'` in `my_large_stats`.

In [37]:
my_large_stats = ...
my_large_stats

Run the cell in which `my_large_stats` is defined many times. Do the histograms and  mean statistics (mean `'Runtime'` and mean `'IMDB Score'`) seem to change more or less across samples of size 180 than across samples of size 70?

Assign either 1, 2, or 3 to the variable `sampling_q5` below. 

1. The statistics change *less* across samples of size 180 than across samples of size 70.
2. The statistics change an *equal amount* across samples of size 180 and across samples of size 70.
3. The statistics change *more* across samples of size 180 than across samples of size 70.

In [38]:
sampling_q5 = ...

In [None]:
grader.check("q2_5")

## 3. COVID Politics 🐘 🐎

In Section 8 of the Midterm Project, we analyzed COVID positivity rates for different states based on the party affiliation of voters in that state, as determined by their votes in the 2020 presidential election. We have the relevant data in the DataFrame`covid_politics` below

In [41]:
# Run this cell to load the data
covid_politics = bpd.read_csv('data/covid_politics.csv')
covid_politics 

As a reminder, each row in the DataFrame represents a state in the United States. The columns are
- `state`,
- `endPositiveRate` (the total number of positive tests per 100,000 people for that state as of December 31, 2020), and 
- `popParty` (the popular political party, according to votes in the 2020 election).

In this question we'll think of the dataset of 50 states as a _population_ and see what we can learn (infer) about the population by looking at data in a _sample_.

In the project you calculated a variable called `difference_by_residents`, defined as the difference in mean COVID positivity rates between `'Republican'` and `'Democratic'` states (in the order `'Republican'` minus `'Democratic'`). We've recalculated it below. 

In [42]:
republican_residents = covid_politics[covid_politics.get('popParty')=='Republican'] 
democratic_residents = covid_politics[covid_politics.get('popParty')=='Democratic']
difference_by_residents = republican_residents.get('endPositiveRate').mean() - democratic_residents.get('endPositiveRate').mean() 
difference_by_residents

**Question 3.1.** Create a function called `mean_diff` that takes as input a DataFrame of states with columns `'endPositiveRate'` and `'popParty'`, and returns the difference between the mean COVID positivity rate for `'Republican'` states and the mean COVID positivity rate for `'Democratic'` states (again, calculate `'Republican'` minus `'Democratic'`).

When called on the input `covid_politics`, the output should be the same as `difference_by_residents`, however, this function should work on *any* DataFrame of states, provided there are at least some `'Republican'` states and some `'Democratic'` states in the DataFrame.

In [43]:
def mean_diff(state_df):
    ...

# This should be the same as difference_by_residents. It's okay if the last few decimal places are off.
mean_diff(covid_politics)

In [None]:
grader.check("q3_1")

**Question 3.2.** The value of `difference_by_residents` uses data from all 50 states in the `covid_politics` dataset. Let's suppose, as is often the case in reality, that you couldn't access information about all of the states in the dataset at once, but instead you could **only look at 15 states at a time**. You want to look at **15 random states, sampled without replacement**, to get a representative sample of the full dataset. Write a function called `pick_15` that simulates this. Specifically, the function should take *no* arguments and should return a DataFrame of 15 randomly selected states from `covid_politics`.

In [46]:
def pick_15():
    """Randomly select 15 different states from covid_politics."""
    ...
pick_15()

In [None]:
grader.check("q3_2")

Now, even without access to the full `covid_politics` dataset, you can get an idea of the difference between mean COVID positivity rates of `'Republican'` and `'Democratic'` states, based on the 15 states in a random sample. The `mean_diff` function you wrote should be able to calculate the difference in mean COVID positivity rates for a random sample of states:

In [50]:
mean_diff(pick_15())

But what if you'd picked a different random 15 states for your sample? Surely, you'd get a different answer, but how different? Run the cell above a few times. You should get different results each time. If not, check for a mistake in your `mean_diff` function or your `pick_15` function.

To answer this question of how the mean difference changes as our sample changes, let's repeat our experiment.

**Question 3.3.** 500 times, randomly select 15 states and calculate the difference of mean COVID positivity rates between `'Republican'` and `'Democratic'` states (do `'Republican'` minus `'Democratic'`). Record the 500 differences of mean COVID positivity rates in an array called `experiment_differences`.

**_Hint:_** Feel free to use previously defined functions. First try simulating 10 trials. Once you are sure you have that figured out, change it to 500 trials. It may take about a minute to run with 500 trials.

In [51]:
experiment_differences = ...
experiment_differences

In [None]:
grader.check("q3_3")

**Question 3.4.** When you ran your experiment 500 times, you got 500 different estimates for the difference of mean COVID positivity rates between `'Republican'` and `'Democratic'` states, and you stored those estimates in `experiment_differences`. These estimates are statistics because they come from samples. Create a density histogram showing the distribution of these statistics.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_4
manual: True
-->

In [54]:
# Create your histogram here.

<!-- END QUESTION -->



**Question 3.5.** Compute the average value of the 500 statistics in `experiment_differences` and store your average in `approximate_difference`. This average is **also** an estimate of the difference in mean COVID positivity rates for the full data set, which is the population parameter you're trying to estimate here. Further, it's probably a better estimate that any individual statistic, because it comes from an average, which balances out the statistics that are too high with the statistics that are too low.

In [55]:
approximate_difference = ...
approximate_difference

In [None]:
grader.check("q3_5")

**Question 3.6.**  Now you have an estimate for the difference in mean COVID positivity rates between `'Republican'` and `'Democratic'` states, but you'd like to know how good of an estimate it is. How far is `approximate_difference`, calculated from your sample statistics, from `difference_by_residents`, the parameter calculated from the full `covid_politics` population? Compute the absolute difference between the two values and store it in the variable `error`. 

In [58]:
error = ...
error

In [None]:
grader.check("q3_6")

If you'd like to explore this some more, try taking samples of different sizes, and calculating the error in your corresponding estimates. Do estimates derived from bigger samples tend to be more accurate?

## 4. Sentiment Analysis 📰

Using text analysis, data scientists can identify positive, negative, and neutral expressions in text. This is known as  sentiment analysis.

Suppose that you consider UCSD's student-run newspaper, [The Guardian](https://ucsdguardian.org/), to be a fairly positive publication. You estimate that about 10% of sentences are negative, 35% are neutral, and 55% are positive. We'll represent that as an array called `guardian_model`.

In [61]:
guardian_model = np.array([0.1, 0.35, 0.55]) 
guardian_model

Let's do a hypothesis test to check if our model is accurate. Suppose you run a sentiment analysis program on the most recent issue of The Guardian, and find that out of 6600 sentences, 3570 are positive. 

**Question 4.1.** Complete the implementation of the function `one_simulation`, which has no arguments and returns the proportion of sentences with a positive sentiment, out of 6600 sentences whose sentiments are **randomly generated** according to our model.

**_Hint:_** Use `np.random.multinomial`.

In [62]:
def one_simulation():
    ...

one_simulation()

In [None]:
grader.check("q4_1")

**Question 4.2.** The test statistic for our hypothesis test will be the **absolute difference between the proportion of positive sentences in a given simulation and the expected proportion of positive sentences in our model**, i.e.

$$| \text{proportion of positive sentences in simulated sample} - 0.55 |$$


Let's conduct 5000 simulations. Create an array named `proportion_diffs` containing 5000 simulated values of the test statistic described above. Utilize the function created in the previous question to perform this task.

In [65]:
proportion_diffs = ...

# Visualize with a histogram
bpd.DataFrame().assign(absolute_differences=proportion_diffs).plot(kind='hist', bins=np.arange(0, 0.03, 0.002), density=True, ec='w', figsize=(10, 5));
plt.axvline(x=abs(3570 / 6600 - 0.55), color='red', label='observed statistic')
plt.legend();

In [None]:
grader.check("q4_2")

**Question 4.3.** Recall that our null hypothesis is that the proportion of positive sentences is 0.55, and that our sentiment analysis program found 3570 out of 6600 sentences to be positive. Use this information to calculate the p-value for this hypothesis test, which is the **proportion of times in our simulation that we saw a test statistic as or more extreme than our observed test statistic**. Assign the result to `guardian_p`.

**_Hint:_** Do large values of our test statistic favor the alternative hypothesis, or do small values of our test statistic favor the alternative hypothesis?

In [68]:
guardian_p = ...
guardian_p

In [None]:
grader.check("q4_3")

**Question 4.4.** Assign the variable `guardian_conclusion` to the best conclusion of this hypothesis test, based on the standard 0.05 significance level.
   
   1. We should reject the null hypothesis because it is unlikely that we'd see the observed number of positive sentences if our model were correct. 
    
   2. We should accept the null hypothesis because our observed data is consistent with our model.
    
   3. We should neither reject nor accept the null hypothesis because we haven't seen any evidence that our model is wrong, but we also don't know that it's accurate.
    

In [71]:
guardian_conclusion = ...

In [None]:
grader.check("q4_4")

 ## 5. Cracking Wordle 🟨 ⬛ 🟨 🟩 ⬛

Suppose you're a really competitive [Wordle](https://www.nytimes.com/games/wordle) player and you're looking for some tips to guess the answer word more quickly. Online, you find a _model_ for the proportion of times each letter in the alphabet is the first letter of the answer word in Wordle. (For example, in the words `"ALOOF"`, `"TRACE"`, and `"POINT"`, the letters in the first position of the word are `"A"`, `"T"`, and `"P"`, respectively.)

The model you found is:

<table>
    <tr><th>Letter</th><th>Estimated Chance of Being First Letter</th></tr>
    <tr><td>A</td><td>7%</td></tr>
        <tr><td>B</td><td>9%</td></tr>
        <tr><td>C</td><td>10%</td></tr>
        <tr><td>D</td><td>4%</td></tr>
       <tr> <td>E</td><td>2%</td></tr>
        <tr><td>F</td><td>5%</td></tr>
        <tr><td>G</td><td>3%</td></tr>
        <tr><td>H</td><td>5%</td></tr>
        <tr><td>I</td><td>2%</td></tr>
        <tr><td>J</td><td>1%</td></tr>
        <tr><td>K</td><td>3%</td></tr>
        <tr><td>L</td><td>2%</td></tr>
        <tr><td>M</td><td>2%</td></tr>
        <tr><td>N</td><td>4%</td></tr>
        <tr><td>O</td><td>3%</td></tr>
        <tr><td>P</td><td>7%</td></tr>
        <tr><td>Q</td><td>0.25%</td></tr>
        <tr><td>R</td><td>2%</td></tr>
        <tr><td>S</td><td>16%</td></tr>
        <tr><td>T</td><td>8%</td></tr>
        <tr><td>U</td><td>1%</td></tr>
        <tr><td>V</td><td>2%</td></tr>
        <tr><td>W</td><td>1%</td></tr>
        <tr><td>X</td><td>0.25%</td></tr>
        <tr><td>Y</td><td>0.25%</td></tr>
        <tr><td>Z</td><td>0.25%</td></tr>
</table>

Let's store these values in an array called `wordle_distribution`.

In [74]:
# Just run this cell, do not change it!
wordle_distribution = np.array([0.07, 0.09, 0.10, 0.04, 0.02, 0.05, 0.03, 0.05, 0.02, 0.01, 0.03, 0.02, 0.02, 0.04, 0.03, 0.07, 0.0025, 0.02, 0.16, 0.08, 0.01, 0.02, 0.01, 0.0025, 0.0025, 0.0025])
wordle_distribution

You notice that you have seen the letter `"S"` as the first letter of the Wordle quite often. The model you found estimates that there is a 16% chance of `"S"` being the first letter of the answer word in Wordle. You decide to play Wordle for 100 straight days, and `"S"` is the first letter of the answer word exactly 9 times. You start to suspect that 16% might be **too high** of an estimate, and that the model is wrong. 

**Question 5.1.** Using the model in which there is a 16% chance of `"S"` being the first letter, write a simulation that runs 100 games and keeps track of the **difference** between: 
- the number of Wordles in which `"S"` is the first letter, and 
- the number of times you'd expect `"S"` to be the first letter in 100 Wordles according to the model.

In other words, you will be calculating the observed (empirical) minus expected (theoretical) number of times `"S"` is the first letter in 100 Wordles. Note that there are no absolute values involved, unlike in Question 4.

Run your simulation 5000 times. Keep track of the differences in an *array* called `wordle_differences`.

**_Hint:_** If A is the 1st letter in the alphabet, then S is the 19th.

In [75]:
wordle_differences = ...

# Visualize with a histogram
bpd.DataFrame().assign(differences=wordle_differences).plot(kind='hist', density=True, bins=np.arange(-15, 15, 1), ec='w', figsize=(10, 5));
plt.axvline(x=-7, color='red', label='observed statistic')
plt.legend();

In [None]:
grader.check("q5_1")

**Question 5.2.** Recall, your null hypothesis was that there is a 16% chance of `"S"` being the first letter of the Wordle, but you observed `"S"` being the first letter 9 times out of 100. Compute the p-value for this hypothesis test, and save the result to `wordle_p_value`.

**_Hint:_** Remember, the reason you ran a hypothesis test at all was that you thought 16% was too high of an estimate.

In [79]:
wordle_p_value = ...
wordle_p_value

In [None]:
grader.check("q5_2")

**Question 5.3.** Based on the histogram and the p-value, set the variable `wordle_null_hypothesis` below to `True` if you think your model is plausible or `False` if it should be rejected at the standard 0.05 significance level.

In [82]:
wordle_null_hypothesis = ...
wordle_null_hypothesis

In [None]:
grader.check("q5_3")

**Question 5.4.** In this question, we chose as our test statistic the difference (signed, not absolute) between the number of times out of 100 `"S"` was the first letter of the Wordle and the number of times out of 100 you would expect this to happen. But this is not the only statistic we could have chosen; there are many that could have worked here. 

From the options below, choose the test statistic that would **not** have worked for this hypothesis test, and save your choice in the variable `wordle_bad_choice`. 

1. The number of times out of 100 that `"S"` was the first letter of the Wordle.
2. The proportion of times that `"S"` was the first letter of the Wordle.
3. The absolute difference between the number of times out of 100 that `"S"` was the first letter of the Wordle and the theoretical number of times out of 100. ($\text{statistic} = |\text{empirical} - \text{theoretical}|$)
4. The sum of the number of times out of 100 that `"S"` was the first letter of the Wordle and the theoretical number of times out of 100 that `"S"` is the first letter. ($\text{statistic} = \text{empirical} + \text{theoretical}$)

**_Hint:_** Our goal is to find a test statistic that will help us determine whether the number of times `"S"` is the first letter of the Wordle is **less** than the expected number of 16.



In [85]:
wordle_bad_choice = ...

In [None]:
grader.check("q5_4")

## 6. <span style='color:#FF1480'> Surprise Mini Brands!</span>  🍭

When you buy a Surprise Mini Brands toy, you open it up to reveal tiny replicas of branded supermarket products. Here are some of the possible items you may see when opening a Surprise Mini Brands toy:
<img src='data/minibrand.png' width='650'>

No, that is not real pasta sauce!

There are four types of replicas in a Surprise Mini Brands toy: `'Gold'`, `'Metallic'`, `'Glow in the Dark'`, and `'Common'`. The first three are "rare" types, which are made of special materials.

Unfortunately, Zuru, the company behind Surprise Mini Brands, doesn't make public the probability of getting any of the four types of replicas. A DSC 10 tutor proposed the following probability distribution:

| Type | Estimated Probability of Type |
| --- | --- |
| Gold | $\frac{1}{15}$ |
| Metallic | $\frac{1}{15}$ |
| Glow in the Dark | $\frac{1}{30}$ |
| Common | $\frac{5}{6}$ |

We'll store this distribution in an array, in the order `'Gold'`, `'Metallic'`, `'Glow in the Dark'`, and `'Common'`:

In [88]:
# Just run this cell, do not change it!
type_distribution_tutor = np.array([1 / 15, 1 / 15, 1 / 30, 5 / 6])
type_distribution_tutor

To assess the validity of their model, the tutor surveyed many individuals who purchased Surprise Mini Brands toys and asked them for the types of replicas they received. In total, they were given information about 15525 replicas, out of which:
- 818 were `'Gold'`,
- 976 were `'Metallic'`,
- 412 were `'Glow in the Dark'`, and
- the rest were `'Common'`.

We can calculate the **empirical** type distribution using survey data and store it in an array as well (in the same order as before):

In [89]:
# Just run this cell, do not change it!
empirical_type_distribution = np.array([818, 976, 412, (15525 - 818 - 976 - 412)]) / 15525
empirical_type_distribution

**Question 6.1.** Let's perform a hypothesis test to determine whether the tutor's model is accurate. Note that this hypothesis test is different than the ones performed in Questions 4 and 5, since we aren't just looking at one number or one proportion, but rather four proportions – one for each of `'Gold'`, `'Metallic'`, `'Glow in the Dark'`, and `'Common'`.

Which of the following is **not** a reasonable choice of test statistic for this hypothesis test? Save your choice in the variable `unreasonable_test_statistic`. You may only choose one.
1. The total variation distance between the proposed distribution (expected proportion of types) and the empirical distribution (actual proportion of types).
2. The sum of the absolute difference between the proposed distribution (expected proportion of types) and the empirical distribution (actual proportion of types).
3. The absolute difference between the sum of the proposed distribution (expected proportion of types) and the sum of the empirical distribution (actual proportion of types).

In [90]:
unreasonable_test_statistic = ...

In [None]:
grader.check("q6_1")

**Question 6.2.** We'll use the TVD, i.e. **total variation distance**, as our test statistic. Below, complete the implementation of the function `total_variation_distance`, which takes in two distributions (stored as arrays) as arguments and returns the total variation distance between the two arrays.

Then, use the function `total_variation_distance` to determine the TVD between the type distribution proposed by the tutor and the empirical type distribution observed. Assign this TVD to `observed_tvd`.

In [93]:
def total_variation_distance(first_distrib, second_distrib):
    '''Computes the total variation distance between two distributions.'''
    ...

observed_tvd = ...
observed_tvd

In [None]:
grader.check("q6_2")

**Question 6.3.** Now, we'll calculate 5000 simulated TVDs to see what a typical TVD between the proposed distribution and an empirical distribution would look like if the tutor's model were accurate. Since our real-life data includes 15525 replicas, in each trial of the simulation, we'll:
- draw 15525 replicas at random from the tutor's proposed distribution, then 
- calculate the TVD between the **type distribution proposed by the tutor** and the **empirical type distribution from the simulated sample**. 

Store these 5000 simulated TVDs in an array called `simulated_tvds`.

In [98]:
simulated_tvds = ...

# Visualize the distribution of TVDs with a histogram
bpd.DataFrame().assign(simulated_tvds=simulated_tvds).plot(kind='hist', density=True, ec='w', figsize=(10, 5));
plt.axvline(x=observed_tvd, color='red', label='observed TVD')
plt.legend();

In [None]:
grader.check("q6_3")

**Question 6.4.** Now, we check the p-value of our test by computing the proportion of times in our simulation that we saw a TVD greater than or equal to our observed TVD. Assign your result to `type_p_value`.

Additionally, conclude whether we should reject the null hypothesis at the standard 0.05 significance level. Set the variable `type_null` below to `True` if you think we should fail to reject the null hypothesis or `False` if you think the null hypothesis should be rejected.

In [101]:
type_p_value = ...
type_null = ...
type_p_value, type_null

In [None]:
grader.check("q6_4")

It looks like our tutor didn't do such a good job at proposing a model!

## Finish Line 🏁

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [105]:
grader.check_all()