In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 4 – Hypothesis and Permutation Testing

## DSC 80, Spring 2023

### Due Date: Monday, May 1st at 11:59 PM

## Instructions
Welcome to the fourth lab assignment in DSC 80 this quarter!

Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook, and **you will only submit that `lab.py` file**, not this notebook!

Some additional guidelines:
- **Unlike in DSC 10, labs will have both public tests and hidden tests.** The bulk of your grade will come from your scores on hidden tests, which you will only see on Gradescope after the assignment deadline.
- **Do not change the function names in the `lab.py` file!** The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2023-wi).
- Notebooks are nice for testing and experimenting with different implementations before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file, since that's all you're submitting.
- **To ensure that all of your work to be submitted is in `lab.py`, we've provided an additional uneditable notebook, called `lab-validation.ipynb`, that contains only the tests and their setup. Make sure you are able to run it top-to-bottom without error before submitting!**
- You are encouraged to write your own additional helper functions to solve the lab, as long as they also end up in `lab.py`.

**Importing code from `lab.py`**:

* Below, we import the `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from lab import *

In [4]:
import pandas as pd
import numpy as np
import io
import os

## Part 1: Time Series Data

**Note: You should not use `for`-loops at all in this part!**

Imagine that you own an online store and you'd like to monitor the visits to your site. You've collected information about different login dates and times for different users and stored it in `data/login_table.csv`. Some users are unique, while some visited your store multiple times.

Answer the questions below to better understand the login patterns of your users.

### Question 1 – Prime Time ⏰

Complete the implementation of the function `prime_time_logins`, which takes in a DataFrame like `login` and outputs a DataFrame indexed by `'Login Id'`, counting the number of prime-time logins for each user – that is, the number of logins that were between 4 PM(inclusive) and 8 PM (exclusive) for each user. The DataFrame should have just one column, named `'Time'`.

For example, if a user logs in at 5 PM on Day 1, at 1 PM on Day 2, at 6 PM on Day 2, and at 7 PM on Day 2, then their total number of prime-time logins is 3. Note that the values in your returned DataFrame should only include counts, not timestamp objects.

***Note:*** You do not need to use Python's `datetime` module – instead, use the built-in `pandas` methods for working with times that we introduced in [Lecture 6](https://dsc80.com/resources/lectures/lec06/lec06.html) (though you may need to do a bit more research to fully answer the question).

In [14]:
# don't change this cell -- it is needed for the tests to work
fp = os.path.join('data', 'login_table.csv')
login = pd.read_csv(fp)
q1_result = prime_time_logins(login)
q1_result

Unnamed: 0_level_0,Time
Login Id,Unnamed: 1_level_1
381,1
393,12
412,13
413,64
419,2
...,...
1302,2
1304,0
1305,1
1306,1


In [15]:
grader.check("q1")

### Question 2 – Return Users 🔁

As a site owner, you would like to find your most enthusiastic users – the ones who return to your site most frequently. You've noticed that there are users who have several logins and users who logged in only once. You are interested in finding the number of logins *per day* for each user.

Complete the implementation of the function `count_frequency`, which takes in a DataFrame like `login` and outputs a Series containing the number of logins per day for each user. Your Series should have `'Login Id'`s in its index, and the frequencies as its values. The order of users in the index is arbitrary.

To do this, you can assume today is  January 31, 2023. The first login date of a user is the first day of their membership on the site, and you may assume they are still a member today. For simplicity, you only need to count full days that a user has been a member till the end of today. For example, if a user's first login was 12 days and 5 hours ago, you can say that they have been a user for 12 days. 

***Hint:*** Can you write a custom aggregator that allows you to do this with just one `.groupby`?

In [None]:
# don't change this cell -- it is needed for the tests to work
fp = os.path.join('data', 'login_table.csv')
login = pd.read_csv(fp)
q2_result = count_frequency(login)

In [None]:
grader.check("q2")

## Part 2: Relational Algebra

Recall, in [Lecture 7](https://dsc80.com/resources/lectures/lec07/lec07.html#Relational-algebra), we briefly introduced the concept of relational algebra, which is a mathematical system for describing operations performed on relations. (For now, you can think of relations and DataFrames as being interchangeable – but don't tell your DSC 100 instructor.)

In lecture, we introduced five relational operators, one of which was the _cross product_, $A \times B$, which combines every combination of a row in $A$ with a row in $B$. At the time, the cross product didn't seem all that useful, but in this part we'll explore its utility more.

Let's start by loading in two DataFrames: `customers` and `products`.
- `customers` contains one row for each of several (very familiar sounding) customers at TritonBank.
- `products` contains one row for each of several phones on sale at the UCSD Bookstore.

In [None]:
customers_fp = os.path.join('data', 'customers.csv')
customers = pd.read_csv(customers_fp)
customers.head()

In [None]:
products_fp = os.path.join('data', 'products.csv')
products = pd.read_csv(products_fp)
products.head()

### Question 3 – Most Expensivest 💰📱

Complete the implementation of the function `most_expensive_per_customer`, which takes in two DataFrames: one like `customers` and one like `products`. It should return a new DataFrame that has **one row for every customer in `customers`** and the following columns:
- `'Name'`.
- `'Email'`.
- `'Model'`, the name of the most expensive phone in `phones` that the customer can afford, given their bank account balance.
- `'Price'`, the cost of the most expensive phone in `phones` that the customer can afford, given their bank account balance.

The order of the rows in the output DataFrame does not matter. 

For example, the most expensive phone that Daniel can buy with his $837 is the iPhone 14, so `most_expensive_per_customer(customers, products)` should contain a row, somewhere, with the following information (though with a possibly different index):

<table border="1" class="dataframe">
  <tbody>
    <tr>
      <th>4</th>
      <td>Li, Daniel</td>
      <td>ddli@ucsd.edu</td>
      <td>iPhone 14</td>
      <td>799</td>
    </tr>
  </tbody>
</table>


We'll allow you to make the following simplifying assumptions:
- Each customer can afford at least one phone.
- Each phone has a unique price, i.e. there are no ties.
- There are no null values in either input DataFrame.
- There are no duplicate rows in either input DataFrame.

***Hint***: In relational algebra, your first step is to execute the following expression:

$$\sigma_{\text{customers.Balance } \geq \text{ products.Price}} \: \: (\text{customers} \times \text{products})$$

You may also want to look at [Lecture 5](https://dsc80.com/resources/lectures/lec05/lec05.html) for a hint.

***Note***: Don't use a `for`-loop, and don't write a helper function and use the `apply` method. Instead, think about how to use the hint above directly. Also, remember that the hidden tests may test your implementation on DataFrames _like_ `customers` and `products`, in that they contain the same column names and the same types of information, but possibly different values.

In [None]:
# don't change this cell -- it is needed for the tests to work
customers_fp = os.path.join('data', 'customers.csv')
customers = pd.read_csv(customers_fp)
products_fp = os.path.join('data', 'products.csv')
products = pd.read_csv(products_fp)
q3_out = most_expensive_per_customer(customers, products)
q3_out

In [None]:
grader.check("q3")

## Part 3: Hypothesis Testing

In this section, you'll recall the terms and structure of hypothesis testing from DSC 10.

The first step is always to define what you're looking at, create your hypotheses, and set a level of significance (i.e. a p-value cutoff). Once you've done that, you can find a p-value.

If all of these words are foreign, look at the [Lecture 9](https://dsc80.com/resources/lectures/lec09/lec09.html) notebook and the readings, and don't forget to think about the real-world meaning of these terms!  The following example describes a real-world scenario, which should help keep it easy to interpret.

Note that you **can** use `for`-loops to conduct hypothesis and permutation tests in assignments.

### Question 4 – Surf's Up 🏄

In San Diego, students are looking to surf in their free time. There is a pop-up surf store on Library Walk selling wet suits and surf board to students. Last Saturday, this store sold 250 wet suits to UCSD students. After a surf session, 10 students complained that their wet suits had tears in them, letting cold ocean water to rush in their suits. In response to the student dissatisfaction, the store claims that 98% of their wet suits are produced without any manufacturing defects. You think this seems unlikely and decide to investigate.

First, select a significance level for your investigation. You don't need to turn this in anywhere. Then, complete the implementation of the following three functions.

#### `null_hyp`

Complete the implementation of the function `suits_null_hyp`, which has no parameters and returns your answer to the following question **as a list**.

What are reasonable choices for the **null hypothesis** for your investigation? Select all that apply.
1. The store sells wet suits that are approximately 2% defective.
2. The store sells wet suits that are 98% non-defective.
3. The store sells wet suits that are less than 98% non-defective.
4. The store sells wet suits that are at least 2% defective.

<br>

#### `simulate_suits_null`

Complete the implementation of the function `simulate_suits_null`, which simulates a single step of the data generation process under the null hypothesis. The function should return a binary array, i.e. an array of 0s and 1s, of length 250. It is up to you to decide what the 0s and 1s mean.

***Hints:*** `np.random.choice` might be useful in this case.

<br>

#### `estimate_p_val`

Complete the implementation of the function `estimate_suits_p_val`, which takes in an integer `N` and returns the estimated p-value of your investigation upon simulating the null hypothesis `N` times.

***Note***: Plot the null distribution and your observed statistic to check your work. (If you decide to plot, you may have to run `import matplotlib.pyplot as plt` or `import plotly.express as px`.)

In [50]:
import plotly.express as px

In [None]:
grader.check("q4")

Now that we've gotten our feet wet with hypothesis testing, let's take a closer look at how to choose null and alternative hypotheses and test statistics.

### Question 5 – Tires 🚗

A tire manufacturer, TritonTire, claims that their tires are so good, they will bring a Toyota Highlander from 60 mph to a complete stop in under 106 feet, 97% percent of the time.

Now, you own a Toyota Highlander equipped with TritonTire tires, and you decide to test this claim. You take your car to an empty Vons parking lot, speed up to exactly 60 mph, hit the brakes, and measure the stopping distance. As illegal as it is, you repeat this process 50 times and find that **you stopped in under 106 feet only 47 of the 50 times**.

Livid, you call TritonTire and say that their claim is false. They say, no, that you were just unlucky: your experiment is consistent with their claim. But they didn't realize that they are dealing with a *data scientist* 🧑‍🔬.

To settle the matter, you decide to unleash the power of the hypothesis test. The following three subparts ask you to answer a total of four select-all multiple choice questions.

#### Question 5.1

You will set up a hypothesis test in order to test your suspicion that the tires are are actually worse than claimed. Which of the following are valid null and alternative hypotheses for this hypothesis test?

1. The tires will stop your car in under 106 feet exactly 97% of the time.
0. The tires will stop your car in under 106 feet less than 97% of the time.
0. The tires will stop your car in under 106 feet greater than 97% of the time.
0. The tires will stop your car in more than 106 feet exactly 3% of the time.
0. The tires will stop your car in more than 106 feet less than 3% of the time.
0. The tires will stop your car in more than 106 feet greater than 3% of the time.

Complete the implementation of the function `car_null_hypoth`, which takes zero arguments and returns a list of integers, corresponding to the the valid null hypotheses above.
Also complete the implementation of the function called `car_alt_hypoth`, which takes zero arguments and returns a list of integers, corresponding to the valid alternative hypotheses above.

<br>

#### Question 5.2

Which of the following are valid test statistics for our question?

1. The number of times the car stopped in under 106 feet in 50 attempts.
1. The average number of feet the car took to come to a complete stop in 50 attempts.
1. The number of attempts it took before the car stopped in under 95 feet.
1. The proportion of attempts in which the car stopped in under 106 feet in 50 attempts.

Complete the implementation of the function `car_test_stat`, which takes zero arguments and returns a list of integers, corresponding to the valid test statistics above

<br>

#### Question 5.3

The p-value is the probability, under the assumption the null hypothesis is true, of observing a test statistic **equal to our observed statistic, or more extreme in the direction of the alternative hypothesis**.

Why don't we just look at the probability of observing a test statistic equal to our observed statistic? That is, why is the "more extreme in the direction of the alternative hypothesis" part necessary?

1. Because our observed test statistic isn't extreme.
4. Because our null hypothesis isn't suggesting equality.
5. Because our alternative hypothesis isn't suggesting equality.
2. Because the probability of finding our observed test statistic equals the probability of finding something more extreme.
3. Because if we run more and more trials (where a trial is speeding up the car then stopping), the probability of finding *any* particular observed test statistic gets closer and closer to zero, so if we did this we would always reject the null with more trials even if the null is true. For example, flipping a fair coin twice means it’s pretty likely to see 50% heads, but flipping it 10000 times means it’s quite unlikely to see 50% heads.

Complete the implementation of the function `car_p_value`, which takes zero arguments and returns the correct reason as an integer (not a list).

In [None]:
grader.check("q5")

### Question 6 – Superheroes 🦸

In the previous two questions, we ran hypothesis tests that didn't require us to look at stored data. In this next question, we'll return to the `heroes` DataFrame from Lab 2, which is read in from the file `data/superheroes.csv`.

Our goal in this section will be to answer the question:

> Are blond-haired, blue-eyed characters significantly **more** "good" than the general pool of characters?

#### `bhbe_col`

To start, complete the implementation of the function `bhbe_col`, which takes in a DataFrame like `heroes` and returns a Boolean Series that contains `True` for characters that have **both** blond hair and blue eyes, and `False` for all other characters. 

***Note***: If a character's hair color contains the word `'blond'`, uppercase or lowercase, we consider their hair to be blond for the purposes of this question. Similarly, if a character's eye color contains the word `'blue'`, uppercase or lowercase, we consider their eye color to be blue for the purposes of this question.

<br>

Now that you have an easy way of accessing only the blond-haired, blue-eyed characters in `heroes`, you can proceed with a hypothesis test. You choose the following null hypothesis:

> The proportion of "good" characters among blond-haired, blue-eyed characters is equal to the proportion of "good" characters in the overall population."

Fix a significance level (i.e. p-value cutoff) of 1%.

Before proceeding, think about what test statistic to use in this hypothesis test (and to do that, read the initial question carefully). Once you've done that, complete the implementations of the following functions.

***Hint:*** Alternative hypothesis: the distribution of "good" characters among blond-haired, blue-eyed characters is different from the proportion of "good" characters in the overall population.

<br>

#### `superheroes_observed_stat`
Complete the implementation of the function `superheroes_observed_stat`, which takes in the DataFrame `heroes` and returns the observed test statistic.

<br>

#### `simulate_bhbe_null` 
Complete the implementation of the function `simulate_bhbe_null`, which takes in a positive integer `n` and returns an array of length `n`, where each element is a simulated test statistic according to the null hypothesis. You should hard-code the simulation parameter within your function; do not read in any data. (The simulation parameter is a proportion/probability; you can round it to two decimal places.)

***Hint:*** While you're not prohibited from using a `for`-loop, try avoiding one here. You can access columns of a multidimensional array the same way you access columns of a DataFrame using `iloc`.

<br>

#### `superheroes_calc_pval` 
Complete the implementation of the function `superheroes_calc_pval`, which takes in no parameters and returns a list where:
* The first element is the p-value for the hypothesis test (using 100,000 simulations). Please run the code yourself **in your notebook** and hard-code this answer **in your `.py` file**, as actually running the 100,000 simulation hypothesis test will timeout on Gradescope.
* The second element is `'Reject'` if you reject the null hypothesis and `'Fail to reject'` if you fail to reject the null hypothesis, at the 1% significance level.

In [43]:
# don't change this cell -- it is needed for the tests to work
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
bhbe_out = bhbe_col(heroes)

obs_stat_out = superheroes_observed_stat(heroes)

simulate_bhbe_out = simulate_bhbe_null(10)

pval_out = superheroes_calc_pval()
simulate_bhbe_out

array([0.72, 0.69, 0.69, 0.74, 0.71, 0.61, 0.76, 0.65, 0.63, 0.67])

In [44]:
sims = simulate_bhbe_null(100_000)
p_value = (sims >= superheroes_observed_stat(heroes)).sum()
p_value

13

In [45]:
grader.check("q6")

## Part 4: Permutation Testing

Recall, hypothesis tests answer questions of the form:

> I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population?

While permutation tests answer questions of the form:

> I have two samples, but no information about any population distributions. Do these samples look like they were drawn from the same population?

Keep this in mind while working on this last part of the lab.

<br>

[Skittles](https://en.wikipedia.org/wiki/Skittles_(confectionery)) 🍬 are made in two locations in the United States: Yorkville, Illinois and Waco, Texas. In these factories, Skittles of different colors are made separately by different machines and combined/packaged into bags for sale. The **tab-separated file** `data/skittles.tsv` contains the contents of 468 bags of Skittles.

Throughout this question, we will compare the color distribution of Skittles between bags made in the Yorkville factory and bags made in the Waco factory. Most people have preferences for their favorite flavor, and there is a surprising amount of variation among the distribution of flavors in each bag.

Look at the variation by bag in the dataset below:

In [46]:
skittles_fp = os.path.join('data', 'skittles.tsv')
skittles = pd.read_csv(skittles_fp, sep='\t')
skittles.head()

Unnamed: 0,red,orange,yellow,green,purple,Factory
0,10,15,11,7,18,Yorkville
1,5,12,17,15,10,Yorkville
2,16,11,15,11,9,Waco
3,15,8,13,16,7,Waco
4,11,14,20,8,7,Waco


In [47]:
skittles.shape

(468, 6)

### Question 7 – Orange Skittles 🟠

First, you will investigate if the machine that mixes together the Skittles of different colors might favor one color over another. Use a permutation test to assess whether, on average, bags made in Yorkville have the same number of orange skittles as bags made in Waco. Do this by implementing the functions described below.

<br>

#### `diff_of_means`

Complete the implementation of the function `diff_of_means`, which takes in a DataFrame like `skittles` and returns the **absolute difference** between the **mean** number of orange Skittles per bag from Yorkville and the **mean** number of orange Skittles per bag from Waco.

<br>

#### `simulate_null`

Complete the implementation of the function `simulate_null`, which takes in a DataFrame like `skittles` and returns one simulated instance of the test statistic under the null hypothesis. Note that this will involve shuffling the `'Factory'` column!

<br>

#### `pval_color`

Complete the implementation of the function `pval_color`, which takes in a DataFrame like `skittles` and calculates the p-value for the permutation test using 1000 trials.

<br>

Plot the observed statistic, along with the histogram for the simulated distribution, to check your work.

***Note:*** In all functions, the default argument for `col` is `'orange'`. Your functions should still work for any color so that you can call it in later questions.

In [57]:
# don't change this cell -- it is needed for the tests to work
# cell may take about 1-2 minutes to execute to completion
skittles_fp = os.path.join('data', 'skittles.tsv')
skittles = pd.read_csv(skittles_fp, sep='\\t', engine='python')
q7_diff_of_means_out = diff_of_means(skittles)
q7_simulate_null_out = simulate_null(skittles)
q7_pval_out = pval_color(skittles)
q7_pval_out

0.039

In [58]:
# fig = px.histogram(data_frame=pd.DataFrame(q7_pval_out), x=0,nbins=50, histnorm='probability')
# fig.add_vline(x=diff_of_means(skittles),line_color='red')

ValueError: DataFrame constructor not properly called!

In [59]:
grader.check("q7")

### Question 8 – Generalizing to all colors 🔴🟠🟡🟢🟣

While your `pval_color` function used a default color of `'orange'`, it should also work for all other colors of Skittles, meaning you can run the same permutation test from Question 7 on all colors of Skittles. Call `pval_color` on all colors of Skittles to find which colors differ the most between the two locations on average. 

Then, complete the implementation of the function `ordered_colors`, which returns a list of five ordered pairs, each of the form `('color', p_value)`. For example, your list might look like `[('pink', 0.000), ('brown', 0.025), ...]`. 

The list should be **hard-coded**, meaning that you should run your permutation tests in your notebook, not in your `.py` file. The list should also be sorted in **increasing order of p-value**. Make sure your p-values are rounded to **3 decimal places**.

Even though there is randomness in the color composition in each bag, this list gives the likelihood that the machines have a systematic, meaningful, difference in how they blend the colors in each bag.

In [68]:
# don't change this cell -- it is needed for the tests to work
q8_out = ordered_colors()
q8_colors = {'green', 'orange', 'purple', 'red', 'yellow'}
q8_test_colors = [x[0] for x in q8_out]

In [69]:
for color in q8_colors:
    print(color, pval_color(skittles, color))

orange 0.037
yellow 0.0
red 0.242
green 0.479
purple 0.971


In [70]:
grader.check("q8")

### Question 9 – Overall distributions 🏭

Now, suppose you would like to assess whether the two locations make similar amounts of each color overall. That is, suppose we:
* Combine and count up all the Skittles of each color that were made in Yorkville (e.g. 14303 total red skittles, 9091 total green skittles, etc.).
* Combine and count up all the Skittles of each color that were made in Waco.

**Are these distributions of colors similar?** Is the variation among the bags due to each factory making different amounts of each color?

Use a permutation test to assess whether the distribution of colors of Skittles made in Yorkville is statistically significantly different than those made in Waco. Set a significance level (i.e. p-value cutoff) of 0.01 and determine whether you can reject a null hypothesis that answers the question above using a permutation test with 1000 trials. For your test statistic, use the **total variation distance (TVD)**.

Refer to [Lecture 10](https://dsc80.com/resources/lectures/lec10/lec10.html) to see an example of a [permutation test](https://www.inferentialthinking.com/chapters/12/Comparing_Two_Samples.html) that uses the [TVD](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html) as the test statistic. Some guidance:

- Our previous permutation tests have compared the mean number of (say) orange Skittles in Yorkville bags to the mean number number of orange Skittles in Waco bags. The role of shuffling was to randomly assign bags to Yorkville and Waco.
- In this permutation test, we are **still** shuffling to randomly assign bags to Yorkville and Waco. The only difference is that after we randomly assign each bag to a factory, we will compute the **distribution** of colors among the two factories and find the TVD between those two distributions.

**Your job:** Complete the implementation of the function `same_color_distribution`, which takes in no arguments and outputs a hard-coded **tuple** with the p-value and whether you `'Fail to Reject'` or `'Reject'` the null hypothesis.

In [156]:
def calc_tvd(data):
    by_factory = data.groupby('Factory').sum()
    as_proportions = by_factory.div(by_factory.sum(axis=1), axis=0)
    return as_proportions.diff().iloc[-1, :].abs().sum()

In [157]:
observed = calc_tvd(skittles)
observed

0.040808228055829954

In [158]:
tvds = []
for _ in range(1000):
    shuffled = skittles.assign(Factory = np.random.permutation(skittles['Factory']))
    tvds.append(calc_tvd(shuffled))
print(tvds)
(tvds >= observed).sum() / 1000

[0.021876406950217475, 0.010593075995924212, 0.044119948116523255, 0.013590733850180003, 0.01582204886002586, 0.029934780159481922, 0.030787932154100944, 0.020287800626796387, 0.01613599823869477, 0.012304434584373747, 0.02579750975946346, 0.03692017894984695, 0.03380162116125282, 0.029797178191190765, 0.011294247102480226, 0.010679793634564283, 0.02972438218458545, 0.03347577134654381, 0.014935288915512712, 0.027132400137700463, 0.021588414902385444, 0.02136167376465703, 0.02472990002693934, 0.018875851532758975, 0.023364572687953084, 0.01584473001435266, 0.019872454208638896, 0.01980039845698847, 0.012982822001214644, 0.02604964303078197, 0.01604664527278124, 0.022160872484737998, 0.015450674491710309, 0.00661533324986438, 0.020488817531796044, 0.019933812276488078, 0.027826760077642426, 0.02099005556396169, 0.032714255850929436, 0.034136212537495686, 0.020771473722554862, 0.023278076219267668, 0.03435184984239287, 0.017908138038292, 0.018806282718362072, 0.01660748338208476, 0.02689

0.007

In [159]:
# don't change this cell -- it is needed for the tests to work
q9_out = same_color_distribution()

In [160]:
grader.check("q9")

### Question 10 – Permutation testing vs. hypothesis testing 🧪

In each of the following scenarios, decide  whether  a  permutation test is appropriate to determine if there is a  significant difference between the quantities described. If a permutation test is appropriate, mark `'P'`. Otherwise, mark `'H'`.

Record your answers in the function `perm_vs_hyp` that outputs a list of length 5, containing the values `'P'` and `'H'`.

1. Compare the DSC 80 pass rate between second years and third years who take the class.
2. Compare the proportion of Data Science majors who have completed DSC 80 and the proportion of Data Science minors who have completed DSC 80.
3. Compare the proportion of students who have iPhones to the proportion of students who have Android phones (for simplicity, assume that all students either have an iPhone or an Android).
4. In DSC 80, we ask all students whether they liked DSC 40A or DSC 40B more. Compare the proportion of students who preferred DSC 40A to the proportion who preferred DSC 40B.
5. Compare the attendance rate of classes that play music before class vs. classes that do not play music before class.

***Hint:*** Think about the type of data you would collect in each case, and how you would simulate new data under the null hypothesis. It will be useful to refer to the explanation at the start of Part 4.

In [None]:
# don't change this cell -- it is needed for the tests to work
q10_out = perm_vs_hyp()

In [None]:
grader.check("q10")

## Congratulations! You're done with Lab 4! 🏁

As a reminder, all of the work you want to submit needs to be in `lab.py`.

To verify that all of your work is indeed in `lab.py`, and that you didn't accidentally implement a function in this notebook and not in `lab.py`, we've included another notebook in the lab folder, called `lab-validation.ipynb`. `lab-validation.ipynb` is a version of this notebook with only the `grader.check` cells and the code needed to set up the tests. 

### **Go to `lab-validation.ipynb`, and go to Kernel > Restart & Run All.** This will check if all `grader.check` test cases pass using just the code in `lab.py`.

Once you're able to pass all test cases in `lab-validation.ipynb`, including the call to `grader.check_all()` at the very bottom, then you're ready to submit your `lab.py` (and only your `lab.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()