In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw06.ipynb")

# HW 06 - Permutation Testing and Missing Values 

You must submit this assignment to Gradescope by the on-time deadline. **We strongly encourage you to plan to submit your work to Gradescope several days (hours) before the stated deadline.** This way, you will have ample time to reach out to staff for support if you encounter difficulties with submission. While course staff is happy to help guide you with submitting your assignment ahead of the deadline, we will not respond to last-minute requests for assistance (TAs need to sleep, after all!).

Please read the instructions carefully when you are submitting your work to Gradescope.



### Debugging Guide
If you run into any technical issues, we highly recommend checking out the [Debugging Guide](https://mtu.instructure.com/courses/1527249/pages/debugging-guide). In this guide, you can find general questions about Jupyter notebooks / Datahub, Gradescope, and common pandas errors.

In [None]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
sns.set()
plt.style.use('fivethirtyeight')

from scipy import stats

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.io as pio
pd.options.plotting.backend = 'plotly'


def create_kde_plotly(df, group_col, group1, group2, vals_col, title=''):
    fig = ff.create_distplot(
        hist_data=[df.loc[df[group_col] == group1, vals_col], df.loc[df[group_col] == group2, vals_col]],
        group_labels=[group1, group2],
        show_rug=False, show_hist=False
    )
    return fig.update_layout(title=title)


In [None]:
# set up random number generator 
rng_seed = 42
np.random.seed(rng_seed)

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

# Part 1 - Permutation Testing 

Recall, hypothesis tests answer questions of the form: 

    I have a population distribution, and I have one sample. Does this sample look like it was drawn from the population? 

While permutation tests answer questions of the form: 

    I have two samples, but no information about any population distributions.  Do these samples look like they were drawn from the same population? 

Keep this in mind while working on this assignment. 

## Question 1 

Skittles are made in two locations in the US: Yorkville, IL and Waco, TX.  In these factories, Skittles of different colors are made separately by different machines and combined/packaged into bags for sale. 

The **tab-separated file** `data/skittles.tsv` contains the contents of 468 bags of Skittles. 

Throughout this question, we will compare the color distribution of Skittles between bags made in the Yorkville factory and bags made in the Waco factory.  Most people have preferences for their favorite flavor/color, and there is a surprising amount of variation among the distribution of flavors in each bag. 

Look at the variation by bag in the dataset below: 

In [None]:
skittles_fp = Path('data') / 'skittles.tsv'
skittles = pd.read_csv(skittles_fp, sep='\t')
skittles.head()

In [None]:
skittles.shape

<br>

--- 

### Question 1a - Orange Skittles 

First, you will investigate if the machine that mixes together Skittles of different colors might favor one color over another.  Use a permutation test to assess whether, on average, bags made in Yorkville have the same number of orange skittles as bags made in Waco.  

You will implement the following functions. 

`diff_of_means`  
The function takes a DataFrame like `skittles` and a column specifying a color (defaults to orange)as input.  The function returns the **absolutes difference** between the **mean** number of orange skittles per bag from Yorkville and the **mean** number of orange skittles per bag in Waco. 

`simulate_null`  
The function takes in a DataFrame like `skittles` and a column specifying a color (defaults to orange) as input.  The function returns one simulated instance of the test statistic under the null hypothesis.  This will involve shuffling the either `'Factory'` or color column. 

`color_p_value`  
The function takes in a DataFrame like `skittles` and a column specifying a color (defaults to orange) as input.  The function calculates the p-value for the permutation test using 1000 trials.  The function returns a list of the p-value, Series-like simulated distribution values of the test statistic. 

`plot_q1a_dist`  
The function takes in a Series like the simulated distribution values, and the observed statistic.  The function returns None. The function plots the observed statistis in the histogram of the simulated distribution. 



In [None]:
def diff_of_means(df, col='orange'): 
    # Input: a DataFrame like "skittles" and column specifying color to investigate
    # Return: the abs. difference between the mean num. of orange at Yorkville 
    #  and the mean number of orange skittles per bag in Waco. 

    diff = ...
    
    return diff

In [None]:
def simulate_null(df, col='orange'): 
    # Input: a DataFrame like "skittles" and column specifying color to investigate
    # Return: one simulated instance of the test statistic under the null hypothesis

    difference = ...
    
    return difference

In [None]:
def color_p_value(df, col='orange'): 
    # Input: a DataFrame like "skittles" and column specifying color to investigate
    # Return: list of the p-value, Series-like simulated distribution values of 
    #    the test statistic

    differences = ...

    pval = ...
    
    return [pval, differences]

In [None]:
def plot_q1a_dist(diffs, obs_val): 

    return None

In [None]:
# Do not change this cell = it will test your code. 
# It may take about 1-2 minutes to run
q1a_diff_of_means = diff_of_means(skittles)
q1a_simulate_null = simulate_null(skittles)
q1a_pval_out = color_p_value(skittles)
plot_q1a_dist(q1a_pval_out[1], q1a_diff_of_means)

In [None]:
grader.check("q1a")

<br>

---

### Question 1b - Generalize to all colors 

While your `color_p_value` function used a default color of `'orange'`, it should also work for all other colors of Skittles, meaning you can run the same permutation test from Question 1a on all colors of Skittles. Call `color_p_value` on all colors of Skittles to find which colors differ the most between the two locations on average. 

Create `q1b_out` a list of five ordered pairs, each of the form `('color', p_value)`. For example, your list might look like `[('pink', 0.000), ('brown', 0.025), ...]`.  Try using [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). 

The list should also be sorted in **increasing order of p-value**. Make sure your p-values are rounded to **3 decimal places**.

Even though there is randomness in the color composition in each bag, this list gives the likelihood that the machines have a systematic, meaningful, difference in how they blend the colors in each bag.

In [None]:
# Create a list of ordered pairs of the form ('color', pval), sorted by 
#  increasing order of pvalue

colors = ['green', 'orange', 'purple', 'red', 'yellow']
q1b_out = ... 

# Do not change this part of the cell - it used for testing 

q1b_test_colors = [x[0] for x in q1b_out]

In [None]:
grader.check("q1b")

<br>

--- 

### Question 1c 

Now, suppose you would like to assess whether the two locations make similar amounts of each color overall. That is, suppose we:
* Combine and count up all the Skittles of each color that were made in Yorkville (e.g. 14303 total red skittles, 9091 total green skittles, etc.).
* Combine and count up all the Skittles of each color that were made in Waco.

Now, suppose you would like to assess whether the two locations make similar proportions of each color overall. That is, suppose we:
* Calculate the proportion of each Skittles color that were made in Yorkville (e.g. out of the 14704 skittles made, 19.8% of them are red,  18.9% of them are green, etc.).
* Calculate the proportion of each Skittles color that were made in Waco.

**Are these distributions of colors similar?** Is the variation among the bags due to each factory making different amounts of each color?

Use a permutation test to assess whether the distribution of colors of Skittles made in Yorkville is statistically significantly different than those made in Waco. Set a significance level (i.e. p-value cutoff) of 0.01 and determine whether you can reject a null hypothesis that answers the question above using a permutation test with 1000 trials. For your test statistic, use the **total variation distance (TVD)**.

Refer to the end of Lecture 9 - Permutation-testing-meets-TVD to see an example of a [permutation test](https://www.inferentialthinking.com/chapters/12/Comparing_Two_Samples.html) that uses the [TVD](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html) as the test statistic. Some guidance:

- Our previous permutation tests have compared the mean number of (say) orange Skittles in Yorkville bags to the mean number number of orange Skittles in Waco bags. The role of shuffling was to randomly assign bags to Yorkville and Waco.
- In this permutation test, we are **still** shuffling to randomly assign bags to Yorkville and Waco. The only difference is that after we randomly assign each bag to a factory, we will compute the **distribution** of colors among the two factories and find the TVD between those two distributions.

**Your job**: Complete the implementation of the function `same_color_distribution`, which takes in no arguments and outputs a hard-coded **tuple** with the p-value and whether you `'Reject'` or `'Fail to reject'` the null hypothesis.

In [None]:
def same_color_distribution(): 
    # Returns a tuple (pvalue, 'Reject' or 'Fail to reject')

    ...
    pval

    return None 

# You may want to create a helper function tvd_of_groups similar to the lecture 


# Do not change this part of the cell - it used for testing 

q1c_out = same_color_distribution()

In [None]:
grader.check("q1c")

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

# Part 2 - Missingness 

First, let's recap the different mechanisms of missingness we studied in lecture.

#### Missing by Design (MD)
- The missing field is deliberately missing. The missing field is deliberately set to null or not collected (hence, "missing by design").
- The missingness can be exactly predicted when a column will be null, with only knowledge of the other columns using a function of the rows of the dataset.

#### Missing Completely at Random (MCAR)
- The missingness of missing value isn't related to the actual, unreported value itself, nor the values in any other fields. The missingness is not systematic.
- The missingness is unconditionally uniform across rows. MCAR doesn't bias the observed data.
- There is no relationship between the missing data and the any of the other data, observed or missing.

#### Missing at Random (MAR)
- The missingness of the missing value has nothing to do with the value itself, but may be related to another field.
- The missingness is uniform across rows, perhaps conditional on another column. MAR biases the observed data, but is fixable.
- There is a systematic relationship between the missing values and the observed data (but not the missing values themselves).
- Difference between MD and MAR: If you can *exactly/always* determine missingness using the other columns, the missingness is MD. If there is just some sort of systematic relationship between the missing columns/values and other columns/values that may help us predict missingness, the missingness is MAR.

#### Not Missing At Random (NMAR)
- The missingness of the missing value is related to the actual, unreported value.
- NMAR biases the observed data in unobservable ways.
- There is relationship between the propensity of a value to be missing and its value.

<br>

---

## Question 2 – Miscellaneous Missingness Questions

In each of the following scenarios, choose the best answer out of the missingness types: `'MD'`, `'MCAR'`, `'MAR'`, and `'NMAR'`. Store your answers in a list of length 5, and complete the implementation of the function `q2_questions`, which returns that list.

1. A large state university has recently adopted GrubHub as the food pre-ordering app for campus restaurants, so you can order your food ahead of time and stop by before your next class. In a DataFrame of GrubHub app orders, which contains information such as `'restaurant'`, `'name'`, `'items'`, and `'total'`, the column `'delivery_address'` is often missing for university students. Which is the most likely missingness mechanism for this column?


2. In a database of student records that records student profile data, such as `'first_name'`, `'home_address'`, `'ethnicity'`, etc., sometimes the `'middle_name'` column is missing. Which is the most likely missingness mechanism for this column?


3. A club baseball team creates a signup sheet for potential new members. The sheet contains the columns `'full_name'`, `'year'`, `'email'`, `'favorite_sports'`, `'number_of_sports_played'`, and `'sports_previously_played'`. The team president notices that many students left the `'sports_previously_played'` column blank. Which is the most likely missingness mechanism for this column?


4. After the 2023 Winter Carnival, USG sends out a survey to all students about whether their expectations for the 2023 Winter Carnival were met, with all questions being optional. They notice that many students left the "Were you satisfied with the 2023 Winter Carnival?" question blank. Which is the most likely missingness mechanism for answers to this question?


5. Your university has been using a two-factor authentication system, DUO, since October 16th, 2019. When using DUO, all university accounts are assigned a unique code. The university's Service Desk, who maintains DUO, has a database that stores each user's code and their phone number, which users must provide when they sign up for DUO. They notice that many phone numbers are missing. Which is the most likely missingness mechanism for phone numbers?

In [None]:
def q2_quesiton(): 
    # Return a list with the type of missingness for the 5 examples.
    #  Types are 'MD', 'MCAR', 'MAR', 'NMAR'
    
    return None
q2_ans = q2_question()

In [None]:
grader.check("q1")

<br>

--- 

## Question 3

Let's now focus on deciding whether data in a particular column look MCAR or MAR through permutation tests.

In `data/payment.csv`, you are given a dataset of payment information for purchases made on January 1st, 2019. The dataset contains the customers' `'id'`, `'credit_card_type'`, `'credit_card_number'`, and `'date_of_birth'`.

You'd like to assess whether the missingness of `'credit_card_number'` is dependent on the age of the customer. Here's how you'll proceed:


#### `first_round`

Look at distribution of ages by missingness of `'credit_card_number'` and determine if the missingness is dependent on age or not.

Use the following steps to approach this problem:

- Compute the ages of the customers. To find a customer's age, compute the number of years between their birth year and 2024.
- Perform a permutation test for whether or not the two distributions mentioned above are drawn from the same population distribution. Use a 5% significance level. Use the **absolute difference of means** as your test statistic.

Note that some of the ages themselves are also missing; you don't need to do anything about this.

Complete the implementation of the function `first_round`, which takes in no arguments that returns a **list** with two values:
* The first value is the p-value from your permutation test. 
* The second value is either `'R'` if you reject the null hypothesis, or `'NR'` if you fail to reject the null.

**Does the result match your guess? If not, what might be a problem?**

***HINT*** consider your helper functions from Question 1a above. How you may adapt them for this problem. 

#### `second_round`

Repeat the same permutation test as in `first_round`, but this time, use the **Kolmogorov-Smirnov statistic** as your test statistic.

Complete the implementation of the function `second_round` with no arguments that returns a __list__ with three values: 
* The first value is the p-value from your new permutation test.
* The second value is either `'R'` if you reject the null hypothesis, or `'NR'` if you fail to reject the null. 
* The third value is your final conclusion: `'D'` (the missingness of `'credit_card_number'` is dependent on age) or `'ND'` (the missingness of `'credit_card_number'` is not dependent on age).

Note that in Lecture 11, we ran permutation tests using the Kolmogorov-Smirnov test statistic **without `for`-loops**. You can use this same procedure; we have already imported `stats` from `scipy`.

In [None]:
payments_fp = Path('data') / 'payment.csv'
payments = pd.read_csv(payments_fp)
payments.head()

In [None]:
# Helper functions to add 

...

def first_round(): 
    # Returns a list with two values 
    #  pvalue from permutation test 
    #  'R' or 'NR' to reject or fail to reject the null hypothesis 

    ...

    return None


In [None]:
# Helper functions to add 
...

def second_round(): 
    # Returns a list with three values 
    #  pvalue from permutation test 
    #  'R' or 'NR' to reject or fail to reject the null hypothesis 
    # 'D' (the missingness of 'credit_card_number' is dependent on age) or 
    #   'ND' (the missingness of 'credit_card_number' is not dependent on age).
    ...

    return None


In [None]:
# Don't change this cell 
first_pval, first_res = first_round()
second_pval, second_res, second_res1 = second_round()
out1 = first_round()
out2 = second_round()

In [None]:
grader.check("q3")

## Congratulations! You have finished Homework 6! ##

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers.  If you run into any issues when running this cell, feel free to check the [Debugging Guide](https://mtu.instructure.com/courses/1527249/pages/debugging-guide).


### Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)