In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

# Lab 6 - Missing Values and Imputation 

To receive credit for a lab, answer all questions correctly and submit before the deadline.

You must submit this assignment to Gradescope by the on-time deadline. We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline. This way, you will have ample time to contact staff for submission support.

In [None]:
import pandas as pd
import numpy as np
import zipfile
import matplotlib
import matplotlib.pyplot as plt
from pathlib import Path

from scipy import stats

plt.rcParams['figure.figsize'] = (8, 5)

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />


## Part 1: Missingness Mechanisms

First, let's recap the different mechanisms of missingness we studied in lecture.

#### Missing by Design (MD)
- The missing field is deliberately missing. The missing field is deliberately set to null or not collected (hence, "missing by design").
- The missingness can be exactly predicted when a column will be null, with only knowledge of the other columns using a function of the rows of the dataset.

#### Missing Completely at Random (MCAR)
- The missingness of missing value isn't related to the actual, unreported value itself, nor the values in any other fields. The missingness is not systematic.
- The missingness is unconditionally uniform across rows. MCAR doesn't bias the observed data.
- There is no relationship between the missing data and the any of the other data, observed or missing.

#### Missing at Random (MAR)
- The missingness of the missing value has nothing to do with the value itself, but may be related to another field.
- The missingness is uniform across rows, perhaps conditional on another column. MAR biases the observed data, but is fixable.
- There is a systematic relationship between the missing values and the observed data (but not the missing values themselves).
- Difference between MD and MAR: If you can *exactly/always* determine missingness using the other columns, the missingness is MD. If there is just some sort of systematic relationship between the missing columns/values and other columns/values that may help us predict missingness, the missingness is MAR.

#### Not Missing At Random (NMAR)
- The missingness of the missing value is related to the actual, unreported value.
- NMAR biases the observed data in unobservable ways.
- There is relationship between the propensity of a value to be missing and its value.

<br>

--- 

### Question 1

You run a small e-commerce website and send surveys out to customers after they purchase an item from your store. The survey asks whether the customer is satisfied with their purchase ("Yes" or "No"). Below, you are presented with possible datasets, each of which contains a column `'satisfied'` as described above, as well as a `'customer_id'` number corresponding to the customer and an `'item'` column describing the item that the customer purchased. **The column `'satisfied'` is missing data.**

For each of the following datasets, label the column `'satisfied'` as being `'MD'`, `'MCAR'`, `'MAR'`, or `'NMAR'`.

1. The dataset consists only of the columns `'customer_id'` and `'satisfied'`.
2. The dataset contains the `'customer_id'` of every customer with an account, even if they didn't make a purchase. Also, in this case, you notice everyone who was sent a survey filled it out.
3. The dataset contains a column specifying if the user later returned the item.
4. The dataset contains a column with the serial number for the item purchased.
5. The dataset contains a column with the price of the item purchased.

Complete the implementation of the function `after_purchase`, which records your answers and returns a list of length 5, containing the values `'MD'`, `'MCAR'`, `'MAR'`, or `'NMAR'`. For some questions there may be multiple good answers, but there is generally one answer that is "best". 

***Disclaimer***: It is possible to just look at the some of the correct answers by running `grader.check`. This is not a good idea – you should really think about all of the questions here, since similar questions will be on the exams.

In [None]:
def after_purchase(): 
    # Return a list with the type of missingness for the 5 datasets.
    #  Types are 'MD', 'MCAR', 'MAR', 'NMAR'
    
    return None
q1_ans = after_purchase()

In [None]:
grader.check("q1")

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />


## Part 2: Assessing Missingness through data 

Let's now focus on deciding whether data in a particular column look MCAR or MAR through permuatation tests. 


<br> 

--- 

### Question 2 

In the file `data/missing_heights.csv` are the heights of adult children and their fathers (`'child'` and `'father'`). The additional `'child_X'` columns are missing values in varying proportions; for each X, `'child_X'` is X\% not missing (and hence (100-X)\% missing). **The missingness of these `'child_X'` columns were created as MAR dependent on father's heights (similar to what was done in Lecture 10-11. The missingness of these `'child_X'` columns are all equally dependent on father's heights.**

You will attempt to **verify** the missingness of the `'child_X'` columns as being dependent on the `'father'` column by using permutation tests. Your permutation tests should use the Kolmogorov-Smirnov test statistic. You can use `scipy.stats`' built-in K-S function to run your permutation tests and compute your p-values; you don't need to simulate manually using a `for`-loop, instead you can directly use `.pvalue` attribute after calling **k2_samp**.

To do this, complete the implementation of the function `verify_child`, which takes in the `heights` DataFrame and returns a Series of p-values from your permutation tests, indexed by the names of the columns in `heights` that are formatted like `'child_X'` (that is, its index should be `'child_95'`, `'child_90'`, ..., `'child_5'`; the order of the Series is not important).

To clarify, for each `child_X` column, you will be running one permutation test comparing it to the `father` column. Your permutation tests should run within your `verify_child` function. You can **only** use a for-loop to loop over the **columns** of `heights`, and you shouldn't need to use a `for`-loop to conduct your permutation tests.

In [None]:
def verify_child(heights): 
    # Return a Series of p-values from the permutation tests
    
    return None

In [None]:
# don't change this cell 
heights = pd.read_csv(Path('data') / 'missing_heights.csv')
q2_out = verify_child(heights.copy())

In [None]:
grader.check("q2")

Let's reflect on the p-values that you found: 

In [None]:
q2_out

Remember, **in all seven columns, the data are truly MAR** – we know this for a fact since we were told in the question:

> The missingness of these <code>'child_X'</code> columns were created as MAR dependent on father's heights (similar to what was done in Lecture 11. The missingness of these <code>'child_X'</code> columns are all equally dependent on father's heights.

- If our permutation test returned a small $p$-value for a particular column, it means that the distribution of father's heights in rows where the child's height was missing looked significantly different than the distribution of father's heights in rows where the child's height was present. That's evidence that the missingness of that column depends on father's heights.

- If our permutation test returned a large $p$-value for a particular column, that's evidence that the missingness of that column doesn't depend on father's heights.

Despite the fact that the missingness of each `'child_X'` column truly depends on father's heights (by design), it appears that **in all cases except `'child_50'`, we'd conclude that the child's height columns are MCAR** at the 5% significance level! We should be precise – we cannot **prove** that heights are MCAR or MAR, just like we cannot prove either hypothesis in a hypothesis test. Instead, all we can say, for instance, is that two samples don't look like they were drawn from the same population distribution, and hence, the missingness of a particular column **appears** to be dependent on another column.

One thing you'll notice is that when a column contains relatively few missing values, it is exceedingly difficult to conclude that values in that column are missing at random dependent on another column. Think about why this is the case.

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />


## Part 3: Imputation

Now that we have worked with missingness mechanisms and how to detect them in data, let's focus on filling in missing values. 

<br>

--- 

### Question 3

In Lecture 11, you learned how to perform single-valued imputation conditionally on a **categorical** column: impute with the mean for each group. That is, for each distinct value of the **categorical** column, there is a single imputed value.

Here, you will perform single-valued imputation by conditioning on a **quantitative** column. 

You will work with a version of the `heights` DataFrame, `new_heights`, that has a `'father'` column and a single `'child'` column. The `'child'` column has missing values. To impute the `'child'` column, transform the `'father'` column into a categorical column by binning the values of `'father'` into [quartiles](https://en.wikipedia.org/wiki/Quartile). Once this is done, you can impute `'child'` as in lecture (and described above).

<br>

#### `cond_single_imputation`

Complete the implementation of the function `cond_single_imputation`, which takes in a DataFrame with columns `'father'` and `'child'` (where `'child'` has missing values) and performs a single-valued mean imputation of the `'child'` column, conditional on `'father'`. Your function should return a **Series**.

***Hints***:
- `pd.qcut` may be helpful !
- The `transform` method is useful for this question, though it's also possible to do this using the `aggregate` method.
- As a reminder, *loops are not allowed*, and functions mentioned in "Hints" are not required.

In [None]:
def cond_single_imputation(df):
    # Input a DataFrame with columns 'father' and 'child' (with some missing 
    #   values in child) 
    # Return a Series performing a single-valued mean imputation of the 'child' 
    #   column, conditional on 'father'
    
    return None

In [None]:
# don't change this cell, but do run it -- it is needed for the tests to work
heights_fp = Path('data') / 'missing_heights.csv'
new_heights = pd.read_csv(heights_fp)[['father', 'child_50']]
new_heights = new_heights.rename(columns={'child_50': 'child'})
q3_out = cond_single_imputation(new_heights.copy())

In [None]:
# don't change this cell, but do run it -- it is needed for the tests to work
heights_fp = Path('data') / 'missing_heights.csv'
heights_q3 = pd.read_csv(heights_fp)
heights_q3['child'] = heights_q3['child_50']
inp_q3 = heights_q3
out_q3 = cond_single_imputation(inp_q3)
df_q3 = inp_q3.copy()
df_q3['imputed'] = out_q3
gp1_q3 = df_q3.groupby('father')['imputed'].mean()
gp2_q3 = df_q3.groupby('father')['child'].mean()
m_q3 = (pd.concat([gp1_q3, gp2_q3], axis=1)
     .dropna().diff(axis=1).abs().iloc[:, -1])

In [None]:
grader.check("q3")

<br>

--- 

### Question 4

In Lecture 11, you learned how to impute a quantitative column by sampling from the observed values. **One problem with this technique is that the imputation will never generate imputed values that weren't already in the dataset.** For example, 57, 57.5, and 59 are values in the `'child'` column of `new_heights` while 58 is not. Thus, any imputation done by sampling from the observed values in the `'child'` column will not be able to generate a height of 58, even though it's clearly a reasonable value to occur in the dataset.

To keep things simple, you will impute the `'child'` column **unconditionally** from the distribution of `'child'` heights present in the dataset. This means that you will use the values present in `'child'` to impute missing values, without looking at other columns.

An approach to quantitative imputation that overcomes the limitation mentioned above is as follows:
- Create a histogram of observed `'child'` heights, using 10 bins.
    - Note that in your process, you don't actually need to draw a histogram – instead, use `np.histogram`.
- Use the histogram to generate a number within the observed range of `'child'` heights:
    - The likelihood a generated number belongs to a given bin is equal to the area of that bin. (Remember, in histograms, areas are proportions.)
    - Any number within a fixed bin is equally likely to occur.
    
Let's illustrate this approach with an example. Let `demo` be the array of 10 numbers defined below.

```py
demo = np.array([10, 11, 11, 13, 14, 14, 13.5, 14, 15, 16])
```

- The first step is creating a histogram of `demo`. Note that with this small dataset, we will use 3 bins, but you will be using 10 bins in your imputation process.

<img src='imgs/demo_histogram.png' width=300>

- In the histogram above, we see that $2 \cdot 0.15 = 0.3 = 30\%$ of values lie in the [10, 12) bin, $2 \cdot 0.1 = 0.2 = 20\%$ of values lie in the [12, 14) bin, and $2 \cdot 0.25 = 0.5 = 50\%$ of values lie in the [14, 16] bin.
- Next, we need to pick a bin at random. There's a 30\% chance we pick the [10, 12) bin, a 20\% chance we pick the [12, 14) bin, and a 50\% chance we pick the [14, 16] bin. `np.random.choice` will be helpful in picking a bin at random.
- Once we pick a bin, we pick a number **uniformly at random** from within the bin. For instance, suppose we randomly chose the [14, 16] bin in the previous step. We then must select a (real) number between 14 and 16 uniformly at random. `np.random.uniform` can help you here.

<br>

#### `quantitative_distribution`
    
Complete the implementation of the function `quantitative_distribution`, which takes in a Series, `child`, in which some values are missing, and a positive integer `N`, and returns an **array** of `N` imputed values using the method described above. 

***Note***: You may use a `for`-loop.

<br>

#### `impute_height_quant`

Complete the implementation of the function `impute_height_quant`, which takes in a Series, `child`, in which some values are missing and imputes them using the scheme above. `impute_height_quant` should return a Series that is the same length of `child` but with no missing values. **You should use `quantitative_distribution` to help you do this.**

In [None]:
def quantitative_distribution(child, N):
    # Input: Series, child and a positive integer N 
    # Return: an array of N imputed values 
    ...


def impute_height_quant(child):
    # Input: Series, child with some missing variables
    # Return: Series of same length as child with missing values imputed. 
    ...
    

In [None]:
# don't change this cell, but do run it -- it is needed for the tests to work
heights_fp = Path('data') / 'missing_heights.csv'
heights = pd.read_csv(heights_fp)
child = heights['child_50']
q4_quantitative_distribution_out = quantitative_distribution(child.copy(), 100)
q4_impute_height_quant_out = impute_height_quant(child.copy())

In [None]:
grader.check("q4")

## Congrats! You have finished Lab 06! 

**Important**: To make sure the test cases run correctly, click `Kernel>Restart & Run All` and make sure all of the test cases are still passing. 

If your test cases are no longer passing after restarting, it's likely because you're missing a variable, or the modifications that you'd previously made to your DataFrame are no longer taking place (perhaps because you deleted a cell). 

You may submit this assignment as many times as you'd like before the deadline.

**You must restart and run all cells before submitting. Otherwise, you may pass test cases locally, but not on our servers. We will not entertain regrade requests of the form, “my code passed all of my local test cases, but failed the autograder”.**


## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)