# ODSC London 22 September - Missing Data Tutorial

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

## Introduction

In most fields, missing data is so common that it is a given. You literally design your studies from the beginning with missing data in mind. The problem is particularly acute for surveys, longitudinal studies, and in clinical medicine.

Overall, handling missing data is so important that (Wainer 2010) considers it one of the “six necessary tools” that researchers need to master in order to successfully tackle problems in their fields in the 21st century.

**Aim of this tutorial**
1. To give you an understanding of missing data and the statistical problems it raises
2. To provide you with a comprehensive, A to Z, applied tutorial on handling missing data


### What is missing data? 

**"Values that are not available and that would be meaningful for analysis if they were observed" (Little 2012)**

When you have missing data for variables that are not of interest in your study, you can consider you do not have missing data. The first step is hence to reduce your dataset to the relevant variables. Then you can start looking at missing data.

In [29]:
# Put a dataframe and explain what it means to have missing data - stupid but necessary.
# Add a small note on the "mathematical" description

### Why do we care?

1. **Increase of bias/Underfitting**: 
    In statistics, bias reflects the extent to which your expected values differ from the true population parameters that you are trying to estimate. Concretely, bias arises in a dataset with missing data whenever for a variable, the missing data differs substantially from the observed data.

    **Longitudinal study** and **dropout** : Let’s take a controlled drug study in medicine. In such a study, one compares the two arms of a study: the treatment arm and the control arm. Bias will depend on the relationship between missingness, treatment, and outcome. It is not uncommon for patients in the treatment arm to drop out due to adverse reactions to the treatment, or due to lack of improvement. In parallel, particularly healthy subjects, or ones who react exceptionally well to the treatment, will have very high completion rates for the study. The former, missing patients, will then become missing data in the final study. Ignoring them would lead to biased estimates of the efficacy of the treatment. We would be estimating how efficient the treatment was for a subsample of the population, the “healthier/better reacting” patients that completed the study, as opposed to the entire target population. 
    
    **Censoring**: Values for certain variables might be below certain thresholds making them innacurate

In [26]:
# add an example where you show bias increase in missing data, just a simple calculation. 
# Bias(\mu) = \E[/mu -x ] - estimator's bias in this case, mean. 

**2. Reduction of power:**
    In statistics, power, scaled from 0 to 1, is the probability that a hypothesis test correctly rejects the null hypothesis when it is false. This type of error is also called a type 2 error. The higher the power of your study is, the less likely you are to make a type 2 error.

Statistical power is influenced by two characteristics of your study: 
    1. Sample size 
    2. Variability of the outcomes observed. 
We therefore see that power increases if the sample size increases, or if the variability of the outcomes observed decreases. 

The mechanisms through which missing data directly influences the power of your study:
    * if you simply delete the observations with missing values, you reduce your effective sample size, and therefore reduce your power
    * if your missing values are more extreme figures than the observed ones (outliers), which regularly happens in practice, you will underestimate your variability and therefore artificially narrow your confidence interval (small variance, higher bias)


In [27]:
# Add an example here as well? 

**3. Most statistical tools, theoretical and applied are designed for complete datasets**

Almost all standard statistical analysis techniques and their implementations in various softwares were developed for complete datasets, and cannot handle appropriately incomplete ones (Schafer and Graham, 2002).

**Models** 
**TODO: Add table with models that handle missing data such as DT/RF and ones that don't i.e. NN. **

## Description and patterns of missing data

**What do you want to say here?** 

Unit vs item
Graph. 1930s polls.
Item: income

Run me!
![title](img/item-nonresponse.png)
![title](img/unit-nonresponse.png)
![title](img/univariate.png)
![title](img/monotone.png)
![title](img/non-monotone.png)

## Mechanisms of missing data
It is hard to differentiate between the first one (MCAR) and the last one (MNAR) but we'll try to show you some tricks! 

**MCAR: Missing Completely at Random**

If the missingness of the data is *unrelated* to both the observed and the unobserved data, the missing data is said to be missing completely at random (MCAR). In this situation, the missing data is a random subsample of the complete dataset. Discarding data can bias estimates and lose information

In this specific case, analyzing only the observations with complete data in the study (complete case analysis) would lead to a loss of power/efficiency in the study, but not to a higher bias in the estimates. As a rule of thumb, if less than 60% of the features of a datapoint are missing, we shouldn't remove it.

An important caveat is that MCAR and correlations are not necessarily mutually exclusive, i.e. the possibility that the missingness of a variable X is related to the missingness of some other variable Y. If in a survey the same subjects refused to answer both the age and the gender questions, this can mean both MCAR or not. 

In [None]:
# Add something interesting here

**MAR: Missing at Random** 

If missingness of the data depends only on the observed data and not on the unobserved data, we say that the data is missing at random (MAR). This implies that after taking the observed data into account, there are no systematic differences between items (subjects) with missing data and those without missing data.

In [30]:
# Add something interesting here

**MNAR: Missing not at Random**
This occurs when the missingness of the data depends on the values of the missing data themselves. It often occurs for the income variable in surveys for example: both high income earners and low income earners are less likely to report their income (to answer the income question) than are the average income earners. Therefore the fact that those answers will be missing will be correlated to the value of those missing data points themselves. The data in such a study cannot be said to be missing at random.


Most statistical tools require data to be MAR or MCAR. However, in the presence of MNAR, the same statistical tools remain the best thing we have – just much more uncertain/unreliable.


In [31]:
# Add something interesting here

## Visualization
As usual, we start by visualizing our missing data. 

We started deadset on sticking to a pure Python workflow. If something not implemented in Python, we would rather code it ourselves than revert to R. That’s how we roll.

It became clear over time that R keeps a tremendous advantage in esoteric statistical fields such as handling missing data. At this point in time, no robust and comprehensive python package for handling missing data. A good lesson in data science – be versatile and use whatever is available.

I will show you here the workflow by appealing to the best R packages I could find for each task.

Visualization is key. Important not only for you, but also for your audience/publications.


In [15]:
#getwd()
data <- read.csv("../.././data/imputation_dataset_no_censoring_24022018.csv", na.strings = '', header=TRUE)
data_knn_part <- kNN(data)

matrixplot(data, labels=TRUE)

ERROR: Error in kNN(data): could not find function "kNN"


Run me!
![title](img/knn_matrixplot.png)

In [None]:
aggr(data, prop=c(TRUE,FALSE), sortVars=TRUE, sortCombs=TRUE)

Run me! 
![title](img/knn_proportions_missing.png)

### Interlude: Is the missing data MCAR? (How to test for this)
[Baylor](https://cran.r-project.org/web/packages/BaylorEdPsych/BaylorEdPsych.pdf)
[MCAR Test Youtube](https://www.youtube.com/watch?v=LmyRcu75XEI)

If yes, great. If not, more manual investigations needed, as there is no statistical test to differentiate between MAR and MNAR. **YOU MEAN MCAR AND MNAR?**

Question: Should this part go here?


________

## R works either with Kernel or with below magic commands

In [3]:
%load_ext rpy2.ipython

In [15]:
%R require(ggplot)
df = pd.DataFrame({'Alphabet': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],
                   'A': [4, 3, 5, 2, 1, 7, 7, 5, 9],
                   'B': [0, 4, 3, 6, 7, 10,11, 9, 13],
                   'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
%R -i df

array([0], dtype=int32)