# ODSC London 22 September - Missing Data Tutorial

## Introduction

In most fields, missing data is so common that it is a given. You literally design your studies from the beginning with missing data in mind. The problem is particularly acute for surveys, longitudinal studies, and in clinical medicine.

Overall, handling missing data is so important that (Wainer 2010) considers it one of the “six necessary tools” that researchers need to master in order to successfully tackle problems in their fields in the 21st century.

**Aim of this tutorial**
1. To give you an understanding of missing data and the statistical problems it raises
2. To provide you with a comprehensive, A to Z, applied tutorial on handling missing data


### What is missing data? 

**"Values that are not available and that would be meaningful for analysis if they were observed" (Little 2012)**

When you have missing data for variables that are not of interest in your study, you can consider you do not have missing data. The first step is hence to reduce your dataset to the relevant variables. Then you can start looking at missing data.

### Why do we care?

1. **Increase of bias/Underfitting**: 
    In statistics, bias reflects the extent to which your expected values differ from the true population parameters that you are trying to estimate. Concretely, bias arises in a dataset with missing data whenever for a variable, the missing data differs substantially from the observed data.

    **Longitudinal study** and **dropout** : Let’s take a controlled drug study in medicine. In such a study, one compares the two arms of a study: the treatment arm and the control arm. Bias will depend on the relationship between missingness, treatment, and outcome. It is not uncommon for patients in the treatment arm to drop out due to adverse reactions to the treatment, or due to lack of improvement. In parallel, particularly healthy subjects, or ones who react exceptionally well to the treatment, will have very high completion rates for the study. The former, missing patients, will then become missing data in the final study. Ignoring them would lead to biased estimates of the efficacy of the treatment. We would be estimating how efficient the treatment was for a subsample of the population, the “healthier/better reacting” patients that completed the study, as opposed to the entire target population. 
    
    **Censoring**: Values for certain variables might be below certain thresholds making them innacurate

In [26]:
# add an example where you show bias increase in missing data, just a simple calculation. 
# Bias(\mu) = \E[/mu -x ] - estimator's bias in this case, mean. 

**2. Reduction of power:**
    In statistics, power, scaled from 0 to 1, is the probability that a hypothesis test correctly rejects the null hypothesis when it is false. This type of error is also called a type 2 error. The higher the power of your study is, the less likely you are to make a type 2 error.

Statistical power is influenced by two characteristics of your study: 
    1. Sample size 
    2. Variability of the outcomes observed. 
We therefore see that power increases if the sample size increases, or if the variability of the outcomes observed decreases. 

The mechanisms through which missing data directly influences the power of your study:
    * if you simply delete the observations with missing values, you reduce your effective sample size, and therefore reduce your power
    * if your missing values are more extreme figures than the observed ones (outliers), which regularly happens in practice, you will underestimate your variability and therefore artificially narrow your confidence interval (small variance, higher bias)


In [27]:
# Add an example here as well? 

**3. Most statistical tools, theoretical and applied are designed for complete datasets**

Almost all standard statistical analysis techniques and their implementations in various softwares were developed for complete datasets, and cannot handle appropriately incomplete ones (Schafer and Graham, 2002).

**Models** 
TODO: Add table with models that handle missing data such as DT/RF and ones that don't i.e. NN. 

## Description and patterns of missing data

**What do you want to say here?** 

Unit vs item
Graph. 1930s polls.
Item: income

![title](img/item-nonresponse.png)
![title](img/unit-nonresponse.png)
![title](img/monotone.png)
![title](img/non-monotone.png)
![title](img/univariate.png)

## Mechanisms of missing data

In [1]:
import pandas as pd

from pathlib import Path

In [9]:
import warnings
warnings.filterwarnings('ignore')

________

## R works either with Kernel or with below magic commands

In [3]:
%load_ext rpy2.ipython

In [15]:
%R require(ggplot)
df = pd.DataFrame({'Alphabet': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],
                   'A': [4, 3, 5, 2, 1, 7, 7, 5, 9],
                   'B': [0, 4, 3, 6, 7, 10,11, 9, 13],
                   'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
%R -i df

array([0], dtype=int32)