## Using AutoImpute to Explore Missing Data
---
This notebook introduces users to the `autoimpute` package. The tutorial includes:
* Getting started
* Explorations of data with missing values

### Getting Started
---
<div> First, let's examine what the `autoimpute package` has to offer:</div>

In [1]:
import autoimpute.utils as au

def module_explore(m):
    methods = [f for f in dir(m) if not f.startswith("_")]
    statement = f"Available from {m.__name__}"
    print(f"{statement}\n{'-'*len(statement)}\n{methods}\n")

module_explore(au)

Available from autoimpute.utils
-------------------------------
['check_data_structure', 'check_missingness', 'check_nan_columns', 'checks', 'feature_corr', 'feature_cov', 'flux', 'helpers', 'inbound', 'influx', 'md_locations', 'md_pairs', 'md_pattern', 'outbound', 'outflux', 'patterns', 'proportions']



### The Utils Module
---
The `utils` module contains checks to ensure datasets play nicely with imputation methods and functions to explore patterns in missing data. First, note that `check_data_structure`, `check_missingness`, and `check_nan_columns` are used to build custom `Imputers`. This is a more advanced topic covered in a future tutorial. For now, we'll explore the following methods to get started:
* `feature_cov` and `feature_corr`
* `proportions`, `md_locations`, `md_pattern`, and `md_pairs`
* `inbound`, `outbound`, `influx`, `outflux`, and `flux`

This tutorial explains how the methods above work in `autoimpute`. They follow Van Buuren's (VB) Flexible Imputation of Missing Data, 2nd Edition, Section 4.1 closely. For a deeper understanding of the formulas behind each method, refer to his excellent text. Let's start with an example dataframe that mimics the structure from VB Section 4.1:

In [2]:
import numpy as np
import pandas as pd

missing_data = pd.DataFrame({
    "A": [1, 5, 9, 6, 12, 11, np.nan, np.nan],
    "B": [2, 4, 3, 6, 11, np.nan, np.nan, np.nan],
    "C": [-1, 1, np.nan, np.nan, np.nan, -1, 1, 0]
})

missing_data

Unnamed: 0,A,B,C
0,1.0,2.0,-1.0
1,5.0,4.0,1.0
2,9.0,3.0,
3,6.0,6.0,
4,12.0,11.0,
5,11.0,,-1.0
6,,,1.0
7,,,0.0


#### Covariance and Correlation
The `utils` module contains simple methods (`feature_cov` and `feature_corr`) to example the covariance and correlation matrix. Each method takes a dataframe as an argument. Missing values are **dropped by default**. Therefore, these methods contain the covariance / correlation of observed features.

In [3]:
# Covariance matrix after missing records dropped
au.feature_cov(missing_data)

Unnamed: 0,A,B,C
A,17.066667,11.35,-0.666667
B,11.35,12.7,2.0
C,-0.666667,2.0,1.0


In [4]:
# Correlation matrix after missing records dropped
au.feature_corr(missing_data)

Unnamed: 0,A,B,C
A,1.0,0.765722,-0.114708
B,0.765722,1.0,1.0
C,-0.114708,1.0,1.0


#### Locations and Patterns of Missingness
The `utils` module also contains methods to examine the locations and patterns of missingness. These methods help assess where data is missing, how often it is missing, and its co-occurence with missingness in other features.

The first of these methods is **`proportions`**. It gives us the percent missing ("poms") and percent observed ("pobs") for each feature in a dataset. Note that the sum of these two columns should always equal 1. Each row is now a feature from the original dataset.

In [5]:
au.proportions(missing_data)

Unnamed: 0,pobs,poms
A,0.75,0.25
B,0.625,0.375
C,0.625,0.375


Next is **`md_locations`**, which tells us where data is missing within each feature. Here, 1 = missing; 0 = observed

In [6]:
au.md_locations(missing_data)

Unnamed: 0,A,B,C
0,0,0,0
1,0,0,0
2,0,0,1
3,0,0,1
4,0,0,1
5,0,1,0
6,1,1,0
7,1,1,0


Next, **`md_pattern`** shows us the row-wise patterns of missingness in our dataset. Let's start with the first row in the output below. There are 2 instances (count = 2) where every feature is observed (1). As a result, this row has no missing data (nmis = 0). Now examine the last row in the output below. There are 2 instances (count = 2) where column $A$ and $B$ having missing values while column $C$ is observed. As a result, this row has 2 of 3 features missing (nmis = 2).

In [7]:
au.md_pattern(missing_data)

Unnamed: 0,count,A,B,C,nmis
0,2,1,1,1,0
1,3,1,1,0,1
2,1,1,0,1,1
3,2,0,0,1,2


**`md_pairs`** counts the number of missingness pair types between each set of features in a dataset. The pair types are:
1. `rr`: response-response pairs
2. `rm`: response-missing pairs
3. `mr`: missing-response pairs
4. `mm`: missing-missing pairs

The method returns a square matrix for each pair. In the output below, the name of each pair is capitalized to remain consistent with matrix notation in Latex used in this tutorial. `rr` and `mm` are symmetric, as the number of observed-observed or missing-missing patterns is the same regardless of which feature is first In the output below, $RR_{A,B}$ indicates that there are 5 instances where $A$ and $B$ are both observed.Note that $RR_{A,B} = RR_{B,A} = 5$ Another example below, $MR_{A,C}$ indicates that there are 2 instances where $A$ is missing and $C$ is observed. Note that $MR_{A,C} = RM_{C,A} = 2$

In [8]:
pairs = au.md_pairs(missing_data)
for pair_name, pair_data in pairs.items():
    print(f"{pair_name.upper()}\n{'-'*10}")
    print(f"{pair_data}")

RR
----------
   A  B  C
A  6  5  3
B  5  5  2
C  3  2  5
RM
----------
   A  B  C
A  0  1  3
B  0  0  3
C  2  3  0
MR
----------
   A  B  C
A  0  0  2
B  1  0  3
C  3  3  0
MM
----------
   A  B  C
A  2  2  0
B  2  3  0
C  0  0  3


#### Missingness Statistics
The `utils` module includes statistics to assess the examine the effect of missing data on potential feature importance. These methods help assess which features may or may not be good candidates to be imputed or to assist in the imputation of features.

**`inbound`** represents the proportion of useable cases in each column that can be used to impute the feature in each row. For this reason, the diagonal of the matrix is zero, as a feature cannot be useful to impute itself. A high value in an element indicates that the column is useful to impute the row. A low value in an element indicates that the column is not useful to impute the row. In the outbut below, we see that $I_{A,B} = 0$, because there are $0$ instances where $B$ is observed while $A$ is missing. This finding suggests that **$B$ is not helpful to impute $A$** Extending this finding, we see the $C$ is always observed when $A$ is missing ($I_{A,C}$), so $C$ is useful to impute $A$ For features we are interested in imputing, we want them to have at least one (and preferably all) high values across their inbound row.

In [9]:
au.inbound(missing_data)

Unnamed: 0,A,B,C
A,0.0,0.0,1.0
B,0.333333,0.0,1.0
C,1.0,1.0,0.0


**`outbound`** represents how well each column is connected to the rest of the data in each row. For this reason, the diagonal of the matrix is zero, as a feature cannot be well connected to itself. A high value in an element indicates that a row's observed features correspond with most of a column's missing features. A low value in an element indicates that a row's observed features correspond with few of a column's missing features. For example, $O_{B,C} = 0.6$. $A$ has 5 observed values. Of those 5, 3 from $C$ are missing, so outbound = 0.6. This finding suggests that observed in $B$ is well connected to missing in $C$, and $B$ may be helpful to impute $C$. We prefer features have high outbound values in the columns that they are used to impute.

In [10]:
au.outbound(missing_data)

Unnamed: 0,A,B,C
A,0.0,0.166667,0.5
B,0.0,0.0,0.6
C,0.4,0.6,0.0


**`flux`** collects five statistics in one method. They are:
1. `ainb`: average inbound
2. `aout`: average outbound
3. `influx`: influx coefficient
4. `outflux`: outflux coefficient
5. `pobs`: percentage observed

Of interest here are the **`influx`** and **`outflux`** statistics.

**`influx`** $\rightarrow I_{jk} = \frac{mr}{mr+rr}$. The number of `mr` pairs divided by the sum of `mr` and `rr. 0 = completely observed, 1 = completely missing. For two values with the same proportion of missing values, the one **with higher influx is "easier" to impute.**

**`outflux`** $\rightarrow O_{jk} = \frac{rm}{rm+mm}$. The number of `rm` pairs divided by the sum of `rm` and `mm`. 0 = completely missing, 1 = completely observed. For two values with the same proportion of missing values, the one with **higher outflux is better connected and thus a better imputer.**

In [11]:
au.flux(missing_data)

Unnamed: 0,ainb,aout,influx,outflux,pobs
A,0.5,0.333333,0.125,0.5,0.75
B,0.666667,0.3,0.25,0.375,0.625
C,1.0,0.5,0.375,0.625,0.625
