# Inverse Probability of Missing Weights

## Single Missing Variable

This tutorial describes inverse probability of missing weights (IPMW) for a single missing variable. I will describe the general usage of IPMW, broadly how it works, and demonstrate their usage within *zEpid*. This example focuses on a single missing variable

In the following example, we will use a simulated data set that comes with *zEpid*. For the example, our question of interest is the proportion of those died in our sample by week 45. First we will load the data and look at the variable ``dead``

In [1]:
import numpy as np
import pandas as pd

import zepid
from zepid import load_sample_data
from zepid.causal.ipw import IPMW

print(zepid.__version__)

0.9.0


In [2]:
df = load_sample_data(timevary=False).drop(columns=['cd4_wk45'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 547 entries, 0 to 546
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      547 non-null    int64  
 1   male    547 non-null    int64  
 2   age0    547 non-null    int64  
 3   cd40    547 non-null    int64  
 4   dvl0    547 non-null    int64  
 5   art     547 non-null    int64  
 6   dead    517 non-null    float64
 7   t       547 non-null    float64
dtypes: float64(2), int64(6)
memory usage: 38.5 KB


From the output we can see that 30 individuals in the sample are missing the variable ``dead``. From here we can proceed under two different assumptions; 1) assume that ``dead`` is missing completely at random, or 2) assume that ``dead`` is missing at random.

*Note*: this example uses survival data, where an approach that allows for censoring (i.e. Kaplan-Meier estimator) is a better approach. For the sake of the example, we will imagine that we don't have survival times (i.e. ``t`` is not included in our data set)

### 1) Missing Completely at Random
The assumption of missing completely at random (MCAR) means that the missingness of ``dead`` is unrelated to any other variable. Under this assumption, the mean in the observed data is an acceptable replacement for the mean in the data *if we had observed* ``dead`` for all 547 individuals. We can use the standard probability estimator

$$\widehat{\Pr}(Y) = \frac{\sum_{i=1}^n I(Y_i=1)}{n}$$

which is a consistent estimator under MCAR (meaning that as the sample size goes to infinity, the estimate converges to the truth). In the above equation, $I(.)$ is the indicator function, where it takes a value of $1$ when true. We can easily implement this estimator by taking the mean of ``dead`` with ``numpy.mean``

In [3]:
# Proportion dead at t=45 assuming dead is Missing Completely at Random
print('MCAR Mean:', np.round(np.mean(df['dead']), 3))

MCAR Mean: 0.168


In the observed data, 16.8% of individuals died by week 45, under the assumption that ``dead`` is missing completely at random.

MCAR is a strong assumption, that is often unlikely to be true when missing data occurs. We can make a less restrictive assumption. More specifically, we can assume that missing ``dead`` are random conditional on some set of covariates. This is referred to as missing at random (MAR) in the missing data literature

### 2) Missing at Random
Under this assumption, the missingness of ``dead`` is dependent on some known set of variables. To account for the variables related to missingness, we will use IPMW (there are other approaches). First, let's introduce the mathematical notation to calculate IPMW. For individual $i$, their weight is

$$w_i = \frac{1}{\Pr(M_i=1 | L_i)}$$

where the denominator is the probability of observing ($M$) individual $i$ given their covariates $L$. With the addition of IPMW, our estimator becomes

$$\widehat{\Pr_w}(Y) = \frac{\sum_{i=1}^n I(Y_i=1)*w_i}{\sum_{i=1}^n w_i}$$

In *zEpid*, inverse probability of missing weights can be calculated using the ``IPMW`` class. 

The first step is to initialize the IPMW class. We will give the class the following; the data set (``df``), the variable that has missing data (``dead``), and specify that we want the stabilized weights.

Following that, we will specify the regression model we want to use. In the above functions, this refers to $L$. We will make the assumption that ``dead`` is missing at random with the following variables; age (``age0`` modeled with a quadratic term), CD4 T cell count (``cd40`` modeled with a quadratic and cubic term), antiretroviral therapy (``art`` binary), and gender (``male`` binary)

In [4]:
# Creating functional form variables
df['age_sq'] = df['age0']**2
df['cd4_sq'] = df['cd40']**2
df['cd4_cu'] = df['cd40']**3

# Calculating IPMW
ipm = IPMW(df, missing_variable='dead', stabilized=False)
ipm.regression_models(model_denominator='art + male + age0 + age_sq + cd40 + cd4_sq + cd4_cu', 
                     print_results=True)
ipm.fit()  # Calculates the weights after the regression models are fit

Propensity Score Model
                  Generalized Linear Model Regression Results                   
Dep. Variable:     _observed_indicator_   No. Observations:                  547
Model:                              GLM   Df Residuals:                      539
Model Family:                  Binomial   Df Model:                            7
Link Function:                    logit   Scale:                          1.0000
Method:                            IRLS   Log-Likelihood:                -110.74
Date:                  Wed, 30 Dec 2020   Deviance:                       221.48
Time:                          10:49:10   Pearson chi2:                     548.
No. Iterations:                       6                                         
Covariance Type:              nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept

Output to the console provides the logistic regression models betas and fit statistics. These can be suppressed by setting ``print_results=False``

From the ``IPMW`` class, we can now extract out the calculated IPMW for each individual and add them to our data set. We can do that with the following code

In [5]:
# adding weights into the original data set
df['w'] = ipm.Weight

# Calculating the weighted proportion of dead
print('MAR Mean:', np.round(np.sum(df['dead']*df['w']) / np.sum(df['w']), 3))

MAR Mean: 0.17


By week 45, 17.0% of individuals died, assuming that ``dead`` is missing at random, conditional on age, gender, ART status, and CD4 count. While the results between the examples are not too different, in practice they might be divergent. The missing at random is a weaker assumption than missing completely at random.

Stabilized IPMW is also an option. Below is code to calculate the proportion of deaths with stabilized weights instead

In [6]:
# Calculating IPMW
ipm = IPMW(df, missing_variable='dead', stabilized=True)
ipm.regression_models(model_denominator='art + male + age0 + age_sq + cd40 + cd4_sq + cd4_cu', 
                     print_results=False)
ipm.fit()  # Calculates the weights after the regression models are fit

# adding weights into the original data set
df['sw'] = ipm.Weight

# Calculating the weighted proportion of dead
print('MAR Mean:', np.round(np.sum(df['dead']*df['sw']) / np.sum(df['sw']), 3))

MAR Mean: 0.17


### Conclusion
I have briefly went over assumptions regarding missing data, what IPMW are, and how to use them with *zEpid*. While presented in the simple case where we wanted a single mean, these weights are also valuable when used in conjunction with other inverse probability weights in analyses. Additionally, these weights are one way to deal with missing data under the missing at random assumption. Please view other tutorials for more information on functions in *zEpid*

#### Further Readings
Sun, B., Perkins, N. J., Cole, S. R., Harel, O., Mitchell, E. M., Schisterman, E. F., & Tchetgen Tchetgen, E. J. (2017). Inverse-probability-weighted estimation for monotone and nonmonotone missing data. American Journal of Epidemiology, 187(3), 585-591.

Perkins, N. J., Cole, S. R., Harel, O., Tchetgen Tchetgen, E. J., Sun, B., Mitchell, E. M., & Schisterman, E. F. (2017). Principled approaches to missing data in epidemiologic studies. American Journal of Epidemiology, 187(3), 568-575.

Li, L., Shen, C., Li, X., & Robins, J. M. (2013). On weighting approaches for missing data. Statistical Methods in Medical Research, 22(1), 14-30.

Greenland, S., & Finkle, W. D. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. American journal of epidemiology, 142(12), 1255-1264.

Seaman, S. R., & White, I. R. (2013). Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3), 278-295.