In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation (Wiki)
- Overview
- A free book on data imputation by the author of mice package:
Flexible Imputation of Missing Data (2018) Stef van Buuren - Missing-data imputation(Ch. 25) of Data Analysis Using Regression and Multilevel/Hierarchical Models (2006) ' Andrew Gelman, Jennifer Hill
- Missing Data: Our View of the State of the Art (2002) Joseph L. Schafer, John W. Graham
- A Review of Methods for Missing Data (2001) Therese D. Pigott
- A free book on data imputation by the author of mice package:
Types of missing data (Wiki)
- Missed completely at random
- Missed at random
- Missed data that depends on unobserved variables
- Missed data that depends on the missing value itself
- Listwise deletion Complete-case analysis
- Samples (rows) are removed from a dataset if they have missing values. Probably the most simple and popular approach. Often done automatically by many ML packages
- When dealing with big number of variables that have missing values, the number of samples after deletion can be too small
- May lead to biased estimates. Also smaller sample size increases standard errors
- Available-case analysis Complete-variables analysis
- Excluding variables from data if their missing-values rate is lower than some threshold
Imputation (Wiki)
Whenever a single imputation strategy is used, the standard errors of estimates tend to be too low. The intuition here is that we have substantial uncertainty about the missing values, but by choosing a single imputation we inessence pretend that we know the true value with certainty (Data Analysis Using Regression and Multilevel/Hierarchical Models)
- Mean/Mode Replacement
- Replaces missing values with a variable (column) mean, mode or median
- Distorts a probability distribution of an imputed variable
- Distorts relationship between variables
- LOCF Last Observation Carried Forward imputation (Wiki)
- In time-series data the last observed value before a missing one is "carried forward" to fill in the blank points
- Does analysis using “last observation carried forward” introduce bias in dementia research? Frank J. Molnar, Brian Hutton, Dean Fergusson
- Indicator variables
- Extra category that indicates missingness of a variable
- Extra binary indicator variable used together with a variable that includes missing data (works with continuous data too)
- Regression Imputation
- A regression model is created on a variable with missing values, then used to predict blank points
- Deterministic regression imputation uses the original prediction of the regression model to impute missing values
- Stochastic regression imputation adds random error
- Article about the regression imputation method with examples:
Regression Imputation (Stochastic vs. Deterministic & R Example) Statistics Globe
- Iterative Regression Imputation
- When multiple variables have missing values, IRI imputes them iteratively: using non-missing variables first to impute the first missing variable, then using the imputed variable together with non-missing predictors to predict missing values of the second one, etc
- SRMI Sequential Regression Multivariate Imputation
- A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models (2001) Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger
- Cold-deck Imputation
- Impute missing data using previously collected datasets
- Hot-deck Imputation
- For each sample with a missing value, find a similar complete-case sample in the same dataset and use it for imputation
- Uses a scoring function to measure similarity
- A Review of Hot Deck Imputation for Survey Non-response (2010) Rebecca R. Andridge, Roderick J. A. Little
- KNN k-Nearest Neighbors
- Model-based Imputation
- When something is known about why missing data exist, it's possible to directly model the missingness
- Multiple Imputation (Wiki)
- Drawing imputed values multiple times from some distribution. Then each realization of imputed data is analysed. Aggregated results from all realizations are used to get uncertainty estimation.
- Multiple Imputation for Nonresponse in Surveys (1987) Donald B. Rubin
- [Analyzing Incomplete Political Science Data: An Alternative Algorithm forMultiple Imputation] (2001) Gary King, James Honaker, Anne Joseph, Kenneth Scheve
- MICE Multivariate Imputation by Chained Equations (Homepage, Code, CRAN)
- A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models (2001) Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger
- Multiple imputation of discrete and continuousdata by fully conditional specification (2007) Stef van Buuren
- Multiple imputation by chained equations: what is it and how does it work? (2011) Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, Philip J. Leaf
- mice: Multivariate Imputation by Chained Equations in R (2011) Stef van Buuren, Karin Groothuis-Oudshoorn
- MissForest (Code, CRAN)
- MissForest - nonparametric missing value imputation for mixed-type data (2011) Daniel J.Stekhoven, Peter Buhlmann
- Optimal Transport (Code)
- Missing Data Imputation using Optimal Transport (2020) Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi
- Autoencoders
- Missing data imputation in the electronic health record using deeply learned autoencoders Brett K. Beaulieu-Jones, Jason H. Moore
- Multiple Imputation for Biomedical Datausing Monte Carlo Dropout Autoencoders (2020) Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, Daniela Zaharie
- GAIN Missing data imputation with GANs
- GAIN: Missing Data Imputation using Generative Adversarial Nets (2018) insung Yoon, James Jordon, Mihaela van der Schaar
- RNNs
- Modeling Missing Data in Clinical Time Series with RNNs (2016) Zachary C. Lipton, David C. Kale, Randall Wetzel
- Estimating Missing Data in Temporal Data Streams Using Multi-directional Recurrent Neural Networks (2017) Jinsung Yoon, William R. Zame, Mihaela van der Schaar
- BRITS: Bidirectional Recurrent Imputation for Time Series (2018) Wei Cao, Dong Wang, Jian Li, Hao Zhou, Yitan Li, Lei Li
- GPs
- GP-VAE: Deep Probabilistic Time Series Imputation (2019) Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt