Skip to content

pdwaggoner/hdImpute_py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

hdImpute in Python

A python implementation of the batch-based hdImpute algorithm for high dimensional missing data problems.

This is built in the same spirit as the more mature version in R based on the recent paper introducing the hdImpute method for addressing high dimensional missing data problems. There are a few important distinctions, at least in the present version, listed below.

Note: This module is a work in progress. At present, the software includes the "individual" approach to the algorithm, proceeding in 3 stages:

  1. Build the cross-feature correlation matrix (feature_cor)
  2. Flatten the matrix and rank features based on absolute correlations (flatten_mat)
  3. Impute batches of features based on correlation structure, of sizes determined by the user (impute_batches)

The current approach differs from the R approach in the following ways (though continued development will address these and other issues in time):

  • Only numeric features are supported. The algorithm will skip over any non-numeric features (e.g., strings, dates, times, etc.). These columns are appended after the final stage to return a data matrix of the same dimensions as the input data frame.
  • Instead of chained random forests, a similar algorithm in the same spirit from fancyimpute is used. Namely, the imputation engine under the hood is IterativeImputer, which is now mainly supported in scikit-learn but also still in fancyimpute. IterativeImputer is "a strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion (read more here or here)."
  • As noted above, instead of a single function option as in hdImpute in R, users must proceed in sequence across the three stages: 1. build the corr matrix, 2. flatten and rank features, and 3. impute batches and join.

Usage

See the code, docs, and tests for more detail on the functions and process. But a simple demonstration of usage with some synthetic data might look something like this:

Define the data (with a few missing values).

import pandas as pd

data = pd.DataFrame({
    'Feature1': [1.0, 2.0, np.nan, 4.0, 5.0],
    'Feature2': [np.nan, 2.0, 3.0, np.nan, 5.0],
    'Feature3': [1.0, 2.0, 7.0, 4.0, 5.0],
    'Feature4': [1.0, np.nan, 10.0, 4.0, 5.0],
    'Feature5': ["a", "b", "c", "d", "e"]
})

Build the cross-feature correlation matrix.

cor_out = feature_cor(data, return_cor=True) # (optional) returning to inspect

Flatten the matrix and rank features based on absolute correlations.

flat_out = flatten_mat(cor_out)

Impute batches of features based on correlation structure.

# either store and inspect obj
imputed_data = impute_batches(data, flat_out, batch=2, decimal_places=2)
imputed_data

# or run directly to print output
impute_batches(data, flat_out, batch=2, decimal_places=2)

Importantly, users should always remember to closely inspect the data output to ensure missingness is not only dealt with (completed cases), but done so in a reliable and reasonable way. For more on checking the quality of imputations, take a look at the mad() function in the R version. Development of a similar function for this python module is forthcoming.

Contribute

As mentioned, this py version of hdImpute is very much under active development. Contributions in any form are appreciated. For example:

Thanks!