Skip to content

ht-diva/mmid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mmid

A Python package for Multi-Modal Integration and Downstream analyses for healthcare analytics

Getting started

Requirements

Python 3.7

Installation

# Create a conda environment (recommended)
conda create -n mmid python=3.7
conda activate mmid

# Install mmid
pip install git+https://github.com/andmarver/mmid

Run integration and downstream analysis

mmid runs in Python with a single function call. The datasets to analyze and the selected integration and downstream algorithms are passed to the function through configuration files.

Here is an example:

# Import mmid
from mmid.mmid import mmid

# Run analysis
mmid(config_data="/path/to/data_config/data_config.yml", 
    config_model="/path/to/model_config/model_config.yaml", 
    cohort_path="/path/to/cohort/", 
    cohort_file="cohort.csv", 
    end_study_date="2025-12-31", 
    out_path="/path/to/output/")

The function has the following parameters:

Parameter name Format Default Description Example
config_data Path Fully-qualified path of the yaml data configuration file "/path/to/data_config/data_config.yml"
config_model Path Fully-qualified path of the yaml model configuration file "/path/to/model_config/model_config.yaml"
cohort_path Path Path of the cohort_file's directory "/path/to/cohort/"
cohort_file Filename Cohort file name (csv); if inside a directory, the directory name will be interpreted as the analyzed disease "disease/cohort.csv"
end_study_date yyyy-mm-dd Last observation date (i.e., end of the study) "2025-12-31"
out_path Path Path where to store the results "/path/to/output/"
cohort_cov String "" Comma-separated covariates of cohort_file considered for the downstream analysis: blank for not considering covariates from cohort_file for the downstream analysis; "Age_at_baseline" for considering age "CAT_Sex,Age_at_baseline"
baseline_field String "Assessment_date" Field for initial observation date of the study: single (as default) or multiple (comma-separated); in the latter case, optimal_baseline must be set to True to keep the more recent baseline "Assessment_date,Second_assessment_date"
optimal_baseline Boolean False Whether to use the optimal (i.e., more recent) subject-dependent initial observation date; it must be set to True when multiple baselines are specified in baseline_field True
latent_impute Boolean False True to solve missingness in the latent space, if possible False
years_risk_classification Integer 5 N for N-years risk classification (ignored if the downstream task is not classification) 5
feature_selection_classification Boolean False True to use Sequential Forward Feature Selection based on logistic regression to select features for classification (ignored if the dowsntream task is not classification) False
withdrawals Path None Fully-qualified path of the withdrawals file (csv) "/path/to/withdrawals/withdrawals.csv"
genetic_kinship Path None Fully-qualified path of the genetic kinship file (csv); None for ignoring genetic kinship in the analyses "/path/to/genetic_kinship/genetic_kinship.csv"
genetic_kinship_exclude Boolean False True to exclude subjects with some genetic kinship from cross-validation and test, according to the genetic_kinship file; if genetic_kinship is None, then genetic_kinship_exclude must be set to False True
scaling Boolean True Whether to scale features or not True
n_folds Integer 5 Number of folds for k-fold cross-validation downstream model assessment 5
test_size Float 0.2 Proportion of the dataset to include in the held-out test split for downstream analyses 0.2
analysis_tag String "test" Tag to identify the run/analysis "test"
log_path Path None Fully-qualified path of the log file to save in a file in out_path ("" or None to ignore this operation) ""
dpi Integer 300 Figure quality (dpi) 300
seed Integer 42 Random state (seed) 42

Data format

Configuration files

mmid relies on two YAML configuration files:

  • A data configuration file, where the user inserts details on the modality datasets (features to consider, number of factors in the integration, ...)
  • A model configuration file, where the user selects the integration and downstream algorithms (and their hyperparameters) for the analysis

Input datasets

mmid expects the following main input datasets:

  • One or more modality datasets (csv), each one containing features from a specific data source
    • One row per subject
    • One column per feature (feature name must not contain the substring _target)
      • Categorical feature name must start with the prefix CAT_ and be in a number format (e.g., 0/1 allowed, A/B/C not allowed)
    • uid feature (mandatory) for unique integer subject identifiers
  • A cohort dataset (csv) describing the cohort for downstream analyses, with the following structure (feature name must not contain the substring _target and must contain the prefix CAT_ when categorical):
Feature Format Description Example
uid Integer Unique subject identifier 123456
DOB yyyy-mm Date of birth 1998-06
Assessment_date (feature name can be changed) yyyy-mm-dd Initial observation date 2026-03-20
CAT_baselineEvent Binary 1 if the subject experienced a baseline event (e.g., a cardiovascular disease) that implies exclusion from downstream analyses if happened before baseline, 0 otherwise 1
baselineEvent_date yyyy-mm-dd Date of the baseline event (blank when CAT_baselineEvent is 0) 2026-01-01
CAT_endpointEvent Binary 1 if the subject experienced the target event (e.g., coronary artery disease) 1
endpointEvent_date yyyy-mm-dd Date of the target event (blank when CAT_endpointEvent is 0) 2026-03-27
CAT_Exit Binary 1 if the subject early exited the observation study, 0 if observed until the end of the study 0
Exit_date yyyy-mm-dd Date when the subject exited the observation study (blank when CAT_Exit is 0)
CAT_Death Binary 1 if the subjectt died, 0 if alive 0
Death_date yyyy-mm-dd Date of subject's death (blank when CAT_Death is 0)
CAT_Sex (optional) Binary 1 for males, 0 for females 0

Optional inputs

Users can optionally specify:

  • Subjects who withdrew from the study and should thus be excluded from analyses (text file with one unique integer subject identifier per line, and no header)
  • Genetic kinships between study participants (csv), with the following structure:
Feature Format Description Example
uid Integer Unique subject identifier 123456
Genetic_kinship_to_other_participants Binary 1 if subject has genetic kinship with at least another study participant, 0 otherwise 1

About

Multi-Modal Integration and Downstream analyses for healthcare analytics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages