mmid

A Python package for Multi-Modal Integration and Downstream analyses for healthcare analytics

mmid
- Getting started
- Data format

Getting started

Requirements

Python 3.7

Installation

# Create a conda environment (recommended)
conda create -n mmid python=3.7
conda activate mmid

# Install mmid
pip install git+https://github.com/andmarver/mmid

Run integration and downstream analysis

mmid runs in Python with a single function call. The datasets to analyze and the selected integration and downstream algorithms are passed to the function through configuration files.

Here is an example:

# Import mmid
from mmid.mmid import mmid

# Run analysis
mmid(config_data="/path/to/data_config/data_config.yml", 
    config_model="/path/to/model_config/model_config.yaml", 
    cohort_path="/path/to/cohort/", 
    cohort_file="cohort.csv", 
    end_study_date="2025-12-31", 
    out_path="/path/to/output/")

The function has the following parameters:

Parameter name	Format	Default	Description	Example
`config_data`	Path		Fully-qualified path of the yaml data configuration file	`"/path/to/data_config/data_config.yml"`
`config_model`	Path		Fully-qualified path of the yaml model configuration file	`"/path/to/model_config/model_config.yaml"`
`cohort_path`	Path		Path of the `cohort_file`'s directory	`"/path/to/cohort/"`
`cohort_file`	Filename		Cohort file name (csv); if inside a directory, the directory name will be interpreted as the analyzed disease	`"disease/cohort.csv"`
`end_study_date`	yyyy-mm-dd		Last observation date (i.e., end of the study)	`"2025-12-31"`
`out_path`	Path		Path where to store the results	`"/path/to/output/"`
`cohort_cov`	String	`""`	Comma-separated covariates of `cohort_file` considered for the downstream analysis: blank for not considering covariates from `cohort_file` for the downstream analysis; `"Age_at_baseline"` for considering age	`"CAT_Sex,Age_at_baseline"`
`baseline_field`	String	`"Assessment_date"`	Field for initial observation date of the study: single (as default) or multiple (comma-separated); in the latter case, `optimal_baseline` must be set to `True` to keep the more recent baseline	`"Assessment_date,Second_assessment_date"`
`optimal_baseline`	Boolean	`False`	Whether to use the optimal (i.e., more recent) subject-dependent initial observation date; it must be set to `True` when multiple baselines are specified in `baseline_field`	`True`
`latent_impute`	Boolean	`False`	`True` to solve missingness in the latent space, if possible	`False`
`years_risk_classification`	Integer	`5`	N for N-years risk classification (ignored if the downstream task is not classification)	`5`
`feature_selection_classification`	Boolean	`False`	`True` to use Sequential Forward Feature Selection based on logistic regression to select features for classification (ignored if the dowsntream task is not classification)	`False`
`withdrawals`	Path	`None`	Fully-qualified path of the withdrawals file (csv)	`"/path/to/withdrawals/withdrawals.csv"`
`genetic_kinship`	Path	`None`	Fully-qualified path of the genetic kinship file (csv); `None` for ignoring genetic kinship in the analyses	`"/path/to/genetic_kinship/genetic_kinship.csv"`
`genetic_kinship_exclude`	Boolean	`False`	`True` to exclude subjects with some genetic kinship from cross-validation and test, according to the `genetic_kinship` file; if `genetic_kinship` is None, then `genetic_kinship_exclude` must be set to `False`	`True`
`scaling`	Boolean	`True`	Whether to scale features or not	`True`
`n_folds`	Integer	`5`	Number of folds for k-fold cross-validation downstream model assessment	`5`
`test_size`	Float	`0.2`	Proportion of the dataset to include in the held-out test split for downstream analyses	`0.2`
`analysis_tag`	String	`"test"`	Tag to identify the run/analysis	`"test"`
`log_path`	Path	`None`	Fully-qualified path of the log file to save in a file in `out_path` (`""` or `None` to ignore this operation)	`""`
`dpi`	Integer	`300`	Figure quality (dpi)	`300`
`seed`	Integer	`42`	Random state (seed)	`42`

Data format

Configuration files

mmid relies on two YAML configuration files:

A data configuration file, where the user inserts details on the modality datasets (features to consider, number of factors in the integration, ...)
- Data configuration example
A model configuration file, where the user selects the integration and downstream algorithms (and their hyperparameters) for the analysis
- Model configuration example

Input datasets

mmid expects the following main input datasets:

One or more modality datasets (csv), each one containing features from a specific data source
- One row per subject
- One column per feature (feature name must not contain the substring _target)
  - Categorical feature name must start with the prefix CAT_ and be in a number format (e.g., 0/1 allowed, A/B/C not allowed)
- uid feature (mandatory) for unique integer subject identifiers
A cohort dataset (csv) describing the cohort for downstream analyses, with the following structure (feature name must not contain the substring _target and must contain the prefix CAT_ when categorical):

Feature	Format	Description	Example
`uid`	Integer	Unique subject identifier	123456
`DOB`	yyyy-mm	Date of birth	1998-06
`Assessment_date` (feature name can be changed)	yyyy-mm-dd	Initial observation date	2026-03-20
`CAT_baselineEvent`	Binary	1 if the subject experienced a baseline event (e.g., a cardiovascular disease) that implies exclusion from downstream analyses if happened before baseline, 0 otherwise	1
`baselineEvent_date`	yyyy-mm-dd	Date of the baseline event (blank when `CAT_baselineEvent` is 0)	2026-01-01
`CAT_endpointEvent`	Binary	1 if the subject experienced the target event (e.g., coronary artery disease)	1
`endpointEvent_date`	yyyy-mm-dd	Date of the target event (blank when `CAT_endpointEvent` is 0)	2026-03-27
`CAT_Exit`	Binary	1 if the subject early exited the observation study, 0 if observed until the end of the study	0
`Exit_date`	yyyy-mm-dd	Date when the subject exited the observation study (blank when `CAT_Exit` is 0)
`CAT_Death`	Binary	1 if the subjectt died, 0 if alive	0
`Death_date`	yyyy-mm-dd	Date of subject's death (blank when `CAT_Death` is 0)
`CAT_Sex` (optional)	Binary	1 for males, 0 for females	0

Optional inputs

Users can optionally specify:

Subjects who withdrew from the study and should thus be excluded from analyses (text file with one unique integer subject identifier per line, and no header)
Genetic kinships between study participants (csv), with the following structure:

Feature	Format	Description	Example
`uid`	Integer	Unique subject identifier	123456
`Genetic_kinship_to_other_participants`	Binary	1 if subject has genetic kinship with at least another study participant, 0 otherwise	1

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
src/mmid		src/mmid
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mmid

Getting started

Requirements

Installation

Run integration and downstream analysis

Data format

Configuration files

Input datasets

Optional inputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mmid

Getting started

Requirements

Installation

Run integration and downstream analysis

Data format

Configuration files

Input datasets

Optional inputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages