A Python package for Multi-Modal Integration and Downstream analyses for healthcare analytics
Python 3.7
# Create a conda environment (recommended)
conda create -n mmid python=3.7
conda activate mmid
# Install mmid
pip install git+https://github.com/andmarver/mmidmmid runs in Python with a single function call. The datasets to analyze and the selected integration and downstream algorithms are passed to the function through configuration files.
Here is an example:
# Import mmid
from mmid.mmid import mmid
# Run analysis
mmid(config_data="/path/to/data_config/data_config.yml",
config_model="/path/to/model_config/model_config.yaml",
cohort_path="/path/to/cohort/",
cohort_file="cohort.csv",
end_study_date="2025-12-31",
out_path="/path/to/output/")The function has the following parameters:
| Parameter name | Format | Default | Description | Example |
|---|---|---|---|---|
config_data |
Path | Fully-qualified path of the yaml data configuration file | "/path/to/data_config/data_config.yml" |
|
config_model |
Path | Fully-qualified path of the yaml model configuration file | "/path/to/model_config/model_config.yaml" |
|
cohort_path |
Path | Path of the cohort_file's directory |
"/path/to/cohort/" |
|
cohort_file |
Filename | Cohort file name (csv); if inside a directory, the directory name will be interpreted as the analyzed disease | "disease/cohort.csv" |
|
end_study_date |
yyyy-mm-dd | Last observation date (i.e., end of the study) | "2025-12-31" |
|
out_path |
Path | Path where to store the results | "/path/to/output/" |
|
cohort_cov |
String | "" |
Comma-separated covariates of cohort_file considered for the downstream analysis: blank for not considering covariates from cohort_file for the downstream analysis; "Age_at_baseline" for considering age |
"CAT_Sex,Age_at_baseline" |
baseline_field |
String | "Assessment_date" |
Field for initial observation date of the study: single (as default) or multiple (comma-separated); in the latter case, optimal_baseline must be set to True to keep the more recent baseline |
"Assessment_date,Second_assessment_date" |
optimal_baseline |
Boolean | False |
Whether to use the optimal (i.e., more recent) subject-dependent initial observation date; it must be set to True when multiple baselines are specified in baseline_field |
True |
latent_impute |
Boolean | False |
True to solve missingness in the latent space, if possible |
False |
years_risk_classification |
Integer | 5 |
N for N-years risk classification (ignored if the downstream task is not classification) | 5 |
feature_selection_classification |
Boolean | False |
True to use Sequential Forward Feature Selection based on logistic regression to select features for classification (ignored if the dowsntream task is not classification) |
False |
withdrawals |
Path | None |
Fully-qualified path of the withdrawals file (csv) | "/path/to/withdrawals/withdrawals.csv" |
genetic_kinship |
Path | None |
Fully-qualified path of the genetic kinship file (csv); None for ignoring genetic kinship in the analyses |
"/path/to/genetic_kinship/genetic_kinship.csv" |
genetic_kinship_exclude |
Boolean | False |
True to exclude subjects with some genetic kinship from cross-validation and test, according to the genetic_kinship file; if genetic_kinship is None, then genetic_kinship_exclude must be set to False |
True |
scaling |
Boolean | True |
Whether to scale features or not | True |
n_folds |
Integer | 5 |
Number of folds for k-fold cross-validation downstream model assessment | 5 |
test_size |
Float | 0.2 |
Proportion of the dataset to include in the held-out test split for downstream analyses | 0.2 |
analysis_tag |
String | "test" |
Tag to identify the run/analysis | "test" |
log_path |
Path | None |
Fully-qualified path of the log file to save in a file in out_path ("" or None to ignore this operation) |
"" |
dpi |
Integer | 300 |
Figure quality (dpi) | 300 |
seed |
Integer | 42 |
Random state (seed) | 42 |
mmid relies on two YAML configuration files:
- A data configuration file, where the user inserts details on the modality datasets (features to consider, number of factors in the integration, ...)
- A model configuration file, where the user selects the integration and downstream algorithms (and their hyperparameters) for the analysis
mmid expects the following main input datasets:
- One or more modality datasets (csv), each one containing features from a specific data source
- One row per subject
- One column per feature (feature name must not contain the substring
_target)- Categorical feature name must start with the prefix
CAT_and be in a number format (e.g., 0/1 allowed, A/B/C not allowed)
- Categorical feature name must start with the prefix
uidfeature (mandatory) for unique integer subject identifiers
- A cohort dataset (csv) describing the cohort for downstream analyses, with the following structure (feature name must not contain the substring
_targetand must contain the prefixCAT_when categorical):
| Feature | Format | Description | Example |
|---|---|---|---|
uid |
Integer | Unique subject identifier | 123456 |
DOB |
yyyy-mm | Date of birth | 1998-06 |
Assessment_date (feature name can be changed) |
yyyy-mm-dd | Initial observation date | 2026-03-20 |
CAT_baselineEvent |
Binary | 1 if the subject experienced a baseline event (e.g., a cardiovascular disease) that implies exclusion from downstream analyses if happened before baseline, 0 otherwise | 1 |
baselineEvent_date |
yyyy-mm-dd | Date of the baseline event (blank when CAT_baselineEvent is 0) |
2026-01-01 |
CAT_endpointEvent |
Binary | 1 if the subject experienced the target event (e.g., coronary artery disease) | 1 |
endpointEvent_date |
yyyy-mm-dd | Date of the target event (blank when CAT_endpointEvent is 0) |
2026-03-27 |
CAT_Exit |
Binary | 1 if the subject early exited the observation study, 0 if observed until the end of the study | 0 |
Exit_date |
yyyy-mm-dd | Date when the subject exited the observation study (blank when CAT_Exit is 0) |
|
CAT_Death |
Binary | 1 if the subjectt died, 0 if alive | 0 |
Death_date |
yyyy-mm-dd | Date of subject's death (blank when CAT_Death is 0) |
|
CAT_Sex (optional) |
Binary | 1 for males, 0 for females | 0 |
Users can optionally specify:
- Subjects who withdrew from the study and should thus be excluded from analyses (text file with one unique integer subject identifier per line, and no header)
- Genetic kinships between study participants (csv), with the following structure:
| Feature | Format | Description | Example |
|---|---|---|---|
uid |
Integer | Unique subject identifier | 123456 |
Genetic_kinship_to_other_participants |
Binary | 1 if subject has genetic kinship with at least another study participant, 0 otherwise | 1 |