MRePath

Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance, IJCAI 2025. [arxiv]
Mingcheng Qu, Guang Yang, Donglin Di, Tonghua Su, Yue Gao, Yang Song, Lei Fan*

@misc{qu2025multimodalcancersurvivalanalysis,
      title={Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance}, 
      author={Mingcheng Qu and Guang Yang and Donglin Di and Tonghua Su and Yue Gao and Yang Song and Lei Fan},
      year={2025},
      eprint={2505.11997},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.11997}, 
}

Summary: We propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance.

Downloading TCGA Data

To download diagnostic WSIs (formatted as .svs files), molecular feature data and other clinical metadata, please refer to the NIH Genomic Data Commons Data Portal and the cBioPortal. WSIs for each cancer type can be downloaded using the GDC Data Transfer Tool.

Installation Guide for Linux (using anaconda)

Pre-requisites:

Linux (Tested on Ubuntu 22.04)
NVIDIA GPU (Tested on Nvidia GeForce RTX 4090 Ti) with CUDA 12.1
Python (3.10.7)

Processing Whole Slide Images

To process WSIs, we follow a procedure similar to CLAM. First, tissue regions in each biopsy slide are segmented using Otsu's thresholding on a downsampled WSI via OpenSlide. Then, non-overlapping 256 × 256 patches are extracted from the segmented tissue regions at the desired magnification. A pretrained truncated ResNet50 is used to encode these raw image patches into 1024-dimensional feature vectors, which are saved as .h5 files for each WSI. These extracted features serve as input (in .h5 format) for subsequent graph construction. The following folder structure is assumed for the extracted feature vectors.

DATA_ROOT_DIR/
    └──TCGA_BLCA/
         └──h5_files/
              ├── slide_1.h5
              ├── slide_2.pt
              └── ...
    └──TCGA_BRCA/
         └──h5_files/
              ├── slide_1.h5
              ├── slide_2.pt
              └── ...
    ...

DATA_ROOT_DIR is the base directory of all datasets / cancer type(e.g. the directory to your SSD). Within DATA_ROOT_DIR, each folder contains a list of .pt files for that dataset / cancer type.

Molecular Features and Genomic Signatures

Processed molecular profile features containing mutation status, copy number variation, and RNA-Seq abundance can be downloaded from the cBioPortal, which we include as CSV files in the following directory. For ordering gene features into gene embeddings, we used the following categorization of gene families (categorized via common features such as homology or biochemical activity) from MSigDB. Gene sets for homeodomain proteins and translocated cancer genes were not used due to overlap with transcription factors and oncogenes respectively. The curation of "genomic signatures" can be modified to curate genomic embedding that reflect unique biological functions.

Training-Validation Splits

For evaluating the algorithm's performance, we partitioned each dataset using 5-fold cross-validation (stratified by the site of histology slide collection). Splits for each cancer type are found in the splits folder, which each contain splits_{k}.csv for k = 1 to 5. In each splits_{k}.csv, the first column corresponds to the TCGA Case IDs used for training, and the second column corresponds to the TCGA Case IDs used for validation. Slides from one case are not distributed across training and validation sets. Alternatively, one could define their own splits, however, the files would need to be defined in this format. The dataset loader for using these train-val splits are defined in the return_splits function in the SurvivalDatasetFactory.

Running Experiments

Refer to scripts folder for source files to train SurvPath and the baselines presented in the paper. Refer to the paper to find the hyperparameters required for training.

Step 1: Generate WSI Graph Structures

To initiate the graph-building process for WSI patches, you can use the following Python command. This approach involves first storing binary edges, which allows for random sampling and subsequently generating hyperedges based on these stored binary edges:

bash python extract_graph.py --h5_path H5_PATH --graph_save_path GRAPH_SAVE_PATH

--h5_path H5_PATH: This parameter specifies the path to your HDF5 file containing WSI patch data. Replace H5_PATH with the actual path to your data file.

--graph_save_path GRAPH_SAVE_PATH: This parameter defines where the generated graph structures will be saved. Replace GRAPH_SAVE_PATH with your desired output directory or file path.

For a quick start, you can also run the graph.sh script to generate WSI graph structures:

bash scripts/graph.sh

This script will automate the processing of your data and create graph structures suitable for training. Ensure that you check and adjust the parameter settings within the script as needed to ensure compatibility with your specific data format.

Step 2: Start Model Training

Next, use the run.sh script to start training the SurvPath model and other baseline models.

bash scripts/run.sh

Before running this script, refer to the hyperparameter settings recommended in the paper, and adjust relevant configurations in the script according to your experimental needs.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
__pycache__		__pycache__
custom_optims		custom_optims
datasets		datasets
datasets_csv		datasets_csv
dhg		dhg
docs		docs
models		models
scripts		scripts
splits/5folds		splits/5folds
support		support
utils		utils
README.md		README.md
extract_graph.py		extract_graph.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MRePath

Downloading TCGA Data

Installation Guide for Linux (using anaconda)

Pre-requisites:

Processing Whole Slide Images

Molecular Features and Genomic Signatures

Training-Validation Splits

Running Experiments

Step 1: Generate WSI Graph Structures

Step 2: Start Model Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MRePath

Downloading TCGA Data

Installation Guide for Linux (using anaconda)

Pre-requisites:

Processing Whole Slide Images

Molecular Features and Genomic Signatures

Training-Validation Splits

Running Experiments

Step 1: Generate WSI Graph Structures

Step 2: Start Model Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages