This repository contains the implementation for CHMR, a structure-aware molecular representation model.
Molecular perturbations trigger cellular or genetic changes, but modality incompleteness is common. Through augmentation, alignment, and hierarchical modeling, multi-modal representations are progressively organized and structured — forming the core motivation of this work.
- [2025.11] 🎉 Our paper Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling has been accepted as an Oral Presentation at AAAI 2026! 🎤 [arXiv]
Create a folder named raw_data and extract the pretraining data into it.
Download the dataset from Zenodo.
Unzip the pretrain.zip file and place all contents inside the raw_data/ directory.
For more information about the raw dataset, please refer to the InfoAlign project.
Make sure to install all required dependencies.
# 1️⃣ Upgrade pip
pip install --upgrade pip
# 2️⃣ Install PyTorch 2.0.1 (CUDA 11.8)
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 --index-url https://download.pytorch.org/whl/cu118
# 3️⃣ Install PyTorch Geometric dependencies
pip install torch-scatter==2.1.1 torch-sparse==0.6.17 torch-cluster==1.6.3 torch-spline-conv==1.2.2 \
-f https://data.pyg.org/whl/torch-2.0.1+cu118.html
# 4️⃣ Install torch_geometric
pip install torch_geometric==2.5.2
# 5️⃣ Install remaining dependencies
pip install -r requirements.txtTo start the pretraining process, run:
python pretrain.py \
--model-path CHMR \
--lr 1e-4 \
--wdecay 1e-8 \
--epoch 50 \
--batch-size 3072 \
--lambda_1 10 \
--lambda_2 10| Argument | Description |
|---|---|
--model-path |
Path to save the pretrained model (.pt file) |
--lr |
Learning rate |
--wdecay |
Weight decay |
--epoch |
Number of pretraining epochs |
--batch-size |
Training batch size |
--lambda_1 |
Weight for the SCA module |
--lambda_2 |
Weight for the Tree-VQ module |
After pretraining, we fine-tune the model on 9 molecular property prediction datasets.
For BACE, ClinTox, SIDER, and HIV, we apply a structure-aware ensemble strategy, which integrates CHMR predictions with a random forest baseline trained on the same split.
-
Generate random forest predictions:
python random_forest/extract_fingerprint.py --dataset [ogbg-molbace | ogbg-molclintox | ogbg-molsider | ogbg-molhiv] python random_forest/random_forest.py --dataset [ogbg-molbace | ogbg-molclintox | ogbg-molsider | ogbg-molhiv]
The output
rf_pred.npywill be saved in:raw_data/[dataset]/raw/rf_pred.npy
| Dataset | LR | Dropout | γ | Hidden | Batch Size | Patience | Epoch | Norm | λ₁ | λ₂ |
|---|---|---|---|---|---|---|---|---|---|---|
| ChEMBL | 1e-3 | 0.9 | – | 1800 | 5120 | 50 | 300 | LayerNorm | 10 | 0.1 |
| ToxCast | 1e-2 | 0.8 | – | 1800 | 5120 | 50 | 300 | LayerNorm | 1 | 0.1 |
| Broad | 2e-3 | 0.8 | – | 1800 | 5120 | 50 | 300 | LayerNorm | 10 | 10 |
| BBBP | 1e-3 | 0.5 | – | 2400 | 5120 | 50 | 300 | LayerNorm | 10 | 10 |
| BACE | 5e-4 | 0.5 | 0.005 | 1800 | 16 | 5 | 100 | BatchNorm | 0.01 | 0.1 |
| ClinTox | 5e-3 | 0.9 | 0.01 | 2400 | 32 | 5 | 100 | LayerNorm | 0.1 | 10 |
| SIDER | 5e-4 | 0.2 | 0.1 | 1200 | 5120 | 20 | 30 | BatchNorm | 10 | 10 |
| HIV | 1e-3 | 0.8 | 0.001 | 1800 | 10240 | 50 | 300 | BatchNorm | 10 | 10 |
| Biogen | 2e-3 | 0.8 | – | 1200 | 5120 | 50 | 300 | LayerNorm | 10 | 1 |
python finetune.py \
--model-path CHMR \
--dataset finetune-biogenadme \
--lr 2e-3 \
--hidden 4 \
--batch-size 5120 \
--task_dropout 0.8You can also substitute --dataset with:
finetune-chembl2kfinetune-moltoxcastfinetune-board6kfinetune-molbbbp
python finetune.py \
--model-path CHMR \
--dataset finetune-molbace \
--lr 5e-4 \
--gamma 0.005 \
--hidden 6 \
--batch-size 16 \
--task_dropout 0.5You can also replace the dataset with:
finetune-molclintoxfinetune-molsiderfinetune-molhiv
If you find this work useful, please cite:
Mengran Li, Zelin Zang, Wenbin Xing, Junzhou Chen, Ronghui Zhang, Jiebo Luo & Stan Z. Li. “Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling”. arXiv:2511.21120, 2025.
BibTeX:
@article{Li2025Learning,
title = {Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling},
author = {Li, Mengran and Zang, Zelin and Xing, Wenbin and Chen, Junzhou and Zhang, Ronghui and Luo, Jiebo and Li, Stan Z.},
journal = {arXiv preprint arXiv:2511.21120},
year = {2025}
}