RobustRDP: Advancing Reaction Diagram Parsing via
Synthetic-to-Real Data Scaling and Robustness-Oriented Training
This repository contains the official training and evaluation code for RobustRDP, a robust approach for chemical reaction diagram parsing that leverages:
- π Synthetic-to-real data scaling β Large-scale synthetic pretraining (60k images) followed by real-world fine-tuning
- π‘οΈ Robustness-oriented training β Multi-task SFT with region-guided and prefix-perturbed objectives, plus DPO alignment
The model is built upon Qwen2.5-VL-3B-Instruct and trained using LLaMA-Factory.
- Environment Setup
- Quick Evaluation
- Training RobustRDP
- Training Data Generation
- Evaluation Data Generation
- Efficient Annotation Platform
- Citation
conda create -n robustrdp python=3.11.13
conda activate robustrdpgit clone https://github.com/jaydetang/RobustRDP.git
cd RobustRDPπ‘ Note: This repository uses Git LFS to track large files (e.g., pretrain raw data under
pretrain_data_process/raw_data/). Rungit lfs pullafter cloning.
pip install -r requirements.txtgit clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
git checkout v0.9.4
pip install -e .
git apply ../llamafactory_patch/patch.diff
cd ..The patch registers the custom datasets (stage1_pretrain, stage2_sft, stage3_dpo) into LLaMA-Factory's dataset_info.json, and adds support for the disturb_rxns field used in perturbed reaction parsing during SFT.
Evaluate the trained RobustRDP model on both the RxnScribe test and RobustRDP test sets.
Download the trained model checkpoint from Hugging Face and place it under the eval/ directory:
huggingface-cli download Jingcz/RobustRDP --local-dir ./eval/SFT_ModelDownload the processed validation data from Hugging Face:
huggingface-cli download Jingcz/RobustRDP-ProcessedValData --local-dir ./processed_val_data --repo-type datasetThe validation data includes two test sets:
RxnScribe_test/β Test set converted from the RxnScribe benchmarkRobustRDP_test/β Test set constructed for RobustRDP with diverse layouts and reaction types
See processed_val_data/README.md for detailed data specifications.
# Make sure you have updated the model path in eval/eval.sh to point to your downloaded checkpoint
sh eval/eval.shThe script evaluates both test sets sequentially using 8 GPUs with distributed data parallel inference. Results (predictions and scores) are saved to eval/dpo_results/.
Note: The
eval.shscript assumes 8 GPUs are available. AdjustCUDA_VISIBLE_DEVICESandnproc_per_nodein the script if using fewer GPUs.
Training consists of three sequential stages, each building upon the previous one. The original experiments were conducted on 2 Γ NVIDIA H100 (80GB) GPUs.
β οΈ Important: All training commands below must be executed from theRobustRDP/project root directory (i.e., the directory containing thisREADME.md). Thetrain_scripts/,PLMs/,processed_train_data/, andsaves/paths are all relative to this root.
# Download Qwen2.5-VL-3B-Instruct from Hugging Face
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct --local-dir ./PLMs/Qwen2.5-VL-3B-Instruct# Download the processed training data from Hugging Face
huggingface-cli download Jingcz/RobustRDP-ProcessedTrainData --local-dir ./processed_train_data --repo-type datasetThe training data includes three subsets:
pretrain_data/β 60,000 synthetic reaction diagrams (single-line, multi-line, branch, cycle)sft_data/β Multi-task SFT data with real-world diagrams and augmentationsdpo_data/β DPO preference pairs
See processed_train_data/README.md for detailed data specifications.
llamafactory-cli train train_scripts/qwen2_5vl_3b_pretrain.yaml- Base model:
./PLMs/Qwen2.5-VL-3B-Instruct - Dataset: 60k synthetic reaction diagrams
- Training config: Full fine-tuning of language model, frozen vision tower & projector
- Output:
saves/stage1_pretrain/qwen2_5vl-3b/pretrainllm_lr1e-6_bs16_cosine_6w/ - Special tokens:
<rxn>,<rct>,<cnd>,<prd>,<txt>,<mol>
Select the final checkpoint (e.g.,
.../final) for the next stage.
Update model_name_or_path in train_scripts/qwen2_5vl_3b_sft.yaml to point to the final checkpoint from Stage 1, then run:
llamafactory-cli train train_scripts/qwen2_5vl_3b_sft.yaml- Dataset: Multi-task SFT data including:
- Vanilla Reaction Parsing (VRP) β standard parsing with augmentation
- Region-Guided Reaction Parsing (RGRP) β parse within specified bounding-box regions
- Prefix-Perturbed Reaction Parsing (PPRP) β parsing with some equations perturbed
- Training config: Full fine-tuning of all parameters
- Output:
saves/stage2_sft/qwen2_5vl-3b/pretrainllm_sftall_lr1e-5_bs4_cosine_decouple_disturb_15d/
Select the final checkpoint (e.g.,
.../checkpoint-47700) for the next stage.
Update model_name_or_path in train_scripts/qwen2_5vl_3b_dpo.yaml to point to the selected checkpoint from Stage 2, then run:
llamafactory-cli train train_scripts/qwen2_5vl_3b_dpo.yaml- Dataset: DPO preference pairs (14,169 samples), chosen = ground truth, rejected = model prediction with F1 < 0.8
- Training config: Frozen vision tower & projector, language model only
- DPO hyperparameters:
pref_beta=0.1,pref_ftx=0.5,pref_loss=sigmoid - Output:
saves/stage3_dpo/qwen2_5vl-3b/dpollm_lr3e-7_bs64_cosine_beta01_ftx05/
Scripts to synthesize 60,000 chemical reaction diagrams with four layout types:
| Layout Type | Count | Description |
|---|---|---|
| Single-line | 10,000 | Single-line chain-style reactions |
| Multi-line | 10,000 | Multi-line chain-style reactions |
| Branch | 20,000 | Branching reactions |
| Cycle | 20,000 | Cyclic reactions |
See pretrain_data_process/README.md for full instructions.
Generates multi-task training data from 4,240 real-world reaction diagrams with three task variants:
| Task | Samples | Description |
|---|---|---|
| Vanilla Reaction Parsing (VRP) | 127,200 | Standard parsing with augmentation (x15) |
| Region-Guided Reaction Parsing (RGRP) | 31,800 | Parse reactions within a given bounding-box region |
| Prefix-Perturbed Reaction Parsing (PPRP) | 31,800 | Parse with some equations having perturbed boxes |
See sft_data_process/README.md for full instructions.
Generates 14,169 preference pairs by running the Stage 2 SFT model on the VRP training data and filtering samples where model predictions have an overall F1 < 0.8 compared to ground truth.
See dpo_data_process/README.md for full instructions.
Scripts to process raw validation data into the evaluation format used by RobustRDP:
# Step 1: Download raw validation data
huggingface-cli download Jingcz/RobustRDP-RawValData --local-dir ./raw_val_data --repo-type dataset
# Step 2: Process RxnScribe test data
python raw_val_data/gen_processed_val_data_rxnscribe_test.py
# Step 3: Process RobustRDP test data
python raw_val_data/gen_processed_val_data_robustrdp_test.pySee raw_val_data/README.md for full instructions.
The Efficient Annotation Platform used to annotate the 3,500 raw reaction diagrams (the foundation of the SFT and DPO data) is available at:
- Repository: RxnLabel β https://github.com/jaydetang/RxnLabel
- Raw Annotated Data: Jingcz/RxnLabelData on Hugging Face
The platform enables efficient bounding-box and reaction structure annotation for chemical diagrams, supporting the data generation pipeline described in the paper.
RobustRDP/
βββ LLaMA-Factory/ # Cloned & patched LLaMA-Factory (v0.9.4)
βββ llamafactory_patch/ # Patch to register custom datasets in LLaMA-Factory
β βββ patch.diff
βββ PLMs/ # Pre-trained language models (Qwen2.5-VL-3B-Instruct)
βββ saves/ # Training checkpoints (generated during training)
βββ train_scripts/ # LLaMA-Factory training configs (YAML)
β βββ qwen2_5vl_3b_pretrain.yaml
β βββ qwen2_5vl_3b_sft.yaml
β βββ qwen2_5vl_3b_dpo.yaml
βββ eval/ # Evaluation scripts & model checkpoint
β βββ eval.sh # Main evaluation script
β βββ eval_multigpu.py # Distributed evaluation
β βββ evaluater.py # Evaluation metrics
β βββ data.py # Data loading for evaluation
βββ processed_train_data/ # Training data (download from Hugging Face)
β βββ pretrain_data/ # 60k synthetic diagrams
β βββ sft_data/ # Multi-task SFT data
β βββ dpo_data/ # DPO preference pairs
βββ processed_val_data/ # Validation data (download from Hugging Face)
β βββ RxnScribe_test/
β βββ RobustRDP_test/
βββ raw_val_data/ # Raw validation data processing
β βββ gen_processed_val_data_rxnscribe_test.py
β βββ gen_processed_val_data_robustrdp_test.py
β βββ RobustRDP_test/ # Raw RobustRDP test images
β βββ RxnScribe_test/ # Raw RxnScribe test images
β βββ utils/
βββ pretrain_data_process/ # Synthetic pretrain data generation
β βββ gen_single_line.py
β βββ gen_multi_line.py
β βββ gen_branch.py
β βββ gen_cycle.py
β βββ post_process.py
β βββ utils.py
β βββ indigo/ # Indigo cheminformatics wrapper
β βββ raw_data/
βββ sft_data_process/ # SFT data generation
β βββ gen_vanilla_reaction_parsing.py
β βββ gen_region_guided_reaction_parsing.py
β βββ gen_prefix_perturbed_reaction_parsing.py
β βββ post_process.py
β βββ raw_data/
β βββ utils/
βββ dpo_data_process/ # DPO data generation
βββ pre_process.py
βββ gen_dpo.py
βββ gen_dpo.sh
If you find this work useful in your research, please consider citing our paper:
@article{robustrdp2025,
title={RobustRDP: Advancing Reaction Diagram Parsing via Synthetic-to-Real Data Scaling and Robustness-Oriented Training},
author={...},
journal={...},
year={2025}
}This project is licensed under the MIT License β see the LICENSE file for details.
