TL;DR: We unify reward-based fine-tuning algorithms for diffusion and flow generative models. This allows us to distinguish the fundamental design choices from others.
Jeongjae Lee*, Jinho Chang*, Jeongsol Kim†, Jong Chul Ye†.
KAIST
- [2026.05.21] Code released on Github!
- [2026.05.07] Preprint updated on arXiv!
- [2026.04.19] Preprint released on arXiv!
| Directory | Purpose | Model family |
|---|---|---|
sd35_zeroth_order/ |
Zeroth-order experiments against TempFlow-GRPO baseline for Figure 4(a). Most users should start here. | Stable Diffusion 3.5 Medium |
sd15_zeroth_order/ |
Zeroth-order experiments against PCPO baseline for Figure 4(b, c). | Stable Diffusion 1.5 |
sd35_first_order/ |
First-order experiments against VGG-Flow baseline for Figure 5(a, b). | Stable Diffusion 3.5 Medium |
sd15_first_order/ |
First-order experiments against Nabla-GFlowNet baseline for Figure 5(c, d). | Stable Diffusion 1.5 |
Each is an independent component, with its own setup notes, configs, and launch scripts.
Install dependencies inside the component you want to run. Model access, reward-model setup, and component-specific packages are described in each subdirectory README and config files.
Weights & Biases logging is expected by the current code. Set WANDB_API_KEY or replace the placeholder fields in component configs before running.
Review cache, checkpoint, and output paths before launching experiments. Some configs/scripts have hardcoded paths and may need local path edits.
SD3.5-M experiments were tested on CUDA 12.8 with 4 x H200 GPUs. These runs can nearly saturate 140GB of H200 VRAM when hosting the GenEval server simultaneously. They can be adapted to lower-VRAM GPUs (as low as 1 x 24GB GPU), by increasing gradient accumulation. BatchSampler is flexible; you can train with the same effective batch size.
SD1.5 experiments were tested on CUDA 12.x with RTX 4090 GPUs (24GB VRAM).
Package versions may need adjustment for different CUDA versions, CUDA 13.x, PyTorch wheels, xformers builds, or GPU microarchitectures.
cd sd35_zeroth_order
bash scripts/single_node/run_sd3.sh --profile lowsnr2 --sampler branch --reward geneval --loss matching --reweight fairclip2 --num-processes 4cd sd15_zeroth_order
accelerate launch train_grpo_pr.py --config configs/train/base_grpo_pr_uwsigma_lora10.yamlcd sd15_zeroth_order
accelerate launch train_grpo.py --config configs/train/hpsv2_1_lora_grpo.yamlcd sd35_first_order
torchrun --standalone --nproc_per_node=4 train_vggflow.py \
--config=config/hpsv2_geneval_ours.py \
--exp_name=OURScd sd35_first_order
torchrun --standalone --nproc_per_node=4 train_vggflow.py \
--config=config/hpsv2_geneval.py \
--exp_name=PRUNED_BASELINEcd sd15_first_order
torchrun --nproc_per_node=4 --master_port=29501 simple_res-nabladb.py --config config/simple_res-nabladb_sd_hps_usez0_basesnr.yamlcd sd15_first_order
torchrun --nproc_per_node=4 --master_port=29501 simple_res-nabladb.py --config config/simple_res-nabladb_sd_hps.yamlIf you find this repository useful, please cite:
@misc{lee2026rewardscorematchingunifying,
title={Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models},
author={Jeongjae Lee and Jinho Chang and Jeongsol Kim and Jong Chul Ye},
year={2026},
eprint={2604.17415},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.17415},
}This repo is based on ddpo-pytorch, flow_grpo, pcpo, nabla-gfn, vggflow, TempFlow-GRPO.
This project is released under the MIT License. See LICENSE.