GitHub - jaylee2000/rsm: An official implementation of Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

Reward Score Matching

TL;DR: We unify reward-based fine-tuning algorithms for diffusion and flow generative models. This allows us to distinguish the fundamental design choices from others.

Jeongjae Lee*, Jinho Chang*, Jeongsol Kim†, Jong Chul Ye†.

KAIST

🔥 News

[2026.05.21] Code released on Github!
[2026.05.07] Preprint updated on arXiv!
[2026.04.19] Preprint released on arXiv!

Repository Layout

Directory	Purpose	Model family
`sd35_zeroth_order/`	Zeroth-order experiments against TempFlow-GRPO baseline for Figure 4(a). Most users should start here.	Stable Diffusion 3.5 Medium
`sd15_zeroth_order/`	Zeroth-order experiments against PCPO baseline for Figure 4(b, c).	Stable Diffusion 1.5
`sd35_first_order/`	First-order experiments against VGG-Flow baseline for Figure 5(a, b).	Stable Diffusion 3.5 Medium
`sd15_first_order/`	First-order experiments against Nabla-GFlowNet baseline for Figure 5(c, d).	Stable Diffusion 1.5

Each is an independent component, with its own setup notes, configs, and launch scripts.

Setup Notes

Install dependencies inside the component you want to run. Model access, reward-model setup, and component-specific packages are described in each subdirectory README and config files.

Weights & Biases logging is expected by the current code. Set WANDB_API_KEY or replace the placeholder fields in component configs before running.

Review cache, checkpoint, and output paths before launching experiments. Some configs/scripts have hardcoded paths and may need local path edits.

Hardware Notes

SD3.5-M experiments were tested on CUDA 12.8 with 4 x H200 GPUs. These runs can nearly saturate 140GB of H200 VRAM when hosting the GenEval server simultaneously. They can be adapted to lower-VRAM GPUs (as low as 1 x 24GB GPU), by increasing gradient accumulation. BatchSampler is flexible; you can train with the same effective batch size.

SD1.5 experiments were tested on CUDA 12.x with RTX 4090 GPUs (24GB VRAM).

Package versions may need adjustment for different CUDA versions, CUDA 13.x, PyTorch wheels, xformers builds, or GPU microarchitectures.

Reproducing Paper Runs

Figure 4(a), Ours

cd sd35_zeroth_order
bash scripts/single_node/run_sd3.sh --profile lowsnr2 --sampler branch --reward geneval --loss matching --reweight fairclip2 --num-processes 4

Figure 4(b), Ours

cd sd15_zeroth_order
accelerate launch train_grpo_pr.py --config configs/train/base_grpo_pr_uwsigma_lora10.yaml

Figure 4(b, c), Baseline

cd sd15_zeroth_order
accelerate launch train_grpo.py --config configs/train/hpsv2_1_lora_grpo.yaml

Figure 5(a, b), Ours

cd sd35_first_order
torchrun --standalone --nproc_per_node=4 train_vggflow.py \
    --config=config/hpsv2_geneval_ours.py \
    --exp_name=OURS

Figure 5(a, b), Pruned Baseline

cd sd35_first_order
torchrun --standalone --nproc_per_node=4 train_vggflow.py \
    --config=config/hpsv2_geneval.py \
    --exp_name=PRUNED_BASELINE

Figure 5(c, d), Ours

cd sd15_first_order
torchrun --nproc_per_node=4 --master_port=29501 simple_res-nabladb.py --config config/simple_res-nabladb_sd_hps_usez0_basesnr.yaml

Figure 5(c, d), Pruned Baseline

cd sd15_first_order
torchrun --nproc_per_node=4 --master_port=29501 simple_res-nabladb.py --config config/simple_res-nabladb_sd_hps.yaml

Citation

If you find this repository useful, please cite:

@misc{lee2026rewardscorematchingunifying,
      title={Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models}, 
      author={Jeongjae Lee and Jinho Chang and Jeongsol Kim and Jong Chul Ye},
      year={2026},
      eprint={2604.17415},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.17415}, 
}

Acknowledgements

This repo is based on ddpo-pytorch, flow_grpo, pcpo, nabla-gfn, vggflow, TempFlow-GRPO.

License

This project is released under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sd15_first_order		sd15_first_order
sd15_zeroth_order		sd15_zeroth_order
sd35_first_order		sd35_first_order
sd35_zeroth_order		sd35_zeroth_order
toy_experiments @ df706a1		toy_experiments @ df706a1
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Lee et al. - 2026 - Reward Score Matching Unifying Reward-based Fine-tuning for Flow and Diffusion Models.pdf		Lee et al. - 2026 - Reward Score Matching Unifying Reward-based Fine-tuning for Flow and Diffusion Models.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reward Score Matching

🔥 News

Repository Layout

Setup Notes

Hardware Notes

Reproducing Paper Runs

Figure 4(a), Ours

Figure 4(b), Ours

Figure 4(b, c), Baseline

Figure 5(a, b), Ours

Figure 5(a, b), Pruned Baseline

Figure 5(c, d), Ours

Figure 5(c, d), Pruned Baseline

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reward Score Matching

🔥 News

Repository Layout

Setup Notes

Hardware Notes

Reproducing Paper Runs

Figure 4(a), Ours

Figure 4(b), Ours

Figure 4(b, c), Baseline

Figure 5(a, b), Ours

Figure 5(a, b), Pruned Baseline

Figure 5(c, d), Ours

Figure 5(c, d), Pruned Baseline

Citation

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages