Skip to content

gmh5225/ml-diffucoder

 
 

Repository files navigation

Masked Diffusion Models for Code Generation

Paper License deploy deploy deploy

This software project accompanies the research paper, DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation.

Motivation

Scaling upon Masked Denoising Models (MDMs), diffusion LLMs (dLLMs) such as LLaDA and Dream have achieved performance on par with similarly sized autoregressive (AR) LLMs across many benchmarks. Recent commercial-scale dLLMs like Mercury and Gemini further demonstrate that diffusion-based code generators can rival top AR code models on programming tasks while offering faster text generation.

However, the generation pattern and post-training strategies of dLLMs remain under-explored. In this work, we investigate the following questions:

  • How does the generation pattern of dLLMs differ from AR models?
  • What is the difference in modeling different data modalities, such as code vs. math?
  • How diverse can dLLMs be, and how should post-training be designed?

We train DiffuCoder using the adaptation approach in DiffuLLaMA and introduce a new metric — autoregressiveness score — to quantify the causal pattern during dLLM generation and the key findings are listed below.

Findings

  • dLLMs still exhibit a left-to-right bias due to the nature of text, but they can also break this strict order in AR models.

  • After pre-training, we show that code tasks induce less global AR-ness than math.

  • In dLLMs, changing the sampling temperature not only affects sampled tokens (as in AR models), but also alters the generation order itself.

For more interesting findings, please refer to our orginal paper!

We propose Coupled-GRPO, a post-training method to improve DiffuCoder's performance.


Coupled-GRPO

In diffusion LLMs, the per-timestep loss $\mathcal{L}_{t}$ typically computes log-probabilities only at masked token positions, which leads to inefficiency and high variance when sampling is limited. To address this, Coupled-GRPO introduces a coupled-sampling scheme:

  • For each training example, we select $\lambda$ pairs of timesteps $(t, \hat{t})$ such that $t + \hat{t} = T$.
  • We apply two complementary token masks — each mask hides part of the tokens, and together they cover the entire set of target tokens.
  • As a result, every token is unmasked in exactly one of the two forward passes.

This ensures that:

  1. Every token's log-probability is computed at least once, providing a non-zero learning signal for all tokens.
  2. The probability estimates are more accurate, since each token is evaluated in a realistic partially-masked context (rather than always being fully masked).
  3. The scheme effectively uses $2\lambda$ times more sampling passes than the baseline (we choose $\lambda=1$), improving estimation with modest computational overhead.

In this repository, we release our implementation of Coupled-GRPO, built upon open-r1.

Getting Started

├── run.sh # start training
├── setup.py # modified open-r1/setup.py
├── src/open_r1/ #  our code based on open-r1
│   ├── configs.py # with diffusion related params
│   ├── coupled_grpo.py # inherits trl GRPOTrainer 
│   ├── grpo.py # main training script
│   ├── rewards.py # rewrite code reward and code_formar reward 
│   ├── utils/code_providers.py # rewrite pass rate extraction for E2B
├── recipes/process_data.py # prepare grpo training data
├── recipes/config_coupled_code.yaml # training config
├── tests/test_code_reward.py # test sandbox execution for code

1. Prepare code and environment

Clone the source code of Open-R1 from git clone https://github.com/huggingface/open-r1. Merge and replace files between ours and Open-R1's (including setup.py).

Setup the environment and dependencies following Open-R1:

env=openr1
conda create -n $env python=3.11 -y -c anaconda
conda activate $env

pip install vllm==0.8.4
pip install setuptools
pip install flash-attn==2.8.0.post1 --no-build-isolation
pip install -e ".[code]"

Prepare code sandbox at E2B. Export your E2B token to E2B_API_KEY environment variable. Login in wandb and export your WANDB_ENTITY.

2. Data preparation

We prepare hard split of GRPO training data based on AceCode-89k.

cd recipes
python process_data.py --dataset_path "TIGER-Lab/AceCode-89K" --output_path "./acecode_hard.json" --difficulty "hard"

3. Start GRPO training

cd ..
bash run.sh 
# in `run.sh`, we start e2b server locally, but you can also run on cpu clusters to serve it.

Inference

The DiffuCoder models (Base, Instruct, and cpGRPO) are now available on HuggingFace. We'll be adding usage examples here shortly.

Evaluation

The diffusion inference algorithm is based on Dream-7B. The code evaluation is based on Qwen2.5-Coder.

Acknowledge

We sincerely appreciate the following works for DiffuCoder:

Citation

@article{gong2025diffucoder,
  title={DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation},
  author={Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang},
  year={2025},
  eprint={2506.20639},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.20639},
}

About

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.7%
  • Shell 0.3%