Code for the paper.
Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute.
We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic.
Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7x speedup over autoregressive decoding, and up to 1.57x over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4x faster than the static baseline with slightly higher accuracy.
This codebase covers four block-diffusion model families. For each model, we copy the core functionality files into separate subfolders with our modifications. To run experiments, cd into the corresponding subfolder.
S2D2/
├── SDAR/ # SDAR-8B-Chat
├── Fast-dLLM-v2/ # Fast-dLLM v2
├── LLaDA2/ # LLaDA2.1-Mini
└── D2F/ # Discrete Diffusion Forcing
Please follow each model's official instructions to install required packages. We highlight the following version requirements:
- SDAR requires
transformers==4.52.4. Flash Attention must be installed as:pip install "flash-attn==2.7.4.post1" --no-build-isolation --no-cache-dir - Other models generally work with a wider range of dependency versions.
The following arguments control S2D2's routing policies and are shared across all example and evaluation scripts:
| Argument | Description |
|---|---|
--do_verify_policy |
Routing policy: mask_span_length, score_threshold, score_hysteresis, contextual_bandit_ucb
|
--do_verify_score_threshold |
Score threshold |
--hysteresis_threshold_on |
Hysteresis upper threshold |
--hysteresis_threshold_off |
Hysteresis lower threshold |
--do_verify_score_type |
Score type: difference_static (static) or difference_dynamic (dynamic) |
--score_penalty_coef |
Penalty coefficient |
--token_acceptance_estimator |
Estimator type: soft_entropy_negexp (soft entropy-based) or hard_margin_threshold (hard margin-based) |
cd SDAR
# Standard diffusion decoding
python generate.py
# S2D2 (ours)
CUDA_VISIBLE_DEVICES=0 python generate_ssd_policy.py
# Append --forward_stats to print decoding order and decoded tokens at each step
# Append --draft_ver --cache_ver to enable AR-like caching
CUDA_VISIBLE_DEVICES=0 python generate_ssd_policy.py --forward_stats --draft_ver --cache_vercd Fast-dLLM-v2
# Standard diffusion decoding
python example_v2.py --generate_fn='fast'
# S2D2 (ours)
python example_v2.py --generate_fn='ssd_policy'cd LLaDA2
# No-cache version (sample code from the model card, without KV caching)
python example_llada.py --generate_fn='nocache'
# Cached version (our implementation of KV-cached diffusion decoding)
python example_llada.py --generate_fn='cached'
# S2D2 (ours)
python example_llada.py --generate_fn='ssd_policy'cd D2F
python example_d2f.py --generate_fn='d2f'More examples see D2F/README.md.
We use a forked version of lm-evaluation-harness. Clone the fork and switch to the more-eval branch:
git clone https://github.com/phymhan/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout more-eval
pip install -e .Use custom_model_class to specify a custom modeling file (needed when features like AR-like caching are not supported by the original model), and custom_generate to specify the generation function.
SDAR (e.g., IFEval):
cd SDAR
lm_eval --model hf \
--model_args pretrained=JetLM/SDAR-8B-Chat,trust_remote_code=True,custom_generate=./generate_ssd_policy.py:block_diffusion_generate,block_length=16,denoising_steps=16,remasking_strategy=low_confidence_dynamic,min_ssd_span_length=1,confidence_threshold=0.85,cache_ver=true,draft_ver=true \
--batch_size 1 \
--tasks ifevalFast-dLLM-v2 (e.g., HumanEval):
cd Fast-dLLM-v2
HF_ALLOW_CODE_EVAL=1 lm_eval --model hf \
--model_args pretrained=Efficient-Large-Model/Fast_dLLM_v2_7B,trust_remote_code=True,custom_model_class=./modeling_fast.py:Fast_dLLM_QwenForCausalLM,custom_generate=./generate_policy_utils.py:generate_ssd_policy,block_size=32,small_block_size=32,use_block_cache=false,use_ssd_cache=false,threshold=0.9,cache_ver=false,draft_ver=false,max_new_tokens=512,do_verify_policy=mask_span_length,min_ssd_span_length=8 \
--batch_size 1 \
--tasks humaneval \
--confirm_run_unsafe_codeComing soon.
Our code borrows heavily from the original codebases:
- SDAR: https://github.com/JetAstra/SDAR
- Fast-dLLM-v2: https://github.com/NVlabs/Fast-dLLM
- LLaDA2.x: https://huggingface.co/inclusionAI/LLaDA2.1-mini
- D2F: https://github.com/SJTU-DENG-Lab/Discrete-Diffusion-Forcing
We thank the authors for generously open-sourcing their work!
