Lossless token vocabulary compression for fast CFG-constrained decoding.
CFGzip is an offline pre-computation technique that pairs with a constrained decoding engine (e.g. XGrammar2), to massively speed up the engine's inference compute time: generation with CFGzip+XGrammar2 is up to 7.5x faster than the SoTA XGrammar2 alone. CFGzip compression is also lossless: outputs are byte-identical to the unmodified grammar engine.
Within a given context-free grammar (CFG), many tokens are interchangeable: at any point where one is valid, so are
the others, and vice versa. During the pre-compute phase, CFGzip detects these tokens and compresses them into a single
equivalence class, keeping one representative per class. These equivalence classes are then cached — in memory or on
disk (via EquivalenceClassData.save()) — to be used during generation.
At inference time, the grammar engine produces an allowed/disallowed mask over just the class representatives: this
shrinks the per-step search space from 100-200k tokens to (typically) 1-3k equivalence class representatives. A
MaskTranslator losslessly decompresses that class-level mask back to the full token vocabulary at each generation
step.
See our paper for a description of the compression algorithm and correctness proofs.
CFGzip is intended for static, large, complex grammars that are reused across multiple requests: code generation, fixed schemas, structured DSLs. The offline compression step can take a few minutes, so it typically only pays off when one grammar serves many generations. CFGzip is not intended for dynamic, per-request schemas where you'd have to pay the precompute cost every time.
pip install cfgzip # core: offline preprocessing
pip install "cfgzip[xgrammar]" # + the XGrammar generation backendRequires Python ≥ 3.10.
CFGzip has two phases: an offline step that computes and caches equivalence classes for a (grammar, tokenizer)
pair, and an online step, where a third-party grammar engine does the heavy lifting — CFGzip just speeds things up.
from transformers import AutoTokenizer
from cfgzip import preprocess
tokenizer = AutoTokenizer.from_pretrained('gpt2')
grammar = """\
root ::= expr
expr ::= term (("+" | "-") term)*
term ::= factor (("*" | "/") factor)*
factor ::= [0-9]+ | "(" expr ")"
"""
# eq is an EquivalenceClassData object: we can save it
# to disk (like we did here) and/or just use it directly
eq = preprocess(grammar, tokenizer, num_workers=4)
eq.save('cfgzip_data/arithmetic')XgrammarProcessor is a standard transformers LogitsProcessor, so it drops
straight into model.generate:
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessorList
from cfgzip import XgrammarProcessor
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')
processor = XgrammarProcessor.auto_pipeline(
'cfgzip_data/arithmetic', tokenizer, grammar, device=model.device
)
inputs = tokenizer('Calculator: ', return_tensors='pt').to(model.device)
out = model.generate(
**inputs,
max_new_tokens=16,
logits_processor=LogitsProcessorList([processor]),
)
print(tokenizer.decode(out[0]))For the common case (one grammar, many batches) compile once and rebuild only the lightweight per-batch processor, skipping redundant disk I/O and grammar compilation:
from cfgzip import XgrammarProcessor
mask_translator, compiled_grammar = XgrammarProcessor.load_and_compile(
'cfgzip_data/arithmetic', tokenizer, grammar, device=model.device
)
for batch in batches:
processor = XgrammarProcessor.from_compiled(mask_translator, compiled_grammar, tokenizer)
output = model.generate(**batch, logits_processor=LogitsProcessorList([processor]))
# ... do somethingauto_pipeline is just load_and_compile + from_compiled; use the split form above whenever you generate more than
once with the same grammar.
Grammars are written in GBNF (GGML BNF) — the grammar notation used by
llama.cpp and XGrammar2. The same grammar
string drives both the offline preprocess step and online generation; see the linked guide for the full syntax.
Two CFGzip-specific constraints:
- CFGzip parses core GBNF; a few advanced constructs (e.g.
{m,n}repetition counts) aren't supported yet. - The start rule must be named
root(or passstart_symbol=...topreprocess), and the start symbol may not appear in any rule body — keep the recursion on an inner non-terminal, as in theroot ::= exprquickstart grammar above.
- Engine backend: v0.1.0 supports XGrammar2 only.
BaseProcessordefines the contract for adding support for additional engines. We plan to support llguidance and transformers-cfg in later versions. - CFG notation: similarly,
preprocess()only supports the GBNF grammar specification notation used by XGrammar2. Support for additional specification notations (e.g. Lark) is planned alongside support for decoding engines that use them.
| Name | Description |
|---|---|
preprocess |
Computes equivalence classes for a (grammar, tokenizer) pair (offline). |
EquivalenceClassData |
The precomputed equivalence class data; .save / .load / .to. |
XgrammarProcessor |
XGrammar wrapper and LogitsProcessor; .auto_pipeline / .load_and_compile / .from_compiled. |
MaskTranslator |
Expands a class-level mask back to the full token vocabulary in-place. |
BaseProcessor |
Abstract base / extension point for adding a new grammar engine. |
The main algorithm and functions were written by hand. We used Claude Code to port our research repository to a pip-installable module, including writing tests, docstrings, and portions of this README.
If you use CFGzip in your research, please cite:
@article{sullivan2026accelerating,
title = {Accelerating Constrained Decoding with Token Space Compression},
author = {Sullivan, Michael and Koller, Alexander},
journal = {arXiv preprint arXiv:2605.29986},
year = {2026},
url = {https://arxiv.org/abs/2605.29986}
}