Add hrm text#46025
Merged
Merged
Conversation
`L_bp_cycles_padded` is indexed by `high_cycle_idx ∈ [0, H_cycles)` inside the recurrent forward, but `HrmTextModel.__init__` was left-padding it to length `config.L_cycles` instead of `config.H_cycles`. With the upstream defaults (`H_cycles=2`, `L_cycles=3`, `L_bp_cycles=[2]`) this silently produced `L_bp_cycles_padded=[1, 1, 2]`, so the index-1 read in the second H-cycle picked up the leading pad value (1) and the trailing 2 was never reached. Inference is unaffected (the value is only consulted under autograd in training); training-time gradient propagation through the last H-cycle was capped at 1 L-iteration instead of `raw_bp[-1]` (default 2).
vasqu
approved these changes
May 18, 2026
Contributor
vasqu
left a comment
There was a problem hiding this comment.
Just checking the last details 🫡 pretty much ready for merge
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, hrm_text |
Contributor
|
run-slow: hrm_text |
Contributor
|
This comment contains models: ["models/hrm_text"] |
abcd1927
added a commit
to abcd1927/vllm
that referenced
this pull request
May 19, 2026
HRM-Text was merged into huggingface/transformers main yesterday (huggingface/transformers#46025). The model performs a hierarchical recurrent forward over two transformer stacks (`H` slow, `L` fast) inside nested H/L cycle loops; each recurrence step consumes a distinct KV cache slot, matching transformers' `cycle_offset` formula. Attention uses a sigmoid gate (Qwen3Next-style) and a fused gate+q+k+v projection on disk. This PR is fully scoped to the new model file, the registry, and the supported-models doc — no changes to attention backends, the AttentionType enum, or Attention itself. Pipeline parallelism and LoRA are deferred to follow-up PRs (each currently raises at construction). ## Architecture - `HrmTextAttention` keeps `gate_proj` (`ColumnParallelLinear`) separate from `qkv_proj` (`QKVParallelLinear`) so both shard cleanly along the head axis under TP. The HF on-disk schema fuses them into a single `attn.gqkv_proj.weight` (rows concatenated as `[gate | q | k | v]`); `HrmTextForCausalLM.load_weights` splits that tensor at load time before passing it to `AutoWeightsLoader`. Each recurrence step gets its own `Attention` instance under a `nn.ModuleDict`, with a unique `prefix` (`...layers.{global_idx}.attn`), so vLLM allocates a distinct KV cache slot per cycle step. Total slots match `H_cycles * (L_cycles + 1) * num_layers_per_stack`, equal to `config.num_hidden_layers` after HF's `__post_init__` inflation. - `_create_hrm_attention_backend` wraps an attention backend's metadata builder so `causal=False` is set unconditionally on every build. Mirrors `EncoderOnlyAttention`'s `subclass_attention_backend` pattern, but keeps the KV cache (the recurrent forward needs to reuse it). - `HrmTextModel.forward` matches HF main exactly: nested H-outer L-inner loop, `z_L = L(z_L + z_H)` then `z_H = H(z_H + z_L)`. The Python loop is unrolled by `@support_torch_compile` since cycle counts are config constants. ## Why unconditional causal=False vLLM v1 schedules continuous batching. Even without chunked prefill, a single `build()` call can see a *mixed* batch where some requests are decoding and others are entering prefill on this step. Gating on `is_prefilling.all()` keeps `causal=True` in that mixed case, silently running the newly-prefilling requests as pure causal — which diverges from HRM-Text's PrefixLM training distribution. Unconditional `causal=False` is correct because: - Prefill rows (query_len=N): bidirectional, matching HF main with `token_type_ids=ones_like(input_ids)`. - Decode rows (query_len=1): `causal=True` and `causal=False` are mathematically identical (no future tokens to mask). - FlashAttention varlen kernels apply `causal` per sub-sequence via `cu_seqlens_q`, so a single global flag does not cause cross-prompt contamination. Empirical validation on `sapientinc/HRM-Text-1B` GSM8K: vLLM matches HF transformers main with `token_type_ids=1` to within statistical noise (82.0% / 82.0% on the first 50 questions; 84.99% on full 1319; identical between `tensor_parallel_size=1` and `tensor_parallel_size=2`). ## On-disk weight schema HF transformers writes HrmText checkpoints with these keys (per the "hrm_text" entry in `transformers/conversion_mapping.py`): model.embed_tokens.weight model.z_L_init model.{H,L}_module.layers.{i}.attn.gqkv_proj.weight ([gate|q|k|v] dim 0) model.{H,L}_module.layers.{i}.attn.o_proj.weight model.{H,L}_module.layers.{i}.mlp.{gate_up,down}_proj.weight Our model splits `attn.gqkv_proj.weight` into `attn.gate_proj.weight` (first `num_heads * head_dim` rows) and `attn.qkv_proj.weight` (remaining `[q | k | v]` rows) at load time, so vLLM's standard `QKVParallelLinear` weight loader handles TP partitioning of q/k/v without unfusing the gate. Signed-off-by: Wuyifei <wuyifei@me.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds the HRM-Text architecture (Hierarchical Reasoning Model — autoregressive language-modeling variant) by Sapient Intelligence. HRM-Text uses two transformer stacks (H = high-level / slow, L = low-level / fast) traversed in a nested recurrence over the same input embeddings, giving effectively unbounded compute depth at bounded parameter count.
The 1B base checkpoint is published at
sapientinc/HRM-Text-1B.Companion review thread (5 rounds of feedback addressed): https://github.com/huggingface/new-model-addition-hrm/pull/2.
Architectural traits
H_cycles × (L_cycles + 1)traversals over the same input)token_type_ids(paligemma pattern; bidirectional inside the prefix block, causal elsewhere)o_proj(Qwen3-Next-style)NanoChatRMSNorm)L_bp_cyclesk-step gradient routing — training-time only, no effect at inferencenum_layers_per_stack × H_cycles × (L_cycles + 1)slots)config.prefix_lm=True(FA cannot represent the 4-D mask overlay), allowed whenprefix_lm=FalseFiles
src/transformers/models/hrm_text/{configuration,modeling,modular}_hrm_text.pysrc/transformers/models/hrm_text/__init__.pysrc/transformers/conversion_mapping.py(entry for legacyattn.gqkv_proj→self_attn.{gate,q,k,v}_projsplit +attn.o_proj→self_attn.o_projrename +mlp.gate_up_proj→mlp.{gate,up}_projsplit)docs/source/en/model_doc/hrm_text.mdtests/models/hrm_text/test_modeling_hrm_text.pyBefore submitting
tests/models/hrm_text/)docs/source/en/model_doc/hrm_text.md)make fix-repopasses (verified locally and on H100×8 devbox)Test commands run
pytest tests/models/hrm_text/→ 146 / 0 failed locally (Mac M4)pytest tests/models/hrm_text/on H100×8 devbox → 154 / 0 failed (default), 167 / 0 failed (RUN_SLOW=1)make fix-repo→ no diff on HRM-Text-owned files; tree-wide ruff catch-up bundled in commit3f535ed05etrust_remote_code=Trueand via in-tree path; logits bitwise identical):z_L_initParameterrequires_grad=False, 885 unique trained values in[-3.02, 3.00]<|im_start|><|object_ref_start|>The capital of France is<|im_end|>→Paris<|box_end|>prefix_lm=False) top-1 100% match