Add hrm text by abcd1927 · Pull Request #46025 · huggingface/transformers

abcd1927 · 2026-05-18T08:06:44Z

What does this PR do?

Adds the HRM-Text architecture (Hierarchical Reasoning Model — autoregressive language-modeling variant) by Sapient Intelligence. HRM-Text uses two transformer stacks (H = high-level / slow, L = low-level / fast) traversed in a nested recurrence over the same input embeddings, giving effectively unbounded compute depth at bounded parameter count.

The 1B base checkpoint is published at sapientinc/HRM-Text-1B.

Companion review thread (5 rounds of feedback addressed): https://github.com/huggingface/new-model-addition-hrm/pull/2.

Architectural traits

Dual H/L transformer stacks, hierarchical recurrence (H_cycles × (L_cycles + 1) traversals over the same input)
PrefixLM mask via token_type_ids (paligemma pattern; bidirectional inside the prefix block, causal elsewhere)
Per-head sigmoid output gate applied to the attention output before o_proj (Qwen3-Next-style)
Parameterless RMSNorm (inherits NanoChatRMSNorm)
L_bp_cycles k-step gradient routing — training-time only, no effect at inference
KV-cache slot expansion across the recurrent invocations (num_layers_per_stack × H_cycles × (L_cycles + 1) slots)
Conditional FlashAttention support: rejected when config.prefix_lm=True (FA cannot represent the 4-D mask overlay), allowed when prefix_lm=False

Files

src/transformers/models/hrm_text/{configuration,modeling,modular}_hrm_text.py
src/transformers/models/hrm_text/__init__.py
src/transformers/conversion_mapping.py (entry for legacy attn.gqkv_proj → self_attn.{gate,q,k,v}_proj split + attn.o_proj → self_attn.o_proj rename + mlp.gate_up_proj → mlp.{gate,up}_proj split)
docs/source/en/model_doc/hrm_text.md
tests/models/hrm_text/test_modeling_hrm_text.py

Before submitting

Read the contributor guideline
Tests added / updated (tests/models/hrm_text/)
Documentation added (docs/source/en/model_doc/hrm_text.md)
Pre-commit / make fix-repo passes (verified locally and on H100×8 devbox)

Test commands run

pytest tests/models/hrm_text/ → 146 / 0 failed locally (Mac M4)
pytest tests/models/hrm_text/ on H100×8 devbox → 154 / 0 failed (default), 167 / 0 failed (RUN_SLOW=1)
make fix-repo → no diff on HRM-Text-owned files; tree-wide ruff catch-up bundled in commit 3f535ed05e
End-to-end smoke on the released 1.2B base checkpoint (loaded via trust_remote_code=True and via in-tree path; logits bitwise identical):
- z_L_init Parameter requires_grad=False, 885 unique trained values in [-3.02, 3.00]
- Greedy generation: <|im_start|><|object_ref_start|>The capital of France is<|im_end|> → Paris<|box_end|>
- SDPA vs FA-2 (prefix_lm=False) top-1 100% match

`L_bp_cycles_padded` is indexed by `high_cycle_idx ∈ [0, H_cycles)` inside the recurrent forward, but `HrmTextModel.__init__` was left-padding it to length `config.L_cycles` instead of `config.H_cycles`. With the upstream defaults (`H_cycles=2`, `L_cycles=3`, `L_bp_cycles=[2]`) this silently produced `L_bp_cycles_padded=[1, 1, 2]`, so the index-1 read in the second H-cycle picked up the leading pad value (1) and the trailing 2 was never reached. Inference is unaffected (the value is only consulted under autograd in training); training-time gradient propagation through the last H-cycle was capped at 1 L-iteration instead of `raw_bp[-1]` (default 2).

vasqu

Just checking the last details 🫡 pretty much ready for merge

HuggingFaceDocBuilderDev · 2026-05-18T08:34:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2026-05-18T08:42:59Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, hrm_text

vasqu · 2026-05-18T08:50:11Z

run-slow: hrm_text

github-actions · 2026-05-18T08:51:30Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/hrm_text"]
quantizations: []

github-actions · 2026-05-18T09:02:28Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	7bc14b27	workflow commit (merge commit)
PR	8895dcb0	branch commit (from PR)
main	461d4288	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

HRM-Text was merged into huggingface/transformers main yesterday (huggingface/transformers#46025). The model performs a hierarchical recurrent forward over two transformer stacks (`H` slow, `L` fast) inside nested H/L cycle loops; each recurrence step consumes a distinct KV cache slot, matching transformers' `cycle_offset` formula. Attention uses a sigmoid gate (Qwen3Next-style) and a fused gate+q+k+v projection on disk. This PR is fully scoped to the new model file, the registry, and the supported-models doc — no changes to attention backends, the AttentionType enum, or Attention itself. Pipeline parallelism and LoRA are deferred to follow-up PRs (each currently raises at construction). ## Architecture - `HrmTextAttention` keeps `gate_proj` (`ColumnParallelLinear`) separate from `qkv_proj` (`QKVParallelLinear`) so both shard cleanly along the head axis under TP. The HF on-disk schema fuses them into a single `attn.gqkv_proj.weight` (rows concatenated as `[gate | q | k | v]`); `HrmTextForCausalLM.load_weights` splits that tensor at load time before passing it to `AutoWeightsLoader`. Each recurrence step gets its own `Attention` instance under a `nn.ModuleDict`, with a unique `prefix` (`...layers.{global_idx}.attn`), so vLLM allocates a distinct KV cache slot per cycle step. Total slots match `H_cycles * (L_cycles + 1) * num_layers_per_stack`, equal to `config.num_hidden_layers` after HF's `__post_init__` inflation. - `_create_hrm_attention_backend` wraps an attention backend's metadata builder so `causal=False` is set unconditionally on every build. Mirrors `EncoderOnlyAttention`'s `subclass_attention_backend` pattern, but keeps the KV cache (the recurrent forward needs to reuse it). - `HrmTextModel.forward` matches HF main exactly: nested H-outer L-inner loop, `z_L = L(z_L + z_H)` then `z_H = H(z_H + z_L)`. The Python loop is unrolled by `@support_torch_compile` since cycle counts are config constants. ## Why unconditional causal=False vLLM v1 schedules continuous batching. Even without chunked prefill, a single `build()` call can see a *mixed* batch where some requests are decoding and others are entering prefill on this step. Gating on `is_prefilling.all()` keeps `causal=True` in that mixed case, silently running the newly-prefilling requests as pure causal — which diverges from HRM-Text's PrefixLM training distribution. Unconditional `causal=False` is correct because: - Prefill rows (query_len=N): bidirectional, matching HF main with `token_type_ids=ones_like(input_ids)`. - Decode rows (query_len=1): `causal=True` and `causal=False` are mathematically identical (no future tokens to mask). - FlashAttention varlen kernels apply `causal` per sub-sequence via `cu_seqlens_q`, so a single global flag does not cause cross-prompt contamination. Empirical validation on `sapientinc/HRM-Text-1B` GSM8K: vLLM matches HF transformers main with `token_type_ids=1` to within statistical noise (82.0% / 82.0% on the first 50 questions; 84.99% on full 1319; identical between `tensor_parallel_size=1` and `tensor_parallel_size=2`). ## On-disk weight schema HF transformers writes HrmText checkpoints with these keys (per the "hrm_text" entry in `transformers/conversion_mapping.py`): model.embed_tokens.weight model.z_L_init model.{H,L}_module.layers.{i}.attn.gqkv_proj.weight ([gate|q|k|v] dim 0) model.{H,L}_module.layers.{i}.attn.o_proj.weight model.{H,L}_module.layers.{i}.mlp.{gate_up,down}_proj.weight Our model splits `attn.gqkv_proj.weight` into `attn.gate_proj.weight` (first `num_heads * head_dim` rows) and `attn.qkv_proj.weight` (remaining `[q | k | v]` rows) at load time, so vLLM's standard `QKVParallelLinear` weight loader handles TP partitioning of q/k/v without unfusing the gate. Signed-off-by: Wuyifei <wuyifei@me.com>

vasqu and others added 6 commits May 15, 2026 15:13

make new branch because other branch has messed up diff

051fb71

fix

76aaff5

ignore trf rule here, special case where we set requires grad

db922b5

forgot the conversion mapping

ca028db

fixes

6506e7f

vasqu approved these changes May 18, 2026

View reviewed changes

test

b59c1d1

skip TP tests for now

eb8e234

vasqu added 2 commits May 18, 2026 10:45

style

bacbfa1

last skip

8895dcb

vasqu added the New model label May 18, 2026

vasqu merged commit ca80e95 into huggingface:main May 18, 2026
25 checks passed

abcd1927 mentioned this pull request May 19, 2026

[Model] Add HrmTextForCausalLM (Hierarchical Reasoning Model — Text) vllm-project/vllm#43098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hrm text#46025

Add hrm text#46025
vasqu merged 10 commits into
huggingface:mainfrom
abcd1927:add-hrm-text

abcd1927 commented May 18, 2026

Uh oh!

vasqu left a comment

Uh oh!

HuggingFaceDocBuilderDev commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

vasqu commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abcd1927 commented May 18, 2026

What does this PR do?

Architectural traits

Files

Before submitting

Test commands run

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

vasqu commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

CI Results

Commit Info

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants