Skip to content

Add hrm text#46025

Merged
vasqu merged 10 commits into
huggingface:mainfrom
abcd1927:add-hrm-text
May 18, 2026
Merged

Add hrm text#46025
vasqu merged 10 commits into
huggingface:mainfrom
abcd1927:add-hrm-text

Conversation

@abcd1927
Copy link
Copy Markdown
Contributor

What does this PR do?

Adds the HRM-Text architecture (Hierarchical Reasoning Model — autoregressive language-modeling variant) by Sapient Intelligence. HRM-Text uses two transformer stacks (H = high-level / slow, L = low-level / fast) traversed in a nested recurrence over the same input embeddings, giving effectively unbounded compute depth at bounded parameter count.

The 1B base checkpoint is published at sapientinc/HRM-Text-1B.

Companion review thread (5 rounds of feedback addressed): https://github.com/huggingface/new-model-addition-hrm/pull/2.

Architectural traits

  • Dual H/L transformer stacks, hierarchical recurrence (H_cycles × (L_cycles + 1) traversals over the same input)
  • PrefixLM mask via token_type_ids (paligemma pattern; bidirectional inside the prefix block, causal elsewhere)
  • Per-head sigmoid output gate applied to the attention output before o_proj (Qwen3-Next-style)
  • Parameterless RMSNorm (inherits NanoChatRMSNorm)
  • L_bp_cycles k-step gradient routing — training-time only, no effect at inference
  • KV-cache slot expansion across the recurrent invocations (num_layers_per_stack × H_cycles × (L_cycles + 1) slots)
  • Conditional FlashAttention support: rejected when config.prefix_lm=True (FA cannot represent the 4-D mask overlay), allowed when prefix_lm=False

Files

  • src/transformers/models/hrm_text/{configuration,modeling,modular}_hrm_text.py
  • src/transformers/models/hrm_text/__init__.py
  • src/transformers/conversion_mapping.py (entry for legacy attn.gqkv_projself_attn.{gate,q,k,v}_proj split + attn.o_projself_attn.o_proj rename + mlp.gate_up_projmlp.{gate,up}_proj split)
  • docs/source/en/model_doc/hrm_text.md
  • tests/models/hrm_text/test_modeling_hrm_text.py

Before submitting

  • Read the contributor guideline
  • Tests added / updated (tests/models/hrm_text/)
  • Documentation added (docs/source/en/model_doc/hrm_text.md)
  • Pre-commit / make fix-repo passes (verified locally and on H100×8 devbox)

Test commands run

  • pytest tests/models/hrm_text/ → 146 / 0 failed locally (Mac M4)
  • pytest tests/models/hrm_text/ on H100×8 devbox → 154 / 0 failed (default), 167 / 0 failed (RUN_SLOW=1)
  • make fix-repo → no diff on HRM-Text-owned files; tree-wide ruff catch-up bundled in commit 3f535ed05e
  • End-to-end smoke on the released 1.2B base checkpoint (loaded via trust_remote_code=True and via in-tree path; logits bitwise identical):
    • z_L_init Parameter requires_grad=False, 885 unique trained values in [-3.02, 3.00]
    • Greedy generation: <|im_start|><|object_ref_start|>The capital of France is<|im_end|>Paris<|box_end|>
    • SDPA vs FA-2 (prefix_lm=False) top-1 100% match

vasqu and others added 6 commits May 15, 2026 15:13
`L_bp_cycles_padded` is indexed by `high_cycle_idx ∈ [0, H_cycles)`
inside the recurrent forward, but `HrmTextModel.__init__` was
left-padding it to length `config.L_cycles` instead of
`config.H_cycles`. With the upstream defaults (`H_cycles=2`,
`L_cycles=3`, `L_bp_cycles=[2]`) this silently produced
`L_bp_cycles_padded=[1, 1, 2]`, so the index-1 read in the second
H-cycle picked up the leading pad value (1) and the trailing 2 was
never reached. Inference is unaffected (the value is only consulted
under autograd in training); training-time gradient propagation
through the last H-cycle was capped at 1 L-iteration instead of
`raw_bp[-1]` (default 2).
Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking the last details 🫡 pretty much ready for merge

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, hrm_text

@vasqu
Copy link
Copy Markdown
Contributor

vasqu commented May 18, 2026

run-slow: hrm_text

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/hrm_text"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 7bc14b27 workflow commit (merge commit)
PR 8895dcb0 branch commit (from PR)
main 461d4288 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@vasqu vasqu merged commit ca80e95 into huggingface:main May 18, 2026
25 checks passed
abcd1927 added a commit to abcd1927/vllm that referenced this pull request May 19, 2026
HRM-Text was merged into huggingface/transformers main yesterday
(huggingface/transformers#46025). The model
performs a hierarchical recurrent forward over two transformer stacks
(`H` slow, `L` fast) inside nested H/L cycle loops; each recurrence
step consumes a distinct KV cache slot, matching transformers'
`cycle_offset` formula. Attention uses a sigmoid gate (Qwen3Next-style)
and a fused gate+q+k+v projection on disk.

This PR is fully scoped to the new model file, the registry, and the
supported-models doc — no changes to attention backends, the
AttentionType enum, or Attention itself. Pipeline parallelism and LoRA
are deferred to follow-up PRs (each currently raises at construction).

## Architecture

- `HrmTextAttention` keeps `gate_proj` (`ColumnParallelLinear`) separate
  from `qkv_proj` (`QKVParallelLinear`) so both shard cleanly along the
  head axis under TP. The HF on-disk schema fuses them into a single
  `attn.gqkv_proj.weight` (rows concatenated as `[gate | q | k | v]`);
  `HrmTextForCausalLM.load_weights` splits that tensor at load time
  before passing it to `AutoWeightsLoader`. Each recurrence step gets
  its own `Attention` instance under a `nn.ModuleDict`, with a unique
  `prefix` (`...layers.{global_idx}.attn`), so vLLM allocates a
  distinct KV cache slot per cycle step. Total slots match
  `H_cycles * (L_cycles + 1) * num_layers_per_stack`, equal to
  `config.num_hidden_layers` after HF's `__post_init__` inflation.

- `_create_hrm_attention_backend` wraps an attention backend's metadata
  builder so `causal=False` is set unconditionally on every build.
  Mirrors `EncoderOnlyAttention`'s `subclass_attention_backend` pattern,
  but keeps the KV cache (the recurrent forward needs to reuse it).

- `HrmTextModel.forward` matches HF main exactly: nested H-outer
  L-inner loop, `z_L = L(z_L + z_H)` then `z_H = H(z_H + z_L)`. The
  Python loop is unrolled by `@support_torch_compile` since cycle counts
  are config constants.

## Why unconditional causal=False

vLLM v1 schedules continuous batching. Even without chunked prefill, a
single `build()` call can see a *mixed* batch where some requests are
decoding and others are entering prefill on this step. Gating on
`is_prefilling.all()` keeps `causal=True` in that mixed case, silently
running the newly-prefilling requests as pure causal — which diverges
from HRM-Text's PrefixLM training distribution.

Unconditional `causal=False` is correct because:
- Prefill rows (query_len=N): bidirectional, matching HF main with
  `token_type_ids=ones_like(input_ids)`.
- Decode rows (query_len=1): `causal=True` and `causal=False` are
  mathematically identical (no future tokens to mask).
- FlashAttention varlen kernels apply `causal` per sub-sequence via
  `cu_seqlens_q`, so a single global flag does not cause cross-prompt
  contamination.

Empirical validation on `sapientinc/HRM-Text-1B` GSM8K: vLLM matches HF
transformers main with `token_type_ids=1` to within statistical noise
(82.0% / 82.0% on the first 50 questions; 84.99% on full 1319;
identical between `tensor_parallel_size=1` and `tensor_parallel_size=2`).

## On-disk weight schema

HF transformers writes HrmText checkpoints with these keys (per the
"hrm_text" entry in `transformers/conversion_mapping.py`):

  model.embed_tokens.weight
  model.z_L_init
  model.{H,L}_module.layers.{i}.attn.gqkv_proj.weight  ([gate|q|k|v] dim 0)
  model.{H,L}_module.layers.{i}.attn.o_proj.weight
  model.{H,L}_module.layers.{i}.mlp.{gate_up,down}_proj.weight

Our model splits `attn.gqkv_proj.weight` into `attn.gate_proj.weight`
(first `num_heads * head_dim` rows) and `attn.qkv_proj.weight`
(remaining `[q | k | v]` rows) at load time, so vLLM's standard
`QKVParallelLinear` weight loader handles TP partitioning of q/k/v
without unfusing the gate.

Signed-off-by: Wuyifei <wuyifei@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants