Skip to content

fix(z_image): align cap_pos length with cap_out by passing raw caption length#13576

Open
Anai-Guo wants to merge 2 commits intohuggingface:mainfrom
Anai-Guo:fix/z-image-cap-pos-len
Open

fix(z_image): align cap_pos length with cap_out by passing raw caption length#13576
Anai-Guo wants to merge 2 commits intohuggingface:mainfrom
Anai-Guo:fix/z-image-cap-pos-len

Conversation

@Anai-Guo
Copy link
Copy Markdown

Summary

Fixes #13574. In transformer_z_image.py (and the # Copied from-marked
controlnet_z_image.py path) the caption call to _pad_with_ids passes
a pre-padded pos_grid_size while the function itself also pads. This
double-pads the position-id tensor while the feature tensor only gets
single-padded, so they end up with different lengths and downstream
torch.cat produces a cap_pos_ids longer than cap_out.

For len(cap_item) = 120 and SEQ_MULTI_OF = 32:

tensor length
caller's pos_grid_size flat 128
_pad_with_ids pad_len ((-120) % 32) 8
returned cap_out (120 + 8) 128
returned pos_ids (128 + 8) 136 ← bug

_pad_with_ids already rounds ori_len up to SEQ_MULTI_OF and
generates the trailing pad pos-ids itself. The image / siglip call
sites correctly pass the original (F_t, H_t, W_t) / (1, sig_H, sig_W)
grid, only the caption call sites pass the pre-padded length.

Fix

In both patchify_and_embed (basic mode) and patchify_and_embed_omni,
pass (len(cap_feat), 1, 1) / (len(cap_item), 1, 1) to _pad_with_ids
instead of (len(cap_feat) + (-len(cap_feat)) % SEQ_MULTI_OF, 1, 1).
After the fix, cap_pos_ids and cap_out are both total_len
(ori_len + pad_len), so the per-batch torch.cat and downstream
attention shapes line up.

The same _pad_with_ids body is duplicated in
controlnet_z_image.py via a # Copied from marker; the basic-mode
caller there has the same bug and is fixed identically.

Test plan

  • Existing Z-Image tests still pass.
  • Manually verify with the issue's repro that
    len(all_cap_pos_ids[i]) == len(all_cap_out[i]) for every batch
    element with non-multiple-of-32 caption lengths.

🤖 Generated with Claude Code

@github-actions github-actions Bot added fixes-issue models size/S PR with diff < 50 LOC labels Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixes-issue models size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Z-Image的文本位置编码长度不一致问题

1 participant