fix(z_image): align cap_pos length with cap_out by passing raw caption length by Anai-Guo · Pull Request #13576 · huggingface/diffusers

Anai-Guo · 2026-04-29T07:47:12Z

Summary

Fixes #13574. In transformer_z_image.py (and the # Copied from-marked
controlnet_z_image.py path) the caption call to _pad_with_ids passes
a pre-padded pos_grid_size while the function itself also pads. This
double-pads the position-id tensor while the feature tensor only gets
single-padded, so they end up with different lengths and downstream
torch.cat produces a cap_pos_ids longer than cap_out.

For len(cap_item) = 120 and SEQ_MULTI_OF = 32:

tensor	length
caller's `pos_grid_size` flat	128
`_pad_with_ids` `pad_len` (`(-120) % 32`)	8
returned `cap_out` (`120 + 8`)	128
returned `pos_ids` (`128 + 8`)	136 ← bug

_pad_with_ids already rounds ori_len up to SEQ_MULTI_OF and
generates the trailing pad pos-ids itself. The image / siglip call
sites correctly pass the original (F_t, H_t, W_t) / (1, sig_H, sig_W)
grid, only the caption call sites pass the pre-padded length.

Fix

In both patchify_and_embed (basic mode) and patchify_and_embed_omni,
pass (len(cap_feat), 1, 1) / (len(cap_item), 1, 1) to _pad_with_ids
instead of (len(cap_feat) + (-len(cap_feat)) % SEQ_MULTI_OF, 1, 1).
After the fix, cap_pos_ids and cap_out are both total_len
(ori_len + pad_len), so the per-batch torch.cat and downstream
attention shapes line up.

The same _pad_with_ids body is duplicated in
controlnet_z_image.py via a # Copied from marker; the basic-mode
caller there has the same bug and is fixed identically.

Test plan

Existing Z-Image tests still pass.
Manually verify with the issue's repro that
len(all_cap_pos_ids[i]) == len(all_cap_out[i]) for every batch
element with non-multiple-of-32 caption lengths.

🤖 Generated with Claude Code

Anai-Guo added 2 commits April 29, 2026 00:46

fix(z_image): pass raw caption length to _pad_with_ids

16c7020

fix(z_image controlnet): pass raw caption length to _pad_with_ids

d9fbc95

github-actions Bot added fixes-issue models size/S PR with diff < 50 LOC labels Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(z_image): align cap_pos length with cap_out by passing raw caption length#13576

fix(z_image): align cap_pos length with cap_out by passing raw caption length#13576
Anai-Guo wants to merge 2 commits intohuggingface:mainfrom
Anai-Guo:fix/z-image-cap-pos-len

Anai-Guo commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Anai-Guo commented Apr 29, 2026

Summary

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant