fix(qwen): fix ghosting artifacts Qwen Image Edit#9155
Conversation
…e Edit Qwen Image Edit was applying identical RoPE positions to the noisy and reference latent segments (both packed at the noisy latent's dimensions), so cross-attention couldn't disentangle them — reference content bled into the generation as a faintly offset ghost across the whole frame, outside the masked edit region. The denoise now keeps reference latents at their own (H, W) and uses those dims in the reference segment of img_shapes, matching diffusers' QwenImageEditPipeline / QwenImageEditPlusPipeline. The reference qwen_image_i2l is resized to ~1024² area preserving aspect ratio (matching diffusers' VAE_IMAGE_SIZE) so the reference token sequence stays in the distribution the model was trained on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
One code issue to be resolved prior to functional testing:
|
The frontend resizes the reference image to ~1024² area before VAE encoding, but direct API callers and older graph JSON can wire qwen_image_i2l → qwen_image_denoise without explicit width/height, sending a native-resolution reference latent into the transformer. Without the clamp the model receives an out-of-distribution sequence length (artifact returns, VRAM spikes). Mirror diffusers' QwenImageEdit(Plus) VAE_IMAGE_SIZE behavior in latent space: bilinear-downscale the reference latent to calculate_dimensions(1024², aspect_ratio) snapped to multiples of 32 in pixel space (= multiples of 4 in latent space, so always packable). In-budget latents pass through untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Good catch, addressed in eaf116e — added 7 new unit tests in I didn't add a full I2L→denoise integration test because the i2l step requires loading the Qwen VAE model — heavy for CI. The unit tests assert the exact same invariant the integration test would: given a too-large reference latent, the dims fed to packing/ |
JPPhoto
left a comment
There was a problem hiding this comment.
I was able to do testing with the quantized edit model with and without the Lightning LoRA and it worked with a different source image aspect ratio than output. I think this change is good for the next RC and public testing!
Summary
qwen_image_denoise.pybilinear-resized the reference latents to the noisy latent's dimensions and used identicalimg_shapestuples for both segments.QwenEmbedRopevaries its spatial RoPE frequencies by each segment's H/W, so identical dims gave both segments the same spatial positions and cross-attention couldn't disentangle them.(H, W), packs them at those dims, and places(1, ref_h // 2, ref_w // 2)in the reference segment ofimg_shapes— matching diffusers'QwenImageEditPipeline(pipeline_qwenimage_edit.py:755-760) andQwenImageEditPlusPipeline(pipeline_qwenimage_edit_plus.py:743-751).qwen_image_i2lis now resized to ~1024² area preserving aspect ratio (matching diffusers'VAE_IMAGE_SIZE) so the reference token sequence stays in the distribution the model was trained on.qwen_image_image_to_latents.pybumpsmultiple_offrom 8 to 16 so VAE-encoded latents always have even spatial dims (required by the 2×2 patch packing); aValueErroris raised if a directly-wired reference latent has spatial dims that align to less than 2.Test plan
uv run pytest tests/app/invocations/test_qwen_image_denoise.py— 15 pass (8 new:_align_ref_latent_dimsvalidation incl. zero-dim guard,_build_img_shapesdistinct-dims regression)pnpm test:no-watch buildQwenImageGraph— 30 pass (6 new:calculateQwenImageEditRefDimensionsmatches diffuserscalculate_dimensionsfor square/landscape/portrait/extreme-ratio inputs; verifies computed dims land on the ref i2l node)pnpm lint:eslint,pnpm lint:prettier,pnpm lint:knip,pnpm lint:tsc— cleanuv run ruff check/uv run ruff format --checkon changed files — cleanmultiple_of=16i2l change is a no-op for canvas-composited paths but worth a smoke test).🤖 Generated with Claude Code