Skip to content

Conversation

@JerryWu-code
Copy link
Contributor

What does this PR do?

This PR introduces Z-Image Series into the diffusers library. Z-Image is a powerful and highly efficient 6B-parameter image generation model that is friendly for consumer-grade hardware, with strong capabilities in photorealistic image generation, accurate rendering of both complex Chinese and English text, and robust adherence to bilingual instructions. The technical report and Z-Image-Turbo checkpoint will be released very soon.

Thanks for the support of @yiyixuxu @apolinario.

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for the PR!
I left some feedbacks, I think the most important change is the attention_backend, we should be able to refactor using dispatch_attention_fn and it work out of box with both naive and flash_varlen, insteand of manually handle both

@yiyixuxu
Copy link
Collaborator

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Nov 24, 2025

Style fix runs successfully without any file modified.

# Hardcoded for now because pytorch does not support tuple/int type hints
window_size = (-1, -1)
out, lse, *_ = flash_attn_3_func(
max_seqlen_q = q.shape[2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, what's the reason to use _flash_attn_forward instead here?

Copy link
Contributor Author

@JerryWu-code JerryWu-code Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that may due to mismatch version of flash-attntion 3 with yours, older version (only return out in the default flash_attn_func, Version "flash-attn-3==3.0.0b1") was not compatible with lse & *_ return, and we will fix this in the next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially fixed flash_attn3 in https://github.com/JerryWu-code/diffusers/blob/8a6cb74e7319433126ab09526288ada496a83523/src/diffusers/models/attention_dispatch.py#L651, but not including in this pull request, easy for merging.

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh I missed this, can we remove einops dependency too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we would fix that ~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1dd8f3c.

@JerryWu-code
Copy link
Contributor Author

This newest branch a74a0c4 should be ready to merge, already merge with main branch right now~, check whether any further things to fix 😊

@ChrisLiu6
Copy link
Contributor

ChrisLiu6 commented Nov 25, 2025

@yiyixuxu
By the way, while testing the _flash_3 and _flash_varlen_3 backends, we noticed that the current implementation in attention_dispatch.py is incompatible with the latest Flash Attention 3 APIs.

The recent FA3 commit (Dao-AILab/flash-attention@203b9b3) introduced a return_attn_probs argument and changed the default behavior. The functions now return a single output tensor by default (instead of a tuple), which causes the current tuple unpacking logic in diffusers to fail:

  • out, lse, *_ = flash_attn_3_varlen_func(
    q=query_packed,
    k=key_packed,
    v=value_packed,
    cu_seqlens_q=cu_seqlens_q,
    cu_seqlens_k=cu_seqlens_k,
    max_seqlen_q=max_seqlen_q,
    max_seqlen_k=max_seqlen_k,
    softmax_scale=scale,
    causal=is_causal,
    )
  • out, lse, *_ = flash_attn_3_func(
    q=q,
    k=k,
    v=v,
    softmax_scale=softmax_scale,
    causal=causal,
    qv=qv,
    q_descale=q_descale,
    k_descale=k_descale,
    v_descale=v_descale,
    window_size=window_size,
    attention_chunk=attention_chunk,
    softcap=softcap,
    num_splits=num_splits,
    pack_gqa=pack_gqa,
    deterministic=deterministic,
    sm_margin=sm_margin,
    )

We have implemented a fix that handles this while maintaining backward compatibility:

JerryWu-code@de4c6f1#diff-b027e126a86a26981384b125714e0f3bd9923eaa8322f1ae5f6b53fe3e3481c2

Should we include this fix in the current PR, or would you prefer us to open a separate PR for it?

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this folder for now!
we can add test in a follow-up PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, would do this unit-test for all cases in next commit ~ 😊

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pr for unittest is available @ #12715. Thanks for reviewing yiyi @yiyixuxu 😊 ~

@yiyixuxu
Copy link
Collaborator

@ChrisLiu6
oh thanks! let's do this in a seperate PR since we are eager to get this one in!

Should we include this fix in the current PR, or would you prefer us to open a separate PR for it?

also need tests & doc in a follow-up PR too :)

@yiyixuxu
Copy link
Collaborator

also run make fix-copies and make style

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ChrisLiu6
Copy link
Contributor

Hi @yiyixuxu , ready for merge😄

@yiyixuxu yiyixuxu merged commit 4088e8a into huggingface:main Nov 25, 2025
9 of 11 checks passed
@tin2tin
Copy link

tin2tin commented Nov 26, 2025

@JerryWu-code
Super cool. I would like to test it. Is there some code example somewhere on how to use it with Diffusers?

I found it on HF.

@tin2tin
Copy link

tin2tin commented Nov 27, 2025

Another question: it doesn't seem to accept image input in the Diffusers implementation, but does the model accept image input? Someone in the ComfyUI community claims it does, but the HF Spaces doesn't have this feature.

So, is image input possible or not?

@asomoza
Copy link
Member

asomoza commented Nov 27, 2025

@tin2tin the edit model is soon to be released, you can see it in the github repo. Not sure why is there someone claiming that the turbo model accepts image inputs, as is a pure t2i model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants