-
Notifications
You must be signed in to change notification settings - Fork 467
add Z image turbo support #1018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Working on Vulkan. Full black image on ROCm, though. For the proj preview: diff --git a/stable-diffusion.cpp b/stable-diffusion.cpp
index a32be49..26f74bd 100644
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
@@ -1334,7 +1334,7 @@ public:
if (sd_version_is_sd3(version)) {
latent_rgb_proj = sd3_latent_rgb_proj;
latent_rgb_bias = sd3_latent_rgb_bias;
- } else if (sd_version_is_flux(version)) {
+ } else if (sd_version_is_flux(version) || sd_version_is_zimage(version)) {
latent_rgb_proj = flux_latent_rgb_proj;
latent_rgb_bias = flux_latent_rgb_bias;
} else if (sd_version_is_wan(version) || sd_version_is_qwen_image(version)) { |
|
@rmatif Once the Flux.2 brach gets merged, there will be a single llm.hpp file that should handle most LLM-based text encoders, I think It would make sense to target Flux.2 branch instead of master to take adventage of these changes and avoid duplicated code. (I already rebased your implementation and made the switch to llm.hpp for qwen3 here: https://github.com/stduhpf/stable-diffusion.cpp/tree/z-image-turbo) |
It seems to work on ROCm on my machine. |
|
Maybe the black image on ROCM or CUDA is a text encoder issue? I must admit that I haven't really tried this exact branch because it's incompatible with the llama.cpp quants of Qwen3 I already had available on my machine, only the one rebased on Flux.2 PR, which works consistently for me. |
Oh thats good to know, because I had to redownload a non llama.cpp quant 😄 |
@stduhpf Thanks! and yeah I saw the refactoring on the Flux 2 PR. If you don’t mind, could you open a PR on my branch? Since you already did the rebase on your side so we can get it here Regarding the black image issue: I only tested on CUDA and over ~400 inference runs I never encountered a black output on my side |
I am using https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf , which is a mix of q3_k, q4_k, bf16 and f32. edit: same issue with @stduhpf rebase ontop of flux2 (and the llama.cpp based qwen quant) |
| auto get_graph = [&]() -> struct ggml_cgraph* { | ||
| return build_graph_patchified(x_patchified, timestep, cap_feats, height, width, H_patches, W_patches); | ||
| }; | ||
| bool free_compute_buffer_immediately = !flash_attn_enabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this line? That seems odd. I was wondering why it was spamming stable-diffusion.cpp\ggml_extend.hpp:1674 - Z-Image compute buffer size: 2378.82 MB(VRAM) only when not using flash-attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to run just fine when setting free_compute_buffer_immediately to be always false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, this was an artifact from a debugging session where I had numerical instabilities between steps with FA, so I was trying to isolate the issue. It turned out to be an issue with my patchify/unpatchify implementation
|
Something odd I've noticed: The model seems very bad at text, almost SDXL level of bad, but I've heard that it's supposed to be decent at it, and example online seem to have ok text rendering... I'm using Q8_0 model, maybe the quantization is hurting its ability to type disproportionately? Edit: nope, the bf16/f32 model shows the same behavior... Could also be the text encoder? (I'm using Unlsoth's Q6_K_XL) |
|
Ok, so at 768x768, self made q8_0 works, but q5_k and the quant I linked above still fail at step3. Text wise the q3_k_m looks on par with q8_0 at text (or closer to the image I pulled the prompt from, step2 preview anyway). The way text is always misspelled the same might mean tokenization issue... higher quality qwen somewhat helps. Sexy text benchmark (SFW): https://civitai.com/images/112002653
With my setup the text is almost always:
|
I still haven't tested this PR but on ComfyUI using a q8_0 the text rendering works perfectly, so there may be some issues on the current implementation. I'm using the same text encoder quant that you're using. |
That's a very good observation, it's using Qwen2 tokenizer right now |
|
I think they trained the text encoder instead of using frozen off-the-shelf qwen3_4b.
But it seems that's only part of the problem, using the right text encoder makes text rendering better, but still broken. @Green-Sky are you using my fork or this exact PR? |
I'm on your fork/pr right now. |
Maybe I messed up something when implementing qwen3 into llm.hpp. |
|
I will close this PR once #1020 is merged. I’ll leave it open in the meantime for testing and comparison. Thanks everyone for testing! |







WIP Z image turbo support
./build/bin/sd --diffusion-model models/z_image_turbo_bf16.safetensors --vae models/ae.safetensors --qwen3 models/qwen_3_4b.safetensors -p "a red apple on a wooden table" --cfg-scale 1 --steps 8 -W 1024 -H 1024 --seed 42 --type q8_0 --diffusion-faDownload weights:
https://huggingface.co/Comfy-Org/z_image_turbo/tree/main/split_files