Skip to content

Conversation

@rmatif
Copy link
Contributor

@rmatif rmatif commented Nov 29, 2025

WIP Z image turbo support

 ./build/bin/sd --diffusion-model models/z_image_turbo_bf16.safetensors --vae models/ae.safetensors --qwen3 models/qwen_3_4b.safetensors -p "a red apple on a wooden table" --cfg-scale 1 --steps 8 -W 1024 -H 1024 --seed 42 --type q8_0 --diffusion-fa
z_image_turbo

Download weights:
https://huggingface.co/Comfy-Org/z_image_turbo/tree/main/split_files

@wbruna
Copy link
Contributor

wbruna commented Nov 29, 2025

Working on Vulkan. Full black image on ROCm, though.

For the proj preview:

diff --git a/stable-diffusion.cpp b/stable-diffusion.cpp
index a32be49..26f74bd 100644
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
@@ -1334,7 +1334,7 @@ public:
                 if (sd_version_is_sd3(version)) {
                     latent_rgb_proj = sd3_latent_rgb_proj;
                     latent_rgb_bias = sd3_latent_rgb_bias;
-                } else if (sd_version_is_flux(version)) {
+                } else if (sd_version_is_flux(version) || sd_version_is_zimage(version)) {
                     latent_rgb_proj = flux_latent_rgb_proj;
                     latent_rgb_bias = flux_latent_rgb_bias;
                 } else if (sd_version_is_wan(version) || sd_version_is_qwen_image(version)) {

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

@rmatif Once the Flux.2 brach gets merged, there will be a single llm.hpp file that should handle most LLM-based text encoders, I think It would make sense to target Flux.2 branch instead of master to take adventage of these changes and avoid duplicated code. (I already rebased your implementation and made the switch to llm.hpp for qwen3 here: https://github.com/stduhpf/stable-diffusion.cpp/tree/z-image-turbo)

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

Working on Vulkan. Full black image on ROCm, though.

It seems to work on ROCm on my machine.

@Green-Sky
Copy link
Contributor

Green-Sky commented Nov 29, 2025

I am getting black images on cuda, odd.

edit: recompiling with the preview patch to see if it starts off black
edit2: the first 2 steps look good, then black.

step1/8 step2/8
preview preview

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

Maybe the black image on ROCM or CUDA is a text encoder issue? I must admit that I haven't really tried this exact branch because it's incompatible with the llama.cpp quants of Qwen3 I already had available on my machine, only the one rebased on Flux.2 PR, which works consistently for me.

@Green-Sky
Copy link
Contributor

Maybe the black image on ROCM or CUDA is a text encoder issue? I must admit that I haven't really tried this exact branch because it's incompatible with the llama.cpp quants of Qwen3 I already had available on my machine, only the one rebased on Flux.2 PR, which works consistently for me.

Oh thats good to know, because I had to redownload a non llama.cpp quant 😄

@rmatif
Copy link
Contributor Author

rmatif commented Nov 29, 2025

@rmatif Once the Flux.2 brach gets merged, there will be a single llm.hpp file that should handle most LLM-based text encoders, I think It would make sense to target Flux.2 branch instead of master to take adventage of these changes and avoid duplicated code. (I already rebased your implementation and made the switch to llm.hpp for qwen3 here: https://github.com/stduhpf/stable-diffusion.cpp/tree/z-image-turbo)

@stduhpf Thanks! and yeah I saw the refactoring on the Flux 2 PR. If you don’t mind, could you open a PR on my branch? Since you already did the rebase on your side so we can get it here

Regarding the black image issue: I only tested on CUDA and over ~400 inference runs I never encountered a black output on my side

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

@rmatif rmatif#3

@Green-Sky
Copy link
Contributor

Green-Sky commented Nov 29, 2025

Regarding the black image issue: I only tested on CUDA and over ~400 inference runs I never encountered a black output on my side

I am using https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf , which is a mix of q3_k, q4_k, bf16 and f32.

edit: same issue with @stduhpf rebase ontop of flux2 (and the llama.cpp based qwen quant)
edit2: quantized the model myself to q5_k and same issue

auto get_graph = [&]() -> struct ggml_cgraph* {
return build_graph_patchified(x_patchified, timestep, cap_feats, height, width, H_patches, W_patches);
};
bool free_compute_buffer_immediately = !flash_attn_enabled;
Copy link
Contributor

@stduhpf stduhpf Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this line? That seems odd. I was wondering why it was spamming stable-diffusion.cpp\ggml_extend.hpp:1674 - Z-Image compute buffer size: 2378.82 MB(VRAM) only when not using flash-attention

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to run just fine when setting free_compute_buffer_immediately to be always false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, this was an artifact from a debugging session where I had numerical instabilities between steps with FA, so I was trying to isolate the issue. It turned out to be an issue with my patchify/unpatchify implementation

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

Something odd I've noticed: The model seems very bad at text, almost SDXL level of bad, but I've heard that it's supposed to be decent at it, and example online seem to have ok text rendering... I'm using Q8_0 model, maybe the quantization is hurting its ability to type disproportionately?

Edit: nope, the bf16/f32 model shows the same behavior... Could also be the text encoder? (I'm using Unlsoth's Q6_K_XL)

@Green-Sky
Copy link
Contributor

Green-Sky commented Nov 29, 2025

Ok, so at 768x768, self made q8_0 works, but q5_k and the quant I linked above still fail at step3.
At 1024x1024 the same, but somehow I can't fit q8_0 into my 8gig of vram (?).

Text wise the q3_k_m looks on par with q8_0 at text (or closer to the image I pulled the prompt from, step2 preview anyway).

The way text is always misspelled the same might mean tokenization issue... higher quality qwen somewhat helps.

Sexy text benchmark (SFW): https://civitai.com/images/112002653

A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic


With my setup the text is almost always:

THEITY IS A
CIRCUIT BLOOD,
THAD I OM A BROKEN
TRANSISTOR.

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Nov 29, 2025

Something odd I've noticed: The model seems very bad at text, almost SDXL level of bad, but I've heard that it's supposed to be decent at it, and example online seem to have ok text rendering... I'm using Q8_0 model, maybe the quantization is hurting its ability to type disproportionately?

Edit: nope, the bf16/f32 model shows the same behavior... Could also be the text encoder? (I'm using Unlsoth's Q6_K_XL)

I still haven't tested this PR but on ComfyUI using a q8_0 the text rendering works perfectly, so there may be some issues on the current implementation. I'm using the same text encoder quant that you're using.

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

might mean tokenization issue.

That's a very good observation, it's using Qwen2 tokenizer right now

@wbruna
Copy link
Contributor

wbruna commented Nov 29, 2025

The way text is always misspelled the same might mean tokenization issue... higher quality qwen somewhat helps.

This is with the image and text models from the ComfyUI repo quantized to q8 with sd.cpp, original VAE, Vulkan+radv (first try):

zimg_1764442461

It seems to work on ROCm on my machine.

Odd. Here, even the first preview is fully black already.

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

I think they trained the text encoder instead of using frozen off-the-shelf qwen3_4b.

Qwen3-4B-UD-Q6_K_XL.gguf Qwen3-4B-Instruct-2507-UD-Q6_K_XL.gguf Qwen_3_4b-Q6_K.gguf qwen_3_4b.safetensors
output output - Copy (169) output output

But it seems that's only part of the problem, using the right text encoder makes text rendering better, but still broken.

@Green-Sky are you using my fork or this exact PR?

@Green-Sky
Copy link
Contributor

@Green-Sky are you using my fork or this exact PR?

I'm on your fork/pr right now.

@stduhpf
Copy link
Contributor

stduhpf commented Nov 29, 2025

I'm on your fork/pr right now.

Maybe I messed up something when implementing qwen3 into llm.hpp.

@wbruna wbruna mentioned this pull request Nov 29, 2025
@rmatif
Copy link
Contributor Author

rmatif commented Nov 30, 2025

I will close this PR once #1020 is merged. I’ll leave it open in the meantime for testing and comparison. Thanks everyone for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants