add Z image turbo support #1018

rmatif · 2025-11-29T07:43:45Z

WIP Z image turbo support

 ./build/bin/sd --diffusion-model models/z_image_turbo_bf16.safetensors --vae models/ae.safetensors --qwen3 models/qwen_3_4b.safetensors -p "a red apple on a wooden table" --cfg-scale 1 --steps 8 -W 1024 -H 1024 --seed 42 --type q8_0 --diffusion-fa

Download weights:
https://huggingface.co/Comfy-Org/z_image_turbo/tree/main/split_files

wbruna · 2025-11-29T14:05:27Z

Working on Vulkan. Full black image on ROCm, though.

For the proj preview:

diff --git a/stable-diffusion.cpp b/stable-diffusion.cpp
index a32be49..26f74bd 100644
--- a/stable-diffusion.cpp
+++ b/stable-diffusion.cpp
@@ -1334,7 +1334,7 @@ public:
                 if (sd_version_is_sd3(version)) {
                     latent_rgb_proj = sd3_latent_rgb_proj;
                     latent_rgb_bias = sd3_latent_rgb_bias;
-                } else if (sd_version_is_flux(version)) {
+                } else if (sd_version_is_flux(version) || sd_version_is_zimage(version)) {
                     latent_rgb_proj = flux_latent_rgb_proj;
                     latent_rgb_bias = flux_latent_rgb_bias;
                 } else if (sd_version_is_wan(version) || sd_version_is_qwen_image(version)) {

stduhpf · 2025-11-29T16:47:27Z

@rmatif Once the Flux.2 brach gets merged, there will be a single llm.hpp file that should handle most LLM-based text encoders, I think It would make sense to target Flux.2 branch instead of master to take adventage of these changes and avoid duplicated code. (I already rebased your implementation and made the switch to llm.hpp for qwen3 here: https://github.com/stduhpf/stable-diffusion.cpp/tree/z-image-turbo)

stduhpf · 2025-11-29T16:49:25Z

Working on Vulkan. Full black image on ROCm, though.

It seems to work on ROCm on my machine.

Green-Sky · 2025-11-29T16:50:47Z

I am getting black images on cuda, odd.

edit: recompiling with the preview patch to see if it starts off black
edit2: the first 2 steps look good, then black.

step1/8	step2/8

stduhpf · 2025-11-29T16:59:46Z

Maybe the black image on ROCM or CUDA is a text encoder issue? I must admit that I haven't really tried this exact branch because it's incompatible with the llama.cpp quants of Qwen3 I already had available on my machine, only the one rebased on Flux.2 PR, which works consistently for me.

Green-Sky · 2025-11-29T17:01:16Z

Maybe the black image on ROCM or CUDA is a text encoder issue? I must admit that I haven't really tried this exact branch because it's incompatible with the llama.cpp quants of Qwen3 I already had available on my machine, only the one rebased on Flux.2 PR, which works consistently for me.

Oh thats good to know, because I had to redownload a non llama.cpp quant 😄

rmatif · 2025-11-29T17:10:12Z

@rmatif Once the Flux.2 brach gets merged, there will be a single llm.hpp file that should handle most LLM-based text encoders, I think It would make sense to target Flux.2 branch instead of master to take adventage of these changes and avoid duplicated code. (I already rebased your implementation and made the switch to llm.hpp for qwen3 here: https://github.com/stduhpf/stable-diffusion.cpp/tree/z-image-turbo)

@stduhpf Thanks! and yeah I saw the refactoring on the Flux 2 PR. If you don’t mind, could you open a PR on my branch? Since you already did the rebase on your side so we can get it here

Regarding the black image issue: I only tested on CUDA and over ~400 inference runs I never encountered a black output on my side

stduhpf · 2025-11-29T17:12:38Z

@rmatif rmatif#3

Green-Sky · 2025-11-29T17:14:20Z

Regarding the black image issue: I only tested on CUDA and over ~400 inference runs I never encountered a black output on my side

I am using https://huggingface.co/jayn7/Z-Image-Turbo-GGUF/blob/main/z_image_turbo-Q3_K_M.gguf , which is a mix of q3_k, q4_k, bf16 and f32.

edit: same issue with @stduhpf rebase ontop of flux2 (and the llama.cpp based qwen quant)
edit2: quantized the model myself to q5_k and same issue

stduhpf · 2025-11-29T17:47:28Z

zimage.hpp

+            auto get_graph = [&]() -> struct ggml_cgraph* {
+                return build_graph_patchified(x_patchified, timestep, cap_feats, height, width, H_patches, W_patches);
+            };
+            bool free_compute_buffer_immediately = !flash_attn_enabled;


Is there a reason for this line? That seems odd. I was wondering why it was spamming stable-diffusion.cpp\ggml_extend.hpp:1674 - Z-Image compute buffer size: 2378.82 MB(VRAM) only when not using flash-attention

It seems to run just fine when setting free_compute_buffer_immediately to be always false.

No particular reason, this was an artifact from a debugging session where I had numerical instabilities between steps with FA, so I was trying to isolate the issue. It turned out to be an issue with my patchify/unpatchify implementation

stduhpf · 2025-11-29T18:02:41Z

Something odd I've noticed: The model seems very bad at text, almost SDXL level of bad, but I've heard that it's supposed to be decent at it, and example online seem to have ok text rendering... I'm using Q8_0 model, maybe the quantization is hurting its ability to type disproportionately?

Edit: nope, the bf16/f32 model shows the same behavior... Could also be the text encoder? (I'm using Unlsoth's Q6_K_XL)

Green-Sky · 2025-11-29T18:37:35Z

Ok, so at 768x768, self made q8_0 works, but q5_k and the quant I linked above still fail at step3.
At 1024x1024 the same, but somehow I can't fit q8_0 into my 8gig of vram (?).

Text wise the q3_k_m looks on par with q8_0 at text (or closer to the image I pulled the prompt from, step2 preview anyway).

The way text is always misspelled the same might mean tokenization issue... higher quality qwen somewhat helps.

Sexy text benchmark (SFW): https://civitai.com/images/112002653

A cinematic, melancholic photograph of a solitary hooded figure walking through a sprawling, rain-slicked metropolis at night. The city lights are a chaotic blur of neon orange and cool blue, reflecting on the wet asphalt. The scene evokes a sense of being a single component in a vast machine. Superimposed over the image in a sleek, modern, slightly glitched font is the philosophical quote: 'THE CITY IS A CIRCUIT BOARD, AND I AM A BROKEN TRANSISTOR.' -- moody, atmospheric, profound, dark academic

With my setup the text is almost always:

THEITY IS A
CIRCUIT BLOOD,
THAD I OM A BROKEN
TRANSISTOR.

daniandtheweb · 2025-11-29T18:38:49Z

Something odd I've noticed: The model seems very bad at text, almost SDXL level of bad, but I've heard that it's supposed to be decent at it, and example online seem to have ok text rendering... I'm using Q8_0 model, maybe the quantization is hurting its ability to type disproportionately?

Edit: nope, the bf16/f32 model shows the same behavior... Could also be the text encoder? (I'm using Unlsoth's Q6_K_XL)

I still haven't tested this PR but on ComfyUI using a q8_0 the text rendering works perfectly, so there may be some issues on the current implementation. I'm using the same text encoder quant that you're using.

stduhpf · 2025-11-29T18:40:06Z

might mean tokenization issue.

That's a very good observation, it's using Qwen2 tokenizer right now

wbruna · 2025-11-29T19:03:51Z

The way text is always misspelled the same might mean tokenization issue... higher quality qwen somewhat helps.

This is with the image and text models from the ComfyUI repo quantized to q8 with sd.cpp, original VAE, Vulkan+radv (first try):

It seems to work on ROCm on my machine.

Odd. Here, even the first preview is fully black already.

stduhpf · 2025-11-29T19:25:07Z

I think they trained the text encoder instead of using frozen off-the-shelf qwen3_4b.

Qwen3-4B-UD-Q6_K_XL.gguf	Qwen3-4B-Instruct-2507-UD-Q6_K_XL.gguf	Qwen_3_4b-Q6_K.gguf	qwen_3_4b.safetensors

But it seems that's only part of the problem, using the right text encoder makes text rendering better, but still broken.

@Green-Sky are you using my fork or this exact PR?

Green-Sky · 2025-11-29T20:06:09Z

@Green-Sky are you using my fork or this exact PR?

I'm on your fork/pr right now.

stduhpf · 2025-11-29T20:12:23Z

I'm on your fork/pr right now.

Maybe I messed up something when implementing qwen3 into llm.hpp.

rmatif · 2025-11-30T05:48:10Z

I will close this PR once #1020 is merged. I’ll leave it open in the meantime for testing and comparison. Thanks everyone for testing!

rmatif added 7 commits November 27, 2025 23:40

wip

fe54098

wip: use hidden_states[-2] for text encoder output

95a9285

wip

fb1ba8c

wip

cbca39f

wip

1e0e909

cleanup

925c0e8

cleanup 2

b1d1f5e

rmatif mentioned this pull request Nov 29, 2025

[Feature] Z-Image-Turbo #1013

Open

stduhpf reviewed Nov 29, 2025

View reviewed changes

wbruna mentioned this pull request Nov 29, 2025

add z-image support #1020

Merged

rmatif closed this Dec 1, 2025

shakfu mentioned this pull request Dec 4, 2025

Different from stable-diffusion.cpp behavior in docker and native cuda shakfu/cyllama#7

Open

add Z image turbo support #1018

add Z image turbo support #1018

Uh oh!

Conversation

rmatif commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wbruna commented Nov 29, 2025

Uh oh!

stduhpf commented Nov 29, 2025

Uh oh!

stduhpf commented Nov 29, 2025

Uh oh!

Green-Sky commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Nov 29, 2025

Uh oh!

Green-Sky commented Nov 29, 2025

Uh oh!

rmatif commented Nov 29, 2025

Uh oh!

stduhpf commented Nov 29, 2025

Uh oh!

Green-Sky commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stduhpf Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

rmatif Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

stduhpf commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Nov 29, 2025

Uh oh!

wbruna commented Nov 29, 2025

Uh oh!

stduhpf commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Nov 29, 2025

Uh oh!

stduhpf commented Nov 29, 2025

Uh oh!

rmatif commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rmatif commented Nov 29, 2025 •

edited

Loading

Green-Sky commented Nov 29, 2025 •

edited

Loading

Green-Sky commented Nov 29, 2025 •

edited

Loading

stduhpf Nov 29, 2025 •

edited

Loading

stduhpf commented Nov 29, 2025 •

edited

Loading

Green-Sky commented Nov 29, 2025 •

edited

Loading

daniandtheweb commented Nov 29, 2025 •

edited

Loading

stduhpf commented Nov 29, 2025 •

edited

Loading

rmatif commented Nov 30, 2025 •

edited

Loading