Git commit
N/A
Operating System & Version
Window 11
GGML backends
CUDA
Command-line arguments used
sd-cli --diffusion-model "Z:\gguf_models\Image\Qwen-Image\qwen-image-2512-Q4_K_M.gguf" --vae "Z:\gguf_models\Image\Qwen-Image\qwen_image_vae.safetensors" --llm "Z:\gguf_models\Image\Qwen-Image\Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf" -v --diffusion-fa -W 720 -H 1024 --seed 42 --steps 8 --cfg-scale 1 --sampling-method euler --backend all=cuda0 --params-backend diffusion=cpu,te=cpu,vae=cpu --mmap --lora-model-dir "Z:\gguf_models\Image\Qwen-Image\LoRA" -p "a pack of pikachus in a lush forestlora:Qwen-Image-2512-Lightning-8steps-V1.0-bf16:1"
Steps to reproduce
- Load a quantized Qwen‑Image GGUF model (
Q4_K_M, etc.).
- Load a Lightning LoRA via
--lora-model-dir.
- Enable CPU parameter offloading:
--offload-to-cpu
or
--params-backend diffusion=cpu,te=cpu,vae=cpu
- Start a generation request.
What you expected to happen
Generate an image
What actually happened
When applying a Lightning LoRA to a quantized Qwen‑Image GGUF base model, the system crashes at the first sampling step only when CPU parameter offloading is enabled (--offload-to-cpu or --params-backend diffusion=cpu).
The crash consistently reports:
ggml_cuda_compute_forward: CONCAT failed
CUDA error: an illegal memory access was encountered
This occurs in both sd-cli and sd-server.
It seems that the backend expects everything to be on one device but receives a mix of CPU and GPU tensors instead. LoRA is applied on GPU while the diffusion is running on the CPU, resulting in the model ending up with its data split across two different devices.
If I force --lora-apply-mode immediately, the system catches the mismatch earlier and reports a type‑combination error instead of crashing: unsupported type combination (f32 to q6_K)
Logs / error messages / stack trace
CLI (sd-cli)
[DEBUG] ggml_extend.hpp:1907 - qwen2.5vl compute buffer size: 9.10 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen2.5vl offload params (5918.09 MB, 338 tensors) to runtime backend (CUDA0), taking 3.25s
[DEBUG] conditioner.hpp:2030 - computing condition graph completed, taking 3463 ms
[INFO ] stable-diffusion.cpp:3788 - get_learned_condition completed, taking 3.46s
[INFO ] stable-diffusion.cpp:4021 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - qwen_image compute buffer size: 444.05 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen_image offload params (12631.04 MB, 1933 tensors) to runtime backend (CUDA0), taking 6.17s
[ERROR] ggml_extend.hpp:69 - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:69 - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:69 - current device: 0, in function ggml_cuda_compute_forward at C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3114
[ERROR] ggml_extend.hpp:69 - err
C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:102: CUDA error
Server (sd-server)
[DEBUG] ggml_extend.hpp:1907 - qwen2.5vl compute buffer size: 9.10 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen2.5vl offload params (5918.09 MB, 338 tensors) to runtime backend (CUDA0), taking 1.55s
[DEBUG] conditioner.hpp:2030 - computing condition graph completed, taking 1767 ms
[INFO ] stable-diffusion.cpp:3788 - get_learned_condition completed, taking 1.77s
[INFO ] stable-diffusion.cpp:4021 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - qwen_image compute buffer size: 444.05 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen_image offload params (12631.04 MB, 1933 tensors) to runtime backend (CUDA0), taking 4.48s
[ERROR] ggml_extend.hpp:69 - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:69 - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:69 - current device: 0, in function ggml_cuda_compute_forward at C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3114
[ERROR] ggml_extend.hpp:69 - err
C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:102: CUDA error
Additional context / environment details
System Configuration
- GPU: NVIDIA GeForce RTX 4070 Ti SUPER (16 GB VRAM)
- Build: master‑633‑5b0267e (commit 5b0267e)
- Core Models: Qwen-Image-2512 (Q4_K_M GGUF base + Qwen2.5-VL-7B Text Encoder)
- Lora Model: LightX2V-Qwen-Image-Lightning
Git commit
N/A
Operating System & Version
Window 11
GGML backends
CUDA
Command-line arguments used
sd-cli --diffusion-model "Z:\gguf_models\Image\Qwen-Image\qwen-image-2512-Q4_K_M.gguf" --vae "Z:\gguf_models\Image\Qwen-Image\qwen_image_vae.safetensors" --llm "Z:\gguf_models\Image\Qwen-Image\Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf" -v --diffusion-fa -W 720 -H 1024 --seed 42 --steps 8 --cfg-scale 1 --sampling-method euler --backend all=cuda0 --params-backend diffusion=cpu,te=cpu,vae=cpu --mmap --lora-model-dir "Z:\gguf_models\Image\Qwen-Image\LoRA" -p "a pack of pikachus in a lush forestlora:Qwen-Image-2512-Lightning-8steps-V1.0-bf16:1"
Steps to reproduce
Q4_K_M, etc.).--lora-model-dir.--offload-to-cpuor
--params-backend diffusion=cpu,te=cpu,vae=cpuWhat you expected to happen
Generate an image
What actually happened
When applying a Lightning LoRA to a quantized Qwen‑Image GGUF base model, the system crashes at the first sampling step only when CPU parameter offloading is enabled (
--offload-to-cpuor--params-backend diffusion=cpu).The crash consistently reports:
This occurs in both
sd-cliandsd-server.It seems that the backend expects everything to be on one device but receives a mix of CPU and GPU tensors instead. LoRA is applied on GPU while the diffusion is running on the CPU, resulting in the model ending up with its data split across two different devices.
If I force
--lora-apply-mode immediately, the system catches the mismatch earlier and reports a type‑combination error instead of crashing:unsupported type combination (f32 to q6_K)Logs / error messages / stack trace
CLI (
sd-cli)Server (
sd-server)Additional context / environment details
System Configuration