-
Notifications
You must be signed in to change notification settings - Fork 426
Description
Hi everybody.
I've tried sd.exe on my new personal computer.
I've not installed CUDA drivers yet, but I think everything is OK when I copy all cudart-sd-bin-win-cu12-x64.zip dlls in the sd directory
In fact, I've tried with StableDiffusion 1.4 and it worked well
Then I tried with Flux.1Dev and I get this error :
mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
I dont understand why it works well on a model and not on another. (it works with sd1.4, sd2.1, but not sd3medium)
Thanks a lot for your help !
Here is the log :
C:\Users\obrau\Desktop\sd-master-9578fdc-bin-win-avx2-x64\cuda2>"sd.exe" --diffusion-model "..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf" --clip_l "..\Flux.1 Q4 F16\clip_l.safetensors" --vae "..\Flux.1 Q4 F16\ae.safetensors" --t5xxl "..\Flux.1 Q4 F16\t5xxl_fp16.safetensors" -p "a cute cat" --sampling-method euler --steps 10 -W 512 -H 512 -s 42 -t 16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:202 - loading clip_l from '..\Flux.1 Q4 F16\clip_l.safetensors'
[INFO ] model.cpp:888 - load ..\Flux.1 Q4 F16\clip_l.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:216 - loading t5xxl from '..\Flux.1 Q4 F16\t5xxl_fp16.safetensors'
[INFO ] model.cpp:888 - load ..\Flux.1 Q4 F16\t5xxl_fp16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:223 - loading diffusion model from '..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf'
[INFO ] model.cpp:885 - load ..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf using gguf format
[INFO ] stable-diffusion.cpp:230 - loading vae from '..\Flux.1 Q4 F16\ae.safetensors'
[INFO ] model.cpp:888 - load ..\Flux.1 Q4 F16\ae.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:242 - Version: Flux
[INFO ] stable-diffusion.cpp:275 - Weight type: f16
[INFO ] stable-diffusion.cpp:276 - Conditioner weight type: f16
[INFO ] stable-diffusion.cpp:277 - Diffusion model weight type: q4_0
[INFO ] stable-diffusion.cpp:278 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:319 - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:322 - CLIP: Using CPU backend
[INFO ] flux.hpp:889 - Flux blocks: 19 double, 38 single
|==============> | 413/1440 - 0.00it/s[INFO ] model.cpp:1868 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
|==============================================> | 1334/1440 - 52.63it/s[INFO ] stable-diffusion.cpp:516 - total params memory size = 16018.05MB (VRAM 6699.22MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 6604.64MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:520 - loading model from '' completed, taking 16.90s
[INFO ] stable-diffusion.cpp:537 - running in Flux FLOW mode
[INFO ] stable-diffusion.cpp:682 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1235 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1368 - get_learned_condition completed, taking 14774 ms
[INFO ] stable-diffusion.cpp:1391 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1428 - generating image: 1/1 - seed 42
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
....
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
ggml_cuda_compute_forward: CONT failed
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
err
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error
With StableDiffusion that worked well, here is the log :
C:\Users\obrau\Desktop\sd-master-9578fdc-bin-win-avx2-x64>"cuda\sd.exe" -m "StableDiffusion 1.4 F32\sd-v1-4.ckpt" -p "a cute cat" --sampling-method euler --steps 10 -W 512 -H 512 -s 42 -t 16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:191 - loading model from 'StableDiffusion 1.4 F32\sd-v1-4.ckpt'
[INFO ] model.cpp:891 - load StableDiffusion 1.4 F32\sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:238 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:271 - Weight type: f32
[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:512 - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:516 - loading model from 'StableDiffusion 1.4 F32\sd-v1-4.ckpt' completed, taking 10.77s
[INFO ] stable-diffusion.cpp:546 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:673 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1332 - get_learned_condition completed, taking 226 ms
[INFO ] stable-diffusion.cpp:1355 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1359 - generating image: 1/1 - seed 42
|==================================================| 10/10 - 5.70it/s
[INFO ] stable-diffusion.cpp:1395 - sampling completed, taking 1.89s
[INFO ] stable-diffusion.cpp:1403 - generating 1 latent images completed, taking 1.91s
[INFO ] stable-diffusion.cpp:1406 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1416 - latent 1 decoded, taking 0.37s
[INFO ] stable-diffusion.cpp:1420 - decode_first_stage completed, taking 0.37s
[INFO ] stable-diffusion.cpp:1539 - txt2img completed in 2.51s
save result image to 'output.png'