[SOLVED] CUDA error :  ggml-cuda.cu was compiled for: 520

Hi everybody.
I've tried sd.exe on my new personal computer.
I've not installed CUDA drivers yet, but I think everything is OK when I copy all  **cudart-sd-bin-win-cu12-x64.zip** dlls in the sd directory
In fact, I've tried with StableDiffusion 1.4 and it worked well
Then I tried with Flux.1Dev and I get this error : 
`mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520`
I dont understand why it works well on a model and not on another. (it works with sd1.4, sd2.1, but not sd3medium)
Thanks a lot for your help !

Here is the log :

> C:\Users\obrau\Desktop\sd-master-9578fdc-bin-win-avx2-x64\cuda2>"sd.exe" --diffusion-model "..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf" --clip_l "..\Flux.1 Q4 F16\clip_l.safetensors" --vae "..\Flux.1 Q4 F16\ae.safetensors" --t5xxl "..\Flux.1 Q4 F16\t5xxl_fp16.safetensors" -p "a cute cat" --sampling-method euler --steps 10 -W 512 -H 512 -s 42 -t 16
> ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
> ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
> ggml_cuda_init: found 1 CUDA devices:
>   Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
> [INFO ] stable-diffusion.cpp:202  - loading clip_l from '..\Flux.1 Q4 F16\clip_l.safetensors'
> [INFO ] model.cpp:888  - load ..\Flux.1 Q4 F16\clip_l.safetensors using safetensors format
> [INFO ] stable-diffusion.cpp:216  - loading t5xxl from '..\Flux.1 Q4 F16\t5xxl_fp16.safetensors'
> [INFO ] model.cpp:888  - load ..\Flux.1 Q4 F16\t5xxl_fp16.safetensors using safetensors format
> [INFO ] stable-diffusion.cpp:223  - loading diffusion model from '..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf'
> [INFO ] model.cpp:885  - load ..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf using gguf format
> [INFO ] stable-diffusion.cpp:230  - loading vae from '..\Flux.1 Q4 F16\ae.safetensors'
> [INFO ] model.cpp:888  - load ..\Flux.1 Q4 F16\ae.safetensors using safetensors format
> [INFO ] stable-diffusion.cpp:242  - Version: Flux
> [INFO ] stable-diffusion.cpp:275  - Weight type:                 f16
> [INFO ] stable-diffusion.cpp:276  - Conditioner weight type:     f16
> [INFO ] stable-diffusion.cpp:277  - Diffusion model weight type: q4_0
> [INFO ] stable-diffusion.cpp:278  - VAE weight type:             f32
> [INFO ] stable-diffusion.cpp:319  - set clip_on_cpu to true
> [INFO ] stable-diffusion.cpp:322  - CLIP: Using CPU backend
> [INFO ] flux.hpp:889  - Flux blocks: 19 double, 38 single
>   |==============>                                   | 413/1440 - 0.00it/s[INFO ] model.cpp:1868 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
>   |==============================================>   | 1334/1440 - 52.63it/s[INFO ] stable-diffusion.cpp:516  - total params memory size = 16018.05MB (VRAM 6699.22MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 6604.64MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
> [INFO ] stable-diffusion.cpp:520  - loading model from '' completed, taking 16.90s
> [INFO ] stable-diffusion.cpp:537  - running in Flux FLOW mode
> [INFO ] stable-diffusion.cpp:682  - Attempting to apply 0 LoRAs
> [INFO ] stable-diffusion.cpp:1235 - apply_loras completed, taking 0.00s
> [INFO ] stable-diffusion.cpp:1368 - get_learned_condition completed, taking 14774 ms
> [INFO ] stable-diffusion.cpp:1391 - sampling using Euler method
> [INFO ] stable-diffusion.cpp:1428 - generating image: 1/1 - seed 42
> D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances\../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
> ....
> D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances\../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
> ggml_cuda_compute_forward: CONT failed
> CUDA error: unspecified launch failure
>   current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
>   err
> D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error


With StableDiffusion that worked well, here is the log :

> C:\Users\obrau\Desktop\sd-master-9578fdc-bin-win-avx2-x64>"cuda\sd.exe" -m "StableDiffusion 1.4 F32\sd-v1-4.ckpt" -p "a cute cat" --sampling-method euler --steps 10 -W 512 -H 512 -s 42 -t 16
> ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
> ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
> ggml_cuda_init: found 1 CUDA devices:
>   Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
> [INFO ] stable-diffusion.cpp:191  - loading model from 'StableDiffusion 1.4 F32\sd-v1-4.ckpt'
> [INFO ] model.cpp:891  - load StableDiffusion 1.4 F32\sd-v1-4.ckpt using checkpoint format
> ZIP 0, name = archive/data.pkl, dir = archive/
> [INFO ] stable-diffusion.cpp:238  - Version: SD 1.x
> [INFO ] stable-diffusion.cpp:271  - Weight type:                 f32
> [INFO ] stable-diffusion.cpp:272  - Conditioner weight type:     f32
> [INFO ] stable-diffusion.cpp:273  - Diffusion model weight type: f32
> [INFO ] stable-diffusion.cpp:274  - VAE weight type:             f32
> [INFO ] stable-diffusion.cpp:512  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
> [INFO ] stable-diffusion.cpp:516  - loading model from 'StableDiffusion 1.4 F32\sd-v1-4.ckpt' completed, taking 10.77s
> [INFO ] stable-diffusion.cpp:546  - running in eps-prediction mode
> [INFO ] stable-diffusion.cpp:673  - Attempting to apply 0 LoRAs
> [INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
> [INFO ] stable-diffusion.cpp:1332 - get_learned_condition completed, taking 226 ms
> [INFO ] stable-diffusion.cpp:1355 - sampling using Euler method
> [INFO ] stable-diffusion.cpp:1359 - generating image: 1/1 - seed 42
>   |==================================================| 10/10 - 5.70it/s
> [INFO ] stable-diffusion.cpp:1395 - sampling completed, taking 1.89s
> [INFO ] stable-diffusion.cpp:1403 - generating 1 latent images completed, taking 1.91s
> [INFO ] stable-diffusion.cpp:1406 - decoding 1 latents
> [INFO ] stable-diffusion.cpp:1416 - latent 1 decoded, taking 0.37s
> [INFO ] stable-diffusion.cpp:1420 - decode_first_stage completed, taking 0.37s
> [INFO ] stable-diffusion.cpp:1539 - txt2img completed in 2.51s
> save result image to 'output.png'


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SOLVED] CUDA error : ggml-cuda.cu was compiled for: 520 #547

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[SOLVED] CUDA error : ggml-cuda.cu was compiled for: 520 #547

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions