Skip to content

[SOLVED] CUDA error : ggml-cuda.cu was compiled for: 520 #547

@olivbrau

Description

@olivbrau

Hi everybody.
I've tried sd.exe on my new personal computer.
I've not installed CUDA drivers yet, but I think everything is OK when I copy all cudart-sd-bin-win-cu12-x64.zip dlls in the sd directory
In fact, I've tried with StableDiffusion 1.4 and it worked well
Then I tried with Flux.1Dev and I get this error :
mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
I dont understand why it works well on a model and not on another. (it works with sd1.4, sd2.1, but not sd3medium)
Thanks a lot for your help !

Here is the log :

C:\Users\obrau\Desktop\sd-master-9578fdc-bin-win-avx2-x64\cuda2>"sd.exe" --diffusion-model "..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf" --clip_l "..\Flux.1 Q4 F16\clip_l.safetensors" --vae "..\Flux.1 Q4 F16\ae.safetensors" --t5xxl "..\Flux.1 Q4 F16\t5xxl_fp16.safetensors" -p "a cute cat" --sampling-method euler --steps 10 -W 512 -H 512 -s 42 -t 16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:202 - loading clip_l from '..\Flux.1 Q4 F16\clip_l.safetensors'
[INFO ] model.cpp:888 - load ..\Flux.1 Q4 F16\clip_l.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:216 - loading t5xxl from '..\Flux.1 Q4 F16\t5xxl_fp16.safetensors'
[INFO ] model.cpp:888 - load ..\Flux.1 Q4 F16\t5xxl_fp16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:223 - loading diffusion model from '..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf'
[INFO ] model.cpp:885 - load ..\Flux.1 Q4 F16\Dev\flux1-dev-q4_0.gguf using gguf format
[INFO ] stable-diffusion.cpp:230 - loading vae from '..\Flux.1 Q4 F16\ae.safetensors'
[INFO ] model.cpp:888 - load ..\Flux.1 Q4 F16\ae.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:242 - Version: Flux
[INFO ] stable-diffusion.cpp:275 - Weight type: f16
[INFO ] stable-diffusion.cpp:276 - Conditioner weight type: f16
[INFO ] stable-diffusion.cpp:277 - Diffusion model weight type: q4_0
[INFO ] stable-diffusion.cpp:278 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:319 - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:322 - CLIP: Using CPU backend
[INFO ] flux.hpp:889 - Flux blocks: 19 double, 38 single
|==============> | 413/1440 - 0.00it/s[INFO ] model.cpp:1868 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
|==============================================> | 1334/1440 - 52.63it/s[INFO ] stable-diffusion.cpp:516 - total params memory size = 16018.05MB (VRAM 6699.22MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 6604.64MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:520 - loading model from '' completed, taking 16.90s
[INFO ] stable-diffusion.cpp:537 - running in Flux FLOW mode
[INFO ] stable-diffusion.cpp:682 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1235 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1368 - get_learned_condition completed, taking 14774 ms
[INFO ] stable-diffusion.cpp:1391 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1428 - generating image: 1/1 - seed 42
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
....
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\template-instances../mmq.cuh:2589: ERROR: CUDA kernel mul_mat_q has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520
ggml_cuda_compute_forward: CONT failed
CUDA error: unspecified launch failure
current device: 0, in function ggml_cuda_compute_forward at D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2174
err
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:70: CUDA error

With StableDiffusion that worked well, here is the log :

C:\Users\obrau\Desktop\sd-master-9578fdc-bin-win-avx2-x64>"cuda\sd.exe" -m "StableDiffusion 1.4 F32\sd-v1-4.ckpt" -p "a cute cat" --sampling-method euler --steps 10 -W 512 -H 512 -s 42 -t 16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4070 SUPER, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:191 - loading model from 'StableDiffusion 1.4 F32\sd-v1-4.ckpt'
[INFO ] model.cpp:891 - load StableDiffusion 1.4 F32\sd-v1-4.ckpt using checkpoint format
ZIP 0, name = archive/data.pkl, dir = archive/
[INFO ] stable-diffusion.cpp:238 - Version: SD 1.x
[INFO ] stable-diffusion.cpp:271 - Weight type: f32
[INFO ] stable-diffusion.cpp:272 - Conditioner weight type: f32
[INFO ] stable-diffusion.cpp:273 - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:274 - VAE weight type: f32
[INFO ] stable-diffusion.cpp:512 - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:516 - loading model from 'StableDiffusion 1.4 F32\sd-v1-4.ckpt' completed, taking 10.77s
[INFO ] stable-diffusion.cpp:546 - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:673 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1199 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1332 - get_learned_condition completed, taking 226 ms
[INFO ] stable-diffusion.cpp:1355 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1359 - generating image: 1/1 - seed 42
|==================================================| 10/10 - 5.70it/s
[INFO ] stable-diffusion.cpp:1395 - sampling completed, taking 1.89s
[INFO ] stable-diffusion.cpp:1403 - generating 1 latent images completed, taking 1.91s
[INFO ] stable-diffusion.cpp:1406 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1416 - latent 1 decoded, taking 0.37s
[INFO ] stable-diffusion.cpp:1420 - decode_first_stage completed, taking 0.37s
[INFO ] stable-diffusion.cpp:1539 - txt2img completed in 2.51s
save result image to 'output.png'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions