Skip to content

feat: add PiD support#1585

Merged
leejet merged 3 commits into
masterfrom
PiD
May 31, 2026
Merged

feat: add PiD support#1585
leejet merged 3 commits into
masterfrom
PiD

Conversation

@leejet
Copy link
Copy Markdown
Owner

@leejet leejet commented May 31, 2026

Summary

  • Add PiD support

Related Issue / Discussion

N/A

Additional Information

Examples

.\bin\Release\sd-cli.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\pid_flux1_512_to_2048_4step_bf16.safetensors --llm "..\..\ComfyUI\models\text_encoders\gemma_2_2b_it_elm_bf16.safetensors" --vae ..\..\ComfyUI\models\vae\ae.sft --vae-format flux --cfg-scale 1.0  -p "a lovely cat" -r ..\assets\ernie_image\turbo_example.png --diffusion-fa -v --steps 4 -H 2048 -W 2048 --rng cpu

before:
turbo_example

after:
output

Checklist

@Green-Sky
Copy link
Copy Markdown
Contributor

In general, it would be great if we could dump the raw latents into a gguf or something. Impl wise it would be like an image (in/out) where the vae is skipped.

@wbruna
Copy link
Copy Markdown
Contributor

wbruna commented May 31, 2026

In general, it would be great if we could dump the raw latents into a gguf or something.

Or embed them as metadata in a preview image. Would be useful for i2i steps, too.

@leejet
Copy link
Copy Markdown
Owner Author

leejet commented May 31, 2026

I once considered whether to implement a similar mechanism, but I felt it would expose too much internal content. The loss of detail from reconstructing the latent through VAE encoding is also within an acceptable range.

@leejet leejet merged commit 0982807 into master May 31, 2026
14 checks passed
@wbruna
Copy link
Copy Markdown
Contributor

wbruna commented May 31, 2026

For anyone else trying: even with --max-vram, this needs a ton of VRAM for the intended resolutions 🙂 On ROCm, with 320x512 -> 1280x2048, I can't get peak usage at less than ~11.5G; 512² -> 2048² goes past my 16G VRAM. Plus, the model can't really deal with anything other than 512p input and 2048p output.

@leejet , would it be possible to split the compute graph even further? It's unfortunately out of reach for Vulkan right now.

@leejet
Copy link
Copy Markdown
Owner Author

leejet commented May 31, 2026

Did you not use --offload-to-cpu? In my tests, even with the original bf16 weights, the VRAM usage for 512x512 -> 2048x2048 was only around 6 GB.

[DEBUG] ggml_extend.hpp:1902 - PiD compute buffer size: 3684.03 MB(VRAM)
[INFO ] ggml_extend.hpp:2142 - PiD offload params (2605.55 MB, 456 tensors) to runtime backend (CUDA0), taking 0.50s

@leejet leejet deleted the PiD branch May 31, 2026 17:45
@wbruna
Copy link
Copy Markdown
Contributor

wbruna commented May 31, 2026

Did you not use --offload-to-cpu? In my tests, even with the original bf16 weights, the VRAM usage for 512x512 -> 2048x2048 was only around 6 GB.

[DEBUG] ggml_extend.hpp:1902 - PiD compute buffer size: 3684.03 MB(VRAM)
[INFO ] ggml_extend.hpp:2142 - PiD offload params (2605.55 MB, 456 tensors) to runtime backend (CUDA0), taking 0.50s

@leejet , I did. On ROCm I get:

./sd-cli --backend ROCm0 --diffusion-model pid_flux1_512_to_2048_4step_bf16.safetensors --llm gemma_2_2b_it_elm_bf16.safetensors --vae ae.safetensors --vae-format flux -r cat.png --cfg-scale 1 --steps 4 -W 2048 -H 2048 -o cat-up.png -p a lovely cat --offload-to-cpu --mmap --max-vram -0.5
(...)
[ERROR] ggml_extend.hpp:69 - ggml_backend_cuda_buffer_type_alloc_buffer: allocating 26026.79 MiB on device 0: cudaMalloc failed: out of memory

full log

$ ./sd-cli --backend ROCm0 --diffusion-model pid_flux1_512_to_2048_4step_bf16.safetensors --llm gemma_2_2b_it_elm_bf16.safetensors --vae ae.safetensors --vae-format flux -r cat.png --cfg-scale 1 --steps 4 -W 2048 -H 2048 -o cat-up.png -p a lovely cat --offload-to-cpu --mmap --max-vram -0.5
[INFO ] ggml_extend.hpp:63 - ggml_cuda_init: found 2 ROCm devices (Total VRAM: 39406 MiB):
[INFO ] ggml_extend.hpp:63 - Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32, VRAM: 16368 MiB
[INFO ] ggml_extend.hpp:63 - Device 1: AMD Radeon Vega 11 Graphics, gfx902:xnack- (0x902), VMM: no, Wave Size: 64, VRAM: 23038 MiB
[INFO ] ggml_graph_cut.cpp:123 - --max-vram < 0 auto-detected 15.95 GiB free VRAM (15.98 GiB total), reserving 0.50 GiB; using 15.45 GiB
[INFO ] stable-diffusion.cpp:281 - loading diffusion model from 'pid_flux1_512_to_2048_4step_bf16.safetensors'
[INFO ] model.cpp:219 - load pid_flux1_512_to_2048_4step_bf16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:328 - loading llm from 'gemma_2_2b_it_elm_bf16.safetensors'
[INFO ] model.cpp:219 - load gemma_2_2b_it_elm_bf16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:342 - loading vae from 'ae.safetensors'
[INFO ] model.cpp:219 - load ae.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:384 - Version: PiD
[INFO ] stable-diffusion.cpp:412 - Weight type stat: f32: 244 | bf16: 744
[INFO ] stable-diffusion.cpp:413 - Conditioner weight type stat: bf16: 288
[INFO ] stable-diffusion.cpp:414 - Diffusion model weight type stat: bf16: 456
[INFO ] stable-diffusion.cpp:415 - VAE weight type stat: f32: 244
[WARN ] stable-diffusion.cpp:448 - in mode 'immediately', LoRAs will cause extra memory usage with mmap
[INFO ] pid.hpp:707 - PiD params: patch_depth=14, pixel_depth=2, patch_mlp_hidden_dim=4096, lq_latent_channels=16, lq_latent_down_factor=8
[INFO ] stable-diffusion.cpp:826 - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:525 - vae decoder: ch = 128
[INFO ] model.cpp:803 - using mmap for 'pid_flux1_512_to_2048_4step_bf16.safetensors'
[INFO ] model.cpp:803 - using mmap for 'gemma_2_2b_it_elm_bf16.safetensors'
[INFO ] model.cpp:803 - using mmap for 'ae.safetensors'
[INFO ] model.cpp:825 - model files processing completed in 0.00s
[INFO ] model.cpp:910 - memory-mapped 556 tensors in 3 files (6412.01 MB), taking 0.00s
|=======================> | 456/988 - 238.57MB/s
|=====================================> | 744/988 - 999.30MB/s
|==================================================| 988/988 - 1.06GB/s
[INFO ] model.cpp:1164 - loading tensors completed, taking 1.37s (read: 0.13s, memcpy: 0.00s, convert: 0.29s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:1126 - total params memory size = 2465.46MB (VRAM 0.00MB, RAM 2465.46MB): text_encoders 2250.92MB(RAM), diffusion_model 54.79MB(RAM), vae 159.75MB(RAM), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1228 - running in FLOW mode
[INFO ] stable-diffusion.cpp:4371 - generate_image 2048x2048
[INFO ] denoiser.hpp:619 - get_sigmas with LCM scheduler
[INFO ] stable-diffusion.cpp:3452 - sampling using LCM method
[INFO ] stable-diffusion.cpp:3955 - EDIT mode
[INFO ] ggml_graph_cut.cpp:731 - vae build cached graph cut plan done (taking 0 ms)
[INFO ] ggml_extend.hpp:2137 - vae offload params (160.00 MB, 244 tensors) to runtime backend (ROCm0), taking 0.34s
[INFO ] stable-diffusion.cpp:4064 - encode_first_stage completed, taking 0.97s
[INFO ] ggml_graph_cut.cpp:731 - gemma2_2b build cached graph cut plan done (taking 1 ms)
[INFO ] ggml_graph_cut.cpp:694 - gemma2_2b graph cut max_vram=15822.00 MB merged 28 segments -> 1 segments
[INFO ] ggml_graph_cut.cpp:702 - gemma2_2b graph cut max_vram budget merge took 8 ms
[INFO ] ggml_extend.hpp:2137 - gemma2_2b offload params (6111.92 MB, 288 tensors) to runtime backend (ROCm0), taking 14.22s
[INFO ] stable-diffusion.cpp:4125 - get_learned_condition completed, taking 14.46s
[INFO ] stable-diffusion.cpp:4405 - generating image: 1/1 - seed 42
[INFO ] ggml_graph_cut.cpp:731 - PiD build cached graph cut plan done (taking 2 ms)
[INFO ] ggml_graph_cut.cpp:702 - PiD graph cut max_vram budget merge took 3 ms
[ERROR] ggml_extend.hpp:69 - ggml_backend_cuda_buffer_type_alloc_buffer: allocating 26026.79 MiB on device 0: cudaMalloc failed: out of memory
[ERROR] ggml_extend.hpp:69 - ggml_gallocr_reserve_n_impl: failed to allocate ROCm0 buffer of size 27291066112
[ERROR] ggml_extend.hpp:1892 - PiD: failed to allocate the compute buffer
[ERROR] ggml_extend.hpp:2418 - PiD alloc compute buffer failed
[ERROR] stable-diffusion.cpp:2108 - diffusion model compute failed
[ERROR] stable-diffusion.cpp:2206 - Diffusion model sampling failed
[ERROR] stable-diffusion.cpp:4444 - sampling for image 1/1 failed after 4.18s
[ERROR] main.cpp:792 - generate failed

Vulkan is similar. This is with --max-vram 4: it seemingly worked for the conditioner, but not for the diffusion:

[INFO ] stable-diffusion.cpp:3955 - EDIT mode
[INFO ] ggml_graph_cut.cpp:731 - vae build cached graph cut plan done (taking 0 ms)
[INFO ] ggml_extend.hpp:2137 - vae offload params (160.00 MB, 244 tensors) to runtime backend (Vulkan1), taking 0.10s
[INFO ] stable-diffusion.cpp:4064 - encode_first_stage completed, taking 0.58s
[INFO ] ggml_graph_cut.cpp:731 - gemma2_2b build cached graph cut plan done (taking 1 ms)
[INFO ] ggml_graph_cut.cpp:694 - gemma2_2b graph cut max_vram=4096.00 MB merged 28 segments -> 2 segments
[INFO ] ggml_graph_cut.cpp:702 - gemma2_2b graph cut max_vram budget merge took 5 ms
[INFO ] stable-diffusion.cpp:4125 - get_learned_condition completed, taking 4.72s
[INFO ] stable-diffusion.cpp:4405 - generating image: 1/1 - seed 42
[INFO ] ggml_graph_cut.cpp:731 - PiD build cached graph cut plan done (taking 2 ms)
[INFO ] ggml_graph_cut.cpp:702 - PiD graph cut max_vram budget merge took 3 ms
ggml_vulkan: Device memory allocation of size 26722162176 failed.

full log

$ ./sd-cli --backend Vulkan1 --diffusion-model pid_flux1_512_to_2048_4step_bf16.safetensors --llm gemma_2_2b_it_elm_bf16.safetensors --vae ae.safetensors --vae-format flux -r cat.png --cfg-scale 1 --steps 4 -W 2048 -H 2048 -o cat-up.png -p a lovely cat --offload-to-cpu --mmap --max-vram 4
[INFO ] ggml_extend.hpp:63 - ggml_cuda_init: found 2 ROCm devices (Total VRAM: 39406 MiB):
[INFO ] ggml_extend.hpp:63 - Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32, VRAM: 16368 MiB
[INFO ] ggml_extend.hpp:63 - Device 1: AMD Radeon Vega 11 Graphics, gfx902:xnack- (0x902), VMM: no, Wave Size: 64, VRAM: 23038 MiB
[INFO ] stable-diffusion.cpp:281 - loading diffusion model from 'pid_flux1_512_to_2048_4step_bf16.safetensors'
[INFO ] model.cpp:219 - load pid_flux1_512_to_2048_4step_bf16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:328 - loading llm from 'gemma_2_2b_it_elm_bf16.safetensors'
[INFO ] model.cpp:219 - load gemma_2_2b_it_elm_bf16.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:342 - loading vae from 'ae.safetensors'
[INFO ] model.cpp:219 - load ae.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:384 - Version: PiD
[INFO ] stable-diffusion.cpp:412 - Weight type stat: f32: 244 | bf16: 744
[INFO ] stable-diffusion.cpp:413 - Conditioner weight type stat: bf16: 288
[INFO ] stable-diffusion.cpp:414 - Diffusion model weight type stat: bf16: 456
[INFO ] stable-diffusion.cpp:415 - VAE weight type stat: f32: 244
[WARN ] stable-diffusion.cpp:448 - in mode 'immediately', LoRAs will cause extra memory usage with mmap
[INFO ] pid.hpp:707 - PiD params: patch_depth=14, pixel_depth=2, patch_mlp_hidden_dim=4096, lq_latent_channels=16, lq_latent_down_factor=8
[INFO ] stable-diffusion.cpp:826 - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:525 - vae decoder: ch = 128
[INFO ] model.cpp:803 - using mmap for 'pid_flux1_512_to_2048_4step_bf16.safetensors'
[INFO ] model.cpp:803 - using mmap for 'gemma_2_2b_it_elm_bf16.safetensors'
[INFO ] model.cpp:803 - using mmap for 'ae.safetensors'
[INFO ] model.cpp:825 - model files processing completed in 0.00s
[INFO ] model.cpp:910 - memory-mapped 556 tensors in 3 files (6412.01 MB), taking 0.00s
|=======================> | 456/988 - 238.57MB/s
|=====================================> | 744/988 - 1.13GB/s
|==================================================| 988/988 - 1.20GB/s
[INFO ] model.cpp:1164 - loading tensors completed, taking 1.22s (read: 0.05s, memcpy: 0.00s, convert: 0.24s, copy_to_backend: 0.00s)
[INFO ] stable-diffusion.cpp:1126 - total params memory size = 2465.46MB (VRAM 0.00MB, RAM 2465.46MB): text_encoders 2250.92MB(RAM), diffusion_model 54.79MB(RAM), vae 159.75MB(RAM), controlnet 0.00MB(N/A), pmid 0.00MB(N/A)
[INFO ] stable-diffusion.cpp:1228 - running in FLOW mode
[INFO ] stable-diffusion.cpp:4371 - generate_image 2048x2048
[INFO ] denoiser.hpp:619 - get_sigmas with LCM scheduler
[INFO ] stable-diffusion.cpp:3452 - sampling using LCM method
[INFO ] stable-diffusion.cpp:3955 - EDIT mode
[INFO ] ggml_graph_cut.cpp:731 - vae build cached graph cut plan done (taking 0 ms)
[INFO ] ggml_extend.hpp:2137 - vae offload params (160.00 MB, 244 tensors) to runtime backend (Vulkan1), taking 0.10s
[INFO ] stable-diffusion.cpp:4064 - encode_first_stage completed, taking 0.58s
[INFO ] ggml_graph_cut.cpp:731 - gemma2_2b build cached graph cut plan done (taking 1 ms)
[INFO ] ggml_graph_cut.cpp:694 - gemma2_2b graph cut max_vram=4096.00 MB merged 28 segments -> 2 segments
[INFO ] ggml_graph_cut.cpp:702 - gemma2_2b graph cut max_vram budget merge took 5 ms
[INFO ] stable-diffusion.cpp:4125 - get_learned_condition completed, taking 4.72s
[INFO ] stable-diffusion.cpp:4405 - generating image: 1/1 - seed 42
[INFO ] ggml_graph_cut.cpp:731 - PiD build cached graph cut plan done (taking 2 ms)
[INFO ] ggml_graph_cut.cpp:702 - PiD graph cut max_vram budget merge took 3 ms
ggml_vulkan: Device memory allocation of size 26722162176 failed.
ggml_vulkan: Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory
[ERROR] ggml_extend.hpp:69 - ggml_gallocr_reserve_n_impl: failed to allocate Vulkan1 buffer of size 27792539144
[ERROR] ggml_extend.hpp:1892 - PiD: failed to allocate the compute buffer
[ERROR] ggml_extend.hpp:2418 - PiD alloc compute buffer failed
[ERROR] stable-diffusion.cpp:2108 - diffusion model compute failed
[ERROR] stable-diffusion.cpp:2206 - Diffusion model sampling failed
[ERROR] stable-diffusion.cpp:4444 - sampling for image 1/1 failed after 3.72s
[ERROR] main.cpp:792 - generate failed

Edit: running with -v, I see the graph cutting logs, but it immediately tries to allocate everything:

[DEBUG] ggml_extend.hpp:2553 - PiD graph cut executing segment 1/17: pid.patch_blocks.0
[DEBUG] ggml_extend.hpp:2217 - PiD offload partial params (234.60 MB, 75 tensors) to runtime backend (ROCm0)
[ERROR] ggml_extend.hpp:69 - ggml_backend_cuda_buffer_type_alloc_buffer: allocating 26026.79 MiB on device 0: cudaMalloc failed: out of memory

Details

[DEBUG] conditioner.hpp:1811 - parse ' a lovely cat' to [[' a lovely cat', 1], ]
[DEBUG] bpe_tokenizer.cpp:207 - split prompt " a lovely cat" to tokens ["▁a", "▁lovely", "▁cat", ]
[INFO ] ggml_graph_cut.cpp:731 - gemma2_2b build cached graph cut plan done (taking 1 ms)
[INFO ] ggml_graph_cut.cpp:694 - gemma2_2b graph cut max_vram=4096.00 MB merged 28 segments -> 2 segments
[INFO ] ggml_graph_cut.cpp:702 - gemma2_2b graph cut max_vram budget merge took 5 ms
[DEBUG] ggml_extend.hpp:2553 - gemma2_2b graph cut executing segment 1/2: llm.text.prelude..llm.text.layers.10
[DEBUG] ggml_extend.hpp:2217 - gemma2_2b offload partial params (3883.89 MB, 122 tensors) to runtime backend (ROCm0)
[DEBUG] ggml_extend.hpp:1899 - gemma2_2b compute buffer size: 54.46 MB(VRAM)
[DEBUG] ggml_extend.hpp:1986 - gemma2_2b cache backend buffer size = 4.46 MB(VRAM) (1 tensors)
[DEBUG] ggml_extend.hpp:2525 - gemma2_2b execute_graph timing: offload=753 ms alloc=0 ms copy_in=1 ms compute=104 ms cache=0 ms total=859 ms
[DEBUG] ggml_extend.hpp:2553 - gemma2_2b graph cut executing segment 2/2: llm.text.layers.11..ggml_runner.final
[DEBUG] ggml_extend.hpp:2217 - gemma2_2b offload partial params (2228.04 MB, 166 tensors) to runtime backend (ROCm0)
[DEBUG] ggml_extend.hpp:1899 - gemma2_2b compute buffer size: 54.46 MB(VRAM)
[DEBUG] ggml_extend.hpp:2525 - gemma2_2b execute_graph timing: offload=526 ms alloc=0 ms copy_in=1 ms compute=89 ms cache=1 ms total=618 ms
[DEBUG] conditioner.hpp:2178 - computing condition graph completed, taking 1489 ms
[INFO ] stable-diffusion.cpp:4125 - get_learned_condition completed, taking 1.49s
[INFO ] stable-diffusion.cpp:4405 - generating image: 1/1 - seed 42
[INFO ] ggml_graph_cut.cpp:731 - PiD build cached graph cut plan done (taking 1 ms)
[INFO ] ggml_graph_cut.cpp:702 - PiD graph cut max_vram budget merge took 3 ms
[DEBUG] ggml_extend.hpp:2553 - PiD graph cut executing segment 1/17: pid.patch_blocks.0
[DEBUG] ggml_extend.hpp:2217 - PiD offload partial params (234.60 MB, 75 tensors) to runtime backend (ROCm0)
[ERROR] ggml_extend.hpp:69 - ggml_backend_cuda_buffer_type_alloc_buffer: allocating 26026.79 MiB on device 0: cudaMalloc failed: out of memory
[ERROR] ggml_extend.hpp:69 - ggml_gallocr_reserve_n_impl: failed to allocate ROCm0 buffer of size 27291066112
[ERROR] ggml_extend.hpp:1892 - PiD: failed to allocate the compute buffer
[ERROR] ggml_extend.hpp:2418 - PiD alloc compute buffer failed
[ERROR] stable-diffusion.cpp:2108 - diffusion model compute failed
[ERROR] stable-diffusion.cpp:2206 - Diffusion model sampling failed
[ERROR] stable-diffusion.cpp:4444 - sampling for image 1/1 failed after 3.60s
[ERROR] main.cpp:792 - generate failed

@SmallAndSoft
Copy link
Copy Markdown
Contributor

I once considered whether to implement a similar mechanism, but I felt it would expose too much internal content. The loss of detail from reconstructing the latent through VAE encoding is also within an acceptable range.

The paper showcases advantages of decoding directly from latent:

  • early de-noising termination becomes useful (figures 6)
  • lower latency and VRAM use (table 3)

Up-scaling from pixels is certainly useful but the original focus is still in replacing VAE. I am looking forward to have it implemented!

@wbruna
Copy link
Copy Markdown
Contributor

wbruna commented Jun 1, 2026

Update: I was missing --diffusion-fa; ROCm generating successfully now. With --max-vram 4 I get peaks around 5.7G.

Vulkan is crashing with a ggml assert, seemingly independent of memory usage. I'll need to debug it further.

ggml/src/ggml-vulkan/ggml-vulkan.cpp:6763: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed

Details
[INFO ] stable-diffusion.cpp:4405 - generating image: 1/1 - seed 42
[INFO ] ggml_graph_cut.cpp:731  - PiD build cached graph cut plan done (taking 3 ms)
[INFO ] ggml_graph_cut.cpp:694  - PiD graph cut max_vram=4096.00 MB merged 17 segments -> 5 segments
[INFO ] ggml_graph_cut.cpp:702  - PiD graph cut max_vram budget merge took 8 ms
[DEBUG] ggml_extend.hpp:2553 - PiD graph cut executing segment 1/5: pid.patch_blocks.0..pid.patch_blocks.7
[DEBUG] ggml_extend.hpp:2217 - PiD offload partial params (1400.88 MB, 258 tensors) to runtime backend (Vulkan1)
[DEBUG] ggml_extend.hpp:1899 - PiD compute buffer size: 1692.79 MB(VRAM)
[DEBUG] ggml_extend.hpp:1986 - PiD cache backend buffer size =  97.76 MB(VRAM) (2 tensors)
[DEBUG] ggml_extend.hpp:2525 - PiD execute_graph timing: offload=1006 ms alloc=0 ms copy_in=35 ms compute=6174 ms cache=2 ms total=7239 ms
[DEBUG] ggml_extend.hpp:2553 - PiD graph cut executing segment 2/5: pid.patch_blocks.8..pid.patch_blocks.13
[DEBUG] ggml_extend.hpp:2217 - PiD offload partial params (1014.87 MB, 193 tensors) to runtime backend (Vulkan1)
[DEBUG] ggml_extend.hpp:1899 - PiD compute buffer size: 1515.91 MB(VRAM)
[DEBUG] ggml_extend.hpp:1986 - PiD cache backend buffer size =  96.00 MB(VRAM) (1 tensors)
[DEBUG] ggml_extend.hpp:2525 - PiD execute_graph timing: offload=721 ms alloc=1 ms copy_in=4 ms compute=4659 ms cache=1 ms total=5402 ms
[DEBUG] ggml_extend.hpp:2553 - PiD graph cut executing segment 3/5: pid.pixel_blocks.0
[DEBUG] ggml_extend.hpp:2217 - PiD offload partial params (110.76 MB, 23 tensors) to runtime backend (Vulkan1)
[DEBUG] ggml_extend.hpp:1899 - PiD compute buffer size: 3385.00 MB(VRAM)
ggml/src/ggml-vulkan/ggml-vulkan.cpp:6763: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed
[New LWP 1176344]
[New LWP 1176343]
[New LWP 1176342]
[New LWP 1176341]
[New LWP 1176338]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: Arquivo ou diretório inexistente
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007f2dc9a9b668 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: Arquivo ou diretório inexistente
#2  0x00007f2dc9a9b6ad in __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007f2dc9b067c7 in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: Arquivo ou diretório inexistente
#4  0x000055b1238b68bb in ggml_print_backtrace ()
#5  0x000055b1238b6a0e in ggml_abort ()
#6  0x000055b1237c532b in void ggml_vk_dispatch_pipeline<vk_op_gated_delta_net_push_constants>(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, std::shared_ptr<vk_pipeline_struct>&, std::initializer_list<vk::DescriptorBufferInfo> const&, vk_op_gated_delta_net_push_constants const&, std::array<unsigned int, 3ul>) ()
#7  0x000055b123877171 in ggml_vk_mul_mat_q_f16(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, bool) ()
#8  0x000055b12388897a in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] ()
#9  0x000055b1238897e1 in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) ()
#10 0x000055b1238cd3ee in ggml_backend_graph_compute ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants