Same speed as LLAMA.cpp? #1291

xantrk · 2026-02-20T14:17:16Z

xantrk
Feb 20, 2026

Hello all,
Since my 12 gb vram (5070ti laptop) + 32 gb RAM system is doomed to run hybrid inference, I gave ik_llama a try. I can't seem to get any meaningful difference in speed. Does anyone have any ideas if I should be getting any or can tweak some params to do so?

LLAMA.cpp bench:

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-cuda.dll
load_backend: loaded RPC backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-rpc.dll
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Intel(R) Graphics (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\Users\furka\AppData\Local\Microsoft\WinGet\Packages\ggml.llamacpp_Microsoft.Winget.Source_8wekyb3d8bbwe\ggml-cpu-alderlake.dll

model	size	params	backend	ngl	n_cpu_moe	type_k	fa	test	t/s
qwen3next 80B.A3B IQ4_XS - 4.25 bpw	30.42 GiB	60.33 B	CUDA,Vulkan	99	40	q8_0	1	tg128	30.27 ± 6.34
qwen3next 80B.A3B IQ4_XS - 4.25 bpw	30.42 GiB	60.33 B	CUDA,Vulkan	99	40	q4_0	1	tg128	35.06 ± 4.18
qwen3next 80B.A3B IQ4_XS - 4.25 bpw	30.42 GiB	60.33 B	CUDA,Vulkan	99	42	q8_0	1	tg128	22.28 ± 7.91
qwen3next 80B.A3B IQ4_XS - 4.25 bpw	30.42 GiB	60.33 B	CUDA,Vulkan	99	42	q4_0	1	tg128	33.10 ± 1.48

build: d5dfc3302 (8069)`

ik_llama.cpp bench:

.\llama-bench.exe -m "C:\Users\furka\.lmstudio\models\lmstudio-community\Qwen3-Coder-30B-A3B-Instruct-GGUF\Qwen3-Coder-Next-REAM.IQ4_XS.gguf" -p 0  -b 2048  -ngl 99 -fa 1 --n-cpu-moe 40,42 -ctk q8_0,q4_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti Laptop GPU, compute capability 12.0, VMM: yes, VRAM: 12226 MiB

model	size	params	backend	ngl	type_k	test	t/s
===================================== llama_init_from_model: f16
qwen3next 80B.A3B IQ4_XS - 4.25 bpw	30.42 GiB	60.33 B	CUDA	99	q8_0	tg128	30.21 ± 5.96
~ggml_backend_cuda_context: have 41 graphs
===================================== llama_init_from_model: f16
qwen3next 80B.A3B IQ4_XS - 4.25 bpw	30.42 GiB	60.33 B	CUDA	99	q4_0	tg128	35.28 ± 0.55
~ggml_backend_cuda_context: have 41 graphs
build: `1b60cb1` (1)

Answered by ikawrakow

Feb 22, 2026

@xantrk

In your test case most of the time is spent computing on the CPU matrix-vector multiplications of MoE tensors left in RAM. These computations are 100% memory bandwidth bound, so there will be not much of a difference between llama.cpp and ik_llama.cpp (or between these two and the inference framework my grandmother put together over the weekend with the help of Claude Opus).

Prompt processing, which you didn't test, may be slightly faster with ik_llama.cpp, but the difference will not be very big because in that case processing time is dominated by the time it takes to copy MoE tensors left in RAM to the GPU. Unless you use very large batches, but you cannot really do that with yo…

View full answer

ikawrakow · 2026-02-22T06:56:00Z

ikawrakow
Feb 22, 2026
Maintainer

@xantrk

In your test case most of the time is spent computing on the CPU matrix-vector multiplications of MoE tensors left in RAM. These computations are 100% memory bandwidth bound, so there will be not much of a difference between llama.cpp and ik_llama.cpp (or between these two and the inference framework my grandmother put together over the weekend with the help of Claude Opus).

Prompt processing, which you didn't test, may be slightly faster with ik_llama.cpp, but the difference will not be very big because in that case processing time is dominated by the time it takes to copy MoE tensors left in RAM to the GPU. Unless you use very large batches, but you cannot really do that with your 12 GB VRAM because of the large compute buffers required in that case.

Sorry, but ik_llama.cpp cannot somehow magically eliminate the hardware limitations of your system.

If you want to observe larger performance differences, pick a model that fits in your 32 GB RAM and run CPU-only with a long context.

1 reply

xantrk Feb 22, 2026
Author

Thank you so much for the detailed explanation @ikawrakow , really appreciated! I think the root of my confusion was coming from the fact that I've expected to see difference in hybrid inference, but MOE has some quirks that I did not realize as you explained.

I will test hybrid (non-moe) and cpu inference to get a bit more understanding, but again, thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same speed as LLAMA.cpp? #1291

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Same speed as LLAMA.cpp? #1291

Uh oh!

Uh oh!

xantrk Feb 20, 2026

Replies: 1 comment · 1 reply

Uh oh!

ikawrakow Feb 22, 2026 Maintainer

Uh oh!

xantrk Feb 22, 2026 Author

xantrk
Feb 20, 2026

Replies: 1 comment 1 reply

ikawrakow
Feb 22, 2026
Maintainer

xantrk Feb 22, 2026
Author