Speed benchmarks of various LLMs #1544
Replies: 6 comments 10 replies
-
|
I have a simple question, eypc 9654 96 core seems cheaper than 10*3090... I guess, the speed will be close or surpass the 'full offload layer' layer one??? |
Beta Was this translation helpful? Give feedback.
-
|
GLM5 smol-IQ2_KL 10 x RTX 3090 (the first GPU is x16); split mode: layer; log Details
|
Beta Was this translation helpful? Give feedback.
-
|
Qwen3.5-27B-GGUF/IQ5_KS Two RTX 3090 EVGA x16; The dynamic TDP is used to keep the GPUs lower than 80C (hence the zig-zags). logs Detailsgraph:
layer:
|
Beta Was this translation helpful? Give feedback.
-
|
AesSedai/Kimi-K2.5/Q4_X *its basically for the illustrative purposes. The previous benchmark ( https://huggingface.co/AesSedai/Kimi-K2.5-GGUF/discussions/7#69c9120d934491df89a986b1 ) with 8x3090 partial offload came with the result: That seems to be very slow. So the below shows that running a hybrid inference just with a head offloaded works much better. So here is how becomes actually more-or-less usable: hardware: 2xEPYC 7B13; NPS=0; DDR4-2933 ECC (2666 overclocked; timings are auto; command rate=1T); single RTX 3090 EVGA command: DetailsGGML_CUDA_NO_PINNED=1 numactl --interleave=all /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
--warmup-batch \
-f /opt/ik_llama.cpp/wiki.test.raw \
--model /opt/AesSedai/Kimi-K2.5/Q4_X/Kimi-K2.5-Q4_X-00001-of-00014.gguf \
--alias AesSedai/Kimi-K2.5-GGUF \
-b $((1 * 1024)) -ub $((1 * 1024)) \
--ctx-size $((128 * 1024)) \
--mlock \
--temp 0.0 --top-k 0 --top-p 1.0 \
-ctk f16 \
-ctv f16 \
-khad -vhad \
-amb 256 \
-muge \
--merge-qkv \
--split-mode layer \
--cpu-moe \
--graph-reduce-type f16 \
--threads 128 \
--gpu-layers 99 \
--host 0.0.0.0 \
--port 8080 \
--log-enable \
--logdir /var/log/ \
--jinja \
--chat-template-file /opt/AesSedai/Kimi-K2.5/Q4_X/chat_template.jinja \
--special \
--verbose-prompt --verbosity 2 \
--prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
--slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
--lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
--keep -1 \
--slot-prompt-similarity 0.35 \
--metrics \
-cuda fusion=1 |
Beta Was this translation helpful? Give feedback.
-
|
bartowski Qwen3.6-27b Q8_0 hardware: 2 x RTX 3090. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
// I went through the repos of
vllm,sglangetc. (not the first time) and I was unable to find any info related to the speed benchmarks. So creating the separate discussion to keep the interesting performance data here.Qwen3.5 397B IQ4_KSS
hardware: RTX 3090 FE x10 TDP
350W420W, DDR4 2666 MT/s ECC, AMD THREADRIPPER PRO 3995WX; ASROCK WRX80 Creator 2.0UPDATE: Moved the GPUs to the GIGABYTE MC62-G40. That allowed to get the config with one x16 gpu and the rest with the x8. That improved the prefill (see the green graph below). So the GIGABYTE motherboards are recommended, not ASROCK.
UPDATE2: Installed the
PLX 88096switch to have more x16 GPUs.hybrid inference (single GPU, layer) / full offload (layer vs graph)
llama-sweep-bench run command
Details
/usr/share/doc/nvidia-cuda-toolkit/examples/bin/x86_64/linux/release/p2pBandwidthLatencyTest --p2p_read --sm_copyDetails
updated for x16 and x8 rest:
Details
updated for five x16 and x8 rest via
PLX 88096switch:Details
File: /root/p2p-11gpu.log
logs
Details
File: qwen3.5-397b-iq4_kss-full-offload-graph.log
File: qwen3.5-397b-iq4_kss-full-offload-layer.log
File: qwen3.5-397b-iq4_kss-hybrid-layer.log
File: /root/utils/bench-10gpu-mist-3k_b-5x(x16)-5x(x8).log
generate_svgs.sh (a dodgy script to generate svg out of
llama-sweep-bench.sh)Details
Beta Was this translation helpful? Give feedback.
All reactions