Information retention difference betwen bf16 and q8_0 hadamard rotation kv cache #1557
-
|
Hi, the title says it all. Basically where i'm coming from is that i heard the recommendation to use bf16 kv cache with qwen 3.5 on very long contexts and i was wondering how well kv cache quantized to q8_0 compared to that when using hadamard rotation as available in ik_llama.cpp. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 6 replies
-
|
Let me guess: you saw this recommendation on Reddit, right? |
Beta Was this translation helpful? Give feedback.
-
AFAIK, there is no point of using hadamard for anything higher than |
Beta Was this translation helpful? Give feedback.
-
If I am not mistaken, the |
Beta Was this translation helpful? Give feedback.
-
|
Below is the example of testing the perplexity with Kimi-K2.5 at 64k ctx. I can't see any difference between AesSedai Kimi-K2.5 Q4_X ayn-rand-atlas-shrugged.txt wget -nc 'https://drive.usercontent.google.com/download?id=0B_jOTiI5YWpPNDlsejBFRThzOVk&export=download' -O 'ayn-rand-atlas-shrugged.pdf'
hash=$(sha256sum ayn-rand-atlas-shrugged.pdf | cut -d' ' -f1)
if [[ "$hash" != "6f90245574a4a558d2bc9aa48292a1d415e2a443034b0a1d7d95565b842eaf82" ]]; then rm -f ayn-rand-atlas-shrugged.pdf; echo "Woops!"; fi
apt install poppler-utils
pdftotext ./ayn-rand-atlas-shrugged.pdf
ls -lah ayn-rand-atlas-shrugged.txtperplexity calc: GGML_CUDA_NO_PINNED=1 numactl --interleave=all /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-perplexity \
-f ~/ayn-rand-atlas-shrugged.txt \
--model /opt/AesSedai/Kimi-K2.5/Q4_X/Kimi-K2.5-Q4_X-00001-of-00014.gguf \
--alias AesSedai/Kimi-K2.5-GGUF \
-b $((2048)) -ub $((512)) \
--ctx-size $((64 * 1024)) \
--chunks 8 \
--mlock \
--temp 0.0 --top-k 0 --top-p 1.0 \
-ctk f16 \
-ctv f16 \
--seed 42 \
-amb 256 \
-muge \
--merge-qkv \
--split-mode layer \
--cpu-moe \
--graph-reduce-type f16 \
--threads 128 \
|
Beta Was this translation helpful? Give feedback.
-
|
Interesting. Getting a Segfault with Qwen3.5. [EDIT]: Oh! I should not probably use Details |
Beta Was this translation helpful? Give feedback.
-
|
See the PPL values for the long context above. Qwen3.5 for example. As far as I can see, there is no difference between bf16 and q8_0. Like, at all.We Same goes for Kimi-K2.5: Here is bf16: and here goes the q8_0: I do believe there will be about 0.01 delta, so no difference. [EDIT]: okay I added two more points and cancelling this stupid test. :) There is no difference between q8_0 and bf16/f16 cache. |
Beta Was this translation helpful? Give feedback.
@milpster
Qwen3.5-397B-IQ4_KSS
8k ctx, q8_0 kv: