Information retention difference betwen bf16 and q8_0 hadamard rotation kv cache #1557

milpster · 2026-03-30T14:12:10Z

milpster
Mar 30, 2026

Hi,

the title says it all. Basically where i'm coming from is that i heard the recommendation to use bf16 kv cache with qwen 3.5 on very long contexts and i was wondering how well kv cache quantized to q8_0 compared to that when using hadamard rotation as available in ik_llama.cpp.

Thanks!

Answered by magikRUKKOLA

Apr 1, 2026

@milpster

Qwen3.5-397B-IQ4_KSS

8k ctx, q8_0 kv:

system_info: n_threads = 1 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity_v2: tokenizing the input ..
perplexity_v2: have 772507 tokens. Calculation chunk = 8448
perplexity_v2: calculating perplexity over 8 chunks, batch_size=2048
perplexity_v2: 15.52 seconds per pass - ETA 2.07 minutes
[1]4.6632,[2]5.2037,[3]5.1015,[4]5.0195,[5]5.1195,[6]5.2209,[7]5.2348,[8]5.2971,

llama_print_timings:        load time =   56825.87 ms
ll…

View full answer

ikawrakow · 2026-03-30T14:28:24Z

ikawrakow
Mar 30, 2026
Maintainer

Let me guess: you saw this recommendation on Reddit, right?

5 replies

Dampfinchen Mar 30, 2026

I have heard of this as well.. on reddit. Would be interesting to explore if there's any truth to that, especially on long context multi agentic tasks (100K context+)

milpster Mar 30, 2026
Author

Let me guess: you saw this recommendation on Reddit, right?

Among others, which i ignored at first, but then unsloth recommended it:

https://unsloth.ai/docs/models/qwen3.5

"If you're getting gibberish, your context length might be set too low. Or try using --cache-type-k bf16 --cache-type-v bf16 which might help."

ikawrakow Mar 31, 2026
Maintainer

Well, if Unsloth recommended it, then it must be true /s

milpster Mar 31, 2026
Author

Oh i see we are being helpful here. What did i do to you?

magikRUKKOLA Mar 31, 2026

@Dampfinchen

Would be interesting to explore if there's any truth to that, especially on long context multi agentic tasks (100K context+)

Are you positive?
If so, lets start with some case when sending 100k+ prompt (and the seed for a certain model as well) and using q8_0 kv fails if compared to bf16 kv. :) If there is no such data exists, then what do you suggest? Someone have to find such a data first? Its unclear.

magikRUKKOLA · 2026-03-31T21:22:29Z

magikRUKKOLA
Mar 31, 2026

@milpster

i was wondering how well kv cache quantized to q8_0 compared to that when using hadamard rotation as available in ik_llama.cpp.

AFAIK, there is no point of using hadamard for anything higher than q5_k.

0 replies

magikRUKKOLA · 2026-03-31T23:18:03Z

magikRUKKOLA
Mar 31, 2026

@milpster

Basically where i'm coming from is that i heard the recommendation to use bf16 kv cache

If I am not mistaken, the bf16 is getting converted to f16 when using the flash attention anyways.
Now guess what is the default kv-cache type.

0 replies

magikRUKKOLA · 2026-04-01T02:41:16Z

magikRUKKOLA
Apr 1, 2026

@milpster

Below is the example of testing the perplexity with Kimi-K2.5 at 64k ctx. I can't see any difference between q8_0 vs f16.

AesSedai Kimi-K2.5 Q4_X

ayn-rand-atlas-shrugged.txt

wget -nc 'https://drive.usercontent.google.com/download?id=0B_jOTiI5YWpPNDlsejBFRThzOVk&export=download' -O 'ayn-rand-atlas-shrugged.pdf'
hash=$(sha256sum ayn-rand-atlas-shrugged.pdf | cut -d' ' -f1)
if [[ "$hash" != "6f90245574a4a558d2bc9aa48292a1d415e2a443034b0a1d7d95565b842eaf82" ]]; then rm -f ayn-rand-atlas-shrugged.pdf; echo "Woops!"; fi
apt install poppler-utils
pdftotext ./ayn-rand-atlas-shrugged.pdf
ls -lah ayn-rand-atlas-shrugged.txt

perplexity calc:

GGML_CUDA_NO_PINNED=1 numactl --interleave=all /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-perplexity \
    -f ~/ayn-rand-atlas-shrugged.txt \
    --model /opt/AesSedai/Kimi-K2.5/Q4_X/Kimi-K2.5-Q4_X-00001-of-00014.gguf \
    --alias AesSedai/Kimi-K2.5-GGUF \
    -b $((2048)) -ub $((512)) \
    --ctx-size $((64 * 1024)) \
    --chunks 8 \
    --mlock \
    --temp 0.0 --top-k 0 --top-p 1.0 \
    -ctk f16 \
    -ctv f16 \
    --seed 42 \
    -amb 256 \
    -muge \
    --merge-qkv \
    --split-mode layer \
    --cpu-moe \
    --graph-reduce-type f16 \
    --threads 128 \

~~64k~~ 4k ctx, q8_0:

system_info: n_threads = 128 / 256 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1153.6 ms
perplexity: calculating perplexity over 8 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 47.58 seconds per pass - ETA 6.33 minutes
[1]1.1022,[2]1.0840,[3]1.0996,[4]1.1049,[5]1.1149,[6]1.1213,[7]1.1416,[8]1.1639,
Final estimate: PPL over 8 chunks for n_ctx=4096 = 1.1639 +/- 0.00613

llama_print_timings:        load time =  395928.03 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  372111.95 ms / 32768 tokens (   11.36 ms per token,    88.06 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  374354.33 ms / 32769 tokens
~ggml_backend_cuda_context: have 120 graphs

~~64k~~ 4k ctx, f16:

system_info: n_threads = 128 / 256 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1141.39 ms
perplexity: calculating perplexity over 8 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 48.21 seconds per pass - ETA 6.42 minutes
[1]1.1031,[2]1.0840,[3]1.1003,[4]1.1049,[5]1.1149,[6]1.1216,[7]1.1418,[8]1.1644,
Final estimate: PPL over 8 chunks for n_ctx=4096 = 1.1644 +/- 0.00616

llama_print_timings:        load time =  439216.94 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  378356.82 ms / 32768 tokens (   11.55 ms per token,    86.61 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  380746.52 ms / 32769 tokens

1 reply

magikRUKKOLA Apr 1, 2026

@milpster

Qwen3.5-397B-IQ4_KSS

8k ctx, q8_0 kv:

system_info: n_threads = 1 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity_v2: tokenizing the input ..
perplexity_v2: have 772507 tokens. Calculation chunk = 8448
perplexity_v2: calculating perplexity over 8 chunks, batch_size=2048
perplexity_v2: 15.52 seconds per pass - ETA 2.07 minutes
[1]4.6632,[2]5.2037,[3]5.1015,[4]5.0195,[5]5.1195,[6]5.2209,[7]5.2348,[8]5.2971,

llama_print_timings:        load time =   56825.87 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   94864.91 ms / 67584 tokens (    1.40 ms per token,   712.42 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  119392.18 ms / 67585 tokens
~ggml_backend_cuda_context: have 72 graphs

8k ctx, f16 kv:

system_info: n_threads = 1 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity_v2: tokenizing the input ..
perplexity_v2: have 772507 tokens. Calculation chunk = 8448
perplexity_v2: calculating perplexity over 8 chunks, batch_size=2048
perplexity_v2: 15.23 seconds per pass - ETA 2.02 minutes
[1]4.7001,[2]5.2141,[3]5.1042,[4]5.0233,[5]5.1266,[6]5.2252,[7]5.2419,[8]5.3031,

llama_print_timings:        load time =   56622.80 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   94976.51 ms / 67584 tokens (    1.41 ms per token,   711.59 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  119632.90 ms / 67585 tokens

8k ctx, bf16 kv:

system_info: n_threads = 1 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity_v2: tokenizing the input ..
perplexity_v2: have 772507 tokens. Calculation chunk = 8448
perplexity_v2: calculating perplexity over 8 chunks, batch_size=2048
perplexity_v2: 15.13 seconds per pass - ETA 2.02 minutes
[1]4.6840,[2]5.2134,^[[3]5.1015,[4]5.0252,[5]5.1239,[6]5.2207,[7]5.2350,[8]5.2936,

llama_print_timings:        load time =   56576.28 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   94737.24 ms / 67584 tokens (    1.40 ms per token,   713.38 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  119212.53 ms / 67585 tokens

64k ctx, q8_0:


system_info: n_threads = 1 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1447.79 ms
perplexity: calculating perplexity over 8 chunks, n_ctx=65536, batch_size=2048, n_seq=1
perplexity: 101.25 seconds per pass - ETA 13.48 minutes
[1]6.1918,[2]6.2837,[3]6.3524,[4]6.4525,[5]6.2666,[6]6.2157,[7]6.2326,[8]6.3342,
Final estimate: PPL over 8 chunks for n_ctx=65536 = 6.3342 +/- 0.02744

llama_print_timings:        load time =   56552.10 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  773645.54 ms / 524288 tokens (    1.48 ms per token,   677.69 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  796015.71 ms / 524289 tokens

64k ctx, bf16 kv:

system_info: n_threads = 1 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1459.04 ms
perplexity: calculating perplexity over 8 chunks, n_ctx=65536, batch_size=2048, n_seq=1
perplexity: 100.08 seconds per pass - ETA 13.33 minutes
[1]6.1903,[2]6.2829,[3]6.3512,[4]6.4510,[5]6.2660,[6]6.2157,[7]6.2320,[8]6.3332,
Final estimate: PPL over 8 chunks for n_ctx=65536 = 6.3332 +/- 0.02743

llama_print_timings:        load time =   56649.36 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  772975.81 ms / 524288 tokens (    1.47 ms per token,   678.27 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  795398.08 ms / 524289 tokens

Answer selected by milpster

magikRUKKOLA · 2026-04-01T05:07:51Z

magikRUKKOLA
Apr 1, 2026

@ikawrakow

Interesting. Getting a Segfault with Qwen3.5.

[EDIT]: Oh! I should not probably use --ppl-stride 512 then?

Details


[Thread 0x7fca30fc9000 (LWP 954435) exited]
[Thread 0x7fca347d0000 (LWP 954428) exited]
[Thread 0x7fca327cc000 (LWP 954432) exited]
[Thread 0x7fca36fd5000 (LWP 954423) exited]
[Thread 0x7fca307c8000 (LWP 954436) exited]
[Thread 0x7fca2ffc7000 (LWP 954437) exited]
[Thread 0x7fca2f7c6000 (LWP 954438) exited]
perplexity_v2: 480.11 seconds per pass - ETA 1 hours 4.00 minutes

Thread 1 "llama-perplexit" received signal SIGSEGV, Segmentation fault.
0x00007fffe79717bd in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffe79717bd in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00005555555fc356 in std::uninitialized_copy<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float*> (__first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __result=0x55556db43bc0)
    at /usr/include/c++/15/bits/stl_uninitialized.h:273
#2  0x00005555555f9dc4 in std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float*, float> (
    __first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __result=0x55556db43bc0)
    at /usr/include/c++/15/bits/stl_uninitialized.h:635
#3  0x00005555555f4b8e in std::vector<float, std::allocator<float> >::_M_range_initialize_n<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > > > (
    this=0x7fffffffaf80, __first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __n=248320)
    at /usr/include/c++/15/bits/stl_vector.h:1989
#4  0x00005555555f006e in std::vector<float, std::allocator<float> >::vector<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, void> (this=0x7fffffffaf80, __first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __a=...) at /usr/include/c++/15/bits/stl_vector.h:746
#5  0x00005555555dc52e in perplexity_v2 (ctx=0x5555638c0530, params=...)
    at /opt/ik_llama.cpp/ik_llama.cpp/examples/perplexity/perplexity.cpp:458
#6  0x00005555555dca51 in perplexity (ctx=0x5555638c0530, params=..., n_ctx=65536)
    at /opt/ik_llama.cpp/ik_llama.cpp/examples/perplexity/perplexity.cpp:482
#7  0x00005555555e74e9 in main (argc=83, argv=0x7fffffffdb98) at /opt/ik_llama.cpp/ik_llama.cpp/examples/perplexity/perplexity.cpp:2065

(gdb) bt full
#0  0x00007fffe79717bd in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00005555555fc356 in std::uninitialized_copy<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float*> (__first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __result=0x55556db43bc0)
    at /usr/include/c++/15/bits/stl_uninitialized.h:273
        __n = 248320
#2  0x00005555555f9dc4 in std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, float*, float> (
    __first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __result=0x55556db43bc0)
    at /usr/include/c++/15/bits/stl_uninitialized.h:635
No locals.
#3  0x00005555555f4b8e in std::vector<float, std::allocator<float> >::_M_range_initialize_n<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, __gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > > > (
    this=0x7fffffffaf80, __first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __n=248320)
    at /usr/include/c++/15/bits/stl_vector.h:1989
        __start = 0x55556db43bc0
#4  0x00005555555f006e in std::vector<float, std::allocator<float> >::vector<__gnu_cxx::__normal_iterator<float*, std::vector<float, std::allocator<float> > >, void> (this=0x7fffffffaf80, __first=<error reading variable: Cannot access memory at address 0x7f8d1f048810>,
    __last=<error reading variable: Cannot access memory at address 0x7f8d1f13b010>, __a=...) at /usr/include/c++/15/bits/stl_vector.h:746
        __n = 248320
        __n = <optimized out>
#5  0x00005555555dc52e in perplexity_v2 (ctx=0x5555638c0530, params=...)
    at /opt/ik_llama.cpp/ik_llama.cpp/examples/perplexity/perplexity.cpp:458
        tok_logits = std::vector of length 0, capacity 248320
        prob = 0
        j = 65279
        end = 65792
        num_batches = 33
        logits = std::vector of length 16337469440, capacity 32547799040 = {5.37177658, 8.03621674, 7.52868605, 6.32745934, 3.85191226,
          3.70060468, 6.30780315, 7.26194143, 7.83440351, 7.25559473, 6.4135375, 5.93045712, 7.39800024, 8.20700645, 6.16542435,
          7.03815937, 8.46436214, 7.66030645, 6.89105511, 6.34279633, 6.28064299, 5.48187876, 5.21680737, 5.87006235, 6.13227797,
          7.6898365, 6.45092106, 5.17230225, 5.89929104, 5.32410574, 3.22509813, 5.20883751, 8.64043713, 7.38054466, 7.05490446,
          7.74955273, 7.87623692, 7.75324965, 7.04416466, 5.84781837, 8.00580692, 6.40156603, 6.41605711, 7.91536188, 6.35764265,
          6.81698513, 5.80734587, 7.14266682, 4.70078087, 6.91059542, 7.29101944, 7.67421389, 6.40587234, 6.9440527, 6.86338758,
          5.25766563, 5.41511488, 5.38861704, 7.17822456, 7.10878134, 6.22065496, 4.76229525, 5.54861593, 2.96887875, 7.64295197,
          5.390872, 4.79868555, 4.43484497, 6.77907705, 6.19098854, 5.00169659, 5.83159208, 5.76409054, 5.1004591, 4.41728163, 7.57873154,
          5.7718482, 7.10335636, 5.06527328, 5.86280537, 3.49555683, 7.37776375, 7.63828278, 6.0880599, 5.86679792, 4.94591808,
          5.76158524, 6.2307601, 6.02112341, 5.8106389, 5.45629787, 4.34285402, 6.5829587, 2.23821115, 0.324911207, 0.143318042,
          -0.105972864, -0.880836725, -1.95416069, -0.687838972, -0.427717984, -0.967862964, 1.2163142, 0.0740813538, -0.483913928,
          -1.32869864, -0.880546868, -1.33325398, -0.0112506533, -1.213135, -1.18345261, -0.0453275628, 0.115652077, -1.09044075,
          -0.78650862, 0.175811827, 0.240634903, -1.24453056, -0.180296943, -2.08426332, -0.196902946, -0.616328597, -1.95760608,
          0.410465747, -4.97515821, -4.97515821, 3.43419218, 2.94880247, 1.6928575, 3.28403997, 1.11196303, 1.11030197, 0.764343083,
          1.86917937, 2.27176332, 3.31768727, 2.55820489, 0.88756907, 0.969912529, 1.23616421, 0.585639358, -0.488767713, 1.64798081,
          0.686187208, 2.28047442, 1.23707247, 1.01784575, 1.9902643, 0.719053566, 2.85736942, 1.4046452, 2.71138549, 0.226571783,
          -0.747447252, -0.529561341, -1.26806772, -0.271738112, 1.42543685, 2.69376421, 0.206167474, -0.117487341, -1.65945005,
          -0.382600516, -0.301978558, -0.864809334, 2.30226111, 0.314517051, 0.431895941, 0.167901963, -0.587988675, 0.967787087,
          3.03864288, 0.811101973, -2.46478629, -1.03426754, -2.15076613, -1.70969236, -4.97515821, -4.97515821, -4.97515821, -4.97515821,
          -4.97515821, -4.97730875, -4.97730875, -4.97730875, -4.97730875, -4.97730875, -4.97730875, 0.302295774, 1.27664065, 1.86175561,
          1.27425969, 1.15398586, 1.76751149, 0.41406697, 1.87962174, 2.09321332, 5.22243261, 8.90007973, 0.964737713...}
        t_start = std::chrono::sys_time = { 1775019199595409540ns [2026-04-01 04:53:19] }
        start = 0
        t_end = std::chrono::sys_time = { 1775019679708827217ns [2026-04-01 05:01:19] }
        i = 0
        add_bos = false
        __func__ = "perplexity_v2"
        tokens = std::vector of length 772507, capacity 3224968 = {200, 2473, 314, 34672, 198, 3723, 5577, 198, 13155, 5577, 198, 35, 290,
          19669, 198, 35924, 271, 32379, 353, 471, 19950, 12, 5609, 2301, 32667, 3461, 198, 84424, 353, 471, 3067, 83663, 198, 84424,
          7696, 471, 3067, 6612, 6622, 198, 84424, 14016, 471, 3067, 24219, 3449, 3067, 87454, 198, 84424, 16332, 471, 3067, 6314, 63793,
          3380, 11090, 69772, 198, 84424, 629, 471, 3067, 6759, 1728, 2860, 2913, 3067, 414, 13607, 1058, 5609, 30778, 198, 84424, 28699,
          471, 3067, 19950, 7415, 1833, 44, 26545, 6014, 198, 84424, 43469, 471, 3067, 13073, 1537, 922, 4180, 3449, 3067, 13073, 1537,
          75815, 198, 84424, 55316, 471, 3067, 72614, 469, 34643, 29840, 198, 84424, 38397, 471, 3067, 86841, 6301, 3449, 3067, 24705,
          26915, 198, 84424, 1543, 471, 457, 56, 20951, 38290, 67590, 2080, 271, 32379, 7696, 471, 90256, 12, 842, 198, 84424, 353, 471,
          3067, 24915, 37914, 65951, 7301, 1425, 6004, 458, 73741, 198, 84424, 7696, 471, 3067, 6064, 3718, 90274, 42295, 2913, 387, 1377,
--Type <RET> for more, q to quit, c to continue without paging--
          198, 84424, 14016, 471, 41352, 35434, 18917, 198, 84424, 16332, 471, 3067, 57684, 3461, 2913, 3067, 629, 14376, 1728, 198,
          84424, 629, 471, 60256, 34291, 74401, 45, 198, 84424, 28699, 471, 234960, 72160, 35602, 939, 198, 84424, 43469, 471, 3067...}
        n_ctx = 65792
        logit_history = std::vector of length 772507, capacity 772507 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0...}
        prob_history = std::vector of length 772507, capacity 772507 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0...}
        calc_chunk = 65792
        n_chunk_max = 1381
        n_chunk = 8
        n_vocab = 248320
        n_batch = 2048
        count = 0
        nll = 0
#6  0x00005555555dca51 in perplexity (ctx=0x5555638c0530, params=..., n_ctx=65536)
    at /opt/ik_llama.cpp/ik_llama.cpp/examples/perplexity/perplexity.cpp:482
        add_bos = false
        logits_stream = <incomplete type>
        __func__ = "perplexity"
        tim1 = std::chrono::sys_time = { 140737078435110ns [1970-01-02 15:05:37] }
        tokens = std::vector of length 0, capacity 884974128905409727 = {
          <error reading variable tokens (Cannot access memory at address 0x1)>
        tim2 = std::chrono::sys_time = { 140737077952173ns [1970-01-02 15:05:37] }
        logit_history = std::vector of length 35184270646272, capacity 884974128905409748 = {
          <error reading variable logit_history (Cannot access memory at address 0x0)>
        prob_history = std::vector of length 35184269488025, capacity 35184270646466 = {
          <error reading variable prob_history (Cannot access memory at address 0x0)>
        n_chunk_max = -2147483648
        n_chunk = 6
        n_vocab = 0
        n_batch = 2
        count = 32767
        nll = 0
        nll2 = 6.9533558068547781e-310
        num_batches = 1432266718
        n_seq = 0
        batch = {n_tokens = 0, token = 0x30203d2044, embd = 0x7fffffffb240, pos = 0x0, n_seq_id = 0x4d49535f4d534100,
          seq_id = 0x80000030203d2044, logits = 0x7fffffffb260 "BLAS = 1 | ", all_pos_0 = 11, all_pos_1 = 0, all_seq_id = 1396788290}
        logits = std::vector of length 1960060, capacity -35184126838788 = {-3.84800103e-09, 4.59163468e-41, -3.85075793e-09,
          4.59163468e-41, -3.84883947e-09, 4.59163468e-41, -3.91821686e-09, 4.59163468e-41, -3.84884657e-09, 4.59163468e-41,
          -3.84885368e-09, 4.59163468e-41, -3.91854371e-09, 4.59163468e-41, -3.86751253e-09, 4.59163468e-41, -3.91779054e-09,
          4.59163468e-41, -3.84886789e-09, 4.59163468e-41, 0, 0, 0, 0, -3.84799392e-09, 4.59163468e-41, -3.85075083e-09, 4.59163468e-41,
          -3.84881105e-09, 4.59163468e-41, -3.91688104e-09, 4.59163468e-41, -3.84881815e-09, 4.59163468e-41, -3.84882526e-09,
          4.59163468e-41, -3.91720789e-09, 4.59163468e-41, -3.86296506e-09, 4.59163468e-41, -3.91639787e-09, 4.59163468e-41,
          -3.84883236e-09, 4.59163468e-41, 0, 0, 0, 0, -3.84798682e-09, 4.59163468e-41, -3.85074372e-09, 4.59163468e-41, -3.84878263e-09,
          4.59163468e-41, -3.9155168e-09, 4.59163468e-41, -3.84878973e-09, 4.59163468e-41, -3.84879684e-09, 4.59163468e-41,
          -3.91584365e-09, 4.59163468e-41, -3.86580723e-09, 4.59163468e-41, -3.91509047e-09, 4.59163468e-41, -3.84880394e-09,
          4.59163468e-41, 0, 0, 0, 0, -3.84797971e-09, 4.59163468e-41, -3.85073662e-09, 4.59163468e-41, -3.8487471e-09, 4.59163468e-41,
          -3.91417387e-09, 4.59163468e-41, -3.8487542e-09, 4.59163468e-41, -3.84876131e-09, 4.59163468e-41, -3.91450072e-09,
          4.59163468e-41, -3.86850729e-09, 4.59163468e-41, -3.91374755e-09, 4.59163468e-41, -3.84877552e-09, 4.59163468e-41, 0, 0, 0, 0,
          -3.84797261e-09, 4.59163468e-41, -3.85072951e-09, 4.59163468e-41, -3.84871868e-09, 4.59163468e-41, -3.91303701e-09,
          4.59163468e-41, -3.84872578e-09, 4.59163468e-41, -3.84873289e-09, 4.59163468e-41, -3.9131578e-09, 4.59163468e-41,
          -3.86323507e-09, 4.59163468e-41, -3.91280253e-09, 4.59163468e-41, -3.84873999e-09, 4.59163468e-41, 0, 0, 0, 0, -3.8479655e-09,
          4.59163468e-41, -3.85072241e-09, 4.59163468e-41, -3.84869026e-09, 4.59163468e-41, -3.91238331e-09, 4.59163468e-41,
          -3.84869736e-09, 4.59163468e-41, -3.84870447e-09, 4.59163468e-41, -3.9125041e-09, 4.59163468e-41, -3.86431509e-09,
          4.59163468e-41, -3.91214883e-09, 4.59163468e-41, -3.84871157e-09, 4.59163468e-41, 0, 0, 0, 0, -3.8479584e-09, 4.59163468e-41,
          -3.8507153e-09, 4.59163468e-41, -3.84866183e-09, 4.59163468e-41, -3.91144539e-09, 4.59163468e-41, -3.84866894e-09,
          4.59163468e-41, -3.84867604e-09, 4.59163468e-41, -3.91156618e-09, 4.59163468e-41, -3.86094001e-09, 4.59163468e-41,
--Type <RET> for more, q to quit, c to continue without paging--
          -3.91113986e-09, 4.59163468e-41, -3.84868315e-09, 4.59163468e-41, 0, 0, 0, 0, -3.84795129e-09, 4.59163468e-41, -3.8507082e-09,
          4.59163468e-41, -3.84862631e-09, 4.59163468e-41, -3.91043642e-09, 4.59163468e-41, -3.84863341e-09, 4.59163468e-41,
          -3.84864052e-09, 4.59163468e-41, -3.91055721e-09, 4.59163468e-41, -3.86637566e-09, 4.59163468e-41, -3.91020194e-09,
          4.59163468e-41, -3.84865473e-09, 4.59163468e-41, 0, 0, 0, 0, -3.84794419e-09, 4.59163468e-41, -3.85070109e-09, 4.59163468e-41,
          -3.84859078e-09, 4.59163468e-41, -3.9094985e-09, 4.59163468e-41...}
        workers = std::vector of length 0, capacity -289360691350225892
        log_probs = std::vector of length 0, capacity 3255307777713450285
        first = -19024
        ppl = 6.9533558069033941e-310
#7  0x00005555555e74e9 in main (argc=83, argv=0x7fffffffdb98) at /opt/ik_llama.cpp/ik_llama.cpp/examples/perplexity/perplexity.cpp:2065
        params = {devices = "", seed = 3407, n_threads = 1, n_threads_batch = -1, n_predict = -1, n_ctx = 65792, n_batch = 2048,
          n_ubatch = 512, n_keep = -1, n_chunks = 8, n_parallel = 1, n_sequences = 1, p_split = 0.100000001, n_gpu_layers = 99,
          main_gpu = 0, max_gpu = 2, ncmoe = 0, fit_margin = 0, fit = false, worst_graph_tokens = 0, tensor_split = {
            0 <repeats 128 times>}, grp_attn_n = 1, grp_attn_w = 512, n_print = -1, rope_freq_base = 0, rope_freq_scale = 0,
          yarn_ext_factor = -1, yarn_attn_factor = -1, yarn_beta_fast = -1, yarn_beta_slow = -1, yarn_orig_ctx = 0, defrag_thold = -1,
          ban_phrases_bias = -999, max_extra_alloc_MiB = 256, nrep = 1, cb_eval = 0x0, cb_eval_user_data = 0x0,
          numa = GGML_NUMA_STRATEGY_DISABLED, split_mode = LLAMA_SPLIT_MODE_GRAPH,
          rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED, pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED,
          attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED, sparams = {n_prev = 64, n_probs = 0, min_keep = 0, top_k = 0, top_p = 1,
            min_p = 0.0500000007, tfs_z = 1, typical_p = 1, temp = 0, dynatemp_range = 0, dynatemp_exponent = 1, penalty_last_n = 64,
            penalty_repeat = 1, penalty_freq = 0, penalty_present = 0, dry_multiplier = 0, dry_base = 1.75, dry_allowed_length = 2,
            dry_penalty_last_n = 65792, total_context_size = 16840, mirostat = 0, mirostat_tau = 5, mirostat_eta = 0.100000001,
            xtc_probability = 0, xtc_threshold = 1, top_n_sigma = 0, adaptive_target = -1, adaptive_decay = 0.899999976,
            adaptive_updt_w_cur = false, penalize_nl = false, seed = 3407, dry_sequence_breakers = std::vector of length 4, capacity 4 = {
              "\n", ":", "\"", "*"}, samplers_sequence = std::vector of length 11, capacity 11 = {llama_sampler_type::DRY,
              llama_sampler_type::TOP_K, llama_sampler_type::TFS_Z, llama_sampler_type::TYPICAL_P, llama_sampler_type::TOP_P,
              llama_sampler_type::MIN_P, llama_sampler_type::XTC, llama_sampler_type::TOP_N_SIGMA, llama_sampler_type::TEMPERATURE,
              llama_sampler_type::ADAPTIVE_P, llama_sampler_type::DIST}, grammar = "", grammar_lazy = false,
            grammar_triggers = std::vector of length 0, capacity 0, preserved_tokens = std::set with 0 elements, cfg_negative_prompt = "",
            cfg_scale = 1, logit_bias = std::unordered_map with 0 elements, penalty_prompt_tokens = std::vector of length 0, capacity 0,
            use_penalty_prompt_tokens = false}, speculative = {type = COMMON_SPECULATIVE_TYPE_NONE, devices = "", params = "",
            n_threads = -1, n_threads_batch = -1, n_max = 16, n_min = 0, p_split = 0.100000001, p_min = 0.75, ngram_size_n = 12,
            ngram_size_m = 48, ngram_min_hits = 1, ngram_mod = std::shared_ptr<common_ngram_mod> (empty) = {get() = 0x0},
            lookup_cache_static = "", lookup_cache_dynamic = "", mparams_dft = {path = "", url = "", hf_repo = "", hf_file = "",
              docker_repo = ""}, model_dft = 0x0, cparams_dft = {seed = 6, n_ctx = 0, n_batch = 1769108595, n_ubatch = 26478,
              n_seq_max = 18, n_threads = 0, n_threads_batch = 1446630048, max_extra_alloc = 21845, worst_case_tokens = 21,
              rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_NONE, pooling_type = 21, attention_type = LLAMA_ATTENTION_TYPE_CAUSAL,
              rope_freq_base = -1.02669461e+34, rope_freq_scale = 4.59163468e-41, yarn_ext_factor = 5.10787039e+13,
              yarn_attn_factor = 3.0611365e-41, yarn_beta_fast = 5.10788381e+13, yarn_beta_slow = 3.0611365e-41,
              yarn_orig_ctx = 1446630112, defrag_thold = 3.0611365e-41, cb_eval = 0x7fffffffd058, cb_eval_user_data = 0x4,
              type_k = 1819047278, type_v = GGML_TYPE_F32, type_reduce = 4154540696, logits_all = 255, embeddings = 127,
              offload_kqv = false, flash_attn = false, mla_attn = -12168, attn_max_batch = 32767, fused_moe_up_gate = 12,
              grouped_expert_routing = false, fused_up_gate = false, fused_mmad = false, rope_cache = false, graph_reuse = false,
              min_experts = 1819635234, thresh_experts = 1.26871591e+31, only_active_experts = 112, k_cache_hadamard = 97,
              v_cache_hadamard = 99, split_mode_graph_scheduling = 101, scheduler_async = false, mtp = 127, mtp_op_type = MTP_OP_NONE,
              abort_callback = 0x0, abort_callback_data = 0x0, offload_policy = 0x7fffffffd0b0, cuda_params = 0x7}, n_ctx = 0,
            n_gpu_layers = -1, model = "", replacements = std::vector of length 0, capacity 0, cache_type_k = "", cache_type_v = ""},
          model = "/opt/ubergarm/Qwen3.5-397B-A17B-GGUF/IQ4_KSS/mist.bin", model_alias = "ubergarm/Qwen3.5-397B-A17B-IQ4_KSS",
          model_url = "", hf_token = "", hf_repo = "", hf_file = "",
          prompt = "\fTable of Contents\nTitle Page\nCopyright Page\nDedication\nIntroduction\n\nPART I - NON-CONTRADICTION\nCHAPTER I - THE THEME\nCHAPTER II - THE CHAIN\nCHAPTER III - THE TOP AND THE BOTTOM\nCHAPTER IV - THE IMMO"..., prompt_file = "/root/atlas.txt",
          prompt_is_binary = false, path_prompt_cache = "", input_prefix = "", input_suffix = "", logdir = "/var/log/",
          lookup_cache_static = "", lookup_cache_dynamic = "/root/.cache/ik_llama.cpp/slot.bin", logits_file = "", rpc_servers = "",
          cuda_params = "fusion=1", in_files = std::vector of length 0, capacity 0, antiprompt = std::vector of length 0, capacity 0,
          ban_phrases = std::vector of length 0, capacity 0, banned_n = 1, n_buffer = 0, can_ban_phrases = true,
          kv_overrides = std::vector of length 0, capacity 0, tensor_buft_overrides = std::vector of length 0, capacity 0,
          offload_policy = std::vector of length 0, capacity 0, lora_init_without_apply = false,
          lora_adapters = std::vector of length 0, capacity 0, control_vectors = std::vector of length 0, capacity 0, verbosity = 2,
          control_vector_layer_start = -1, control_vector_layer_end = -1, ppl_stride = 512, ppl_output_type = 0, hellaswag = false,
          hellaswag_tasks = 400, winogrande = false, winogrande_tasks = 0, multiple_choice = false, multiple_choice_tasks = 0,
          kl_divergence = false, usage = false, use_color = false, special = false, interactive = false, interactive_first = false,
          conversation = false, prompt_cache_all = false, prompt_cache_ro = false, ctx_shift = true, escape = true,
          multiline_input = false, simple_io = false, cont_batching = true, flash_attn = true, mla_attn = 3, attn_max_batch = 256,
          fused_moe_up_gate = true, fused_up_gate = true, fused_mmad = true, grouped_expert_routing = true, rope_cache = false,
          graph_reuse = true, min_experts = -1, thresh_experts = 0, input_prefix_bos = false, ignore_eos = false, logits_all = true,
          use_mmap = true, use_mlock = true, verbose_prompt = true, display_prompt = true, infill = false, dump_kv_cache = false,
--Type <RET> for more, q to quit, c to continue without paging--
          no_kv_offload = false, warmup = true, batch_warmup = true, check_tensors = false, repack_tensors = false, use_thp = false,
          validate_quants = false, only_active_exps = true, merge_qkv = false, merge_up_gate_exps = true, k_cache_hadamard = false,
          v_cache_hadamard = false, split_mode_graph_scheduling = true, scheduler_async = true, fused_delta_net = 0, has_mtp = false,
          cache_type_k = "bf16", cache_type_v = "bf16", reduce_type = "f32", mmproj = {path = "", url = "", hf_repo = "", hf_file = "",
            docker_repo = ""}, mmproj_use_gpu = true, no_mmproj = false, image = std::vector of length 0, capacity 0,
          image_min_tokens = -1, image_max_tokens = -1, embedding = false, embd_normalize = 2, embd_out = "", embd_sep = "\n",
          port = 8080, timeout_read = 600, timeout_write = 600, n_threads_http = -1, send_done = false, hostname = "0.0.0.0",
          public_path = "", chat_template = "", use_jinja = true, use_peg = false, system_prompt = "", enable_chat_template = true,
          reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK, think_tokens = {exclude = false, begin = "<think>", end = "</think>"},
          reasoning_budget = -1, prefill_assistant = true, dry_run = false, api_keys = std::vector of length 0, capacity 0,
          ssl_file_key = "", ssl_file_cert = "", default_template_kwargs = std::map with 0 elements, webui = COMMON_WEBUI_AUTO,
          endpoint_slots = true, endpoint_props = false, endpoint_metrics = true, log_json = false,
          slot_save_path = "/root/.cache/ik_llama.cpp/slot.bin/", sql_save_file = "", sqlite_zstd_ext_file = "",
          slot_prompt_similarity = 0.5, do_checkpoint = false, ctx_checkpoints_n = 1024, ctx_checkpoints_interval = 256,
          ctx_checkpoints_tolerance = 5, cache_ram_mib = 32768, cache_ram_n_min = 0, cache_ram_similarity = 0.5, is_pp_shared = false,
          n_pp = std::vector of length 0, capacity 0, n_tg = std::vector of length 0, capacity 0,
          n_pl = std::vector of length 0, capacity 0, context_files = std::vector of length 0, capacity 0, chunk_size = 64,
          chunk_separator = "\n", n_junk = 250, i_pos = -1, out_file = "imatrix.dat", output_tensor_name = "output.weight",
          n_out_freq = 10, n_save_freq = 0, i_chunk = 0, process_output = false, compute_ppl = true, n_pca_batch = 100,
          n_pca_iterations = 1000, cvector_dimre_method = DIMRE_METHOD_PCA, cvector_outfile = "control_vector.gguf",
          cvector_positive_file = "examples/cvector-generator/positive.txt",
          cvector_negative_file = "examples/cvector-generator/negative.txt", spm_infill = false,
          lora_outfile = "ggml-lora-merged-f16.gguf", sweep_bench_output_jsonl = false, minilog = false}
        n_ctx = 65536
        __func__ = "main"
        ppl = true
        rng = {static state_size = 624, _M_x = {3407, 3092088620, 1311765288, 1948966960, 2486938713, 2235871980, 2757621228, 1127245037,
            3129875748, 2075448839, 4144735848, 1687481154, 326937723, 64675732, 873228658, 661756937, 1821532317, 2679609757, 2348724941,
            3235271358, 3321940389, 365126803, 3941488917, 4115655621, 2283731766, 387182493, 2405666315, 2273893032, 1755922222,
            1809781928, 3450902475, 777906695, 3883668451, 3317086081, 2782460012, 3197034889, 1944018683, 1372209863, 3723644484,
            1416487210, 1762056479, 2981541631, 2716767995, 3199979880, 890329342, 3479833187, 1560508430, 705127706, 1169935730,
            640653712, 2095425026, 4203158114, 3373420921, 3559550807, 3852329050, 3267896404, 2798249355, 3963412550, 99379315,
            961527962, 2260690686, 3095690665, 1194530741, 79932995, 2706021295, 3447324802, 3404610599, 1663630967, 3432221394,
            2738265530, 2398859742, 3363825683, 2786533016, 3031792907, 2730698199, 2333076052, 302611002, 4290528047, 191696042,
            4000512353, 3296712186, 3586174094, 2771814643, 3361996904, 3130682763, 239765602, 2568110336, 3356211233, 2467813314,
            2971524889, 1525381377, 833076571, 3976226883, 2461229469, 2673646361, 273130758, 3468579262, 1660904690, 40755009,
            1218043400, 3670436593, 2216801247, 1368931223, 2450311317, 2793730811, 2303335846, 3549853214, 1985814236, 2622255773,
            3386462248, 1137747045, 403070947, 1191792639, 1517125031, 3170527472, 2274552301, 1437610687, 2765992811, 3831101923,
            2472048087, 2378726017, 3611128104, 150645105, 4255732496, 4207407611, 791162709, 1132603911, 4177644253, 1733810198,
            1693290644, 3291269195, 919702763, 3280165947, 2494913181, 1197791297, 205754823, 4028433163, 3075046065, 2173449257,
            3043530882, 3076565772, 2853174035, 2938185539, 1423787572, 760352889, 1266696526, 2310409149, 2807355886, 4069344944,
            3212292148, 822045668, 683578763, 3422905199, 2823358005, 3634619213, 1282370657, 1066738300, 649000841, 1983838891,
            3986939825, 2856027098, 3186728153, 3946911241, 2009009301, 1143417608, 3535457330, 1548732667, 350098761, 1979807605,
            2343384429, 385318005, 1148303316, 3205596341, 2532330464, 2425873112, 2152278705, 1939645519, 3604779383, 3104339830,
            3161470327, 924038365, 1956356326, 617814745, 3365321044, 2449977099, 1324539974, 902396861, 702609228, 3173858744, 16246047,
            1665400057, 385582743, 241428819, 2324527488, 2321908492, 1787900745, 2259862572, 2496005355, 421503411, 3337131878...},
          _M_p = 624}
        llama_init = {model = 0x5555563a2cb0, context = 0x5555638c0530, lora_adapters = std::vector of length 0, capacity 0}
        model = 0x5555563a2cb0
        ctx = 0x5555638c0530
        n_ctx_train = 262144
        results = {tokens = std::vector of length 0, capacity 0, ppl_value = 4.6355712809931998e-310,
          logits = std::vector of length 0, capacity 0, probs = std::vector of length 0, capacity 0}

0 replies

magikRUKKOLA · 2026-04-01T06:27:51Z

magikRUKKOLA
Apr 1, 2026

@milpster

See the PPL values for the long context above.

Qwen3.5 for example.
6.3342 vs 6.3332

As far as I can see, there is no difference between bf16 and q8_0. Like, at all.We

Same goes for Kimi-K2.5:

Here is bf16:

SM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1140.32 ms
perplexity: calculating perplexity over 8 chunks, n_ctx=65536, batch_size=2048, n_seq=1
perplexity: 1022.65 seconds per pass - ETA 2 hours 16.35 minutes
[1]1.3886,[2]1.4678,[3]1.4577,[4]1.4504,[5]1.4919,[6]1.4848,[7]1.4902,[8]1.5095,
Final estimate: PPL over 8 chunks for n_ctx=65536 = 1.5095 +/- 0.00328

and here goes the q8_0:

system_info: n_threads = 128 / 256 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1151.29 ms
perplexity: calculating perplexity over 8 chunks, n_ctx=65536, batch_size=2048, n_seq=1
perplexity: 1020.00 seconds per pass - ETA 2 hours 15.98 minutes
[1]1.3896,[2]1.4687,[3]1.4581,[4]1.4506,

I do believe there will be about 0.01 delta, so no difference.

[EDIT]: okay I added two more points and cancelling this stupid test. :) There is no difference between q8_0 and bf16/f16 cache.

0 replies

Information retention difference betwen bf16 and q8_0 hadamard rotation kv cache #1557

Uh oh!

Replies: 6 comments · 6 replies

Uh oh!

ikawrakow Mar 30, 2026 Maintainer

Uh oh!

Uh oh!

milpster Mar 30, 2026 Author

Uh oh!

ikawrakow Mar 31, 2026 Maintainer

Uh oh!

milpster Mar 31, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 6 replies

ikawrakow
Mar 30, 2026
Maintainer

milpster Mar 30, 2026
Author

ikawrakow Mar 31, 2026
Maintainer

milpster Mar 31, 2026
Author