Bug
When feeding the same llama.cpp-native GGUF to both apr run --json and
llama-completion with greedy decoding (--temp 0 --top-k 1), the two runtimes
produce different text for 11/12 test cases. Only SmolLM-135M/completion matches.
Results Matrix (3 models × 4 prompts)
| Model |
Prompt |
apr Output (first 80 chars) |
llama-completion Output (first 80 chars) |
Match? |
| SmolLM-135M |
arithmetic |
2 |
2\n\nQuestion 10.\nWhat is 2+2? Answer: 2\n |
DIFFER |
| SmolLM-135M |
code |
if n:\n return fib(n)\n if n == 0: |
\n if n <= 1:\n return n\n return fibonacci(n |
DIFFER |
| SmolLM-135M |
completion |
Paris, the capital of France, the country of France |
Paris. It is the largest city in France |
DIFFER |
| SmolLM-135M |
greeting |
Alex, I'm here to help you! I'm here to learn |
Alex. I'm 13 years old and I love playing |
DIFFER |
| Qwen2-0.5B |
arithmetic |
The answer to the question "What is 2+2?" is:\nuser |
The answer to the question "What is 2+2?" is:\n\n> EOF |
DIFFER |
| Qwen2-0.5B |
code |
The Fibonacci sequence is a series of numbers in |
The Fibonacci sequence is a series of numbers in |
ALMOST (diverge after ~60 chars) |
| Qwen2-0.5B |
completion |
The capital of France is Paris.\nThe capital of Fr |
The capital of France is Paris.\n\n> EOF by user |
DIFFER |
| Qwen2-0.5B |
greeting |
(empty) |
assistant\n\n> EOF by user |
DIFFER |
| GPT-2 |
arithmetic |
I in in in in in in in in in in in in in all |
\n\nThe answer is 2+2.\n\nThe answer is 2+2. |
DIFFER |
| GPT-2 |
code |
in in ways ways ways\n\n in in in in in in |
\n\nfor n in range(n):\n\nreturn fibonacci(n) |
DIFFER |
| GPT-2 |
completion |
all all all all all all all all all all all all |
the capital of the French Republic. The French |
DIFFER |
| GPT-2 |
greeting |
and we we we we we we we we we we we we we we |
K. I am a student at the University of California |
DIFFER |
Cross-runtime exact text match: pytest xfail test_cross_runtime_text_match
- 1 XPASS (claim survived):
smollm-135m/completion
- 11 XFAIL (claims falsified): all other combinations
Key Observations
-
apr GPT-2 output is degenerate: Produces repetitive single-word sequences ("all all all", "we we we we", "in in in") while llama-completion produces coherent text from the same GGUF. This strongly suggests apr's weight loading or inference for GPT-2 architecture is broken.
-
Qwen2 diverges after initial agreement: Both runtimes start with the same text ("The Fibonacci sequence is a series of numbers in") then diverge, suggesting tokenizer or sampling differences rather than weight loading errors.
-
SmolLM is closest: SmolLM/completion produces an exact text match. Other SmolLM prompts diverge but both outputs are coherent, suggesting the core inference is close but there are edge-case differences.
Reproduction
# Versions
apr --version # apr 0.2.18 (940ef71e)
llama-completion --version # build: 7746 (39173bcac)
# Use llama.cpp-native Q8_0 GGUFs (NOT apr-exported)
# These were created by convert_hf_to_gguf.py + llama-quantize
GGUF=models/smollm-135m-q8_0.gguf
PROMPT="Hello"
# apr inference (greedy via default settings)
apr run "$GGUF" --prompt "$PROMPT" --json --max-tokens 32
# {"text": "Alex, I'm here to help you! I'm here to learn about the basics of the best wishes to all the 1000000", ...}
# llama-completion inference (explicit greedy)
CUDA_VISIBLE_DEVICES="" llama-completion -m "$GGUF" -p "$PROMPT" -n 32 \
--temp 0 --top-k 1 --no-display-prompt -s 42
# Alex. I'm 13 years old and I love playing soccer. Recently, I joined a local soccer club and I really enjoyed it. Howev
# They differ!
Full reproduction script
#!/bin/bash
# Run from tiny-model-ground-truth root
for model in smollm-135m qwen2-0.5b gpt2-124m; do
for prompt_name in arithmetic code completion greeting; do
prompt=$(python3 -c "import json; print(json.load(open('oracle/${model}/${prompt_name}.json'))['prompt'])")
gguf="models/${model}-q8_0.gguf"
apr_text=$(apr run "$gguf" --prompt "$prompt" --json --max-tokens 32 2>/dev/null \
| python3 -c "import sys,json; print(json.load(sys.stdin).get('text',''))" 2>/dev/null)
llama_text=$(CUDA_VISIBLE_DEVICES="" llama-completion -m "$gguf" -p "$prompt" -n 32 \
--temp 0 --top-k 1 --no-display-prompt -s 42 2>/dev/null)
if [ "$apr_text" = "$llama_text" ]; then
echo "MATCH ${model}/${prompt_name}"
else
echo "DIFFER ${model}/${prompt_name}"
echo " apr: ${apr_text:0:100}"
echo " llama: ${llama_text:0:100}"
fi
done
done
Five-Whys Analysis
GPT-2 degenerate output (most severe)
- Why does apr produce
"all all all all..." from GPT-2 GGUF? → The model is stuck in a repetition loop
- Why is it stuck? → The logits are not being computed correctly, so argmax always returns the same token
- Why are logits wrong? → apr's GPT-2 GGUF inference likely misinterprets weight layout or architecture-specific ops
- Why would weight interpretation differ? → GPT-2 uses standard LayerNorm + learned position embeddings (not RoPE), which differs from LLaMA. apr may be applying LLaMA-style ops to GPT-2 weights.
- Why wasn't this caught? → apr's GGUF inference was likely only tested against LLaMA-family models, not GPT-2
SmolLM partial match (root cause isolation)
- Why does SmolLM/completion match but other prompts don't? → The completion prompt (
"The capital of France is") has very high confidence continuations where both runtimes agree
- Why do other prompts diverge? → Lower-confidence continuations expose differences in token probability computation
- Why would token probabilities differ? → Potential floating-point accumulation differences, different attention implementation, or tokenizer encoding differences
- Why would tokenizer encoding differ? → apr and llama.cpp may tokenize the same prompt string differently (BPE merge order, special token handling)
- Why would BPE merges differ? → apr reads merges from GGUF
tokenizer.ggml.merges, same as llama.cpp — so this is likely not the cause. More likely an attention/FFN numerical difference.
Qwen2 early divergence
- Why do Qwen2 outputs start the same then diverge? → The first few tokens are high-confidence (same argmax in both runtimes), but as context grows, numerical differences accumulate
- Why do numerical differences accumulate? → Autoregressive generation: a small logit difference at token N changes the input to token N+1, cascading
- Why is there any logit difference at all? → Different dequantization precision, different matrix multiplication order, or different attention scaling
- Why would dequantization differ? → apr implements its own Q8_0 dequant; llama.cpp uses ggml's. If they use different rounding or SIMD paths, values differ at float32 epsilon level.
- Why does epsilon matter? → With greedy decoding, even a 1e-7 logit difference can flip the argmax when two tokens are close in probability
Popperian Falsification
Claim: "Given the same GGUF file and greedy decoding, apr and llama.cpp produce identical text output."
Test: Feed 3 models × 4 prompts through both runtimes using llama.cpp-native Q8_0 GGUFs with --temp 0 --top-k 1 (greedy).
Result: FALSIFIED for 11/12 test cases. Only SmolLM-135M/completion survives.
Falsification gradient (from most to least severe):
- GPT-2: Total inference failure. apr produces degenerate repetition (
"all all all") while llama.cpp produces coherent text. This is not a numerical precision issue — it's a fundamental correctness bug in apr's GPT-2 GGUF inference.
- Qwen2: Partial agreement followed by divergence. Both produce coherent text, but they diverge after ~10 tokens. Likely a numerical precision or tokenizer issue.
- SmolLM: Nearest to parity. Both produce coherent text, 1/4 prompts match exactly. Remaining 3/4 diverge but show semantic similarity. Could be fixed by aligning floating-point accumulation.
Implication: apr's GGUF inference is not a drop-in replacement for llama.cpp. Users
who convert models to GGUF expecting interoperability will get different (and in GPT-2's
case, broken) results.
Context
Acceptance Criteria
Bug
When feeding the same llama.cpp-native GGUF to both
apr run --jsonandllama-completionwith greedy decoding (--temp 0 --top-k 1), the two runtimesproduce different text for 11/12 test cases. Only SmolLM-135M/completion matches.
Results Matrix (3 models × 4 prompts)
22\n\nQuestion 10.\nWhat is 2+2? Answer: 2\nif n:\n return fib(n)\n if n == 0:\n if n <= 1:\n return n\n return fibonacci(nParis, the capital of France, the country of FranceParis. It is the largest city in FranceAlex, I'm here to help you! I'm here to learnAlex. I'm 13 years old and I love playingThe answer to the question "What is 2+2?" is:\nuserThe answer to the question "What is 2+2?" is:\n\n> EOFThe Fibonacci sequence is a series of numbers inThe Fibonacci sequence is a series of numbers inThe capital of France is Paris.\nThe capital of FrThe capital of France is Paris.\n\n> EOF by userassistant\n\n> EOF by userI in in in in in in in in in in in in in all\n\nThe answer is 2+2.\n\nThe answer is 2+2.in in ways ways ways\n\n in in in in in in\n\nfor n in range(n):\n\nreturn fibonacci(n)all all all all all all all all all all all allthe capital of the French Republic. The Frenchand we we we we we we we we we we we we we weK. I am a student at the University of CaliforniaCross-runtime exact text match: pytest xfail
test_cross_runtime_text_matchsmollm-135m/completionKey Observations
apr GPT-2 output is degenerate: Produces repetitive single-word sequences (
"all all all","we we we we","in in in") while llama-completion produces coherent text from the same GGUF. This strongly suggests apr's weight loading or inference for GPT-2 architecture is broken.Qwen2 diverges after initial agreement: Both runtimes start with the same text (
"The Fibonacci sequence is a series of numbers in") then diverge, suggesting tokenizer or sampling differences rather than weight loading errors.SmolLM is closest: SmolLM/completion produces an exact text match. Other SmolLM prompts diverge but both outputs are coherent, suggesting the core inference is close but there are edge-case differences.
Reproduction
Full reproduction script
Five-Whys Analysis
GPT-2 degenerate output (most severe)
"all all all all..."from GPT-2 GGUF? → The model is stuck in a repetition loopSmolLM partial match (root cause isolation)
"The capital of France is") has very high confidence continuations where both runtimes agreetokenizer.ggml.merges, same as llama.cpp — so this is likely not the cause. More likely an attention/FFN numerical difference.Qwen2 early divergence
Popperian Falsification
Claim: "Given the same GGUF file and greedy decoding, apr and llama.cpp produce identical text output."
Test: Feed 3 models × 4 prompts through both runtimes using llama.cpp-native Q8_0 GGUFs with
--temp 0 --top-k 1(greedy).Result: FALSIFIED for 11/12 test cases. Only SmolLM-135M/completion survives.
Falsification gradient (from most to least severe):
"all all all") while llama.cpp produces coherent text. This is not a numerical precision issue — it's a fundamental correctness bug in apr's GPT-2 GGUF inference.Implication: apr's GGUF inference is not a drop-in replacement for llama.cpp. Users
who convert models to GGUF expecting interoperability will get different (and in GPT-2's
case, broken) results.
Context
convert_hf_to_gguf.py+llama-quantize Q8_0)tiny-model-ground-truth, Layer 4c teststests/test_llamacpp_parity.py::test_cross_runtime_text_matchAcceptance Criteria
test_cross_runtime_text_matchxfail removed, ≥9/12 tests pass green