apr and llama-completion produce different text from same GGUF with greedy decoding

## Bug

When feeding the same llama.cpp-native GGUF to both `apr run --json` and
`llama-completion` with greedy decoding (`--temp 0 --top-k 1`), the two runtimes
produce different text for 11/12 test cases. Only SmolLM-135M/completion matches.

### Results Matrix (3 models × 4 prompts)

| Model | Prompt | apr Output (first 80 chars) | llama-completion Output (first 80 chars) | Match? |
|-------|--------|----------------------------|------------------------------------------|--------|
| SmolLM-135M | arithmetic | `2` | ` 2\n\nQuestion 10.\nWhat is 2+2? Answer: 2\n` | DIFFER |
| SmolLM-135M | code | `if n:\n        return fib(n)\n    if n == 0:` | `\n    if n <= 1:\n        return n\n    return fibonacci(n` | DIFFER |
| SmolLM-135M | completion | `Paris, the capital of France, the country of France` | ` Paris. It is the largest city in France` | DIFFER |
| SmolLM-135M | greeting | `Alex, I'm here to help you! I'm here to learn` | ` Alex. I'm 13 years old and I love playing` | DIFFER |
| Qwen2-0.5B | arithmetic | `The answer to the question "What is 2+2?" is:\nuser` | `The answer to the question "What is 2+2?" is:\n\n> EOF` | DIFFER |
| Qwen2-0.5B | code | `The Fibonacci sequence is a series of numbers in` | `The Fibonacci sequence is a series of numbers in` | **ALMOST** (diverge after ~60 chars) |
| Qwen2-0.5B | completion | `The capital of France is Paris.\nThe capital of Fr` | `The capital of France is Paris.\n\n> EOF by user` | DIFFER |
| Qwen2-0.5B | greeting | *(empty)* | `assistant\n\n> EOF by user` | DIFFER |
| GPT-2 | arithmetic | `I in in in in in in in in in in in in in all` | `\n\nThe answer is 2+2.\n\nThe answer is 2+2.` | DIFFER |
| GPT-2 | code | `in in ways ways ways\n\n in in in in in in` | `\n\nfor n in range(n):\n\nreturn fibonacci(n)` | DIFFER |
| GPT-2 | completion | `all all all all all all all all all all all all` | ` the capital of the French Republic. The French` | DIFFER |
| GPT-2 | greeting | `and we we we we we we we we we we we we we we` | ` K. I am a student at the University of California` | DIFFER |

**Cross-runtime exact text match**: pytest xfail `test_cross_runtime_text_match`
- **1 XPASS** (claim survived): `smollm-135m/completion`
- **11 XFAIL** (claims falsified): all other combinations

### Key Observations

1. **apr GPT-2 output is degenerate**: Produces repetitive single-word sequences (`"all all all"`, `"we we we we"`, `"in in in"`) while llama-completion produces coherent text from the same GGUF. This strongly suggests apr's weight loading or inference for GPT-2 architecture is broken.

2. **Qwen2 diverges after initial agreement**: Both runtimes start with the same text (`"The Fibonacci sequence is a series of numbers in"`) then diverge, suggesting tokenizer or sampling differences rather than weight loading errors.

3. **SmolLM is closest**: SmolLM/completion produces an exact text match. Other SmolLM prompts diverge but both outputs are coherent, suggesting the core inference is close but there are edge-case differences.

## Reproduction

```bash
# Versions
apr --version        # apr 0.2.18 (940ef71e)
llama-completion --version  # build: 7746 (39173bcac)

# Use llama.cpp-native Q8_0 GGUFs (NOT apr-exported)
# These were created by convert_hf_to_gguf.py + llama-quantize
GGUF=models/smollm-135m-q8_0.gguf
PROMPT="Hello"

# apr inference (greedy via default settings)
apr run "$GGUF" --prompt "$PROMPT" --json --max-tokens 32
# {"text": "Alex, I'm here to help you! I'm here to learn about the basics of the best wishes to all the 1000000", ...}

# llama-completion inference (explicit greedy)
CUDA_VISIBLE_DEVICES="" llama-completion -m "$GGUF" -p "$PROMPT" -n 32 \
  --temp 0 --top-k 1 --no-display-prompt -s 42
# Alex. I'm 13 years old and I love playing soccer. Recently, I joined a local soccer club and I really enjoyed it. Howev

# They differ!
```

### Full reproduction script

```bash
#!/bin/bash
# Run from tiny-model-ground-truth root
for model in smollm-135m qwen2-0.5b gpt2-124m; do
  for prompt_name in arithmetic code completion greeting; do
    prompt=$(python3 -c "import json; print(json.load(open('oracle/${model}/${prompt_name}.json'))['prompt'])")
    gguf="models/${model}-q8_0.gguf"
    
    apr_text=$(apr run "$gguf" --prompt "$prompt" --json --max-tokens 32 2>/dev/null \
      | python3 -c "import sys,json; print(json.load(sys.stdin).get('text',''))" 2>/dev/null)
    llama_text=$(CUDA_VISIBLE_DEVICES="" llama-completion -m "$gguf" -p "$prompt" -n 32 \
      --temp 0 --top-k 1 --no-display-prompt -s 42 2>/dev/null)
    
    if [ "$apr_text" = "$llama_text" ]; then
      echo "MATCH  ${model}/${prompt_name}"
    else
      echo "DIFFER ${model}/${prompt_name}"
      echo "  apr:   ${apr_text:0:100}"
      echo "  llama: ${llama_text:0:100}"
    fi
  done
done
```

## Five-Whys Analysis

### GPT-2 degenerate output (most severe)
1. **Why** does apr produce `"all all all all..."` from GPT-2 GGUF? → The model is stuck in a repetition loop
2. **Why** is it stuck? → The logits are not being computed correctly, so argmax always returns the same token
3. **Why** are logits wrong? → apr's GPT-2 GGUF inference likely misinterprets weight layout or architecture-specific ops
4. **Why** would weight interpretation differ? → GPT-2 uses standard LayerNorm + learned position embeddings (not RoPE), which differs from LLaMA. apr may be applying LLaMA-style ops to GPT-2 weights.
5. **Why** wasn't this caught? → apr's GGUF inference was likely only tested against LLaMA-family models, not GPT-2

### SmolLM partial match (root cause isolation)
1. **Why** does SmolLM/completion match but other prompts don't? → The completion prompt (`"The capital of France is"`) has very high confidence continuations where both runtimes agree
2. **Why** do other prompts diverge? → Lower-confidence continuations expose differences in token probability computation
3. **Why** would token probabilities differ? → Potential floating-point accumulation differences, different attention implementation, or tokenizer encoding differences
4. **Why** would tokenizer encoding differ? → apr and llama.cpp may tokenize the same prompt string differently (BPE merge order, special token handling)
5. **Why** would BPE merges differ? → apr reads merges from GGUF `tokenizer.ggml.merges`, same as llama.cpp — so this is likely not the cause. More likely an attention/FFN numerical difference.

### Qwen2 early divergence
1. **Why** do Qwen2 outputs start the same then diverge? → The first few tokens are high-confidence (same argmax in both runtimes), but as context grows, numerical differences accumulate
2. **Why** do numerical differences accumulate? → Autoregressive generation: a small logit difference at token N changes the input to token N+1, cascading
3. **Why** is there any logit difference at all? → Different dequantization precision, different matrix multiplication order, or different attention scaling
4. **Why** would dequantization differ? → apr implements its own Q8_0 dequant; llama.cpp uses ggml's. If they use different rounding or SIMD paths, values differ at float32 epsilon level.
5. **Why** does epsilon matter? → With greedy decoding, even a 1e-7 logit difference can flip the argmax when two tokens are close in probability

## Popperian Falsification

**Claim**: "Given the same GGUF file and greedy decoding, apr and llama.cpp produce identical text output."

**Test**: Feed 3 models × 4 prompts through both runtimes using llama.cpp-native Q8_0 GGUFs with `--temp 0 --top-k 1` (greedy).

**Result**: **FALSIFIED** for 11/12 test cases. Only SmolLM-135M/completion survives.

**Falsification gradient** (from most to least severe):
1. **GPT-2**: Total inference failure. apr produces degenerate repetition (`"all all all"`) while llama.cpp produces coherent text. This is not a numerical precision issue — it's a fundamental correctness bug in apr's GPT-2 GGUF inference.
2. **Qwen2**: Partial agreement followed by divergence. Both produce coherent text, but they diverge after ~10 tokens. Likely a numerical precision or tokenizer issue.
3. **SmolLM**: Nearest to parity. Both produce coherent text, 1/4 prompts match exactly. Remaining 3/4 diverge but show semantic similarity. Could be fixed by aligning floating-point accumulation.

**Implication**: apr's GGUF inference is not a drop-in replacement for llama.cpp. Users
who convert models to GGUF expecting interoperability will get different (and in GPT-2's
case, broken) results.

## Context

- apr version: 0.2.18 (940ef71e)
- llama.cpp version: build 7746 (39173bcac), Feb 2026
- GGUF source: llama.cpp-native (`convert_hf_to_gguf.py` + `llama-quantize Q8_0`)
- Test repo: `tiny-model-ground-truth`, Layer 4c tests
- Test file: `tests/test_llamacpp_parity.py::test_cross_runtime_text_match`
- Related: GH-277 (apr-exported GGUFs fail to load in llama.cpp)

### Acceptance Criteria

- [ ] GPT-2 GGUF inference produces coherent text (not degenerate repetition)
- [ ] SmolLM GGUF all 4 prompts match llama-completion output exactly
- [ ] Qwen2 GGUF matches llama-completion for at least first 10 tokens
- [ ] Root cause identified: weight loading vs tokenizer vs inference ops
- [ ] `test_cross_runtime_text_match` xfail removed, ≥9/12 tests pass green

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apr and llama-completion produce different text from same GGUF with greedy decoding #278

Bug

Results Matrix (3 models × 4 prompts)

Key Observations

Reproduction

Full reproduction script

Five-Whys Analysis

GPT-2 degenerate output (most severe)

SmolLM partial match (root cause isolation)

Qwen2 early divergence

Popperian Falsification

Context

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Prompt	apr Output (first 80 chars)	llama-completion Output (first 80 chars)	Match?
SmolLM-135M	arithmetic	`2`	`2\n\nQuestion 10.\nWhat is 2+2? Answer: 2\n`	DIFFER
SmolLM-135M	code	`if n:\n return fib(n)\n if n == 0:`	`\n if n <= 1:\n return n\n return fibonacci(n`	DIFFER
SmolLM-135M	completion	`Paris, the capital of France, the country of France`	`Paris. It is the largest city in France`	DIFFER
SmolLM-135M	greeting	`Alex, I'm here to help you! I'm here to learn`	`Alex. I'm 13 years old and I love playing`	DIFFER
Qwen2-0.5B	arithmetic	`The answer to the question "What is 2+2?" is:\nuser`	`The answer to the question "What is 2+2?" is:\n\n> EOF`	DIFFER
Qwen2-0.5B	code	`The Fibonacci sequence is a series of numbers in`	`The Fibonacci sequence is a series of numbers in`	ALMOST (diverge after ~60 chars)
Qwen2-0.5B	completion	`The capital of France is Paris.\nThe capital of Fr`	`The capital of France is Paris.\n\n> EOF by user`	DIFFER
Qwen2-0.5B	greeting	(empty)	`assistant\n\n> EOF by user`	DIFFER
GPT-2	arithmetic	`I in in in in in in in in in in in in in all`	`\n\nThe answer is 2+2.\n\nThe answer is 2+2.`	DIFFER
GPT-2	code	`in in ways ways ways\n\n in in in in in in`	`\n\nfor n in range(n):\n\nreturn fibonacci(n)`	DIFFER
GPT-2	completion	`all all all all all all all all all all all all`	`the capital of the French Republic. The French`	DIFFER
GPT-2	greeting	`and we we we we we we we we we we we we we we`	`K. I am a student at the University of California`	DIFFER

apr and llama-completion produce different text from same GGUF with greedy decoding #278

Description

Bug

Results Matrix (3 models × 4 prompts)

Key Observations

Reproduction

Full reproduction script

Five-Whys Analysis

GPT-2 degenerate output (most severe)

SmolLM partial match (root cause isolation)

Qwen2 early divergence

Popperian Falsification

Context

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions