Skip to content

apr and llama-completion produce different text from same GGUF with greedy decoding #278

@noahgift

Description

@noahgift

Bug

When feeding the same llama.cpp-native GGUF to both apr run --json and
llama-completion with greedy decoding (--temp 0 --top-k 1), the two runtimes
produce different text for 11/12 test cases. Only SmolLM-135M/completion matches.

Results Matrix (3 models × 4 prompts)

Model Prompt apr Output (first 80 chars) llama-completion Output (first 80 chars) Match?
SmolLM-135M arithmetic 2 2\n\nQuestion 10.\nWhat is 2+2? Answer: 2\n DIFFER
SmolLM-135M code if n:\n return fib(n)\n if n == 0: \n if n <= 1:\n return n\n return fibonacci(n DIFFER
SmolLM-135M completion Paris, the capital of France, the country of France Paris. It is the largest city in France DIFFER
SmolLM-135M greeting Alex, I'm here to help you! I'm here to learn Alex. I'm 13 years old and I love playing DIFFER
Qwen2-0.5B arithmetic The answer to the question "What is 2+2?" is:\nuser The answer to the question "What is 2+2?" is:\n\n> EOF DIFFER
Qwen2-0.5B code The Fibonacci sequence is a series of numbers in The Fibonacci sequence is a series of numbers in ALMOST (diverge after ~60 chars)
Qwen2-0.5B completion The capital of France is Paris.\nThe capital of Fr The capital of France is Paris.\n\n> EOF by user DIFFER
Qwen2-0.5B greeting (empty) assistant\n\n> EOF by user DIFFER
GPT-2 arithmetic I in in in in in in in in in in in in in all \n\nThe answer is 2+2.\n\nThe answer is 2+2. DIFFER
GPT-2 code in in ways ways ways\n\n in in in in in in \n\nfor n in range(n):\n\nreturn fibonacci(n) DIFFER
GPT-2 completion all all all all all all all all all all all all the capital of the French Republic. The French DIFFER
GPT-2 greeting and we we we we we we we we we we we we we we K. I am a student at the University of California DIFFER

Cross-runtime exact text match: pytest xfail test_cross_runtime_text_match

  • 1 XPASS (claim survived): smollm-135m/completion
  • 11 XFAIL (claims falsified): all other combinations

Key Observations

  1. apr GPT-2 output is degenerate: Produces repetitive single-word sequences ("all all all", "we we we we", "in in in") while llama-completion produces coherent text from the same GGUF. This strongly suggests apr's weight loading or inference for GPT-2 architecture is broken.

  2. Qwen2 diverges after initial agreement: Both runtimes start with the same text ("The Fibonacci sequence is a series of numbers in") then diverge, suggesting tokenizer or sampling differences rather than weight loading errors.

  3. SmolLM is closest: SmolLM/completion produces an exact text match. Other SmolLM prompts diverge but both outputs are coherent, suggesting the core inference is close but there are edge-case differences.

Reproduction

# Versions
apr --version        # apr 0.2.18 (940ef71e)
llama-completion --version  # build: 7746 (39173bcac)

# Use llama.cpp-native Q8_0 GGUFs (NOT apr-exported)
# These were created by convert_hf_to_gguf.py + llama-quantize
GGUF=models/smollm-135m-q8_0.gguf
PROMPT="Hello"

# apr inference (greedy via default settings)
apr run "$GGUF" --prompt "$PROMPT" --json --max-tokens 32
# {"text": "Alex, I'm here to help you! I'm here to learn about the basics of the best wishes to all the 1000000", ...}

# llama-completion inference (explicit greedy)
CUDA_VISIBLE_DEVICES="" llama-completion -m "$GGUF" -p "$PROMPT" -n 32 \
  --temp 0 --top-k 1 --no-display-prompt -s 42
# Alex. I'm 13 years old and I love playing soccer. Recently, I joined a local soccer club and I really enjoyed it. Howev

# They differ!

Full reproduction script

#!/bin/bash
# Run from tiny-model-ground-truth root
for model in smollm-135m qwen2-0.5b gpt2-124m; do
  for prompt_name in arithmetic code completion greeting; do
    prompt=$(python3 -c "import json; print(json.load(open('oracle/${model}/${prompt_name}.json'))['prompt'])")
    gguf="models/${model}-q8_0.gguf"
    
    apr_text=$(apr run "$gguf" --prompt "$prompt" --json --max-tokens 32 2>/dev/null \
      | python3 -c "import sys,json; print(json.load(sys.stdin).get('text',''))" 2>/dev/null)
    llama_text=$(CUDA_VISIBLE_DEVICES="" llama-completion -m "$gguf" -p "$prompt" -n 32 \
      --temp 0 --top-k 1 --no-display-prompt -s 42 2>/dev/null)
    
    if [ "$apr_text" = "$llama_text" ]; then
      echo "MATCH  ${model}/${prompt_name}"
    else
      echo "DIFFER ${model}/${prompt_name}"
      echo "  apr:   ${apr_text:0:100}"
      echo "  llama: ${llama_text:0:100}"
    fi
  done
done

Five-Whys Analysis

GPT-2 degenerate output (most severe)

  1. Why does apr produce "all all all all..." from GPT-2 GGUF? → The model is stuck in a repetition loop
  2. Why is it stuck? → The logits are not being computed correctly, so argmax always returns the same token
  3. Why are logits wrong? → apr's GPT-2 GGUF inference likely misinterprets weight layout or architecture-specific ops
  4. Why would weight interpretation differ? → GPT-2 uses standard LayerNorm + learned position embeddings (not RoPE), which differs from LLaMA. apr may be applying LLaMA-style ops to GPT-2 weights.
  5. Why wasn't this caught? → apr's GGUF inference was likely only tested against LLaMA-family models, not GPT-2

SmolLM partial match (root cause isolation)

  1. Why does SmolLM/completion match but other prompts don't? → The completion prompt ("The capital of France is") has very high confidence continuations where both runtimes agree
  2. Why do other prompts diverge? → Lower-confidence continuations expose differences in token probability computation
  3. Why would token probabilities differ? → Potential floating-point accumulation differences, different attention implementation, or tokenizer encoding differences
  4. Why would tokenizer encoding differ? → apr and llama.cpp may tokenize the same prompt string differently (BPE merge order, special token handling)
  5. Why would BPE merges differ? → apr reads merges from GGUF tokenizer.ggml.merges, same as llama.cpp — so this is likely not the cause. More likely an attention/FFN numerical difference.

Qwen2 early divergence

  1. Why do Qwen2 outputs start the same then diverge? → The first few tokens are high-confidence (same argmax in both runtimes), but as context grows, numerical differences accumulate
  2. Why do numerical differences accumulate? → Autoregressive generation: a small logit difference at token N changes the input to token N+1, cascading
  3. Why is there any logit difference at all? → Different dequantization precision, different matrix multiplication order, or different attention scaling
  4. Why would dequantization differ? → apr implements its own Q8_0 dequant; llama.cpp uses ggml's. If they use different rounding or SIMD paths, values differ at float32 epsilon level.
  5. Why does epsilon matter? → With greedy decoding, even a 1e-7 logit difference can flip the argmax when two tokens are close in probability

Popperian Falsification

Claim: "Given the same GGUF file and greedy decoding, apr and llama.cpp produce identical text output."

Test: Feed 3 models × 4 prompts through both runtimes using llama.cpp-native Q8_0 GGUFs with --temp 0 --top-k 1 (greedy).

Result: FALSIFIED for 11/12 test cases. Only SmolLM-135M/completion survives.

Falsification gradient (from most to least severe):

  1. GPT-2: Total inference failure. apr produces degenerate repetition ("all all all") while llama.cpp produces coherent text. This is not a numerical precision issue — it's a fundamental correctness bug in apr's GPT-2 GGUF inference.
  2. Qwen2: Partial agreement followed by divergence. Both produce coherent text, but they diverge after ~10 tokens. Likely a numerical precision or tokenizer issue.
  3. SmolLM: Nearest to parity. Both produce coherent text, 1/4 prompts match exactly. Remaining 3/4 diverge but show semantic similarity. Could be fixed by aligning floating-point accumulation.

Implication: apr's GGUF inference is not a drop-in replacement for llama.cpp. Users
who convert models to GGUF expecting interoperability will get different (and in GPT-2's
case, broken) results.

Context

Acceptance Criteria

  • GPT-2 GGUF inference produces coherent text (not degenerate repetition)
  • SmolLM GGUF all 4 prompts match llama-completion output exactly
  • Qwen2 GGUF matches llama-completion for at least first 10 tokens
  • Root cause identified: weight loading vs tokenizer vs inference ops
  • test_cross_runtime_text_match xfail removed, ≥9/12 tests pass green

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions