Skip to content

v0.3.0-hybrid: Measured Hybrid Retrieval (8B-judged)

Choose a tag to compare

@jigangz jigangz released this 25 Apr 00:10
· 3 commits to main since this release

v0.3.0-hybrid — Measured Hybrid Retrieval (8B-judged)

Release date: 2026-04-23

Summary

Hybrid retrieval (pgvector + Postgres tsvector + Reciprocal Rank Fusion,
k=60) is now benchmarked end-to-end against semantic-only baseline on
Railway production. Measured numbers replace prior projected values
from earlier Ralph iterations.

Bottom line: hybrid materially improves faithfulness (-13.6pp
fewer hallucinated/unsupported claims)
and substring match on
semantically-phrased queries (+10pp)
, but reduces answer relevancy
and recall on this 14-15q sample. Honest mixed result on a small
benchmark — see "Reading the numbers" below.

Key Metrics (15-query synthetic corpus, measured 2026-04-23)

Metric Semantic-only (15q) Hybrid RRF (14q valid) Δ
RAGAS Faithfulness 0.241 0.377 +13.6pp ✓
RAGAS Answer Relevancy 0.729 0.596 -13.3pp
RAGAS Context Precision 0.128 0.101 -2.7pp
RAGAS Context Recall 0.377 0.273 -10.4pp
Substring Hit (overall) 0.333 0.357 +2.4pp ✓
↳ Semantic-leaning queries 0.300 0.400 +10.0pp ✓
↳ Keyword-leaning queries 0.400 0.200 -20.0pp

Raw data:

Reading the numbers (honest portfolio interpretation)

This is not a clean "hybrid wins everything" story. Three things
are happening:

  1. Hybrid genuinely reduces hallucination. Faithfulness +13.6pp
    means significantly fewer unsupported claims in the answer. The
    broader retrieval pool gives the LLM more anchors to cite from
    instead of guessing.

  2. Hybrid hurts on keyword queries (-20pp substring). RRF can
    bring in tsvector matches that look semantically distant but
    keyword-relevant. On 8B-instant pipeline, the model struggles to
    synthesize from this less-coherent context. A 70B model would
    likely close most of this gap
    — but free-tier TPD prevented a
    same-day 70B benchmark.

  3. Sample is small (14-15q). Each per-category bucket has only
    3-5 queries. Standard error on these deltas is wide. Treat any
    single delta as ±5-10pp noise.

The two metrics with the strongest signal are Faithfulness and
Substring hit on semantic-phrased queries — both move in hybrid's
favor. The other deltas are within noise + 8B synthesis weakness on
heterogeneous context.

Methodology

Pipeline (both runs)

  • Endpoint: https://trustrag-production.up.railway.app
  • Generation model: llama-3.1-8b-instant via Groq free tier
    (70B's 100K TPD was exhausted by 09:48 from earlier benchmark
    attempts; production normally runs 70B-versatile)
  • Prompt style: merged JSON output (answer + in-prompt self-check
    for hallucination flags, HTTP path; SIGN-112 disclosure)
  • Retrieval:
    • Semantic-only: pgvector cosine similarity, top-20 → top-5
    • Hybrid: pgvector + Postgres tsvector plainto_tsquery, fused via
      Reciprocal Rank Fusion with k=60 (Cormack et al. 2009)
  • Cache bypass: nocache=True per SIGN-111 to ensure benchmark
    measures real pipeline, not warm cache

RAGAS judge

  • Judge LLM: groq-llama-3.1-8b-instant (Groq free tier, 500K
    TPD vs 70B's 100K, fast inference for serial RAGAS calls)
  • Embeddings (for answer_relevancy): gemini-embedding-001
  • Settings: max_workers=1 (serial — Groq RPM ~30 chokes parallel
    RAGAS bursts), max_retries=6, max_wait=30s,
    answer_relevancy.strictness=1 (Groq endpoint rejects n>1)

Deterministic metrics (no LLM, deterministic)

  • Substring hit: did the answer text contain the dataset's
    expected_answer_substring (e.g. "6 feet", "PPE", "permit")?
    This is the real-data signal — works on production responses.
  • Hit@5 (chunk_id): not used for headline numbers because the
    synthetic dataset uses placeholder chunk IDs (c_042 etc.) that
    don't match real DB UUIDs. Reported as 0.0 in raw JSONs but ignore.

Why a different judge for these benchmarks vs the design spec

The spec called for Gemini 2.5 Flash Lite as judge. Gemini works for
faithfulness scoring on tiny samples but its AFC (Automatic Function
Calling) loops make 60-job RAGAS runs take 2+ hours on free tier.
Groq 8B is 6x faster (4s vs 26s per metric on smoke test) and
the binary 0/1 verdicts that RAGAS produces are not very sensitive to
judge quality at this granularity. Documented in
packages/trustrag-eval/src/trustrag_eval/ragas_pipeline.py comments.

Tradeoffs Disclosed

  • Merged-prompt HTTP path has known LLM self-check bias
    (in-prompt, same model checks own claims). RAGAS faithfulness
    (Groq-judged, independent metric instance) IS the bias-free
    reference even though it shares the same model family. See
    SIGN-112 in plans/guardrails.md.
  • 8B-on-8B is harsh-but-consistent. Both pipeline and judge run
    Llama 3.1 8B Instant. Absolute scores will be lower than what 70B
    would produce, but deltas between semantic and hybrid use the
    same setup and are apples-to-apples
    .
  • Q014 (1 of 15) failed both runs with transient 502 (Railway
    → Groq → 429 propagation). RAGAS aggregates over the 13-14 valid
    rows. Not enough to invalidate trends, but error bars get wider.
  • Production reverts to llama-3.3-70b-versatile post-benchmark.

Reproduce

# Set keys (Groq for pipeline + judge, Gemini for embeddings only)
export GROQ_API_KEY="..."
export GOOGLE_API_KEY="..."   # https://aistudio.google.com/app/apikey
pip install -e packages/trustrag-eval

# Semantic baseline
railway variables --set "HYBRID_ENABLED=false" --set "GROQ_MODEL=llama-3.1-8b-instant"
# (or use the dashboard; wait ~2min for redeploy)
curl -X POST https://trustrag-production.up.railway.app/api/query/admin/clear-cache
python -m trustrag_eval.ragas_pipeline \
  --endpoint https://trustrag-production.up.railway.app \
  --dataset eval/synthetic_queries.json \
  --limit 15 --mode semantic \
  --pipeline-model llama-3.1-8b-instant \
  --output eval/results/my-semantic.json

# Hybrid (after UTC day rolls for fresh Groq TPD if needed)
railway variables --set "HYBRID_ENABLED=true"
curl -X POST https://trustrag-production.up.railway.app/api/query/admin/clear-cache
python -m trustrag_eval.ragas_pipeline \
  --endpoint https://trustrag-production.up.railway.app \
  --dataset eval/synthetic_queries.json \
  --limit 15 --mode hybrid \
  --pipeline-model llama-3.1-8b-instant \
  --output eval/results/my-hybrid.json

Each run takes ~30-35 min (10 min collection + 20-25 min serial RAGAS
on Groq 8B judge with retry backoffs).

Changes (since v0.2.0-streaming)

  • feat(backend) merged generation + self-check prompt (fe50642)
  • feat(backend) Postgres-backed query cache (ef908cd)
  • perf(backend) eliminate 6 redundant fastembed calls per query (c31af8f)
  • feat(eval) Gemini wrapper for RAGAS (ebc07a8)
  • feat(eval) Groq 8B judge (7bfa775) + tuning + substring_hit (75f460d)
  • eval semantic 15q + hybrid 15q (75f460d, f262f93)
  • docs(spec) v2-completion design (7c85b69)
  • docs(plan) 11-task implementation plan (f5b7217)

What's next (v0.5.0-mcp + v1.0.0)

  • MCP Claude Desktop end-to-end demo + screenshots (P5-GATE)
  • README full rewrite with hero + badges + integrations
  • v1.0.0 release with consolidated changelog

Future benchmark improvements (post-v1.0.0)

  1. Re-run on llama-3.3-70b-versatile once free-tier TPD is fresh —
    should narrow or close the answer_relevancy / context_recall gaps
    driven by 8B's synthesis weakness on heterogeneous context.
  2. Expand to full 30-query dataset (5+5+5 → 10+10+10) for tighter
    error bars.
  3. Replace placeholder chunk_id ground-truth with real UUIDs once the
    dataset can be built against actual ingested PDFs (was synthesized
    pre-ingestion).
  4. Consider cross-encoder rerank stage between RRF and top-5 to
    address the keyword-query degradation observed here.