v0.3.0-hybrid: Measured Hybrid Retrieval (8B-judged)
v0.3.0-hybrid — Measured Hybrid Retrieval (8B-judged)
Release date: 2026-04-23
Summary
Hybrid retrieval (pgvector + Postgres tsvector + Reciprocal Rank Fusion,
k=60) is now benchmarked end-to-end against semantic-only baseline on
Railway production. Measured numbers replace prior projected values
from earlier Ralph iterations.
Bottom line: hybrid materially improves faithfulness (-13.6pp
fewer hallucinated/unsupported claims) and substring match on
semantically-phrased queries (+10pp), but reduces answer relevancy
and recall on this 14-15q sample. Honest mixed result on a small
benchmark — see "Reading the numbers" below.
Key Metrics (15-query synthetic corpus, measured 2026-04-23)
| Metric | Semantic-only (15q) | Hybrid RRF (14q valid) | Δ |
|---|---|---|---|
| RAGAS Faithfulness | 0.241 | 0.377 | +13.6pp ✓ |
| RAGAS Answer Relevancy | 0.729 | 0.596 | -13.3pp |
| RAGAS Context Precision | 0.128 | 0.101 | -2.7pp |
| RAGAS Context Recall | 0.377 | 0.273 | -10.4pp |
| Substring Hit (overall) | 0.333 | 0.357 | +2.4pp ✓ |
| ↳ Semantic-leaning queries | 0.300 | 0.400 | +10.0pp ✓ |
| ↳ Keyword-leaning queries | 0.400 | 0.200 | -20.0pp |
Raw data:
Reading the numbers (honest portfolio interpretation)
This is not a clean "hybrid wins everything" story. Three things
are happening:
-
Hybrid genuinely reduces hallucination. Faithfulness +13.6pp
means significantly fewer unsupported claims in the answer. The
broader retrieval pool gives the LLM more anchors to cite from
instead of guessing. -
Hybrid hurts on keyword queries (-20pp substring). RRF can
bring in tsvector matches that look semantically distant but
keyword-relevant. On 8B-instant pipeline, the model struggles to
synthesize from this less-coherent context. A 70B model would
likely close most of this gap — but free-tier TPD prevented a
same-day 70B benchmark. -
Sample is small (14-15q). Each per-category bucket has only
3-5 queries. Standard error on these deltas is wide. Treat any
single delta as ±5-10pp noise.
The two metrics with the strongest signal are Faithfulness and
Substring hit on semantic-phrased queries — both move in hybrid's
favor. The other deltas are within noise + 8B synthesis weakness on
heterogeneous context.
Methodology
Pipeline (both runs)
- Endpoint:
https://trustrag-production.up.railway.app - Generation model:
llama-3.1-8b-instantvia Groq free tier
(70B's 100K TPD was exhausted by 09:48 from earlier benchmark
attempts; production normally runs 70B-versatile) - Prompt style: merged JSON output (answer + in-prompt self-check
for hallucination flags, HTTP path; SIGN-112 disclosure) - Retrieval:
- Semantic-only: pgvector cosine similarity, top-20 → top-5
- Hybrid: pgvector + Postgres tsvector
plainto_tsquery, fused via
Reciprocal Rank Fusion with k=60 (Cormack et al. 2009)
- Cache bypass:
nocache=Trueper SIGN-111 to ensure benchmark
measures real pipeline, not warm cache
RAGAS judge
- Judge LLM:
groq-llama-3.1-8b-instant(Groq free tier, 500K
TPD vs 70B's 100K, fast inference for serial RAGAS calls) - Embeddings (for answer_relevancy):
gemini-embedding-001 - Settings:
max_workers=1(serial — Groq RPM ~30 chokes parallel
RAGAS bursts),max_retries=6,max_wait=30s,
answer_relevancy.strictness=1(Groq endpoint rejects n>1)
Deterministic metrics (no LLM, deterministic)
- Substring hit: did the answer text contain the dataset's
expected_answer_substring(e.g. "6 feet", "PPE", "permit")?
This is the real-data signal — works on production responses. - Hit@5 (chunk_id): not used for headline numbers because the
synthetic dataset uses placeholder chunk IDs (c_042etc.) that
don't match real DB UUIDs. Reported as 0.0 in raw JSONs but ignore.
Why a different judge for these benchmarks vs the design spec
The spec called for Gemini 2.5 Flash Lite as judge. Gemini works for
faithfulness scoring on tiny samples but its AFC (Automatic Function
Calling) loops make 60-job RAGAS runs take 2+ hours on free tier.
Groq 8B is 6x faster (4s vs 26s per metric on smoke test) and
the binary 0/1 verdicts that RAGAS produces are not very sensitive to
judge quality at this granularity. Documented in
packages/trustrag-eval/src/trustrag_eval/ragas_pipeline.py comments.
Tradeoffs Disclosed
- Merged-prompt HTTP path has known LLM self-check bias
(in-prompt, same model checks own claims). RAGAS faithfulness
(Groq-judged, independent metric instance) IS the bias-free
reference even though it shares the same model family. See
SIGN-112 inplans/guardrails.md. - 8B-on-8B is harsh-but-consistent. Both pipeline and judge run
Llama 3.1 8B Instant. Absolute scores will be lower than what 70B
would produce, but deltas between semantic and hybrid use the
same setup and are apples-to-apples. - Q014 (1 of 15) failed both runs with transient 502 (Railway
→ Groq → 429 propagation). RAGAS aggregates over the 13-14 valid
rows. Not enough to invalidate trends, but error bars get wider. - Production reverts to
llama-3.3-70b-versatilepost-benchmark.
Reproduce
# Set keys (Groq for pipeline + judge, Gemini for embeddings only)
export GROQ_API_KEY="..."
export GOOGLE_API_KEY="..." # https://aistudio.google.com/app/apikey
pip install -e packages/trustrag-eval
# Semantic baseline
railway variables --set "HYBRID_ENABLED=false" --set "GROQ_MODEL=llama-3.1-8b-instant"
# (or use the dashboard; wait ~2min for redeploy)
curl -X POST https://trustrag-production.up.railway.app/api/query/admin/clear-cache
python -m trustrag_eval.ragas_pipeline \
--endpoint https://trustrag-production.up.railway.app \
--dataset eval/synthetic_queries.json \
--limit 15 --mode semantic \
--pipeline-model llama-3.1-8b-instant \
--output eval/results/my-semantic.json
# Hybrid (after UTC day rolls for fresh Groq TPD if needed)
railway variables --set "HYBRID_ENABLED=true"
curl -X POST https://trustrag-production.up.railway.app/api/query/admin/clear-cache
python -m trustrag_eval.ragas_pipeline \
--endpoint https://trustrag-production.up.railway.app \
--dataset eval/synthetic_queries.json \
--limit 15 --mode hybrid \
--pipeline-model llama-3.1-8b-instant \
--output eval/results/my-hybrid.jsonEach run takes ~30-35 min (10 min collection + 20-25 min serial RAGAS
on Groq 8B judge with retry backoffs).
Changes (since v0.2.0-streaming)
feat(backend)merged generation + self-check prompt (fe50642)feat(backend)Postgres-backed query cache (ef908cd)perf(backend)eliminate 6 redundant fastembed calls per query (c31af8f)feat(eval)Gemini wrapper for RAGAS (ebc07a8)feat(eval)Groq 8B judge (7bfa775) + tuning + substring_hit (75f460d)evalsemantic 15q + hybrid 15q (75f460d,f262f93)docs(spec)v2-completion design (7c85b69)docs(plan)11-task implementation plan (f5b7217)
What's next (v0.5.0-mcp + v1.0.0)
- MCP Claude Desktop end-to-end demo + screenshots (P5-GATE)
- README full rewrite with hero + badges + integrations
v1.0.0release with consolidated changelog
Future benchmark improvements (post-v1.0.0)
- Re-run on
llama-3.3-70b-versatileonce free-tier TPD is fresh —
should narrow or close the answer_relevancy / context_recall gaps
driven by 8B's synthesis weakness on heterogeneous context. - Expand to full 30-query dataset (5+5+5 → 10+10+10) for tighter
error bars. - Replace placeholder chunk_id ground-truth with real UUIDs once the
dataset can be built against actual ingested PDFs (was synthesized
pre-ingestion). - Consider cross-encoder rerank stage between RRF and top-5 to
address the keyword-query degradation observed here.