Skip to content

v1.0.0 - Production-Grade Trust-Verified RAG

Latest

Choose a tag to compare

@jigangz jigangz released this 25 Apr 23:33

v1.0.0 — Production-Grade Trust-Verified RAG

Release date: 2026-04-24

A 13-night high-intensity sprint took TrustRAG from a single-app demo
to a production-deployed, ecosystem-integrated platform with measured
quality, three published PyPI packages, and Claude Desktop MCP
integration verified end-to-end.

🎯 Highlights

  • WebSocket streaming with cancellation, multi-stage status,
    and error frames (TTFT < 500ms target)
  • Hybrid retrieval (pgvector cosine + Postgres tsvector +
    Reciprocal Rank Fusion k=60), benchmarked vs semantic-only
    baseline — see "Measured Quality" below
  • 3 PyPI packages:
    • trustrag-langchain — Retriever, Tool, LangGraph multi-hop agent with trust budget
    • trustrag-mcp — MCP server, 3 tools (query / upload / audit), stdio
    • trustrag-eval — RAGAS pipeline with Groq / Gemini judge variants, deterministic substring-hit metric, CLI runner
  • MCP in Claude Desktop verified end-to-end with production
    Railway backend — see docs/releases/v0.5.0-mcp.md
  • n8n workflow templates (3) — doc ingestion, Slack trust gate,
    daily low-confidence digest
  • Live deployment: Vercel frontend + Railway backend with
    pgvector + UptimeRobot keep-alive — $0/month infrastructure
  • Latency engineered for free-tier hardware: 30-60s → 5-10s
    cache miss / sub-300ms cache hit
    via embedding cleanup, merged
    generation+self-check prompt, and Postgres-backed query cache

📊 Measured Quality (15q synthetic, 8B-pipeline + Groq judge)

Metric Semantic Hybrid Δ
Faithfulness (RAGAS) 0.241 0.377 +13.6pp ✓
Substring Hit (overall) 0.333 0.357 +2.4pp ✓
↳ Semantic queries 0.300 0.400 +10pp ✓
↳ Keyword queries 0.400 0.200 -20pp
Answer Relevancy 0.729 0.596 -13.3pp
Context Precision 0.128 0.101 -2.7pp
Context Recall 0.377 0.273 -10.4pp

Hybrid significantly improves faithfulness (less hallucination,
+13.6pp) and substring-match on semantic queries (+10pp). Other
metrics show 8B-instant's synthesis weakness on broader RRF context;
70B re-run is a planned follow-up. Full methodology + honest analysis
in docs/releases/v0.3.0-hybrid.md.

⚡ Latency Profile (Railway production)

Path Latency
Cache hit ~300ms (p95 < 500ms)
Cache miss, merged HTTP 5-10s
Streaming TTFT < 500ms (Llama 70B + Groq)
Cold start 0 (UptimeRobot 5-min ping)

📦 PyPI Packages

pip install trustrag-langchain  # 0.1.0
pip install trustrag-mcp         # 0.1.2
pip install trustrag-eval        # 0.1.0

🌐 Live URLs

🛡️ Architectural Tradeoffs Disclosed

  1. Merged-prompt HTTP path uses in-prompt LLM self-check (single
    Groq call returns {answer, self_check.unsupported_claims}).
    Known ~5-10% bias since the same model checks its own answer.
    RAGAS faithfulness (independent evaluation) is the bias-free
    reference. SIGN-112 in plans/guardrails.md.
  2. Streaming WebSocket path keeps 2-call architecture (separate
    hallucination check) for stricter fact-checking under the
    token-flow UX where the second call's latency is hidden.
  3. Railway free tier (1GB RAM / 0.5 vCPU): UptimeRobot keep-alive
    prevents cold sleeps but doesn't buy more CPU. Embedding query
    stays on the critical path at ~2-5s.
  4. Benchmarks ran on llama-3.1-8b-instant because the 70B daily
    token quota was exhausted. Production reverts to
    llama-3.3-70b-versatile for both pipeline and (when needed)
    trust-verification calls.

🔄 Breaking Changes

None. v0.1's API contract is preserved — QueryResponse.hallucination_check.flags,
/api/query/, WebSocket message shapes — all unchanged.

🗺️ Roadmap

  • v1.1: DOCX + HTML ingestion
  • v1.2: Session auth + per-user rate limits
  • v1.3: Cross-encoder rerank between RRF and top-5 (addresses
    keyword-query regression observed in v0.3.0 benchmark)
  • v2.0: Multi-tenant + usage quotas

Commits in this release (since v0.4.0-langchain)

Full git log:

  • Spec & plan: 7c85b69, f5b7217
  • WS1 backend opt: c31af8f (embedding cleanup) + ed9e64b (ruff + /health HEAD)
  • Cache: ef908cd + 2b643f7
  • Merged prompt: fe50642
  • Eval (Gemini): ebc07a8 + 859b05b + ef01285
  • Eval (Groq judge + tuning): 7bfa775 + 75f460d
  • Benchmarks: 75f460d (semantic) + f262f93 (hybrid)
  • v0.3.0-hybrid release: 5f96b12
  • v0.5.0-mcp draft: 07da34f

🙏 Credits

  • Engineering & Architecture: Jigang Zhou (Harry) — github.com/jigangz
  • Pair-programming partner: Claude Code (Anthropic)

Built during the 2026-04 sprint as a portfolio project for SWE / ML
engineer / Founding engineer roles. Production-grade decisions made
under realistic free-tier constraints — every architecture choice is
documented in docs/superpowers/specs/.


Install now:

pip install trustrag-langchain trustrag-mcp trustrag-eval

Try the demo: https://trustrag.vercel.app