Skip to content

v2.1.0 — MLOps evaluation layer (RAGAS + CI regression gate + GoldenRunner)

Choose a tag to compare

@irfanalidv irfanalidv released this 03 Apr 14:40
· 4 commits to main since this release

What's new

ragfallback now ships a complete MLOps evaluation layer — something most RAG libraries don't include at all.


ragfallback/mlops/ — new package

GoldenRunner

Runs your retrieval pipeline against a labeled golden dataset (JSON file or list[dict]), tracks per-sample latency, computes recall@3, recall@5, and P95 latency across all samples. Fully async via asyncio.gather.

RagasHook

Wraps RAGAS evaluation — faithfulness, answer relevance, context precision, context recall. Falls back to heuristic scoring if ragas is not installed. No crash, logged warning only.

BaselineRegistry

Stores metric snapshots per dataset in a committed JSON file. compare_or_fail() raises RegressionError if any quality metric drops more than 5%, or P95 latency spikes more than 12% vs the stored baseline.

QuerySimulator

Generates adversarial query mixes from any base query set:

  • short_keyword — first 2 content words only
  • long_nl — expanded with verbose instruction prefix
  • ambiguous — proper nouns stripped
  • out_of_domain — completely unrelated topic injection

simulate_unhappy_paths() produces all 4 types for every input query (4× expansion).

MLflowLogger

Logs all GoldenReport fields as MLflow metrics and params. No-op if mlflow is not installed.

generate_locustfile(output_path, endpoint)

Writes a ready-to-run Locust load test file simulating realistic RAG traffic — short keyword (40%), long NL (20%), out-of-domain (10%).


CI regression gate

A new mlops-regression-gate job runs on every push to main:

  1. Builds golden dataset from SQuAD (CC BY-SA 4.0, no API key needed)
  2. Indexes passages in ChromaDB using all-MiniLM-L6-v2 (local, no API key)
  3. Runs GoldenRunner async across 20 samples
  4. Calls compare_or_fail() against committed examples/baselines.json
  5. Exits 0 (pass) or 1 (regression detected)

Bug fixes

  • recall_at_k now counts distinct relevant docs in top-k so duplicates cannot push recall above 1.0
  • BaselineRegistry.compare_or_fail accepts a separate latency_threshold parameter (default 0.12) for looser P95 gating in noisy CI environments

Install

pip install ragfallback[mlops]
python examples/build_golden_dataset.py
python examples/ci_regression_gate.py

Full changelog

See CHANGELOG.md