docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)#39
Conversation
…c drift (G4) Cycle 3 PearMedica audit (2026-04-15) observed every case in the 12-sample golden set scoring ~0.25 higher on Context Relevance under v0.2.1's llm-judge than under v0.1.3's — with identical telemetry, identical retrieval code, identical judge model. The production pipeline did not change between the two runs. The measurement did. v0.2.1 shipped this rubric change without documenting it. A tool whose entire purpose is honest measurement cannot silently make one of its metrics more lenient and call the release clean. This commit ships the retroactive disclosure. Root cause is architectural, not a wording tweak: v0.1.3 scored Context Relevance by per-chunk averaging — the judge rated each retrieved chunk individually 1-5, normalised via mean(ratings)/5. See packages/evaluator/src/rag_forge_evaluator/ metrics/context_relevance.py on the v0.1.3 tag. v0.2.0 switched to a holistic 0-1 score across the whole context as part of the combined-pass llm-judge refresh (commit 330465f). For a hybrid retrieval pipeline returning a mix of relevant and irrelevant chunks (the normal case), these methodologies are not equivalent. Per-chunk averaging penalises irrelevant chunks in the aggregate. Holistic scoring typically rewards the presence of any relevant information. Holistic is systematically more lenient — that is the mechanism behind the +0.25 drift. Changes: - CHANGELOG.md: new 0.2.2 entry (unreleased) with a dedicated "Measurement rubric changes" section explaining the drift, the mechanism, the before/after prompt excerpts, and concrete re- baselining guidance. Also adds a retroactive 0.2.1 entry for the partial-publish recovery so the changelog is contiguous. - apps/docs/content/upgrading-v0-2-x.mdx: user-facing migration page mirroring the CHANGELOG section with a concrete step-by-step re-baselining workflow. Linked from the Nextra sidebar via _meta.ts. A future --rubric=strict-v1 flag that pins the judge prompt to the v0.1.x per-chunk-averaging shape is noted as out of scope for v0.2.2 and deferred to a later release. The rationale for not fixing it now is kept in the CHANGELOG so downstream users can open a tracking issue if they need it.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 1 minutes and 59 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdded version 0.2.2 release notes documenting three bug fixes: RAGAS adapter method compatibility, evaluation skip reporting accuracy, and cross-package version synchronization. Introduced a new upgrade guide documenting v0.1.x to v0.2.x migration with metric measurement changes and recommended workflow steps. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/docs/content/upgrading-v0-2-x.mdx`:
- Line 48: Replace the hardcoded future date in the example annotation string
"Context Relevance gate bumped 0.60 -> 0.80 on 2026-04-17 to account for
RAG-Forge v0.2.x holistic rubric — see docs." with a neutral, non-expiring
placeholder such as "YYYY-MM-DD" or "<today's date>" so the example remains
valid over time; update that exact example text to read e.g. "Context Relevance
gate bumped 0.60 -> 0.80 on YYYY-MM-DD to account for RAG-Forge v0.2.x holistic
rubric — see docs."
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 728a8d22-113d-4b90-b2b5-76d719ce68b1
📒 Files selected for processing (3)
CHANGELOG.mdapps/docs/content/_meta.tsapps/docs/content/upgrading-v0-2-x.mdx
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Lint, Typecheck & Test
🔇 Additional comments (10)
apps/docs/content/_meta.ts (1)
13-13: LGTM!The navigation entry is correctly configured and matches the new migration guide filename.
apps/docs/content/upgrading-v0-2-x.mdx (5)
1-11: LGTM!The opening section clearly warns users about the rubric drift and provides a concise TL;DR. The strong language about the metric getting "silently more lenient" is appropriate for a measurement tool and aligns with the PR's objective of honest disclosure.
12-40: LGTM!The mechanistic explanation is clear and includes concrete examples. The mathematical calculation in Line 36 is correct, and the side-by-side comparison of the two scoring methods effectively illustrates why holistic scoring produces higher scores.
50-55: LGTM!This section appropriately explains why the other three metrics don't require re-baselining and clarifies that the refusal-aware scoring feature is not a factor in the observed drift.
56-59: LGTM!This section honestly discloses the limitation and provides a workaround with clear tradeoffs. The mention of the future
--rubric=strict-v1flag aligns with the deferred work noted in the PR objectives.
60-62: LGTM!This closing section effectively articulates the philosophical rationale for retroactive disclosure and provides the timeline context. The strong stance on honest measurement is appropriate for a tool focused on evaluation integrity.
CHANGELOG.md (4)
13-48: LGTM!The measurement rubric changes section is comprehensive and consistent with the migration guide. The technical explanation is clear, includes concrete examples, and provides actionable guidance for users upgrading from v0.1.x. The mathematical example is correct and the prompt excerpts are properly attributed to specific file paths and commits.
55-66: LGTM!The retroactive 0.2.1 entry clearly explains the partial-publish issue and the recovery steps taken. The information aligns with the PR objectives stating this is a retroactive disclosure.
51-51: CHANGELOG line 51 lists non-existent methods; correct to actual implementations.The CHANGELOG references
embed_text,embed_texts, andset_run_configas methods asserted in the contract test, butRagForgeRagasEmbeddingsonly implements four methods:
embed_query/embed_documents(sync)aembed_query/aembed_documents(async)The three methods
embed_text,embed_texts, andset_run_configdo not exist in the codebase. Additionally, the referenced test files (tests/test_ragas_adapters_contract.pyandtests/test_ragas_adapters_e2e.py) cannot be found; the actual test file ispackages/evaluator/tests/test_ragas_adapters.py.Update line 51 to remove the non-existent methods and correct the test file path references.
> Likely an incorrect or invalid review comment.
9-9: CHANGELOG line 9 lists incorrect method names for the ragas 0.4.x interface.The ragas 0.4.x contract documented in
ragas_adapters.py(lines 10–22) specifies onlygenerate_text()andagenerate_text()as the required methods. Line 9 incorrectly claims the missing methods were.generate()/is_finished()/get_temperature()/set_run_config(), none of which appear in the ragas 0.4.x BaseRagasLLM interface. The actual missing methods weregenerate_text()andagenerate_text(). Correct the method names to match the documented ragas contract.> Likely an incorrect or invalid review comment.
CodeRabbit on PR #39 noted the example annotation baked in "2026-04-17" — two days after PR creation, chosen to look recent. A hardcoded future date reads as stale the moment a user opens the page more than a week later. Replaced with a generic YYYY-MM-DD placeholder and italicised the example so it reads as template text, not a real log line.
Summary
Ships the retroactive disclosure for the Context Relevance rubric drift introduced silently in v0.2.0.
Release-blocking for v0.2.2. For a tool whose entire job is honest measurement, shipping a silent rubric change without documentation is the exact pattern RAG-Forge exists to catch in other people's pipelines.
What Cycle 3 observed
Every case in the 12-sample PearMedica golden set scored higher on Context Relevance under v0.2.1's llm-judge than under v0.1.3's, with deltas from +0.05 to +0.48 and an aggregate of +0.25. The production retrieval code was unchanged between the two runs. Only the measurement framework moved.
Root cause — architectural, not a wording tweak
mean(ratings) / 5. Source:packages/evaluator/src/rag_forge_evaluator/metrics/context_relevance.pyon thev0.1.3tag.packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.pyin commit330465f.For a hybrid retrieval pipeline returning a mix of relevant and irrelevant chunks (the normal case), these are not equivalent:
0.52.0.80+.Holistic is systematically more lenient. That's the +0.25 drift.
Changes
CHANGELOG.md— new0.2.2 — Honest-Measurement Repair Releaseentry with a dedicated Measurement rubric changes section containing the before/after prompt excerpts, the mechanistic explanation, and concrete re-baselining guidance for users upgrading from v0.1.x. Also adds a retroactive0.2.1 — Partial-Publish Recoveryentry so the changelog is contiguous.apps/docs/content/upgrading-v0-2-x.mdx— new, user-facing Nextra migration page mirroring the CHANGELOG section with a step-by-step re-baselining workflow.apps/docs/content/_meta.ts— adds the migration page to the sidebar.Deferred to later release
A future
--rubric=strict-v1flag that pins the judge prompt to v0.1.x per-chunk-averaging is noted as out of scope for v0.2.2 and deferred. The rationale is kept in both the CHANGELOG and the docs page so users who need v0.1.x semantics before then have a clear path to open an issue.Test plan
Merge order
Can merge in parallel with G2, G3, G5 once G1 lands. Release-blocking for v0.2.2 — the release PR should not go out until this is merged.