docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4) by hallengray · Pull Request #39 · hallengray/rag-forge

hallengray · 2026-04-15T21:46:52Z

Summary

Ships the retroactive disclosure for the Context Relevance rubric drift introduced silently in v0.2.0.

Release-blocking for v0.2.2. For a tool whose entire job is honest measurement, shipping a silent rubric change without documentation is the exact pattern RAG-Forge exists to catch in other people's pipelines.

What Cycle 3 observed

Every case in the 12-sample PearMedica golden set scored higher on Context Relevance under v0.2.1's llm-judge than under v0.1.3's, with deltas from +0.05 to +0.48 and an aggregate of +0.25. The production retrieval code was unchanged between the two runs. Only the measurement framework moved.

Root cause — architectural, not a wording tweak

v0.1.3 scored Context Relevance by per-chunk averaging. The judge rated each retrieved chunk individually on a 1-5 scale, normalised via mean(ratings) / 5. Source: packages/evaluator/src/rag_forge_evaluator/metrics/context_relevance.py on the v0.1.3 tag.
v0.2.0 switched to a holistic 0-1 score across the whole context as part of the combined-pass llm-judge refresh. Source: packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py in commit 330465f.

For a hybrid retrieval pipeline returning a mix of relevant and irrelevant chunks (the normal case), these are not equivalent:

Per-chunk averaging penalises irrelevant chunks in the aggregate. 5 chunks, 2 relevant → roughly 0.52.
Holistic scoring rewards the presence of any relevant information. Same 5 chunks → often 0.80+.

Holistic is systematically more lenient. That's the +0.25 drift.

Changes

CHANGELOG.md — new 0.2.2 — Honest-Measurement Repair Release entry with a dedicated Measurement rubric changes section containing the before/after prompt excerpts, the mechanistic explanation, and concrete re-baselining guidance for users upgrading from v0.1.x. Also adds a retroactive 0.2.1 — Partial-Publish Recovery entry so the changelog is contiguous.
apps/docs/content/upgrading-v0-2-x.mdx — new, user-facing Nextra migration page mirroring the CHANGELOG section with a step-by-step re-baselining workflow.
apps/docs/content/_meta.ts — adds the migration page to the sidebar.

Deferred to later release

A future --rubric=strict-v1 flag that pins the judge prompt to v0.1.x per-chunk-averaging is noted as out of scope for v0.2.2 and deferred. The rationale is kept in both the CHANGELOG and the docs page so users who need v0.1.x semantics before then have a clear path to open an issue.

Test plan

CHANGELOG renders cleanly on the Nextra docs site
Migration page shows up in the sidebar under `Upgrading v0.1.x -> v0.2.x`
Prompt excerpts in the CHANGELOG are accurate quotes from the v0.1.3 and v0.2.0 source files
The tone matches the LinkedIn v0.2.1 post framing — honest-measurement-first

Merge order

Can merge in parallel with G2, G3, G5 once G1 lands. Release-blocking for v0.2.2 — the release PR should not go out until this is merged.

…c drift (G4) Cycle 3 PearMedica audit (2026-04-15) observed every case in the 12-sample golden set scoring ~0.25 higher on Context Relevance under v0.2.1's llm-judge than under v0.1.3's — with identical telemetry, identical retrieval code, identical judge model. The production pipeline did not change between the two runs. The measurement did. v0.2.1 shipped this rubric change without documenting it. A tool whose entire purpose is honest measurement cannot silently make one of its metrics more lenient and call the release clean. This commit ships the retroactive disclosure. Root cause is architectural, not a wording tweak: v0.1.3 scored Context Relevance by per-chunk averaging — the judge rated each retrieved chunk individually 1-5, normalised via mean(ratings)/5. See packages/evaluator/src/rag_forge_evaluator/ metrics/context_relevance.py on the v0.1.3 tag. v0.2.0 switched to a holistic 0-1 score across the whole context as part of the combined-pass llm-judge refresh (commit 330465f). For a hybrid retrieval pipeline returning a mix of relevant and irrelevant chunks (the normal case), these methodologies are not equivalent. Per-chunk averaging penalises irrelevant chunks in the aggregate. Holistic scoring typically rewards the presence of any relevant information. Holistic is systematically more lenient — that is the mechanism behind the +0.25 drift. Changes: - CHANGELOG.md: new 0.2.2 entry (unreleased) with a dedicated "Measurement rubric changes" section explaining the drift, the mechanism, the before/after prompt excerpts, and concrete re- baselining guidance. Also adds a retroactive 0.2.1 entry for the partial-publish recovery so the changelog is contiguous. - apps/docs/content/upgrading-v0-2-x.mdx: user-facing migration page mirroring the CHANGELOG section with a concrete step-by-step re-baselining workflow. Linked from the Nextra sidebar via _meta.ts. A future --rubric=strict-v1 flag that pins the judge prompt to the v0.1.x per-chunk-averaging shape is noted as out of scope for v0.2.2 and deferred to a later release. The rationale for not fixing it now is kept in the CHANGELOG so downstream users can open a tracking issue if they need it.

coderabbitai · 2026-04-15T21:47:11Z

Warning

Rate limit exceeded

@hallengray has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 1 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fa0859f0-77f4-44f5-96d9-48ace155e35e

📥 Commits

Reviewing files that changed from the base of the PR and between 6d74bc4 and 09f80f5.

📒 Files selected for processing (1)

apps/docs/content/upgrading-v0-2-x.mdx

Walkthrough

Added version 0.2.2 release notes documenting three bug fixes: RAGAS adapter method compatibility, evaluation skip reporting accuracy, and cross-package version synchronization. Introduced a new upgrade guide documenting v0.1.x to v0.2.x migration with metric measurement changes and recommended workflow steps.

Changes

Cohort / File(s)	Summary
Release Documentation `CHANGELOG.md`	Added v0.2.2 unreleased entry with bug fixes for RAGAS adapter, evaluation skip reporting, and version drift; included measurement rubric clarification for Context Relevance metric changes.
Documentation Navigation `apps/docs/content/_meta.ts`	Added navigation entry for new upgrade guide: `"upgrading-v0-2-x": "Upgrading v0.1.x -> v0.2.x"`.
Upgrade Guide `apps/docs/content/upgrading-v0-2-x.mdx`	New documentation page explaining Context Relevance scoring changes, refusal-aware scoring behavior, version compatibility, and recommended upgrade workflow from v0.1.x to v0.2.x.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Hops through the docs with glee,
A guide for all to see—
From old versions to the new,
We've fixed bugs and changed the view!
Metrics shift with careful measure,
Upgrades bring the rabbit treasure! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)' accurately and specifically describes the main changes: documentation updates including a CHANGELOG entry and migration guide addressing a Context Relevance metric rubric change.
Description check	✅ Passed	The description comprehensively explains the PR's purpose, root cause of the rubric drift, specific changes made, deferred work, and test plan, all directly related to the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/v0.2.2-g4-rubric-drift

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/docs/content/upgrading-v0-2-x.mdx`:
- Line 48: Replace the hardcoded future date in the example annotation string
"Context Relevance gate bumped 0.60 -> 0.80 on 2026-04-17 to account for
RAG-Forge v0.2.x holistic rubric — see docs." with a neutral, non-expiring
placeholder such as "YYYY-MM-DD" or "<today's date>" so the example remains
valid over time; update that exact example text to read e.g. "Context Relevance
gate bumped 0.60 -> 0.80 on YYYY-MM-DD to account for RAG-Forge v0.2.x holistic
rubric — see docs."

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 728a8d22-113d-4b90-b2b5-76d719ce68b1

📥 Commits

Reviewing files that changed from the base of the PR and between 18947f4 and 6d74bc4.

📒 Files selected for processing (3)

CHANGELOG.md
apps/docs/content/_meta.ts
apps/docs/content/upgrading-v0-2-x.mdx

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Lint, Typecheck & Test

🔇 Additional comments (10)

apps/docs/content/_meta.ts (1)

13-13: LGTM!

The navigation entry is correctly configured and matches the new migration guide filename.

apps/docs/content/upgrading-v0-2-x.mdx (5)

1-11: LGTM!

The opening section clearly warns users about the rubric drift and provides a concise TL;DR. The strong language about the metric getting "silently more lenient" is appropriate for a measurement tool and aligns with the PR's objective of honest disclosure.

12-40: LGTM!

The mechanistic explanation is clear and includes concrete examples. The mathematical calculation in Line 36 is correct, and the side-by-side comparison of the two scoring methods effectively illustrates why holistic scoring produces higher scores.

50-55: LGTM!

This section appropriately explains why the other three metrics don't require re-baselining and clarifies that the refusal-aware scoring feature is not a factor in the observed drift.

56-59: LGTM!

This section honestly discloses the limitation and provides a workaround with clear tradeoffs. The mention of the future --rubric=strict-v1 flag aligns with the deferred work noted in the PR objectives.

60-62: LGTM!

This closing section effectively articulates the philosophical rationale for retroactive disclosure and provides the timeline context. The strong stance on honest measurement is appropriate for a tool focused on evaluation integrity.
CHANGELOG.md (4)
13-48: LGTM!

The measurement rubric changes section is comprehensive and consistent with the migration guide. The technical explanation is clear, includes concrete examples, and provides actionable guidance for users upgrading from v0.1.x. The mathematical example is correct and the prompt excerpts are properly attributed to specific file paths and commits.

55-66: LGTM!

The retroactive 0.2.1 entry clearly explains the partial-publish issue and the recovery steps taken. The information aligns with the PR objectives stating this is a retroactive disclosure.

51-51: CHANGELOG line 51 lists non-existent methods; correct to actual implementations.

The CHANGELOG references embed_text, embed_texts, and set_run_config as methods asserted in the contract test, but RagForgeRagasEmbeddings only implements four methods:

embed_query / embed_documents (sync)

aembed_query / aembed_documents (async)

The three methods embed_text, embed_texts, and set_run_config do not exist in the codebase. Additionally, the referenced test files (tests/test_ragas_adapters_contract.py and tests/test_ragas_adapters_e2e.py) cannot be found; the actual test file is packages/evaluator/tests/test_ragas_adapters.py.

Update line 51 to remove the non-existent methods and correct the test file path references.
			> Likely an incorrect or invalid review comment.
9-9: CHANGELOG line 9 lists incorrect method names for the ragas 0.4.x interface.

The ragas 0.4.x contract documented in ragas_adapters.py (lines 10–22) specifies only generate_text() and agenerate_text() as the required methods. Line 9 incorrectly claims the missing methods were .generate() / is_finished() / get_temperature() / set_run_config(), none of which appear in the ragas 0.4.x BaseRagasLLM interface. The actual missing methods were generate_text() and agenerate_text(). Correct the method names to match the documented ragas contract.
			> Likely an incorrect or invalid review comment.

CodeRabbit on PR #39 noted the example annotation baked in "2026-04-17" — two days after PR creation, chosen to look recent. A hardcoded future date reads as stale the moment a user opens the page more than a week later. Replaced with a generic YYYY-MM-DD placeholder and italicised the example so it reads as template text, not a real log line.

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

Comment thread apps/docs/content/upgrading-v0-2-x.mdx Outdated

hallengray merged commit 0b83222 into main Apr 15, 2026
2 checks passed

hallengray deleted the docs/v0.2.2-g4-rubric-drift branch April 15, 2026 22:53

hallengray mentioned this pull request Apr 15, 2026

chore(release): prep v0.2.2 — lockstep version bump (6 packages) #41

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)#39

docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)#39
hallengray merged 2 commits intomainfrom
docs/v0.2.2-g4-rubric-drift

hallengray commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallengray commented Apr 15, 2026

Summary

What Cycle 3 observed

Root cause — architectural, not a wording tweak

Changes

Deferred to later release

Test plan

Merge order

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 15, 2026 •

edited

Loading