Skip to content

docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)#39

Merged
hallengray merged 2 commits intomainfrom
docs/v0.2.2-g4-rubric-drift
Apr 15, 2026
Merged

docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)#39
hallengray merged 2 commits intomainfrom
docs/v0.2.2-g4-rubric-drift

Conversation

@hallengray
Copy link
Copy Markdown
Owner

Summary

Ships the retroactive disclosure for the Context Relevance rubric drift introduced silently in v0.2.0.

Release-blocking for v0.2.2. For a tool whose entire job is honest measurement, shipping a silent rubric change without documentation is the exact pattern RAG-Forge exists to catch in other people's pipelines.

What Cycle 3 observed

Every case in the 12-sample PearMedica golden set scored higher on Context Relevance under v0.2.1's llm-judge than under v0.1.3's, with deltas from +0.05 to +0.48 and an aggregate of +0.25. The production retrieval code was unchanged between the two runs. Only the measurement framework moved.

Root cause — architectural, not a wording tweak

  • v0.1.3 scored Context Relevance by per-chunk averaging. The judge rated each retrieved chunk individually on a 1-5 scale, normalised via mean(ratings) / 5. Source: packages/evaluator/src/rag_forge_evaluator/metrics/context_relevance.py on the v0.1.3 tag.
  • v0.2.0 switched to a holistic 0-1 score across the whole context as part of the combined-pass llm-judge refresh. Source: packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py in commit 330465f.

For a hybrid retrieval pipeline returning a mix of relevant and irrelevant chunks (the normal case), these are not equivalent:

  • Per-chunk averaging penalises irrelevant chunks in the aggregate. 5 chunks, 2 relevant → roughly 0.52.
  • Holistic scoring rewards the presence of any relevant information. Same 5 chunks → often 0.80+.

Holistic is systematically more lenient. That's the +0.25 drift.

Changes

  • CHANGELOG.md — new 0.2.2 — Honest-Measurement Repair Release entry with a dedicated Measurement rubric changes section containing the before/after prompt excerpts, the mechanistic explanation, and concrete re-baselining guidance for users upgrading from v0.1.x. Also adds a retroactive 0.2.1 — Partial-Publish Recovery entry so the changelog is contiguous.
  • apps/docs/content/upgrading-v0-2-x.mdxnew, user-facing Nextra migration page mirroring the CHANGELOG section with a step-by-step re-baselining workflow.
  • apps/docs/content/_meta.ts — adds the migration page to the sidebar.

Deferred to later release

A future --rubric=strict-v1 flag that pins the judge prompt to v0.1.x per-chunk-averaging is noted as out of scope for v0.2.2 and deferred. The rationale is kept in both the CHANGELOG and the docs page so users who need v0.1.x semantics before then have a clear path to open an issue.

Test plan

  • CHANGELOG renders cleanly on the Nextra docs site
  • Migration page shows up in the sidebar under `Upgrading v0.1.x -> v0.2.x`
  • Prompt excerpts in the CHANGELOG are accurate quotes from the v0.1.3 and v0.2.0 source files
  • The tone matches the LinkedIn v0.2.1 post framing — honest-measurement-first

Merge order

Can merge in parallel with G2, G3, G5 once G1 lands. Release-blocking for v0.2.2 — the release PR should not go out until this is merged.

…c drift (G4)

Cycle 3 PearMedica audit (2026-04-15) observed every case in the
12-sample golden set scoring ~0.25 higher on Context Relevance under
v0.2.1's llm-judge than under v0.1.3's — with identical telemetry,
identical retrieval code, identical judge model. The production
pipeline did not change between the two runs. The measurement did.

v0.2.1 shipped this rubric change without documenting it. A tool whose
entire purpose is honest measurement cannot silently make one of its
metrics more lenient and call the release clean. This commit ships
the retroactive disclosure.

Root cause is architectural, not a wording tweak:

  v0.1.3 scored Context Relevance by per-chunk averaging — the judge
  rated each retrieved chunk individually 1-5, normalised via
  mean(ratings)/5. See packages/evaluator/src/rag_forge_evaluator/
  metrics/context_relevance.py on the v0.1.3 tag.

  v0.2.0 switched to a holistic 0-1 score across the whole context as
  part of the combined-pass llm-judge refresh (commit 330465f).

For a hybrid retrieval pipeline returning a mix of relevant and
irrelevant chunks (the normal case), these methodologies are not
equivalent. Per-chunk averaging penalises irrelevant chunks in the
aggregate. Holistic scoring typically rewards the presence of any
relevant information. Holistic is systematically more lenient — that
is the mechanism behind the +0.25 drift.

Changes:

- CHANGELOG.md: new 0.2.2 entry (unreleased) with a dedicated
  "Measurement rubric changes" section explaining the drift, the
  mechanism, the before/after prompt excerpts, and concrete re-
  baselining guidance. Also adds a retroactive 0.2.1 entry for the
  partial-publish recovery so the changelog is contiguous.
- apps/docs/content/upgrading-v0-2-x.mdx: user-facing migration page
  mirroring the CHANGELOG section with a concrete step-by-step
  re-baselining workflow. Linked from the Nextra sidebar via _meta.ts.

A future --rubric=strict-v1 flag that pins the judge prompt to the
v0.1.x per-chunk-averaging shape is noted as out of scope for v0.2.2
and deferred to a later release. The rationale for not fixing it now
is kept in the CHANGELOG so downstream users can open a tracking
issue if they need it.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Warning

Rate limit exceeded

@hallengray has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 59 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 1 minutes and 59 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fa0859f0-77f4-44f5-96d9-48ace155e35e

📥 Commits

Reviewing files that changed from the base of the PR and between 6d74bc4 and 09f80f5.

📒 Files selected for processing (1)
  • apps/docs/content/upgrading-v0-2-x.mdx

Walkthrough

Added version 0.2.2 release notes documenting three bug fixes: RAGAS adapter method compatibility, evaluation skip reporting accuracy, and cross-package version synchronization. Introduced a new upgrade guide documenting v0.1.x to v0.2.x migration with metric measurement changes and recommended workflow steps.

Changes

Cohort / File(s) Summary
Release Documentation
CHANGELOG.md
Added v0.2.2 unreleased entry with bug fixes for RAGAS adapter, evaluation skip reporting, and version drift; included measurement rubric clarification for Context Relevance metric changes.
Documentation Navigation
apps/docs/content/_meta.ts
Added navigation entry for new upgrade guide: "upgrading-v0-2-x": "Upgrading v0.1.x -> v0.2.x".
Upgrade Guide
apps/docs/content/upgrading-v0-2-x.mdx
New documentation page explaining Context Relevance scoring changes, refusal-aware scoring behavior, version compatibility, and recommended upgrade workflow from v0.1.x to v0.2.x.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Hops through the docs with glee,
A guide for all to see—
From old versions to the new,
We've fixed bugs and changed the view!
Metrics shift with careful measure,
Upgrades bring the rabbit treasure! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'docs(v0.2.2): CHANGELOG + migration guide for Context Relevance rubric drift (G4)' accurately and specifically describes the main changes: documentation updates including a CHANGELOG entry and migration guide addressing a Context Relevance metric rubric change.
Description check ✅ Passed The description comprehensively explains the PR's purpose, root cause of the rubric drift, specific changes made, deferred work, and test plan, all directly related to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/v0.2.2-g4-rubric-drift

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/docs/content/upgrading-v0-2-x.mdx`:
- Line 48: Replace the hardcoded future date in the example annotation string
"Context Relevance gate bumped 0.60 -> 0.80 on 2026-04-17 to account for
RAG-Forge v0.2.x holistic rubric — see docs." with a neutral, non-expiring
placeholder such as "YYYY-MM-DD" or "<today's date>" so the example remains
valid over time; update that exact example text to read e.g. "Context Relevance
gate bumped 0.60 -> 0.80 on YYYY-MM-DD to account for RAG-Forge v0.2.x holistic
rubric — see docs."
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 728a8d22-113d-4b90-b2b5-76d719ce68b1

📥 Commits

Reviewing files that changed from the base of the PR and between 18947f4 and 6d74bc4.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • apps/docs/content/_meta.ts
  • apps/docs/content/upgrading-v0-2-x.mdx
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Lint, Typecheck & Test
🔇 Additional comments (10)
apps/docs/content/_meta.ts (1)

13-13: LGTM!

The navigation entry is correctly configured and matches the new migration guide filename.

apps/docs/content/upgrading-v0-2-x.mdx (5)

1-11: LGTM!

The opening section clearly warns users about the rubric drift and provides a concise TL;DR. The strong language about the metric getting "silently more lenient" is appropriate for a measurement tool and aligns with the PR's objective of honest disclosure.


12-40: LGTM!

The mechanistic explanation is clear and includes concrete examples. The mathematical calculation in Line 36 is correct, and the side-by-side comparison of the two scoring methods effectively illustrates why holistic scoring produces higher scores.


50-55: LGTM!

This section appropriately explains why the other three metrics don't require re-baselining and clarifies that the refusal-aware scoring feature is not a factor in the observed drift.


56-59: LGTM!

This section honestly discloses the limitation and provides a workaround with clear tradeoffs. The mention of the future --rubric=strict-v1 flag aligns with the deferred work noted in the PR objectives.


60-62: LGTM!

This closing section effectively articulates the philosophical rationale for retroactive disclosure and provides the timeline context. The strong stance on honest measurement is appropriate for a tool focused on evaluation integrity.

CHANGELOG.md (4)

13-48: LGTM!

The measurement rubric changes section is comprehensive and consistent with the migration guide. The technical explanation is clear, includes concrete examples, and provides actionable guidance for users upgrading from v0.1.x. The mathematical example is correct and the prompt excerpts are properly attributed to specific file paths and commits.


55-66: LGTM!

The retroactive 0.2.1 entry clearly explains the partial-publish issue and the recovery steps taken. The information aligns with the PR objectives stating this is a retroactive disclosure.


51-51: CHANGELOG line 51 lists non-existent methods; correct to actual implementations.

The CHANGELOG references embed_text, embed_texts, and set_run_config as methods asserted in the contract test, but RagForgeRagasEmbeddings only implements four methods:

  • embed_query / embed_documents (sync)
  • aembed_query / aembed_documents (async)

The three methods embed_text, embed_texts, and set_run_config do not exist in the codebase. Additionally, the referenced test files (tests/test_ragas_adapters_contract.py and tests/test_ragas_adapters_e2e.py) cannot be found; the actual test file is packages/evaluator/tests/test_ragas_adapters.py.

Update line 51 to remove the non-existent methods and correct the test file path references.

			> Likely an incorrect or invalid review comment.

9-9: CHANGELOG line 9 lists incorrect method names for the ragas 0.4.x interface.

The ragas 0.4.x contract documented in ragas_adapters.py (lines 10–22) specifies only generate_text() and agenerate_text() as the required methods. Line 9 incorrectly claims the missing methods were .generate() / is_finished() / get_temperature() / set_run_config(), none of which appear in the ragas 0.4.x BaseRagasLLM interface. The actual missing methods were generate_text() and agenerate_text(). Correct the method names to match the documented ragas contract.

			> Likely an incorrect or invalid review comment.

Comment thread apps/docs/content/upgrading-v0-2-x.mdx Outdated
CodeRabbit on PR #39 noted the example annotation baked in
"2026-04-17" — two days after PR creation, chosen to look recent.
A hardcoded future date reads as stale the moment a user opens the
page more than a week later. Replaced with a generic YYYY-MM-DD
placeholder and italicised the example so it reads as template
text, not a real log line.
@hallengray hallengray merged commit 0b83222 into main Apr 15, 2026
2 checks passed
@hallengray hallengray deleted the docs/v0.2.2-g4-rubric-drift branch April 15, 2026 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant