feat(evaluator): v0.2.0 evaluator refresh by hallengray · Pull Request #33 · hallengray/rag-forge

hallengray · 2026-04-14T17:49:12Z

Summary

Implements the full v0.2.0 evaluator refresh per the design in #32. Rebuilds the RAGAS adapter around injected LLM + embeddings wrappers, ships default-on refusal-aware scoring with a --strict escape hatch, and rewrites the audit report around a single Jinja template with shared browser/print CSS.

Driven by findings from the PearMedica Cycle 2 audit (2026-04-14). Three v0.1.3 bugs fixed at the architectural root, one class of scoring pathology closed, and the report is rebuilt to top-tier standard.

What's in this PR (22 commits)

Workstream A — RAGAS adapter rebuild

New engines/ragas_adapters.py with RagForgeRagasLLM (sync + async) and RagForgeRagasEmbeddings (openai / voyage / mock providers)
ragas_evaluator.py now injects our wrappers instead of letting ragas pick defaults — version-stable across ragas releases
_extract_ragas_score raises ValueError instead of silently coercing to 0.0; exceptions are captured as SkipRecord entries visible in the report
Hard cut on the ragas pin: ragas>=0.4,<0.6
New optional extra [ragas-voyage] for Claude-judge users

Workstream B — Refusal-aware scoring (default-on)

metrics/llm_judge.py injects an inline classification preamble into the scoring prompt. The judge classifies each sample as standard or safety_refusal and scores on the appropriate rubric in one round-trip (no extra API calls)
RagForgeRagasLLM prepends a shorter refusal-aware NOTE to every prompt ragas sends
AuditConfig.refusal_aware: bool = True + CLI flags --strict, --refusal-aware, --no-refusal-aware
30% refusal-rate warning banner in the report for spot-checking judge classifications
MetricResult and SampleResult gain scoring_mode and refusal_justification fields
EvaluationResult gains scoring_modes_count for aggregate reporting

Workstream C — Top-tier audit report

New single Jinja template report/templates/audit.html.j2 (430 lines) with all 14 sections: header → title → run manifest → RMM hero with SVG sparkline → maturity ladder → executive summary → TL;DR callout → metric breakdown → cost summary → safety refusals → worst case → per-sample detail cards → skipped samples → compliance footer
report/generator.py rewritten as a context builder around the new template. Pure helper functions for RMM, TL;DR auto-generation, refusals block, worst case, compliance footer
report/pdf.py uses Playwright's emulate_media(media="print") so the template's @media print rules (page breaks, @page rules, forced-open <details>) actually apply
JSON report gets additive-only fields: skipped_samples, scoring_modes_count, safety_refusal_rate, per-sample scoring_mode + refusal_justification
Compliance footer references NIST AI RMF 1.0 § 4.3, ISO/IEC 42001:2023 § 8.3, EU AI Act Article 15. Authored by RAG-Forge
audit.py drops the hard block on --evaluator ragas --judge claude and replaces it with a capability check for voyageai
Banner no longer hardcodes gpt-4o-mini (RAGAS internal — judge override ignored) — ragas now honors --judge end-to-end

Workstream D — Release coordination

Version bumps to 0.2.0 across rag-forge-core, rag-forge-evaluator, rag-forge-observability, @rag-forge/cli, @rag-forge/mcp
CHANGELOG.md with breaking-changes callout, migration guide, and PearMedica Cycle 2 credits
docs/evaluators.md with a refusal-aware scoring guide (worked example: PearMedica paediatric metformin refusal)

Breaking changes

Scores will shift on pipelines with safety refusals (default-on refusal-aware). Pass --strict to revert.
ragas pin hard cut to >=0.4,<0.6. Users with pinned 0.2.x must upgrade ragas before upgrading rag-forge-evaluator.
--evaluator ragas --judge claude now works — was previously a hard error.

Tests

Full evaluator suite: 256 passed, 4 skipped (the 4 skipped are opt-in: ragas_integration and visual markers)
22 new unit tests + 4 gated integration/visual tests
Ruff: clean
Mypy: clean on files this branch touched
Turbo build: all 5 packages green
Manual smoke audit: rag-forge-eval audit --evaluator llm-judge --judge mock produces HTML + JSON + PDF cleanly

Regression coverage

tests/test_ragas_integration.py — real ragas 0.4.x contract test (gated). If ragas 0.5 breaks BaseRagasLLM / BaseRagasEmbeddings, this test catches it
tests/test_pearmedica_regression.py — sanitized 3-sample PearMedica fixture. Locks in fixes for Findings feat: Phase 2C — Evaluation Enhancements + CI/CD Gate #4, feat: Phase 2D — MCP Server Wiring + Templates #5, feat: Phase 3A — OpenTelemetry Observability #6 — if any of the v0.1.3 pathologies reappear, these assertions fail
tests/test_report_visual_regression.py — Playwright screenshot comparison against committed baseline (gated)

Migration for existing users

# 1. Upgrade ragas (mandatory)
pip install -U rag-forge-evaluator[ragas]

# 2. If using Claude judge with ragas, add Voyage embeddings
pip install rag-forge-evaluator[ragas-voyage]

# 3. To keep v0.1.x scoring semantics in CI:
rag-forge-eval audit ... --strict

Depends on

docs(spec): v0.2.0 evaluator refresh design #32 — spec + implementation plan

Test plan

CodeRabbit review clean
pnpm turbo run build green in CI
uv run pytest packages/evaluator/tests -q green in CI (256 passed)
Smoke: run rag-forge-eval audit with mock judge against packages/evaluator/tests/fixtures/ragas_tiny_fixture.json — HTML/JSON/PDF all generated without errors
Visual: open the generated audit-report.html in a browser and confirm layout matches the design spec (serif headings, RMM hero, TL;DR callout, compliance footer)
Optional but recommended before merge: run real rag-forge-eval audit against PearMedica Cycle 2 data with ragas_integration marker enabled to confirm findings feat: Phase 2C — Evaluation Enhancements + CI/CD Gate #4/feat: Phase 2D — MCP Server Wiring + Templates #5/feat: Phase 3A — OpenTelemetry Observability #6 stay dead

Captures the four workstreams locked during brainstorming in response to the PearMedica Cycle 2 audit findings: - Workstream A: rebuild the RAGAS adapter around injected LLM and embeddings wrappers (RagForgeRagasLLM + RagForgeRagasEmbeddings) so the evaluator stops depending on ragas's version-fragile default picker. Hard cut to ragas>=0.4,<0.6. Adds [ragas-voyage] extra for Claude-judge users. Deletes the 0.0-on-exception coercion; skips surface in reports. - Workstream B: default-on refusal-aware scoring with a --strict escape hatch. The judge classifies each sample as standard or safety_refusal inline and applies the appropriate rubric. A 30% spot-check banner guards against judge misclassification. - Workstream C: rebuild the audit report to top-tier standard. Single Jinja template with a shared browser/print stylesheet, Playwright print-media emulation, 14-section information architecture covering run manifest, RMM hero with historical sparkline, TL;DR callout, cost summary, safety refusals, per-sample detail, and a compliance footer aligned with NIST AI RMF / ISO 42001 / EU AI Act Article 15. - Workstream D: coordinated 0.2.0 version bump across core, evaluator, observability, cli, and mcp packages with a migration checklist. Includes a file-by-file change summary, four-layer testing strategy, locked decisions from brainstorming, and explicit non-goals.

…providers

….2.0

…k with SkipRecord RagasEvaluator now accepts a judge, max_tokens, and embeddings_provider and injects RagForgeRagasLLM + RagForgeRagasEmbeddings into every ragas evaluate() call — ragas no longer reaches for its own default LLM or embeddings. _extract_ragas_score raises ValueError instead of silently returning 0.0 so broken infrastructure surfaces in the report as SkipRecords rather than producing invisibly-wrong scores (root cause of the PearMedica Cycle 2 silent failure). Whole-batch ragas crashes are also caught and converted to SkipRecords per sample×metric.

…onfig fields - Add _voyageai_installed() module-level probe so tests can monkeypatch the voyageai import check without requiring the real SDK installed - Add three new AuditConfig fields: refusal_aware=True, ragas_max_tokens=8192, ragas_embeddings_provider=None - Replace _validate_config ragas branch: claude is now allowed when voyageai is installed; blocks only on missing voyageai dep with a clear install hint pointing at [ragas-voyage] extra - Drop hardcoded 'gpt-4o-mini (RAGAS internal — judge override ignored)' banner — ragas now honors the configured judge end-to-end via RagForgeRagasLLM wrapper shipped in Wave 1 - Update test_ragas_with_mock_judge_is_blocked → _is_allowed: mock is a valid zero-cost judge with the new wrapper; old block premise obsolete

…llm-judge Adds refusal_aware=True parameter to LLMJudgeEvaluator. When enabled, a classification preamble is prepended to the combined single-call prompt so the judge can identify safety-refusal samples (context too thin, response correctly abstained) and apply the refusal rubric — all in one round-trip, zero extra API cost. scoring_mode and refusal_justification are threaded through SampleResult and aggregated into EvaluationResult.scoring_modes_count. Migrates the default evaluation path to a single combined judge call per sample (all four metrics in one JSON response). Per-metric path is preserved for callers who inject custom MetricEvaluator instances (detected via instance type check so patched test evaluators continue to work correctly).

…ylesheet

Replaces the old Jinja FileSystemLoader + audit_report.html.j2 approach with a module-level generate_html() function that builds a typed context dict (RMM summary, TL;DR, refusals, worst-case, compliance footer, cost block, sparkline points) and renders the Task 12 audit.html.j2 template. ReportGenerator class preserved as a backward-compatible shim for audit.py.

…ed CSS

…fusal_rate to JSON output Add four new additive fields to the JSON audit report (v0.2.0): - Top-level 'skipped_samples': list of skip records with sample_id, metric_name, reason, exception_type - Top-level 'scoring_modes_count': dict of scoring mode counts (e.g. {'standard': 11, 'safety_refusal': 1}) - Top-level 'safety_refusal_rate': float ratio of safety_refusal samples to total samples_evaluated - Per-sample 'scoring_mode' and 'refusal_justification': tracking the scoring context for each result All existing JSON fields preserved. Computation handles edge cases (empty reports, zero samples). 8 new tests added covering all combinations.

…ual)

- Add return type annotations to _openai_client() and _voyage_client() - Suppress TC001 for generator.py (types used in signatures) - Combine nested with statements in test_audit_config_validation.py - Remove unused pytest imports in test_refusal_aware_scoring.py - Fix import ordering and formatting per ruff rules

coderabbitai · 2026-04-14T17:49:47Z

Caution

Review failed

Pull request was closed or merged during review

Summary by CodeRabbit

New Features
- Refusal-aware scoring enabled by default (opt-out via --strict / --no-refusal-aware); new CLI flags: --strict, --refusal-aware, --no-refusal-aware.
- Enhanced audit reports (HTML/PDF) with safety-refusal sections, skipped-sample details, compliance footer, and richer JSON output.
- RAGAS integration with adapter support and a new optional extra for Voyage/Claude embeddings; package versions bumped to 0.2.0.
Bug Fixes
- Per-sample failures recorded as skips (not coerced to zero) and handled in reports.
- Fixed embedding and token/overflow issues.
Documentation
- New evaluator docs and migration/design plans.
Tests
- Many new unit/integration/visual tests and fixtures.

Walkthrough

This release updates RAG-Forge from v0.1.3 to v0.2.0 with default refusal-aware scoring, RAGAS LLM/embeddings adapter wrappers, skip-record error handling (no silent 0.0 coercion), revised evaluator wiring and CLI flags, and a rebuilt Jinja HTML/PDF report plus broad test and fixture additions.

Changes

Cohort / File(s)	Summary
Version bumps `packages/cli/package.json`, `packages/core/pyproject.toml`, `packages/evaluator/pyproject.toml`, `packages/mcp/package.json`, `packages/observability/pyproject.toml`	Bumped package versions from `0.1.3` → `0.2.0`.
.gitignore & mypy `.gitignore`, `pyproject.toml`	Added `.superpowers/` to `.gitignore`; extended mypy overrides to include `PIL.`, `voyageai.`, `langchain_core.`, and `rag_forge_evaluator.`.
Changelog & Docs `CHANGELOG.md`, `docs/evaluators.md`, `docs/superpowers/plans/...v0.2.0-evaluator-refresh.md`, `docs/superpowers/specs/...-evaluator-refresh-design.md`	Added v0.2.0 changelog and comprehensive docs/specs/plans describing refusal-aware scoring, RAGAS adapter design, migration steps, and report redesign.
Core engine types `packages/evaluator/src/rag_forge_evaluator/engine.py`	Added `ScoringMode` alias and `SkipRecord`; extended `EvaluationSample`, `MetricResult`, `SampleResult`, and `EvaluationResult` with sample IDs, scoring modes, refusal justifications, `skipped_samples`, and `scoring_modes_count`.
Audit config & validation `packages/evaluator/src/rag_forge_evaluator/audit.py`	Added `_voyageai_installed()` probe; extended `AuditConfig` with `refusal_aware` and `ragas_embeddings_provider`; relaxed RAGAS judge constraints to allow Claude when VoyageAI is installed; threaded `refusal_aware` into evaluator creation and adjusted estimation/banner logic.
CLI refusal flags `packages/evaluator/src/rag_forge_evaluator/cli.py`	Added mutually exclusive `--strict`, `--refusal-aware`, `--no-refusal-aware` flags and mapped them to `AuditConfig.refusal_aware`.
Evaluator factory `packages/evaluator/src/rag_forge_evaluator/engines/...`	`create_evaluator(...)` gains `refusal_aware: bool = True` and forwards it to evaluators; `llm-judge` and `ragas` branches updated to accept/forward the flag.
RAGAS adapter wrappers `packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`	New module with `_JudgeLike` protocol, `RagForgeRagasLLM` (forwards to Judge, supports refusal-aware prompt preamble, sync/async paths, LangChain compatibility fallback) and `RagForgeRagasEmbeddings` (providers: `openai`, `voyage`, `mock` with deterministic mock vectors).
RAGAS evaluator refactor & error handling `packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`	RagasEvaluator now accepts `judge`, `embeddings_provider`, `refusal_aware`; auto-selects embeddings provider; constructs adapter wrappers; raises on missing judge; raises on extraction failures and records `SkipRecord`s; returns `skipped_samples` in results and adjusts pass/scoring logic.
LLM-judge refusal-aware scoring `packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`	Added combined single-call scoring path, refusal classification preamble, dual rubrics (`standard` vs `safety_refusal`), robust parsing with skip handling, and tracking of `scoring_mode`, `refusal_justification`, and `scoring_modes_count`.
Report generation & template `packages/evaluator/src/rag_forge_evaluator/report/generator.py`, `packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2`	New `generate_html(...)` building full Jinja context (RMM, TL;DR, safety refusals, skipped samples, compliance/cost, sparkline); switched rendering to `audit.html.j2`; `generate_json` extended with `skipped_samples`, `scoring_modes_count`, `safety_refusal_rate`, and per-sample fields.
PDF rendering `packages/evaluator/src/rag_forge_evaluator/report/pdf.py`	Playwright PDF rendering now emulates `print` media before loading and uses `prefer_css_page_size=True`.
RAGAS metrics extraction `packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`	`_extract_ragas_score` now raises `ValueError` on extraction failure instead of returning `0.0`, enabling skip recording.
Mock judge `packages/evaluator/src/rag_forge_evaluator/judge/mock_judge.py`	Extended mock judge default JSON to include combined metric fields and `scoring_mode`/`refusal_justification`.
Pyproject/evaluator metadata `packages/evaluator/pyproject.toml`	Bumped evaluator package version; updated `ragas` optional dependency to `ragas>=0.4,<0.6`, added `datasets>=2.14`, added `ragas-voyage` extra with `voyageai>=0.2`, included templates in wheel, and added pytest markers `ragas_integration` and `visual`.
Fixtures `packages/evaluator/tests/fixtures/ragas_tiny_fixture.json`, `packages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonl`	Added JSON/JSONL fixtures for integration and regression testing (clinical and Q&A samples).
Tests — config/cli/engine types `packages/evaluator/tests/test_audit_config_validation.py`, `packages/evaluator/tests/test_cli_refusal_flags.py`, `packages/evaluator/tests/test_engine_types.py`, `packages/evaluator/tests/test_evaluator_factory.py`	Updated/added tests for new AuditConfig defaults, CLI refusal flag parsing, engine datatypes (SkipRecord, scoring fields), and evaluator factory/judge requirements.
Tests — ragas adapters & extractor `packages/evaluator/tests/test_ragas_adapters.py`, `packages/evaluator/tests/test_ragas_extractor.py`	New adapter tests (LLM forwarding, async behavior, embeddings providers including mock determinism), updated extractor tests expecting `ValueError` instead of silent 0.0.
Tests — LLM-judge & refusal-aware `packages/evaluator/tests/test_llm_judge_aggregation.py`, `packages/evaluator/tests/test_refusal_aware_scoring.py`	Updated aggregation tests to combined single-call format; added refusal-aware scoring tests validating classification, justification propagation, and strict mode behavior.
Integration & regression tests `packages/evaluator/tests/test_ragas_integration.py`, `packages/evaluator/tests/test_pearmedica_regression.py`	Added gated ragas integration tests and clinical regression tests verifying skip behavior, embedding regressions, and long-response handling.
Report & PDF tests `packages/evaluator/tests/test_report.py`, `packages/evaluator/tests/test_enhanced_report.py`, `packages/evaluator/tests/test_json_report.py`, `packages/evaluator/tests/test_pdf_generator.py`, `packages/evaluator/tests/test_report_generator.py`, `packages/evaluator/tests/test_report_template.py`, `packages/evaluator/tests/test_report_visual_regression.py`	Updated/added tests for HTML/JSON output fields, Jinja template strict rendering, SVG presence, Playwright print-media emulation, skipped-sample rendering, safety-refusal rate/warnings, and pixel-level visual regression.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant AuditConfig
    participant EvaluatorFactory
    participant LLMJudgeEvaluator
    participant Judge
    participant ReportGenerator

    User->>CLI: audit --refusal-aware / --strict
    CLI->>CLI: parse flags
    CLI->>AuditConfig: create(refusal_aware=...)
    AuditConfig->>EvaluatorFactory: create_evaluator(refusal_aware=...)
    EvaluatorFactory->>LLMJudgeEvaluator: instantiate(refusal_aware=...)
    User->>LLMJudgeEvaluator: evaluate(samples)
    alt refusal_aware=True
        LLMJudgeEvaluator->>Judge: judge(system + classification preamble + dual rubric)
        Judge-->>LLMJudgeEvaluator: combined JSON (scoring_mode, scores, justification)
        LLMJudgeEvaluator->>LLMJudgeEvaluator: apply safety_refusal rubric -> MetricResults with scoring_mode/refusal_justification
    else refusal_aware=False
        LLMJudgeEvaluator->>Judge: judge(system + standard rubric)
        Judge-->>LLMJudgeEvaluator: JSON scores
        LLMJudgeEvaluator->>LLMJudgeEvaluator: produce MetricResults (standard)
    end
    LLMJudgeEvaluator-->>ReportGenerator: EvaluationResult (metrics, skipped_samples, scoring_modes_count)
    ReportGenerator->>ReportGenerator: build Jinja context
    ReportGenerator-->>User: HTML report (safety refusals, skipped samples, metrics)

sequenceDiagram
    participant User
    participant RagasEvaluator
    participant RagForgeRagasLLM
    participant RagForgeRagasEmbeddings
    participant Judge
    participant RagasLib

    User->>RagasEvaluator: evaluate(samples)
    RagasEvaluator->>RagasEvaluator: auto-select embeddings provider
    RagasEvaluator->>RagForgeRagasLLM: create(judge, refusal_aware)
    RagasEvaluator->>RagForgeRagasEmbeddings: create(provider)
    RagasEvaluator->>RagasLib: ragas_evaluate(samples, llm=LLM, embeddings=Embeddings)
    alt refusal_aware=True
        RagForgeRagasLLM->>Judge: generate_text(prompt + refusal note)
        Judge-->>RagForgeRagasLLM: response
    else refusal_aware=False
        RagForgeRagasLLM->>Judge: generate_text(prompt)
        Judge-->>RagForgeRagasLLM: response
    end
    RagasLib-->>RagasEvaluator: raw metric outputs or raise
    alt extraction/exception failure
        RagasEvaluator->>RagasEvaluator: record SkipRecord(sample_id, metric, reason, exception)
        RagasEvaluator-->>User: EvaluationResult with skipped_samples
    else success
        RagasEvaluator-->>User: EvaluationResult with metrics
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hoppity-hop, evaluator anew—

Refusals heard and skips in view.
Adapters bridge judge and ragas' art,
Reports bloom bright and play their part.
RAG‑Forge hops onward—fresh and true.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(evaluator): v0.2.0 evaluator refresh' clearly and concisely summarizes the main change—a major version refresh of the evaluator component—and is directly related to the substantial changeset.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, covering the four main workstreams (RAGAS rebuild, refusal-aware scoring, audit report, release coordination), breaking changes, and test coverage with extensive detail.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/v0.2.0-evaluator-refresh

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Two regressions on PR #33 CI: 1. ruff failures in test_ragas_adapters.py and test_ragas_extractor.py: - I001: unsorted import block - F401: unused pytest import (top-level), unused SkipRecord import - F811: pytest redefined inside test function - F821: leftover captured_prompts reference in a dead stub class Fix: clean up the imports, remove the unused stub, sort the import block. 2. RAG Quality Gate: rag-forge-eval audit --judge mock scored faithfulness 0.0 because the combined-path parser (added in v0.2.0) expects {faithfulness, answer_relevance, context_relevance, hallucination} at the top level of the judge response, but MockJudge still returned the legacy per-metric shape (claims/ratings/score). Fix: additively include the combined-path fields in MockJudge's default response. Legacy per-metric fields stay so the partial- report tests (which monkey-patch _default_metrics) still work. Local verification: `rag-forge-eval audit --golden-set eval/golden_set.json --judge mock --evaluator llm-judge` now scores faithfulness 0.90, all four metrics pass, overall 0.90, RMM Level 3.

coderabbitai

Actionable comments posted: 21

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/evaluator/src/rag_forge_evaluator/audit.py (1)

219-225: ⚠️ Potential issue | 🟠 Major

Propagate the new ragas-specific config here.

ragas_max_tokens and ragas_embeddings_provider were added to AuditConfig, but this construction path still only forwards the generic factory args. Any non-default value is silently ignored today, so callers cannot actually raise the ragas token cap or force mock / voyage / openai embeddings.

Proposed fix

-            evaluator = create_evaluator(
-                self.config.evaluator_engine,
-                judge=judge,
-                thresholds=self.config.thresholds,
-                progress=self._progress,
-                refusal_aware=self.config.refusal_aware,
-            )
+            if self.config.evaluator_engine == "ragas":
+                from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
+
+                evaluator = RagasEvaluator(
+                    judge=judge,
+                    thresholds=self.config.thresholds,
+                    max_tokens=self.config.ragas_max_tokens,
+                    embeddings_provider=self.config.ragas_embeddings_provider,
+                    refusal_aware=self.config.refusal_aware,
+                )
+            else:
+                evaluator = create_evaluator(
+                    self.config.evaluator_engine,
+                    judge=judge,
+                    thresholds=self.config.thresholds,
+                    progress=self._progress,
+                    refusal_aware=self.config.refusal_aware,
+                )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/audit.py` around lines 219 - 225,
The create_evaluator call in audit.py currently ignores the new AuditConfig
fields ragas_max_tokens and ragas_embeddings_provider, so non-default values are
dropped; update the create_evaluator invocation (where evaluator is assigned) to
pass through these new config values (e.g.,
ragas_max_tokens=self.config.ragas_max_tokens and
ragas_embeddings_provider=self.config.ragas_embeddings_provider) so the factory
receives them; ensure create_evaluator’s signature (and any downstream evaluator
factory code) accepts and uses these parameters if not already implemented.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 10: Update the migration text that currently says "ragas will upgrade to
0.4.x automatically" to match the constraint used above: replace that phrase
with the exact constraint "ragas>=0.4,<0.6" or with "ragas 0.4.x/0.5.x" so the
statement in the line referencing ragas mirrors the constraint at the top of the
file (the sentence referencing the `ragas` requirement).

In `@docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md`:
- Around line 1605-1610: Add a blank line before and after each fenced code
block (e.g., the ```bash block containing "uv pip install 'ragas>=0.4,<0.6'" and
"uv run pytest -m ragas_integration ...") so every ```...``` block is separated
by an empty line above and below; update all similar fenced blocks in the
document to maintain consistent markdown rendering.

In `@docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md`:
- Line 59: The fenced code block containing the ASCII architecture diagram
should include a language tag to satisfy markdownlint MD040; update the opening
fence for that block by changing the triple backticks to include "text" (i.e.,
```text) for the ASCII diagram block so the diagram is treated as plain text and
the linter passes.

In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 160-174: The validation in the ragas branch rejects
AuditConfig.judge_model when it is still the default None; update the check in
the block that inspects config.evaluator_engine == "ragas" so that it allows
judge to be None (or implicitly treated as "mock") — e.g., treat None as
equivalent to "mock" or skip the allowlist check when judge is None, leaving
normalization to _create_judge() which constructs MockJudge(); ensure references
to judge, config.evaluator_engine, AuditConfig.judge_model, and _create_judge()
are used so the change only affects the ragas-specific validation and preserves
the existing error paths for explicit invalid strings.

In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py`:
- Around line 24-26: When engine == "ragas", fail fast if no judge is provided:
before instantiating RagasEvaluator in __init__.py, check that the judge
parameter is not None and raise a descriptive ValueError (e.g., "ragas engine
requires a judge") so we don't defer the error until RagasEvaluator.evaluate();
then proceed to import/return RagasEvaluator(judge=judge, thresholds=thresholds,
refusal_aware=refusal_aware).

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 41-79: The RagForgeRagasLLM adapter currently duck-types ragas and
exposes a narrow generate_text( str )->str; update it to subclass ragas'
BaseRagasLLM and implement the full ragas contract: replace class
RagForgeRagasLLM to inherit BaseRagasLLM, change generate_text to the full
signature (accept PromptValue prompt, n:int, temperature:float, stop:list|None,
callbacks:Optional[...] etc.), return an LLMResult object, and ensure
agenerate_text mirrors the async signature and runs the sync implementation via
asyncio.to_thread; also thread the stored max_tokens into the underlying call
(pass max_tokens to _judge.judge or map it into the judge call parameters) and
keep model_name() forwarding to _judge.model_name() so ragas sees the proper
model metadata.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 154-205: The failure branch currently appends SkipRecord entries
but returns samples_evaluated=0 and the aggregate branch can mark passed=True
even when some metrics were dropped; update the exception branch to return
samples_evaluated=len(samples) (use the same samples list used for SkipRecord)
and adjust the final passed logic to require that all expected metrics were
produced before marking the run passed (i.e., require len(aggregated) ==
len(_METRIC_NAMES) && all(m.passed for m in aggregated)); ensure you still
compute overall_score the same way and keep scored_count on each MetricResult as
len(samples). Reference: SkipRecord, EvaluationResult, _METRIC_NAMES,
_extract_ragas_score, MetricResult, self._thresholds, _DEFAULT_THRESHOLDS,
aggregated, samples_evaluated, passed.

In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 353-387: When all metrics in per_metric_results from
self._evaluate_sample_combined(sample) are skipped, do not build a SampleResult
or increment scoring_modes_count; instead create and append a SkipRecord and
continue. Concretely: after populating sample_metric_scores and computing
sample_skipped, detect if sample_skipped == len(metric_names) and if so import
and use SkipRecord to record the skip (including sample.sample_id,
sample.query/response context and refusal_justification/scoring_mode as
appropriate), do not compute worst_metric or root_cause or update
scoring_modes_count, and do not append to sample_results; otherwise keep the
existing flow (compute worst_metric via min(...), build thresholds_map and
root_cause via _determine_root_cause, update scoring_modes_count, and append
SampleResult). Ensure SkipRecord is imported at top of module.
- Around line 117-128: The guard in _are_standard_metrics uses isinstance()
which is too permissive and lets different instances (or subclassed evaluators
overriding evaluate_sample) pass; instead compare the concrete type multiset so
only the exact default metric types match. Replace the isinstance-based check
with an exact-type comparison, e.g. compare Counter(type(m) for m in metrics) ==
Counter(_STANDARD_METRIC_TYPES) (or an equivalent multiset comparison) inside
_are_standard_metrics so only the exact default metric classes are accepted.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 673-692: The compatibility shim currently calls generate_html with
hardcoded metadata and drops rmm_level, trends, and sample_results; update the
wrapper so it forwards the caller's arguments into generate_html (e.g., pass
through project_name, evaluator_name, judge_model_display, rmm_level, trends,
and sample_results) or, if you intend a breaking change, make the API explicit
by removing the shim and documenting the new signature; locate the wrapper that
calls generate_html (and references self.output_dir) and ensure it propagates
the original inputs instead of the hardcoded strings or else convert it to a
clearly breaking call.
- Around line 322-328: Update the data_handling_html string in generator.py (the
data_handling_html variable) to remove the incorrect claim that
queries/responses are only processed in-memory and that the report contains no
PII; instead state explicitly that report artifacts (HTML/JSON) may persist
sample data to disk unless redacted and that evaluation inputs may be
transmitted to the configured judge endpoint under the operator's API key and
are subject to that provider's data policy. Keep the message concise, replace
the existing sentences in data_handling_html, and ensure it clearly warns users
about potential persistence of sample data and the need to redact regulated data
before running the audit.

In `@packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2`:
- Around line 1162-1169: The template loops over sample.scores.items() which
will error if sample.scores is None or missing; update the Jinja loop in the
audit.html.j2 template to defensively handle that by using a default/empty
mapping or a conditional check around sample.scores (e.g., use the |default({})
filter or an if sample.scores statement) so the for-loop over
sample.scores.items() only runs when scores exist and avoids runtime failures.
- Around line 933-951: The template currently renders an <svg> with a <polyline>
and <circle> elements based on history_points; add a Jinja guard so when
history_points is empty the sparkline block (the <svg> / <polyline> / <circle>
rendering) is not emitted or a fallback placeholder is shown instead.
Concretely, wrap the existing SVG block (the polyline points generation and the
loop that emits circles) in a Jinja if/else that checks "if history_points" (or
"if history_points|length") and only renders the polyline and circles when
non-empty; otherwise render a simple empty-state element or omit the SVG to
avoid an empty points attribute and missing markers.

In `@packages/evaluator/tests/test_cli_refusal_flags.py`:
- Line 177: Move the pytest import to the top of the module with the other
imports and remove the trailing `import pytest  # noqa: E402` line; ensure any
references to pytest in test functions (used around the tests at lines where
pytest is currently referenced) continue to work and drop the E402 noqa comment
since pytest has no circular dependency here.

In `@packages/evaluator/tests/test_pdf_generator.py`:
- Around line 63-69: The test currently only checks that emulate_media was
called but not that it happened before navigation/content calls; update the
assertions to enforce ordering by inspecting mock_page.method_calls: find the
index of the emulate_media call (from emulate_calls or by scanning for call[0]
== "emulate_media") and the index of the goto and/or set_content calls (call[0]
== "goto" and call[0] == "set_content") and assert emulate_index < goto_index
(and emulate_index < set_content_index if relevant) so the test validates that
emulate_media executed before navigation/content operations.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 3-8: Remove the duplicate local "import pytest" inside the
tests/test_ragas_adapters.py module so it doesn't shadow the top-level import;
keep only the top-level "import pytest" already present near the top of the file
and delete the inner import (the one inside the test scope that causes
F401/F811), then run ruff/formatter to normalize the import block for the
RagForgeRagasLLM and RagForgeRagasEmbeddings tests.
- Around line 152-175: Define a captured_prompts list and actually exercise the
code path that builds/uses RagForgeRagasLLM so the test proves refusal_aware is
propagated: add captured_prompts = [] before CapturingMockJudge and have the
test invoke the evaluator flow that calls judge (e.g., call the evaluator method
that triggers RagForgeRagasLLM -> judge) or monkeypatch/replace RagForgeRagasLLM
with a small fake that captures its init args; then assert the captured judge
calls happened and that the wrapper/fake received refusal_aware=True (or that
captured_prompts contains the expected user_prompt), instead of only checking
evaluator._refusal_aware. Reference symbols:
test_ragas_evaluator_threads_refusal_aware_into_wrapper,
CapturingMockJudge.judge, RagasEvaluator, and RagForgeRagasLLM.

In `@packages/evaluator/tests/test_ragas_extractor.py`:
- Around line 19-23: The import list in the test module unnecessarily includes
SkipRecord which causes a lint error; edit the import statement that currently
imports EvaluationSample and SkipRecord from rag_forge_evaluator.engine and
remove SkipRecord so only used symbols (e.g., EvaluationSample) are imported;
ensure the other imports (RagasEvaluator, _extract_ragas_score) remain unchanged
to avoid breaking references in the tests.

In `@packages/evaluator/tests/test_ragas_integration.py`:
- Around line 41-83: Update the two ragas tests to assert the documented
samples_evaluated / skip semantics: in test_ragas_end_to_end_with_mock_judge add
an assertion that result.samples_evaluated is either len(samples) or 0 (i.e.,
result.samples_evaluated in {len(samples), 0}) in addition to the existing
metrics/ skipped check; in test_ragas_whole_batch_crash_captured_as_skips assert
result.samples_evaluated is either len(samples) or 0 and, if it is 0, assert
that result.skipped_samples contains the input sample (e.g., sample_id
"broken-001") and that len(result.skipped_samples) == len(samples) so an empty
EvaluationResult no longer passes. Ensure you reference the EvaluationResult
fields samples_evaluated and skipped_samples when adding these assertions.

In `@packages/evaluator/tests/test_report_visual_regression.py`:
- Around line 131-135: The assertion currently uses a strict less-than check so
an exact 0.5% diff fails; change the assertion to allow the advertised tolerance
by using a less-than-or-equal comparison for diff_ratio (i.e., replace the "<
0.005" check in the assertion that references total_pixels, diff_pixels and
diff_ratio with "<= 0.005") and keep the existing formatted failure message
unchanged.

In `@pyproject.toml`:
- Line 36: The pyproject.toml currently suppresses mypy for a first-party
package by including "rag_forge_evaluator.*" in the module ignore list; remove
"rag_forge_evaluator.*" from the module = [...] entry (the line containing
module = ["anthropic.*", ..., "rag_forge_evaluator.*"]) so mypy will type-check
your own package, or alternatively replace the suppression by adding proper type
information (inline typings, py.typed, or stubs) for rag_forge_evaluator so it
can be checked under strict=true.

---

Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 219-225: The create_evaluator call in audit.py currently ignores
the new AuditConfig fields ragas_max_tokens and ragas_embeddings_provider, so
non-default values are dropped; update the create_evaluator invocation (where
evaluator is assigned) to pass through these new config values (e.g.,
ragas_max_tokens=self.config.ragas_max_tokens and
ragas_embeddings_provider=self.config.ragas_embeddings_provider) so the factory
receives them; ensure create_evaluator’s signature (and any downstream evaluator
factory code) accepts and uses these parameters if not already implemented.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb95c54b-c1aa-404f-95cd-d939bdf537b5

📥 Commits

Reviewing files that changed from the base of the PR and between 8bb73c7 and 3ef4130.

📒 Files selected for processing (43)

.gitignore
CHANGELOG.md
docs/evaluators.md
docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md
docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md
packages/cli/package.json
packages/core/pyproject.toml
packages/evaluator/pyproject.toml
packages/evaluator/src/rag_forge_evaluator/audit.py
packages/evaluator/src/rag_forge_evaluator/cli.py
packages/evaluator/src/rag_forge_evaluator/engine.py
packages/evaluator/src/rag_forge_evaluator/engines/__init__.py
packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py
packages/evaluator/src/rag_forge_evaluator/report/generator.py
packages/evaluator/src/rag_forge_evaluator/report/pdf.py
packages/evaluator/src/rag_forge_evaluator/report/templates/__init__.py
packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2
packages/evaluator/tests/fixtures/ragas_regression/__init__.py
packages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonl
packages/evaluator/tests/fixtures/ragas_tiny_fixture.json
packages/evaluator/tests/fixtures/visual_baseline/.gitkeep
packages/evaluator/tests/test_audit_config_validation.py
packages/evaluator/tests/test_cli_refusal_flags.py
packages/evaluator/tests/test_engine_types.py
packages/evaluator/tests/test_enhanced_report.py
packages/evaluator/tests/test_evaluator_factory.py
packages/evaluator/tests/test_json_report.py
packages/evaluator/tests/test_llm_judge_aggregation.py
packages/evaluator/tests/test_pdf_generator.py
packages/evaluator/tests/test_pearmedica_regression.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_extractor.py
packages/evaluator/tests/test_ragas_integration.py
packages/evaluator/tests/test_refusal_aware_scoring.py
packages/evaluator/tests/test_report.py
packages/evaluator/tests/test_report_generator.py
packages/evaluator/tests/test_report_template.py
packages/evaluator/tests/test_report_visual_regression.py
packages/mcp/package.json
packages/observability/pyproject.toml
pyproject.toml

📜 Review details

🧰 Additional context used

🪛 GitHub Actions: CI

packages/evaluator/tests/test_ragas_extractor.py

[error] 19-19: ruff check: rag_forge_evaluator.engine.SkipRecord imported but unused. (F401)

packages/evaluator/tests/test_ragas_adapters.py

[error] 1-8: ruff check: Import block is un-sorted or un-formatted. (Organize imports)

[error] 6-6: ruff check: pytest imported but unused. (F401)

[error] 106-106: ruff check: Redefinition of unused pytest from line 6. (F811)

[error] 159-159: ruff check: Undefined name captured_prompts. (F821)

🪛 LanguageTool

docs/evaluators.md

[style] ~13-~13: Consider an alternative for the overused word “exactly”.
Context: ...-relevance — even though the refusal is exactly what the pipeline was designed to do. ...

(EXACTLY_PRECISELY)

docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md

[style] ~36-~36: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rs.** DeepEval engine is untouched. - No core pipeline changes. `packages/core...