Skip to content

feat(evaluator): v0.2.0 evaluator refresh#33

Merged
hallengray merged 27 commits intomainfrom
feat/v0.2.0-evaluator-refresh
Apr 14, 2026
Merged

feat(evaluator): v0.2.0 evaluator refresh#33
hallengray merged 27 commits intomainfrom
feat/v0.2.0-evaluator-refresh

Conversation

@hallengray
Copy link
Copy Markdown
Owner

Summary

Implements the full v0.2.0 evaluator refresh per the design in #32. Rebuilds the RAGAS adapter around injected LLM + embeddings wrappers, ships default-on refusal-aware scoring with a --strict escape hatch, and rewrites the audit report around a single Jinja template with shared browser/print CSS.

Driven by findings from the PearMedica Cycle 2 audit (2026-04-14). Three v0.1.3 bugs fixed at the architectural root, one class of scoring pathology closed, and the report is rebuilt to top-tier standard.

What's in this PR (22 commits)

Workstream A — RAGAS adapter rebuild

  • New engines/ragas_adapters.py with RagForgeRagasLLM (sync + async) and RagForgeRagasEmbeddings (openai / voyage / mock providers)
  • ragas_evaluator.py now injects our wrappers instead of letting ragas pick defaults — version-stable across ragas releases
  • _extract_ragas_score raises ValueError instead of silently coercing to 0.0; exceptions are captured as SkipRecord entries visible in the report
  • Hard cut on the ragas pin: ragas>=0.4,<0.6
  • New optional extra [ragas-voyage] for Claude-judge users

Workstream B — Refusal-aware scoring (default-on)

  • metrics/llm_judge.py injects an inline classification preamble into the scoring prompt. The judge classifies each sample as standard or safety_refusal and scores on the appropriate rubric in one round-trip (no extra API calls)
  • RagForgeRagasLLM prepends a shorter refusal-aware NOTE to every prompt ragas sends
  • AuditConfig.refusal_aware: bool = True + CLI flags --strict, --refusal-aware, --no-refusal-aware
  • 30% refusal-rate warning banner in the report for spot-checking judge classifications
  • MetricResult and SampleResult gain scoring_mode and refusal_justification fields
  • EvaluationResult gains scoring_modes_count for aggregate reporting

Workstream C — Top-tier audit report

  • New single Jinja template report/templates/audit.html.j2 (430 lines) with all 14 sections: header → title → run manifest → RMM hero with SVG sparkline → maturity ladder → executive summary → TL;DR callout → metric breakdown → cost summary → safety refusals → worst case → per-sample detail cards → skipped samples → compliance footer
  • report/generator.py rewritten as a context builder around the new template. Pure helper functions for RMM, TL;DR auto-generation, refusals block, worst case, compliance footer
  • report/pdf.py uses Playwright's emulate_media(media="print") so the template's @media print rules (page breaks, @page rules, forced-open <details>) actually apply
  • JSON report gets additive-only fields: skipped_samples, scoring_modes_count, safety_refusal_rate, per-sample scoring_mode + refusal_justification
  • Compliance footer references NIST AI RMF 1.0 § 4.3, ISO/IEC 42001:2023 § 8.3, EU AI Act Article 15. Authored by RAG-Forge
  • audit.py drops the hard block on --evaluator ragas --judge claude and replaces it with a capability check for voyageai
  • Banner no longer hardcodes gpt-4o-mini (RAGAS internal — judge override ignored) — ragas now honors --judge end-to-end

Workstream D — Release coordination

  • Version bumps to 0.2.0 across rag-forge-core, rag-forge-evaluator, rag-forge-observability, @rag-forge/cli, @rag-forge/mcp
  • CHANGELOG.md with breaking-changes callout, migration guide, and PearMedica Cycle 2 credits
  • docs/evaluators.md with a refusal-aware scoring guide (worked example: PearMedica paediatric metformin refusal)

Breaking changes

  1. Scores will shift on pipelines with safety refusals (default-on refusal-aware). Pass --strict to revert.
  2. ragas pin hard cut to >=0.4,<0.6. Users with pinned 0.2.x must upgrade ragas before upgrading rag-forge-evaluator.
  3. --evaluator ragas --judge claude now works — was previously a hard error.

Tests

  • Full evaluator suite: 256 passed, 4 skipped (the 4 skipped are opt-in: ragas_integration and visual markers)
  • 22 new unit tests + 4 gated integration/visual tests
  • Ruff: clean
  • Mypy: clean on files this branch touched
  • Turbo build: all 5 packages green
  • Manual smoke audit: rag-forge-eval audit --evaluator llm-judge --judge mock produces HTML + JSON + PDF cleanly

Regression coverage

Migration for existing users

# 1. Upgrade ragas (mandatory)
pip install -U rag-forge-evaluator[ragas]

# 2. If using Claude judge with ragas, add Voyage embeddings
pip install rag-forge-evaluator[ragas-voyage]

# 3. To keep v0.1.x scoring semantics in CI:
rag-forge-eval audit ... --strict

Depends on

Test plan

Captures the four workstreams locked during brainstorming in response to
the PearMedica Cycle 2 audit findings:

- Workstream A: rebuild the RAGAS adapter around injected LLM and
  embeddings wrappers (RagForgeRagasLLM + RagForgeRagasEmbeddings) so the
  evaluator stops depending on ragas's version-fragile default picker.
  Hard cut to ragas>=0.4,<0.6. Adds [ragas-voyage] extra for Claude-judge
  users. Deletes the 0.0-on-exception coercion; skips surface in reports.
- Workstream B: default-on refusal-aware scoring with a --strict escape
  hatch. The judge classifies each sample as standard or safety_refusal
  inline and applies the appropriate rubric. A 30% spot-check banner
  guards against judge misclassification.
- Workstream C: rebuild the audit report to top-tier standard. Single
  Jinja template with a shared browser/print stylesheet, Playwright
  print-media emulation, 14-section information architecture covering
  run manifest, RMM hero with historical sparkline, TL;DR callout, cost
  summary, safety refusals, per-sample detail, and a compliance footer
  aligned with NIST AI RMF / ISO 42001 / EU AI Act Article 15.
- Workstream D: coordinated 0.2.0 version bump across core, evaluator,
  observability, cli, and mcp packages with a migration checklist.

Includes a file-by-file change summary, four-layer testing strategy,
locked decisions from brainstorming, and explicit non-goals.
…k with SkipRecord

RagasEvaluator now accepts a judge, max_tokens, and embeddings_provider
and injects RagForgeRagasLLM + RagForgeRagasEmbeddings into every ragas
evaluate() call — ragas no longer reaches for its own default LLM or
embeddings. _extract_ragas_score raises ValueError instead of silently
returning 0.0 so broken infrastructure surfaces in the report as
SkipRecords rather than producing invisibly-wrong scores (root cause of
the PearMedica Cycle 2 silent failure). Whole-batch ragas crashes are
also caught and converted to SkipRecords per sample×metric.
…onfig fields

- Add _voyageai_installed() module-level probe so tests can monkeypatch
  the voyageai import check without requiring the real SDK installed
- Add three new AuditConfig fields: refusal_aware=True,
  ragas_max_tokens=8192, ragas_embeddings_provider=None
- Replace _validate_config ragas branch: claude is now allowed when
  voyageai is installed; blocks only on missing voyageai dep with a
  clear install hint pointing at [ragas-voyage] extra
- Drop hardcoded 'gpt-4o-mini (RAGAS internal — judge override ignored)'
  banner — ragas now honors the configured judge end-to-end via
  RagForgeRagasLLM wrapper shipped in Wave 1
- Update test_ragas_with_mock_judge_is_blocked → _is_allowed: mock is
  a valid zero-cost judge with the new wrapper; old block premise obsolete
…llm-judge

Adds refusal_aware=True parameter to LLMJudgeEvaluator. When enabled, a
classification preamble is prepended to the combined single-call prompt so
the judge can identify safety-refusal samples (context too thin, response
correctly abstained) and apply the refusal rubric — all in one round-trip,
zero extra API cost. scoring_mode and refusal_justification are threaded
through SampleResult and aggregated into EvaluationResult.scoring_modes_count.

Migrates the default evaluation path to a single combined judge call per
sample (all four metrics in one JSON response). Per-metric path is preserved
for callers who inject custom MetricEvaluator instances (detected via instance
type check so patched test evaluators continue to work correctly).
Replaces the old Jinja FileSystemLoader + audit_report.html.j2 approach
with a module-level generate_html() function that builds a typed context
dict (RMM summary, TL;DR, refusals, worst-case, compliance footer, cost
block, sparkline points) and renders the Task 12 audit.html.j2 template.
ReportGenerator class preserved as a backward-compatible shim for audit.py.
…fusal_rate to JSON output

Add four new additive fields to the JSON audit report (v0.2.0):
- Top-level 'skipped_samples': list of skip records with sample_id, metric_name, reason, exception_type
- Top-level 'scoring_modes_count': dict of scoring mode counts (e.g. {'standard': 11, 'safety_refusal': 1})
- Top-level 'safety_refusal_rate': float ratio of safety_refusal samples to total samples_evaluated
- Per-sample 'scoring_mode' and 'refusal_justification': tracking the scoring context for each result

All existing JSON fields preserved. Computation handles edge cases (empty reports, zero samples).
8 new tests added covering all combinations.
- Add return type annotations to _openai_client() and _voyage_client()
- Suppress TC001 for generator.py (types used in signatures)
- Combine nested with statements in test_audit_config_validation.py
- Remove unused pytest imports in test_refusal_aware_scoring.py
- Fix import ordering and formatting per ruff rules
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

Caution

Review failed

Pull request was closed or merged during review

Summary by CodeRabbit

  • New Features

    • Refusal-aware scoring enabled by default (opt-out via --strict / --no-refusal-aware); new CLI flags: --strict, --refusal-aware, --no-refusal-aware.
    • Enhanced audit reports (HTML/PDF) with safety-refusal sections, skipped-sample details, compliance footer, and richer JSON output.
    • RAGAS integration with adapter support and a new optional extra for Voyage/Claude embeddings; package versions bumped to 0.2.0.
  • Bug Fixes

    • Per-sample failures recorded as skips (not coerced to zero) and handled in reports.
    • Fixed embedding and token/overflow issues.
  • Documentation

    • New evaluator docs and migration/design plans.
  • Tests

    • Many new unit/integration/visual tests and fixtures.

Walkthrough

This release updates RAG-Forge from v0.1.3 to v0.2.0 with default refusal-aware scoring, RAGAS LLM/embeddings adapter wrappers, skip-record error handling (no silent 0.0 coercion), revised evaluator wiring and CLI flags, and a rebuilt Jinja HTML/PDF report plus broad test and fixture additions.

Changes

Cohort / File(s) Summary
Version bumps
packages/cli/package.json, packages/core/pyproject.toml, packages/evaluator/pyproject.toml, packages/mcp/package.json, packages/observability/pyproject.toml
Bumped package versions from 0.1.30.2.0.
.gitignore & mypy
.gitignore, pyproject.toml
Added .superpowers/ to .gitignore; extended mypy overrides to include PIL.*, voyageai.*, langchain_core.*, and rag_forge_evaluator.*.
Changelog & Docs
CHANGELOG.md, docs/evaluators.md, docs/superpowers/plans/...v0.2.0-evaluator-refresh.md, docs/superpowers/specs/...-evaluator-refresh-design.md
Added v0.2.0 changelog and comprehensive docs/specs/plans describing refusal-aware scoring, RAGAS adapter design, migration steps, and report redesign.
Core engine types
packages/evaluator/src/rag_forge_evaluator/engine.py
Added ScoringMode alias and SkipRecord; extended EvaluationSample, MetricResult, SampleResult, and EvaluationResult with sample IDs, scoring modes, refusal justifications, skipped_samples, and scoring_modes_count.
Audit config & validation
packages/evaluator/src/rag_forge_evaluator/audit.py
Added _voyageai_installed() probe; extended AuditConfig with refusal_aware and ragas_embeddings_provider; relaxed RAGAS judge constraints to allow Claude when VoyageAI is installed; threaded refusal_aware into evaluator creation and adjusted estimation/banner logic.
CLI refusal flags
packages/evaluator/src/rag_forge_evaluator/cli.py
Added mutually exclusive --strict, --refusal-aware, --no-refusal-aware flags and mapped them to AuditConfig.refusal_aware.
Evaluator factory
packages/evaluator/src/rag_forge_evaluator/engines/...
create_evaluator(...) gains refusal_aware: bool = True and forwards it to evaluators; llm-judge and ragas branches updated to accept/forward the flag.
RAGAS adapter wrappers
packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
New module with _JudgeLike protocol, RagForgeRagasLLM (forwards to Judge, supports refusal-aware prompt preamble, sync/async paths, LangChain compatibility fallback) and RagForgeRagasEmbeddings (providers: openai, voyage, mock with deterministic mock vectors).
RAGAS evaluator refactor & error handling
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
RagasEvaluator now accepts judge, embeddings_provider, refusal_aware; auto-selects embeddings provider; constructs adapter wrappers; raises on missing judge; raises on extraction failures and records SkipRecords; returns skipped_samples in results and adjusts pass/scoring logic.
LLM-judge refusal-aware scoring
packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py
Added combined single-call scoring path, refusal classification preamble, dual rubrics (standard vs safety_refusal), robust parsing with skip handling, and tracking of scoring_mode, refusal_justification, and scoring_modes_count.
Report generation & template
packages/evaluator/src/rag_forge_evaluator/report/generator.py, packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2
New generate_html(...) building full Jinja context (RMM, TL;DR, safety refusals, skipped samples, compliance/cost, sparkline); switched rendering to audit.html.j2; generate_json extended with skipped_samples, scoring_modes_count, safety_refusal_rate, and per-sample fields.
PDF rendering
packages/evaluator/src/rag_forge_evaluator/report/pdf.py
Playwright PDF rendering now emulates print media before loading and uses prefer_css_page_size=True.
RAGAS metrics extraction
packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
_extract_ragas_score now raises ValueError on extraction failure instead of returning 0.0, enabling skip recording.
Mock judge
packages/evaluator/src/rag_forge_evaluator/judge/mock_judge.py
Extended mock judge default JSON to include combined metric fields and scoring_mode/refusal_justification.
Pyproject/evaluator metadata
packages/evaluator/pyproject.toml
Bumped evaluator package version; updated ragas optional dependency to ragas>=0.4,<0.6, added datasets>=2.14, added ragas-voyage extra with voyageai>=0.2, included templates in wheel, and added pytest markers ragas_integration and visual.
Fixtures
packages/evaluator/tests/fixtures/ragas_tiny_fixture.json, packages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonl
Added JSON/JSONL fixtures for integration and regression testing (clinical and Q&A samples).
Tests — config/cli/engine types
packages/evaluator/tests/test_audit_config_validation.py, packages/evaluator/tests/test_cli_refusal_flags.py, packages/evaluator/tests/test_engine_types.py, packages/evaluator/tests/test_evaluator_factory.py
Updated/added tests for new AuditConfig defaults, CLI refusal flag parsing, engine datatypes (SkipRecord, scoring fields), and evaluator factory/judge requirements.
Tests — ragas adapters & extractor
packages/evaluator/tests/test_ragas_adapters.py, packages/evaluator/tests/test_ragas_extractor.py
New adapter tests (LLM forwarding, async behavior, embeddings providers including mock determinism), updated extractor tests expecting ValueError instead of silent 0.0.
Tests — LLM-judge & refusal-aware
packages/evaluator/tests/test_llm_judge_aggregation.py, packages/evaluator/tests/test_refusal_aware_scoring.py
Updated aggregation tests to combined single-call format; added refusal-aware scoring tests validating classification, justification propagation, and strict mode behavior.
Integration & regression tests
packages/evaluator/tests/test_ragas_integration.py, packages/evaluator/tests/test_pearmedica_regression.py
Added gated ragas integration tests and clinical regression tests verifying skip behavior, embedding regressions, and long-response handling.
Report & PDF tests
packages/evaluator/tests/test_report.py, packages/evaluator/tests/test_enhanced_report.py, packages/evaluator/tests/test_json_report.py, packages/evaluator/tests/test_pdf_generator.py, packages/evaluator/tests/test_report_generator.py, packages/evaluator/tests/test_report_template.py, packages/evaluator/tests/test_report_visual_regression.py
Updated/added tests for HTML/JSON output fields, Jinja template strict rendering, SVG presence, Playwright print-media emulation, skipped-sample rendering, safety-refusal rate/warnings, and pixel-level visual regression.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant AuditConfig
    participant EvaluatorFactory
    participant LLMJudgeEvaluator
    participant Judge
    participant ReportGenerator

    User->>CLI: audit --refusal-aware / --strict
    CLI->>CLI: parse flags
    CLI->>AuditConfig: create(refusal_aware=...)
    AuditConfig->>EvaluatorFactory: create_evaluator(refusal_aware=...)
    EvaluatorFactory->>LLMJudgeEvaluator: instantiate(refusal_aware=...)
    User->>LLMJudgeEvaluator: evaluate(samples)
    alt refusal_aware=True
        LLMJudgeEvaluator->>Judge: judge(system + classification preamble + dual rubric)
        Judge-->>LLMJudgeEvaluator: combined JSON (scoring_mode, scores, justification)
        LLMJudgeEvaluator->>LLMJudgeEvaluator: apply safety_refusal rubric -> MetricResults with scoring_mode/refusal_justification
    else refusal_aware=False
        LLMJudgeEvaluator->>Judge: judge(system + standard rubric)
        Judge-->>LLMJudgeEvaluator: JSON scores
        LLMJudgeEvaluator->>LLMJudgeEvaluator: produce MetricResults (standard)
    end
    LLMJudgeEvaluator-->>ReportGenerator: EvaluationResult (metrics, skipped_samples, scoring_modes_count)
    ReportGenerator->>ReportGenerator: build Jinja context
    ReportGenerator-->>User: HTML report (safety refusals, skipped samples, metrics)
Loading
sequenceDiagram
    participant User
    participant RagasEvaluator
    participant RagForgeRagasLLM
    participant RagForgeRagasEmbeddings
    participant Judge
    participant RagasLib

    User->>RagasEvaluator: evaluate(samples)
    RagasEvaluator->>RagasEvaluator: auto-select embeddings provider
    RagasEvaluator->>RagForgeRagasLLM: create(judge, refusal_aware)
    RagasEvaluator->>RagForgeRagasEmbeddings: create(provider)
    RagasEvaluator->>RagasLib: ragas_evaluate(samples, llm=LLM, embeddings=Embeddings)
    alt refusal_aware=True
        RagForgeRagasLLM->>Judge: generate_text(prompt + refusal note)
        Judge-->>RagForgeRagasLLM: response
    else refusal_aware=False
        RagForgeRagasLLM->>Judge: generate_text(prompt)
        Judge-->>RagForgeRagasLLM: response
    end
    RagasLib-->>RagasEvaluator: raw metric outputs or raise
    alt extraction/exception failure
        RagasEvaluator->>RagasEvaluator: record SkipRecord(sample_id, metric, reason, exception)
        RagasEvaluator-->>User: EvaluationResult with skipped_samples
    else success
        RagasEvaluator-->>User: EvaluationResult with metrics
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hoppity-hop, evaluator anew—

Refusals heard and skips in view.
Adapters bridge judge and ragas' art,
Reports bloom bright and play their part.
RAG‑Forge hops onward—fresh and true.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(evaluator): v0.2.0 evaluator refresh' clearly and concisely summarizes the main change—a major version refresh of the evaluator component—and is directly related to the substantial changeset.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, covering the four main workstreams (RAGAS rebuild, refusal-aware scoring, audit report, release coordination), breaking changes, and test coverage with extensive detail.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/v0.2.0-evaluator-refresh

Comment @coderabbitai help to get the list of available commands and usage tips.

Two regressions on PR #33 CI:

1. ruff failures in test_ragas_adapters.py and test_ragas_extractor.py:
   - I001: unsorted import block
   - F401: unused pytest import (top-level), unused SkipRecord import
   - F811: pytest redefined inside test function
   - F821: leftover captured_prompts reference in a dead stub class
   Fix: clean up the imports, remove the unused stub, sort the import block.

2. RAG Quality Gate: rag-forge-eval audit --judge mock scored
   faithfulness 0.0 because the combined-path parser (added in v0.2.0)
   expects {faithfulness, answer_relevance, context_relevance,
   hallucination} at the top level of the judge response, but MockJudge
   still returned the legacy per-metric shape (claims/ratings/score).
   Fix: additively include the combined-path fields in MockJudge's
   default response. Legacy per-metric fields stay so the partial-
   report tests (which monkey-patch _default_metrics) still work.

Local verification: `rag-forge-eval audit --golden-set eval/golden_set.json
--judge mock --evaluator llm-judge` now scores faithfulness 0.90, all four
metrics pass, overall 0.90, RMM Level 3.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/evaluator/src/rag_forge_evaluator/audit.py (1)

219-225: ⚠️ Potential issue | 🟠 Major

Propagate the new ragas-specific config here.

ragas_max_tokens and ragas_embeddings_provider were added to AuditConfig, but this construction path still only forwards the generic factory args. Any non-default value is silently ignored today, so callers cannot actually raise the ragas token cap or force mock / voyage / openai embeddings.

Proposed fix
-            evaluator = create_evaluator(
-                self.config.evaluator_engine,
-                judge=judge,
-                thresholds=self.config.thresholds,
-                progress=self._progress,
-                refusal_aware=self.config.refusal_aware,
-            )
+            if self.config.evaluator_engine == "ragas":
+                from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
+
+                evaluator = RagasEvaluator(
+                    judge=judge,
+                    thresholds=self.config.thresholds,
+                    max_tokens=self.config.ragas_max_tokens,
+                    embeddings_provider=self.config.ragas_embeddings_provider,
+                    refusal_aware=self.config.refusal_aware,
+                )
+            else:
+                evaluator = create_evaluator(
+                    self.config.evaluator_engine,
+                    judge=judge,
+                    thresholds=self.config.thresholds,
+                    progress=self._progress,
+                    refusal_aware=self.config.refusal_aware,
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/audit.py` around lines 219 - 225,
The create_evaluator call in audit.py currently ignores the new AuditConfig
fields ragas_max_tokens and ragas_embeddings_provider, so non-default values are
dropped; update the create_evaluator invocation (where evaluator is assigned) to
pass through these new config values (e.g.,
ragas_max_tokens=self.config.ragas_max_tokens and
ragas_embeddings_provider=self.config.ragas_embeddings_provider) so the factory
receives them; ensure create_evaluator’s signature (and any downstream evaluator
factory code) accepts and uses these parameters if not already implemented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Line 10: Update the migration text that currently says "ragas will upgrade to
0.4.x automatically" to match the constraint used above: replace that phrase
with the exact constraint "ragas>=0.4,<0.6" or with "ragas 0.4.x/0.5.x" so the
statement in the line referencing ragas mirrors the constraint at the top of the
file (the sentence referencing the `ragas` requirement).

In `@docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md`:
- Around line 1605-1610: Add a blank line before and after each fenced code
block (e.g., the ```bash block containing "uv pip install 'ragas>=0.4,<0.6'" and
"uv run pytest -m ragas_integration ...") so every ```...``` block is separated
by an empty line above and below; update all similar fenced blocks in the
document to maintain consistent markdown rendering.

In `@docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md`:
- Line 59: The fenced code block containing the ASCII architecture diagram
should include a language tag to satisfy markdownlint MD040; update the opening
fence for that block by changing the triple backticks to include "text" (i.e.,
```text) for the ASCII diagram block so the diagram is treated as plain text and
the linter passes.

In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 160-174: The validation in the ragas branch rejects
AuditConfig.judge_model when it is still the default None; update the check in
the block that inspects config.evaluator_engine == "ragas" so that it allows
judge to be None (or implicitly treated as "mock") — e.g., treat None as
equivalent to "mock" or skip the allowlist check when judge is None, leaving
normalization to _create_judge() which constructs MockJudge(); ensure references
to judge, config.evaluator_engine, AuditConfig.judge_model, and _create_judge()
are used so the change only affects the ragas-specific validation and preserves
the existing error paths for explicit invalid strings.

In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py`:
- Around line 24-26: When engine == "ragas", fail fast if no judge is provided:
before instantiating RagasEvaluator in __init__.py, check that the judge
parameter is not None and raise a descriptive ValueError (e.g., "ragas engine
requires a judge") so we don't defer the error until RagasEvaluator.evaluate();
then proceed to import/return RagasEvaluator(judge=judge, thresholds=thresholds,
refusal_aware=refusal_aware).

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 41-79: The RagForgeRagasLLM adapter currently duck-types ragas and
exposes a narrow generate_text( str )->str; update it to subclass ragas'
BaseRagasLLM and implement the full ragas contract: replace class
RagForgeRagasLLM to inherit BaseRagasLLM, change generate_text to the full
signature (accept PromptValue prompt, n:int, temperature:float, stop:list|None,
callbacks:Optional[...] etc.), return an LLMResult object, and ensure
agenerate_text mirrors the async signature and runs the sync implementation via
asyncio.to_thread; also thread the stored max_tokens into the underlying call
(pass max_tokens to _judge.judge or map it into the judge call parameters) and
keep model_name() forwarding to _judge.model_name() so ragas sees the proper
model metadata.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 154-205: The failure branch currently appends SkipRecord entries
but returns samples_evaluated=0 and the aggregate branch can mark passed=True
even when some metrics were dropped; update the exception branch to return
samples_evaluated=len(samples) (use the same samples list used for SkipRecord)
and adjust the final passed logic to require that all expected metrics were
produced before marking the run passed (i.e., require len(aggregated) ==
len(_METRIC_NAMES) && all(m.passed for m in aggregated)); ensure you still
compute overall_score the same way and keep scored_count on each MetricResult as
len(samples). Reference: SkipRecord, EvaluationResult, _METRIC_NAMES,
_extract_ragas_score, MetricResult, self._thresholds, _DEFAULT_THRESHOLDS,
aggregated, samples_evaluated, passed.

In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 353-387: When all metrics in per_metric_results from
self._evaluate_sample_combined(sample) are skipped, do not build a SampleResult
or increment scoring_modes_count; instead create and append a SkipRecord and
continue. Concretely: after populating sample_metric_scores and computing
sample_skipped, detect if sample_skipped == len(metric_names) and if so import
and use SkipRecord to record the skip (including sample.sample_id,
sample.query/response context and refusal_justification/scoring_mode as
appropriate), do not compute worst_metric or root_cause or update
scoring_modes_count, and do not append to sample_results; otherwise keep the
existing flow (compute worst_metric via min(...), build thresholds_map and
root_cause via _determine_root_cause, update scoring_modes_count, and append
SampleResult). Ensure SkipRecord is imported at top of module.
- Around line 117-128: The guard in _are_standard_metrics uses isinstance()
which is too permissive and lets different instances (or subclassed evaluators
overriding evaluate_sample) pass; instead compare the concrete type multiset so
only the exact default metric types match. Replace the isinstance-based check
with an exact-type comparison, e.g. compare Counter(type(m) for m in metrics) ==
Counter(_STANDARD_METRIC_TYPES) (or an equivalent multiset comparison) inside
_are_standard_metrics so only the exact default metric classes are accepted.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 673-692: The compatibility shim currently calls generate_html with
hardcoded metadata and drops rmm_level, trends, and sample_results; update the
wrapper so it forwards the caller's arguments into generate_html (e.g., pass
through project_name, evaluator_name, judge_model_display, rmm_level, trends,
and sample_results) or, if you intend a breaking change, make the API explicit
by removing the shim and documenting the new signature; locate the wrapper that
calls generate_html (and references self.output_dir) and ensure it propagates
the original inputs instead of the hardcoded strings or else convert it to a
clearly breaking call.
- Around line 322-328: Update the data_handling_html string in generator.py (the
data_handling_html variable) to remove the incorrect claim that
queries/responses are only processed in-memory and that the report contains no
PII; instead state explicitly that report artifacts (HTML/JSON) may persist
sample data to disk unless redacted and that evaluation inputs may be
transmitted to the configured judge endpoint under the operator's API key and
are subject to that provider's data policy. Keep the message concise, replace
the existing sentences in data_handling_html, and ensure it clearly warns users
about potential persistence of sample data and the need to redact regulated data
before running the audit.

In `@packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2`:
- Around line 1162-1169: The template loops over sample.scores.items() which
will error if sample.scores is None or missing; update the Jinja loop in the
audit.html.j2 template to defensively handle that by using a default/empty
mapping or a conditional check around sample.scores (e.g., use the |default({})
filter or an if sample.scores statement) so the for-loop over
sample.scores.items() only runs when scores exist and avoids runtime failures.
- Around line 933-951: The template currently renders an <svg> with a <polyline>
and <circle> elements based on history_points; add a Jinja guard so when
history_points is empty the sparkline block (the <svg> / <polyline> / <circle>
rendering) is not emitted or a fallback placeholder is shown instead.
Concretely, wrap the existing SVG block (the polyline points generation and the
loop that emits circles) in a Jinja if/else that checks "if history_points" (or
"if history_points|length") and only renders the polyline and circles when
non-empty; otherwise render a simple empty-state element or omit the SVG to
avoid an empty points attribute and missing markers.

In `@packages/evaluator/tests/test_cli_refusal_flags.py`:
- Line 177: Move the pytest import to the top of the module with the other
imports and remove the trailing `import pytest  # noqa: E402` line; ensure any
references to pytest in test functions (used around the tests at lines where
pytest is currently referenced) continue to work and drop the E402 noqa comment
since pytest has no circular dependency here.

In `@packages/evaluator/tests/test_pdf_generator.py`:
- Around line 63-69: The test currently only checks that emulate_media was
called but not that it happened before navigation/content calls; update the
assertions to enforce ordering by inspecting mock_page.method_calls: find the
index of the emulate_media call (from emulate_calls or by scanning for call[0]
== "emulate_media") and the index of the goto and/or set_content calls (call[0]
== "goto" and call[0] == "set_content") and assert emulate_index < goto_index
(and emulate_index < set_content_index if relevant) so the test validates that
emulate_media executed before navigation/content operations.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 3-8: Remove the duplicate local "import pytest" inside the
tests/test_ragas_adapters.py module so it doesn't shadow the top-level import;
keep only the top-level "import pytest" already present near the top of the file
and delete the inner import (the one inside the test scope that causes
F401/F811), then run ruff/formatter to normalize the import block for the
RagForgeRagasLLM and RagForgeRagasEmbeddings tests.
- Around line 152-175: Define a captured_prompts list and actually exercise the
code path that builds/uses RagForgeRagasLLM so the test proves refusal_aware is
propagated: add captured_prompts = [] before CapturingMockJudge and have the
test invoke the evaluator flow that calls judge (e.g., call the evaluator method
that triggers RagForgeRagasLLM -> judge) or monkeypatch/replace RagForgeRagasLLM
with a small fake that captures its init args; then assert the captured judge
calls happened and that the wrapper/fake received refusal_aware=True (or that
captured_prompts contains the expected user_prompt), instead of only checking
evaluator._refusal_aware. Reference symbols:
test_ragas_evaluator_threads_refusal_aware_into_wrapper,
CapturingMockJudge.judge, RagasEvaluator, and RagForgeRagasLLM.

In `@packages/evaluator/tests/test_ragas_extractor.py`:
- Around line 19-23: The import list in the test module unnecessarily includes
SkipRecord which causes a lint error; edit the import statement that currently
imports EvaluationSample and SkipRecord from rag_forge_evaluator.engine and
remove SkipRecord so only used symbols (e.g., EvaluationSample) are imported;
ensure the other imports (RagasEvaluator, _extract_ragas_score) remain unchanged
to avoid breaking references in the tests.

In `@packages/evaluator/tests/test_ragas_integration.py`:
- Around line 41-83: Update the two ragas tests to assert the documented
samples_evaluated / skip semantics: in test_ragas_end_to_end_with_mock_judge add
an assertion that result.samples_evaluated is either len(samples) or 0 (i.e.,
result.samples_evaluated in {len(samples), 0}) in addition to the existing
metrics/ skipped check; in test_ragas_whole_batch_crash_captured_as_skips assert
result.samples_evaluated is either len(samples) or 0 and, if it is 0, assert
that result.skipped_samples contains the input sample (e.g., sample_id
"broken-001") and that len(result.skipped_samples) == len(samples) so an empty
EvaluationResult no longer passes. Ensure you reference the EvaluationResult
fields samples_evaluated and skipped_samples when adding these assertions.

In `@packages/evaluator/tests/test_report_visual_regression.py`:
- Around line 131-135: The assertion currently uses a strict less-than check so
an exact 0.5% diff fails; change the assertion to allow the advertised tolerance
by using a less-than-or-equal comparison for diff_ratio (i.e., replace the "<
0.005" check in the assertion that references total_pixels, diff_pixels and
diff_ratio with "<= 0.005") and keep the existing formatted failure message
unchanged.

In `@pyproject.toml`:
- Line 36: The pyproject.toml currently suppresses mypy for a first-party
package by including "rag_forge_evaluator.*" in the module ignore list; remove
"rag_forge_evaluator.*" from the module = [...] entry (the line containing
module = ["anthropic.*", ..., "rag_forge_evaluator.*"]) so mypy will type-check
your own package, or alternatively replace the suppression by adding proper type
information (inline typings, py.typed, or stubs) for rag_forge_evaluator so it
can be checked under strict=true.

---

Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 219-225: The create_evaluator call in audit.py currently ignores
the new AuditConfig fields ragas_max_tokens and ragas_embeddings_provider, so
non-default values are dropped; update the create_evaluator invocation (where
evaluator is assigned) to pass through these new config values (e.g.,
ragas_max_tokens=self.config.ragas_max_tokens and
ragas_embeddings_provider=self.config.ragas_embeddings_provider) so the factory
receives them; ensure create_evaluator’s signature (and any downstream evaluator
factory code) accepts and uses these parameters if not already implemented.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb95c54b-c1aa-404f-95cd-d939bdf537b5

📥 Commits

Reviewing files that changed from the base of the PR and between 8bb73c7 and 3ef4130.

📒 Files selected for processing (43)
  • .gitignore
  • CHANGELOG.md
  • docs/evaluators.md
  • docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md
  • docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md
  • packages/cli/package.json
  • packages/core/pyproject.toml
  • packages/evaluator/pyproject.toml
  • packages/evaluator/src/rag_forge_evaluator/audit.py
  • packages/evaluator/src/rag_forge_evaluator/cli.py
  • packages/evaluator/src/rag_forge_evaluator/engine.py
  • packages/evaluator/src/rag_forge_evaluator/engines/__init__.py
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
  • packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py
  • packages/evaluator/src/rag_forge_evaluator/report/generator.py
  • packages/evaluator/src/rag_forge_evaluator/report/pdf.py
  • packages/evaluator/src/rag_forge_evaluator/report/templates/__init__.py
  • packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2
  • packages/evaluator/tests/fixtures/ragas_regression/__init__.py
  • packages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonl
  • packages/evaluator/tests/fixtures/ragas_tiny_fixture.json
  • packages/evaluator/tests/fixtures/visual_baseline/.gitkeep
  • packages/evaluator/tests/test_audit_config_validation.py
  • packages/evaluator/tests/test_cli_refusal_flags.py
  • packages/evaluator/tests/test_engine_types.py
  • packages/evaluator/tests/test_enhanced_report.py
  • packages/evaluator/tests/test_evaluator_factory.py
  • packages/evaluator/tests/test_json_report.py
  • packages/evaluator/tests/test_llm_judge_aggregation.py
  • packages/evaluator/tests/test_pdf_generator.py
  • packages/evaluator/tests/test_pearmedica_regression.py
  • packages/evaluator/tests/test_ragas_adapters.py
  • packages/evaluator/tests/test_ragas_extractor.py
  • packages/evaluator/tests/test_ragas_integration.py
  • packages/evaluator/tests/test_refusal_aware_scoring.py
  • packages/evaluator/tests/test_report.py
  • packages/evaluator/tests/test_report_generator.py
  • packages/evaluator/tests/test_report_template.py
  • packages/evaluator/tests/test_report_visual_regression.py
  • packages/mcp/package.json
  • packages/observability/pyproject.toml
  • pyproject.toml
📜 Review details
🧰 Additional context used
🪛 GitHub Actions: CI
packages/evaluator/tests/test_ragas_extractor.py

[error] 19-19: ruff check: rag_forge_evaluator.engine.SkipRecord imported but unused. (F401)

packages/evaluator/tests/test_ragas_adapters.py

[error] 1-8: ruff check: Import block is un-sorted or un-formatted. (Organize imports)


[error] 6-6: ruff check: pytest imported but unused. (F401)


[error] 106-106: ruff check: Redefinition of unused pytest from line 6. (F811)


[error] 159-159: ruff check: Undefined name captured_prompts. (F821)

🪛 LanguageTool
docs/evaluators.md

[style] ~13-~13: Consider an alternative for the overused word “exactly”.
Context: ...-relevance — even though the refusal is exactly what the pipeline was designed to do. ...

(EXACTLY_PRECISELY)

docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md

[style] ~36-~36: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rs.** DeepEval engine is untouched. - No core pipeline changes. `packages/core...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~37-~37: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rsion bump for release consistency. - No offline / air-gapped embeddings. `sen...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ers` support is deferred to v0.2.1. - No vercel.ts migration or other infrastr...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e churn unrelated to the evaluator. - No ragas 0.2.x compatibility. Hard cut t...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~209-~209: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or aggregate reporting. - SampleScore gains the same two fields. **audit.py / `A...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[grammar] ~267-~267: Use a hyphen to join words.
Context: ...ds with judge justification. 11. Worst performing case callout — sample ID, h...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.22.0)
docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md

[warning] 59-59: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md

[warning] 1608-1608: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2231-2231: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2324-2324: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2333-2333: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2341-2341: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2348-2348: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2360-2360: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 2370-2370: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

🔇 Additional comments (38)
.gitignore (1)

71-72: Looks good — scratch workspace is correctly excluded.

Line 72 cleanly prevents accidental commits of local brainstorming artifacts.

packages/evaluator/tests/fixtures/ragas_tiny_fixture.json (1)

1-18: Fixture structure is solid and test-friendly.

Both samples are minimal, deterministic, and include the required evaluation fields.

packages/mcp/package.json (1)

3-3: Version bump is consistent with release scope.

Line 3 aligns MCP package metadata with the v0.2.0 rollout.

packages/core/pyproject.toml (1)

3-3: Version update is correct for coordinated release.

Line 3 cleanly aligns rag-forge-core with v0.2.0.

packages/observability/pyproject.toml (1)

3-3: Version bump looks good.

Line 3 keeps observability package version in sync with the v0.2.0 release train.

packages/cli/package.json (1)

3-3: CLI version alignment is correct.

Line 3 properly reflects the v0.2.0 release.

packages/evaluator/src/rag_forge_evaluator/report/pdf.py (1)

72-79: Print-media PDF rendering update looks correct.

Line 72 and Lines 74-79 correctly force print media and honor CSS page sizing, which matches the new print-oriented template behavior.

packages/evaluator/tests/test_evaluator_factory.py (1)

24-29: Test setup now matches v0.2.0 ragas construction requirements.

Passing MockJudge() on Line 28 is the right fix to keep this test focused on deferred dependency import behavior.

packages/evaluator/tests/test_report.py (2)

33-35: RMM badge assertions are aligned with the new template format.

The "Level" + "Better Trust" checks correctly match the updated header rendering.


40-42: Metric display-name assertion update is correct.

Checking "Faithfulness" (title-cased) reflects the generator’s display transformation and keeps the test contract accurate.

packages/evaluator/tests/test_enhanced_report.py (2)

11-25: SVG assertion update matches the new sparkline implementation.

The revised check is consistent with the template’s <svg> + polyline/circle rendering.


27-43: Backward-compatibility test for trends kwarg is on point.

This keeps API compatibility covered without asserting removed visual behavior.

packages/evaluator/src/rag_forge_evaluator/cli.py (2)

57-59: Use explicit tri-state resolution for refusal flags.

Line 58 currently ignores refusal_aware_flag; today it works by default, but it makes precedence less explicit and future-default changes riskier.

[ suggest_recommended_refactor ]

♻️ Proposed refactor
-    # Compute refusal_aware from mutually-exclusive flags
-    strict = bool(getattr(args, "strict", False) or getattr(args, "no_refusal_aware_flag", False))
-    refusal_aware = not strict
+    # Compute refusal_aware from mutually-exclusive flags (explicit precedence)
+    if bool(getattr(args, "strict", False) or getattr(args, "no_refusal_aware_flag", False)):
+        refusal_aware = False
+    elif bool(getattr(args, "refusal_aware_flag", False)):
+        refusal_aware = True
+    else:
+        refusal_aware = True  # default for v0.2.x

311-329: Mutually-exclusive refusal flag group is implemented correctly.

The parser-level exclusivity cleanly prevents invalid combinations and keeps CLI behavior predictable.

packages/evaluator/tests/test_pdf_generator.py (1)

31-62: Good coverage addition for print-media emulation.

This test meaningfully guards the new PDF rendering behavior and mocks the Playwright call chain cleanly.

packages/evaluator/src/rag_forge_evaluator/engines/__init__.py (1)

18-23: Refusal-aware threading into LLMJudgeEvaluator is correct.

The new factory parameter is propagated cleanly and keeps behavior centralized.

packages/evaluator/pyproject.toml (3)

34-35: Optional dependency updates are coherent with the new evaluator paths.

The ragas + datasets pairing and ragas-voyage extra are aligned with the new adapter wiring.


49-50: Template force-include is the right packaging safeguard.

This ensures audit.html.j2 is available in wheel installs where PackageLoader is used.


52-56: Pytest marker registration is a good addition.

Declaring opt-in markers here prevents unknown-marker noise and documents test gating clearly.

packages/evaluator/tests/test_json_report.py (4)

6-13: LGTM: Updated imports align with v0.2.0 data model additions.

The imports correctly include pytest for approx usage and SkipRecord for testing the new skip semantics. These align with the engine type changes.


65-92: LGTM: Comprehensive test for skipped_samples field.

The test correctly validates the new skipped_samples JSON structure, including all four required fields (sample_id, metric_name, reason, exception_type). This aligns with the SkipRecord dataclass definition and the generate_json() implementation.


109-122: LGTM: Good use of pytest.approx for floating-point comparison.

Testing safety_refusal_rate with pytest.approx(1 / 12) is the correct approach for floating-point equality in Python tests, avoiding precision issues.


155-163: LGTM: Edge case coverage for zero samples.

Good defensive test ensuring that when samples_evaluated=0, the report gracefully handles the division-by-zero case by returning safety_refusal_rate=0.0 and empty collections.

packages/evaluator/tests/test_ragas_extractor.py (2)

49-58: LGTM: Test expectations correctly updated for v0.2.0 behavior.

The tests now properly assert ValueError is raised instead of silent 0.0 fallback, matching the updated _extract_ragas_score implementation that eliminates the bug where broken infrastructure was indistinguishable from zero scores.


74-93: LGTM: Good coverage for RagasEvaluator judge injection.

These tests validate the factory threading and the new requirement that RagasEvaluator must have a judge injected. The expected metrics list matches _METRIC_NAMES from the implementation.

packages/evaluator/tests/test_engine_types.py (1)

1-57: LGTM: Solid unit test coverage for new engine types.

The test file correctly validates:

  • SkipRecord dataclass construction with all required fields
  • MetricResult accepts new scoring_mode and refusal_justification fields
  • EvaluationResult defaults skipped_samples to [] and scoring_modes_count to {}
  • SampleResult properly stores scoring mode and allows None justification

These tests align with the dataclass definitions in engine.py.

docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md (1)

1-57: LGTM: Comprehensive implementation plan with clear task breakdown.

The plan document is well-structured with:

  • Clear codebase orientation for new engineers
  • File structure plan with responsibilities
  • 20 tasks with TDD approach (failing test → implementation → pass)
  • Spec coverage self-review checklist

This serves as excellent documentation for the v0.2.0 evaluator refresh.

packages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonl (1)

1-3: LGTM: Well-designed regression fixture for v0.1.3 bug coverage.

The three-sample fixture effectively covers:

  1. acs-001-typical-stemi: Standard clinical case for positive scoring
  2. htn-001-emergency-end-organ: Another standard case with multi-organ involvement
  3. t2dm-002-paediatric-dosing-refusal: Refusal case designed to test the new refusal-aware scoring path

The refusal case (line 3) is particularly valuable — it represents the PearMedica scenario where the v0.1.3 evaluator incorrectly penalized valid safety refusals. The sample IDs are unique and follow a consistent {condition}-{number}-{descriptor} pattern.

packages/evaluator/tests/test_audit_config_validation.py (3)

50-62: LGTM: Test correctly updated for v0.2.0 ragas+mock behavior.

The test now validates that ragas + mock judge combination is allowed, reflecting the new RagForgeRagasLLM wrapper that honors the configured judge end-to-end. This replaces the old blocking behavior where mock would silently bypass the cost gate.


128-166: LGTM: Good coverage for voyageai conditional logic.

The tests properly validate:

  • When _voyageai_installed() returns True, ragas+claude construction succeeds
  • When it returns False, a ConfigurationError is raised with "ragas-voyage" in the message

The monkeypatching approach correctly isolates the voyageai availability check.


169-185: LGTM: Default value assertions match AuditConfig definition.

The tests correctly validate:

  • refusal_aware defaults to True
  • ragas_max_tokens defaults to 8192
  • ragas_embeddings_provider defaults to None (auto-select)

These match the AuditConfig dataclass definition in the context snippet.

packages/evaluator/tests/test_cli_refusal_flags.py (3)

9-63: LGTM: Parser reconstruction provides good test isolation.

The _build_parser() function correctly reconstructs the audit subparser with all the refusal-aware flags in a mutually exclusive group. This mirrors the actual CLI implementation and allows testing without invoking the full CLI main().


66-87: LGTM: Config builder correctly mirrors cli.py logic.

The _build_audit_config_from_args() function accurately replicates the refusal-aware computation from cli.py:

strict = bool(getattr(args, "strict", False) or getattr(args, "no_refusal_aware_flag", False))
refusal_aware = not strict

This ensures test assertions reflect actual CLI behavior.


140-174: LGTM: Comprehensive mutual exclusivity tests.

Good coverage of all invalid flag combinations:

  • --strict + --refusal-aware
  • --strict + --no-refusal-aware
  • --refusal-aware + --no-refusal-aware

Using pytest.raises(SystemExit) correctly handles argparse's behavior on conflicting arguments.

packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2 (2)

1-870: LGTM: Well-structured CSS with proper theming and print support.

The CSS architecture is solid:

  • CSS custom properties (--color-*, --font-*) enable consistent theming
  • Responsive breakpoints at 720px handle mobile gracefully
  • Print media rules properly expand <details> elements and prevent page breaks inside cards
  • The @page rules with @bottom-right and @bottom-left content are appropriate for PDF generation

1176-1202: LGTM: Conditional skipped samples section with clear explanation.

The conditional {% if skipped_samples %} block correctly renders only when there are skipped samples, with a helpful note explaining that "Broken infrastructure is visible rather than hidden behind zero scores." This aligns with the v0.2.0 SkipRecord semantics.

packages/evaluator/src/rag_forge_evaluator/engine.py (1)

5-84: Nice backwards-compatible engine model refresh.

Adding the new scoring/skip fields as optional or default-factory-backed extensions keeps existing call sites stable while exposing the richer report state this release needs.

packages/evaluator/tests/test_llm_judge_aggregation.py (1)

37-163: Strong regression coverage for the combined-response path.

These cases exercise both full-sample and per-metric skip behavior, including the PearMedica pollution scenario that motivated the aggregation fix.

Comment thread CHANGELOG.md
Comment on lines +1605 to +1610
```bash
uv pip install 'ragas>=0.4,<0.6'
uv run pytest -m ragas_integration packages/evaluator/tests/test_ragas_integration.py -v
```
Expected: both tests pass.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Optional: Add blank lines around fenced code blocks for markdown consistency.

Several fenced code blocks in this document are not surrounded by blank lines (e.g., lines 1608, 2231, 2324, 2333, 2341, 2348, 2360, 2370). While this doesn't affect functionality, adding blank lines improves markdown rendering consistency across parsers.

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 1608-1608: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md` around lines
1605 - 1610, Add a blank line before and after each fenced code block (e.g., the
```bash block containing "uv pip install 'ragas>=0.4,<0.6'" and "uv run pytest
-m ragas_integration ...") so every ```...``` block is separated by an empty
line above and below; update all similar fenced blocks in the document to
maintain consistent markdown rendering.

Comment thread docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md Outdated
Comment on lines 24 to +26
if engine == "ragas":
from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
return RagasEvaluator(thresholds=thresholds)
return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when creating ragas evaluator without a judge.

Line 26 allows judge=None, but ragas now requires a judge and only errors later during evaluate(). This delays failure and weakens diagnostics.

🛠️ Proposed fix
     if engine == "ragas":
         from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
+        if judge is None:
+            raise ValueError(
+                "Ragas evaluator requires a judge. "
+                "Pass --judge (e.g. mock|claude|openai) or provide a JudgeProvider."
+            )
         return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if engine == "ragas":
from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
return RagasEvaluator(thresholds=thresholds)
return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware)
if engine == "ragas":
from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
if judge is None:
raise ValueError(
"Ragas evaluator requires a judge. "
"Pass --judge (e.g. mock|claude|openai) or provide a JudgeProvider."
)
return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py` around lines
24 - 26, When engine == "ragas", fail fast if no judge is provided: before
instantiating RagasEvaluator in __init__.py, check that the judge parameter is
not None and raise a descriptive ValueError (e.g., "ragas engine requires a
judge") so we don't defer the error until RagasEvaluator.evaluate(); then
proceed to import/return RagasEvaluator(judge=judge, thresholds=thresholds,
refusal_aware=refusal_aware).

Comment on lines +41 to +79
class RagForgeRagasLLM:
"""ragas LLM wrapper that forwards to a rag-forge Judge.

Implements both sync (``generate_text``) and async (``agenerate_text``)
interfaces that ragas 0.4.x calls internally. The async variant runs
the sync path in a thread — the underlying Judge implementations are
sync, and ragas parallelizes metrics at a higher level anyway.
"""

def __init__(
self,
judge: _JudgeLike,
max_tokens: int = 8192,
system_prompt: str = "",
refusal_aware: bool = True,
) -> None:
self._judge = judge
self.max_tokens = max_tokens
self._system_prompt = system_prompt
self._refusal_aware = refusal_aware

def generate_text(self, prompt: str) -> str:
"""Generate text using the underlying judge."""
if self._refusal_aware:
prompt = _REFUSAL_NOTE + prompt
return self._judge.judge(self._system_prompt, prompt)

async def agenerate_text(self, prompt: str) -> str:
"""Async variant of generate_text.

Runs the sync path in a thread to maintain compatibility
with ragas 0.4.x's async callsites.
"""
return await asyncio.to_thread(self.generate_text, prompt)

def model_name(self) -> str:
"""Return the model name from the underlying judge."""
return self._judge.model_name()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd ragas_adapters.py

Repository: hallengray/rag-forge

Length of output: 180


🏁 Script executed:

wc -l packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py

Repository: hallengray/rag-forge

Length of output: 136


🏁 Script executed:

find . -type f -name "*.py" -path "*/ragas_adapters.py"

Repository: hallengray/rag-forge

Length of output: 134


🏁 Script executed:

git ls-files | grep ragas_adapters

Repository: hallengray/rag-forge

Length of output: 180


🏁 Script executed:

find . -type f -name "*.py" | grep -i ragas

Repository: hallengray/rag-forge

Length of output: 425


🏁 Script executed:

fd "ragas" --type f

Repository: hallengray/rag-forge

Length of output: 408


🏁 Script executed:

git ls-files | grep packages/evaluator

Repository: hallengray/rag-forge

Length of output: 4396


🏁 Script executed:

ls -la packages/ 2>/dev/null || find . -maxdepth 3 -name evaluator -type d

Repository: hallengray/rag-forge

Length of output: 512


🏁 Script executed:

find . -type f -name "*.py" | grep -i "evaluator\|ragas" | head -30

Repository: hallengray/rag-forge

Length of output: 1958


🏁 Script executed:

find . -type f -path "*ragas_adapters.py" -o -path "*rag_forge_evaluator*" | head -20

Repository: hallengray/rag-forge

Length of output: 1249


🏁 Script executed:

git log --name-only --oneline --max-count=50 | grep -i ragas

Repository: hallengray/rag-forge

Length of output: 544


🏁 Script executed:

git log --name-status --oneline -20 | head -50

Repository: hallengray/rag-forge

Length of output: 1548


🏁 Script executed:

find . -type f -name "ragas_adapters.py" 2>/dev/null || git ls-files "*ragas*"

Repository: hallengray/rag-forge

Length of output: 134


🏁 Script executed:

git show HEAD:packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py 2>/dev/null || echo "File not in HEAD"

Repository: hallengray/rag-forge

Length of output: 6301


🏁 Script executed:

git status

Repository: hallengray/rag-forge

Length of output: 299


🏁 Script executed:

git diff --name-only --cached

Repository: hallengray/rag-forge

Length of output: 46


🏁 Script executed:

find . \( -name ".git" -o -name "node_modules" \) -prune -o -type f -name "*ragas*" -print

Repository: hallengray/rag-forge

Length of output: 420


🏁 Script executed:

ls -la packages/ 2>/dev/null || ls -la . | head -20

Repository: hallengray/rag-forge

Length of output: 512


🏁 Script executed:

git log --all --name-status --oneline | grep -i ragas | head -20

Repository: hallengray/rag-forge

Length of output: 1057


🏁 Script executed:

rg "RagForgeRagasLLM" -l

Repository: hallengray/rag-forge

Length of output: 577


🏁 Script executed:

rg "RagForgeRagasEmbeddings" -A 50

Repository: hallengray/rag-forge

Length of output: 50376


🏁 Script executed:

rg "from ragas" -A 2

Repository: hallengray/rag-forge

Length of output: 1694


🏁 Script executed:

rg "BaseRagasLLM|BaseRagasEmbeddings"

Repository: hallengray/rag-forge

Length of output: 2742


🏁 Script executed:

find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | head -10

Repository: hallengray/rag-forge

Length of output: 359


🏁 Script executed:

rg "ragas" pyproject.toml

Repository: hallengray/rag-forge

Length of output: 273


🏁 Script executed:

rg "agenerate_text|generate_text" --type python -B 3 -A 8

Repository: hallengray/rag-forge

Length of output: 94


🏁 Script executed:

rg "_REFUSAL_NOTE" -B 5 -A 5

Repository: hallengray/rag-forge

Length of output: 4488


🏁 Script executed:

rg "def judge\(" -B 5 -A 10

Repository: hallengray/rag-forge

Length of output: 37452


🏁 Script executed:

rg "agenerate_text|RagForgeRagasEmbeddings" --type py -B 2 -A 15

Repository: hallengray/rag-forge

Length of output: 22036


🏁 Script executed:

rg "ragas_adapters"

Repository: hallengray/rag-forge

Length of output: 6684


🏁 Script executed:

python3 << 'EOF'
import os
import subprocess

# Get all files in repo
result = subprocess.run(['git', 'ls-files'], capture_output=True, text=True)
files = result.stdout.strip().split('\n')
for f in files:
    if 'ragas' in f.lower() or 'rag_forge_evaluator' in f:
        print(f)
EOF

Repository: hallengray/rag-forge

Length of output: 2706


🌐 Web query:

ragas BaseRagasLLM BaseRagasEmbeddings evaluate() method contracts

💡 Result:

The evaluate method in Ragas is the primary function for performing evaluation on a dataset using specified metrics. Its key contracts regarding BaseRagasLLM and BaseRagasEmbeddings are as follows: - llm: Optional[BaseRagasLLM | LangchainLLM] - The language model used to generate scores for metrics requiring an LLM. If not provided, Ragas uses a default LLM. This can be overridden at the metric level via metric.llm. Custom LLMs must subclass BaseRagasLLM or be wrapped (e.g., LangchainLLMWrapper). - embeddings: Optional[BaseRagasEmbeddings | BaseRagasEmbedding | LangchainEmbeddings] - The embeddings model for metrics requiring embeddings. If not provided, Ragas uses a default. Custom embeddings must subclass BaseRagasEmbeddings (or the single-text BaseRagasEmbedding) or be wrapped (e.g., LangchainEmbeddingsWrapper). BaseRagasLLM is an abstract base class for LLMs in Ragas, providing methods like get_temperature and supporting run_config, caching, and multiple completions. It is used by MetricWithLLM subclasses. BaseRagasEmbeddings extends LangChain's Embeddings ABC, adding run_config support, caching (via embed_query, aembed_query, etc.), and methods like embed_text, embed_texts. BaseRagasEmbedding is a modern single-text embedding ABC with embed_text and aembed_text. These classes ensure compatibility: metrics initialize with provided llm/embeddings (raising ValueError if missing when required), set run_config, and use them for scoring (sync/async via _score/_ascore). Example usage: from ragas import evaluate result = evaluate( dataset=dataset, metrics=[...], llm=your_BaseRagasLLM_instance, embeddings=your_BaseRagasEmbeddings_instance ) Returns EvaluationResult with metric scores (e.g., {'context_precision': 0.817, ...}).

Citations:


🌐 Web query:

ragas 0.4.x LLMResult agenerate_text prompt n temperature stop callbacks

💡 Result:

In Ragas 0.4.x, LLMResult is the return type of text generation methods on BaseRagasLLM instances, imported from langchain_core.outputs (langchain_core.outputs.LLMResult). It contains generations (list of ChatGeneration or Generation objects with text, generation_info, etc.). Key methods using LLMResult: - BaseRagasLLM.generate(prompt: PromptValue, n: int = 1, temperature: Optional[float] = 0.01, stop: Optional[List[str]] = None, callbacks: Callbacks = None) -> LLMResult (async wrapper calling agenerate_text) - Subclasses implement generate_text(prompt: PromptValue, n: int = 1, temperature: Optional[float] = 0.01, stop: Optional[List[str]] = None, callbacks: Callbacks = None) -> LLMResult (sync) - agenerate_text(prompt: PromptValue, n: int = 1, temperature: Optional[float] = 0.01, stop: Optional[List[str]] = None, callbacks: Callbacks = None) -> LLMResult (async, abstract/implemented in wrappers like LangchainLLMWrapper, OCIGenAIWrapper) These are called internally by prompts (e.g., StringPrompt.generate calls llm.agenerate_text(...)) and higher-level components. No direct "agenerate_text prompt n temperature stop callbacks" method on LLMResult itself; it's the output. "agenerate_text" likely refers to the abstract async implementation in BaseRagasLLM subclasses. Example usage from source: llm_result = await llm.agenerate_text(StringPromptValue(text=data), n=1, temperature=temperature, stop=stop, callbacks=callbacks) Access: llm_result.generations[0][0].text, llm_result.flatten for is_finished checks. Ragas v0.4.x docs confirm this API in /en/v0.4.1/references/llms/ and GitHub src/ragas/llms/base.py.

Citations:


Implement proper ragas base class inheritance instead of relying on duck typing.

The adapter classes use narrower method signatures (generate_text(str) -> str, agenerate_text(str) -> str) than ragas 0.4.x contracts require (generate_text(prompt: PromptValue, n: int, temperature: float, stop: list, callbacks) -> LLMResult). While duck typing currently works because ragas metrics only call the basic subset, this is fragile. ragas can silently break these adapters on a minor version update if it starts passing additional parameters or expecting LLMResult. Additionally, max_tokens is stored but never passed to the underlying judge, limiting token control. Subclass BaseRagasLLM and BaseRagasEmbeddings directly, implement the full interface, and properly thread max_tokens through to the judge calls. (docs.ragas.io/en/latest/references/evaluate/)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 41 - 79, The RagForgeRagasLLM adapter currently duck-types ragas and
exposes a narrow generate_text( str )->str; update it to subclass ragas'
BaseRagasLLM and implement the full ragas contract: replace class
RagForgeRagasLLM to inherit BaseRagasLLM, change generate_text to the full
signature (accept PromptValue prompt, n:int, temperature:float, stop:list|None,
callbacks:Optional[...] etc.), return an LLMResult object, and ensure
agenerate_text mirrors the async signature and runs the sync implementation via
asyncio.to_thread; also thread the stored max_tokens into the underlying call
(pass max_tokens to _judge.judge or map it into the judge call parameters) and
keep model_name() forwarding to _judge.model_name() so ragas sees the proper
model metadata.

Comment on lines +152 to +175
def test_ragas_evaluator_threads_refusal_aware_into_wrapper():
"""RagasEvaluator(refusal_aware=True) should construct RagForgeRagasLLM
with refusal_aware=True. We verify by capturing what gets passed to the judge."""
from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator

class CapturingMockJudge:
def judge(self, system_prompt: str, user_prompt: str) -> str:
captured_prompts.append(user_prompt)
# Return a minimal valid JSON that ragas metrics expect
return '{"faithfulness": 0.5}'

def model_name(self) -> str:
return "capture-judge"

evaluator = RagasEvaluator(
judge=CapturingMockJudge(),
thresholds={},
refusal_aware=True,
embeddings_provider="mock",
)

# To avoid needing ragas installed, just verify that the evaluator
# was constructed with refusal_aware=True by checking the attribute
assert evaluator._refusal_aware is True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Make this test exercise the behavior it claims to cover.

captured_prompts is undefined in CapturingMockJudge.judge() (ruff F821), and the test never calls the path that constructs RagForgeRagasLLM anyway—it only inspects a private attribute on RagasEvaluator. As written, it will not catch a regression where the evaluator stops passing refusal_aware into the wrapper.

🧰 Tools
🪛 GitHub Actions: CI

[error] 159-159: ruff check: Undefined name captured_prompts. (F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 152 - 175,
Define a captured_prompts list and actually exercise the code path that
builds/uses RagForgeRagasLLM so the test proves refusal_aware is propagated: add
captured_prompts = [] before CapturingMockJudge and have the test invoke the
evaluator flow that calls judge (e.g., call the evaluator method that triggers
RagForgeRagasLLM -> judge) or monkeypatch/replace RagForgeRagasLLM with a small
fake that captures its init args; then assert the captured judge calls happened
and that the wrapper/fake received refusal_aware=True (or that captured_prompts
contains the expected user_prompt), instead of only checking
evaluator._refusal_aware. Reference symbols:
test_ragas_evaluator_threads_refusal_aware_into_wrapper,
CapturingMockJudge.judge, RagasEvaluator, and RagForgeRagasLLM.

Comment thread packages/evaluator/tests/test_ragas_extractor.py Outdated
Comment on lines +41 to +83
@pytest.mark.ragas_integration
def test_ragas_end_to_end_with_mock_judge():
"""Wire real ragas 0.4.x through our injected wrappers. The exact
scores depend on ragas's internal prompts, but the contract is:
- no exception propagates out
- samples_evaluated matches the input count (or 0 if ragas crashed and everything was skipped)
- skipped_samples plus metrics cover what ragas tried to do
"""
samples = _load_samples()
evaluator = RagasEvaluator(
judge=MockJudge(),
thresholds={},
embeddings_provider="mock",
)
result = evaluator.evaluate(samples)

assert result is not None
# Either ragas ran and produced metrics OR the whole-batch-crash path
# populated skipped_samples — both are acceptable outcomes for this
# contract test. The invariant is that the evaluator did not raise.
assert len(result.metrics) + len(result.skipped_samples) >= 1


@pytest.mark.ragas_integration
def test_ragas_whole_batch_crash_captured_as_skips():
"""Empty contexts is a known bad input for ragas 0.4.x — assert that
when it crashes, we emit SkipRecords instead of raising."""
samples = [
EvaluationSample(
query="broken",
contexts=[], # ragas rejects empty contexts
response="ignored",
sample_id="broken-001",
),
]
evaluator = RagasEvaluator(
judge=MockJudge(),
thresholds={},
embeddings_provider="mock",
)
# Must not raise. Either ragas tolerated it (unlikely) or we captured skips.
result = evaluator.evaluate(samples)
assert result is not None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Tighten the contract assertions in these ragas integration tests.

Both docstrings promise more than the assertions enforce. In particular, the crash-path test still passes on an empty EvaluationResult, and the first test never checks the documented samples_evaluated invariant.

Proposed fix
 `@pytest.mark.ragas_integration`
 def test_ragas_end_to_end_with_mock_judge():
@@
     result = evaluator.evaluate(samples)

     assert result is not None
+    assert result.samples_evaluated in (0, len(samples))
     # Either ragas ran and produced metrics OR the whole-batch-crash path
     # populated skipped_samples — both are acceptable outcomes for this
     # contract test. The invariant is that the evaluator did not raise.
     assert len(result.metrics) + len(result.skipped_samples) >= 1
@@
 `@pytest.mark.ragas_integration`
 def test_ragas_whole_batch_crash_captured_as_skips():
@@
     # Must not raise. Either ragas tolerated it (unlikely) or we captured skips.
     result = evaluator.evaluate(samples)
     assert result is not None
+    assert result.samples_evaluated in (0, len(samples))
+    assert len(result.metrics) + len(result.skipped_samples) >= 1
+    if result.skipped_samples:
+        assert all(s.sample_id == "broken-001" for s in result.skipped_samples)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@pytest.mark.ragas_integration
def test_ragas_end_to_end_with_mock_judge():
"""Wire real ragas 0.4.x through our injected wrappers. The exact
scores depend on ragas's internal prompts, but the contract is:
- no exception propagates out
- samples_evaluated matches the input count (or 0 if ragas crashed and everything was skipped)
- skipped_samples plus metrics cover what ragas tried to do
"""
samples = _load_samples()
evaluator = RagasEvaluator(
judge=MockJudge(),
thresholds={},
embeddings_provider="mock",
)
result = evaluator.evaluate(samples)
assert result is not None
# Either ragas ran and produced metrics OR the whole-batch-crash path
# populated skipped_samples — both are acceptable outcomes for this
# contract test. The invariant is that the evaluator did not raise.
assert len(result.metrics) + len(result.skipped_samples) >= 1
@pytest.mark.ragas_integration
def test_ragas_whole_batch_crash_captured_as_skips():
"""Empty contexts is a known bad input for ragas 0.4.xassert that
when it crashes, we emit SkipRecords instead of raising."""
samples = [
EvaluationSample(
query="broken",
contexts=[], # ragas rejects empty contexts
response="ignored",
sample_id="broken-001",
),
]
evaluator = RagasEvaluator(
judge=MockJudge(),
thresholds={},
embeddings_provider="mock",
)
# Must not raise. Either ragas tolerated it (unlikely) or we captured skips.
result = evaluator.evaluate(samples)
assert result is not None
`@pytest.mark.ragas_integration`
def test_ragas_end_to_end_with_mock_judge():
"""Wire real ragas 0.4.x through our injected wrappers. The exact
scores depend on ragas's internal prompts, but the contract is:
- no exception propagates out
- samples_evaluated matches the input count (or 0 if ragas crashed and everything was skipped)
- skipped_samples plus metrics cover what ragas tried to do
"""
samples = _load_samples()
evaluator = RagasEvaluator(
judge=MockJudge(),
thresholds={},
embeddings_provider="mock",
)
result = evaluator.evaluate(samples)
assert result is not None
assert result.samples_evaluated in (0, len(samples))
# Either ragas ran and produced metrics OR the whole-batch-crash path
# populated skipped_samples — both are acceptable outcomes for this
# contract test. The invariant is that the evaluator did not raise.
assert len(result.metrics) + len(result.skipped_samples) >= 1
`@pytest.mark.ragas_integration`
def test_ragas_whole_batch_crash_captured_as_skips():
"""Empty contexts is a known bad input for ragas 0.4.xassert that
when it crashes, we emit SkipRecords instead of raising."""
samples = [
EvaluationSample(
query="broken",
contexts=[], # ragas rejects empty contexts
response="ignored",
sample_id="broken-001",
),
]
evaluator = RagasEvaluator(
judge=MockJudge(),
thresholds={},
embeddings_provider="mock",
)
# Must not raise. Either ragas tolerated it (unlikely) or we captured skips.
result = evaluator.evaluate(samples)
assert result is not None
assert result.samples_evaluated in (0, len(samples))
assert len(result.metrics) + len(result.skipped_samples) >= 1
if result.skipped_samples:
assert all(s.sample_id == "broken-001" for s in result.skipped_samples)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_integration.py` around lines 41 - 83,
Update the two ragas tests to assert the documented samples_evaluated / skip
semantics: in test_ragas_end_to_end_with_mock_judge add an assertion that
result.samples_evaluated is either len(samples) or 0 (i.e.,
result.samples_evaluated in {len(samples), 0}) in addition to the existing
metrics/ skipped check; in test_ragas_whole_batch_crash_captured_as_skips assert
result.samples_evaluated is either len(samples) or 0 and, if it is 0, assert
that result.skipped_samples contains the input sample (e.g., sample_id
"broken-001") and that len(result.skipped_samples) == len(samples) so an empty
EvaluationResult no longer passes. Ensure you reference the EvaluationResult
fields samples_evaluated and skipped_samples when adding these assertions.

Comment on lines +131 to +135
total_pixels = baseline_img.width * baseline_img.height
diff_pixels = sum(1 for p in diff.getdata() if p != (0, 0, 0))
diff_ratio = diff_pixels / total_pixels
assert diff_ratio < 0.005, (
f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the advertised 0.5% tolerance inclusive.

The module header says “up to 0.5%” drift is tolerated, but this assertion still fails at exactly 0.5%.

Proposed fix
-    assert diff_ratio < 0.005, (
+    assert diff_ratio <= 0.005, (
         f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). "
         f"Re-capture baseline if the template deliberately changed."
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
total_pixels = baseline_img.width * baseline_img.height
diff_pixels = sum(1 for p in diff.getdata() if p != (0, 0, 0))
diff_ratio = diff_pixels / total_pixels
assert diff_ratio < 0.005, (
f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). "
total_pixels = baseline_img.width * baseline_img.height
diff_pixels = sum(1 for p in diff.getdata() if p != (0, 0, 0))
diff_ratio = diff_pixels / total_pixels
assert diff_ratio <= 0.005, (
f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). "
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_report_visual_regression.py` around lines 131 -
135, The assertion currently uses a strict less-than check so an exact 0.5% diff
fails; change the assertion to allow the advertised tolerance by using a
less-than-or-equal comparison for diff_ratio (i.e., replace the "< 0.005" check
in the assertion that references total_pixels, diff_pixels and diff_ratio with
"<= 0.005") and keep the existing formatted failure message unchanged.

Comment thread pyproject.toml Outdated
# Optional integrations installed as extras — stubs not available
[[tool.mypy.overrides]]
module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*"]
module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "rag_forge_evaluator.*"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not suppress mypy errors for first-party package imports.

Line 36 adds rag_forge_evaluator.* to ignore_missing_imports; this can hide real typing regressions in your own code while strict=true is enabled.

Suggested fix
-module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "rag_forge_evaluator.*"]
+module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "rag_forge_evaluator.*"]
module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 36, The pyproject.toml currently suppresses mypy for
a first-party package by including "rag_forge_evaluator.*" in the module ignore
list; remove "rag_forge_evaluator.*" from the module = [...] entry (the line
containing module = ["anthropic.*", ..., "rag_forge_evaluator.*"]) so mypy will
type-check your own package, or alternatively replace the suppression by adding
proper type information (inline typings, py.typed, or stubs) for
rag_forge_evaluator so it can be checked under strict=true.

The v0.2.0 report generator introduced 24 mypy errors under the
project's strict mode:
- 17x "Missing type arguments for generic type 'dict'" where the new
  helper functions returned bare `dict` / `list[dict]` annotations
- 7x "float / int has incompatible type 'object'" in `_build_cost`
  where `_attr` was typed as returning `object` and mypy rejected
  passing it to numeric constructors

Fix: parameterize every dict annotation as `dict[str, Any]`, retype
`_attr` to return `Any`, and add `typing.Any` to the imports. Also
add `voyageai.*` to the mypy override list so the optional-extra
import in ragas_adapters.py and audit.py doesn't trip missing-stub
errors.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/evaluator/src/rag_forge_evaluator/report/generator.py (1)

780-836: 🧹 Nitpick | 🔵 Trivial

Consider adding v0.2.0 fields to partial JSON for consistency.

The generate_partial_json method doesn't include scoring_mode or refusal_justification in its sample_results array, unlike generate_json. If partial audits can have refusal-classified samples, this creates an inconsistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines
780 - 836, generate_partial_json omits the scoring_mode and
refusal_justification fields in each sample entry, causing inconsistency with
generate_json; update the sample_results construction in generate_partial_json
to include "scoring_mode": s.scoring_mode and "refusal_justification":
s.refusal_justification (matching the keys and behavior used in generate_json)
so partial reports contain the same per-sample fields as full reports,
preserving null/None values when those attributes are absent.
♻️ Duplicate comments (2)
pyproject.toml (1)

36-36: ⚠️ Potential issue | 🟠 Major

Remove first-party namespace from ignore_missing_imports override.

Including rag_forge_evaluator.* in this override can hide real typing regressions in your own package while strict = true is enabled.

Suggested patch
-module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "voyageai.*", "rag_forge_evaluator.*"]
+module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "voyageai.*"]
#!/bin/bash
# Verify that rag_forge_evaluator is a first-party namespace and currently suppressed in mypy overrides.
set -euo pipefail

echo "== pyproject override =="
rg -n -C2 '^\s*module\s*=\s*\[.*rag_forge_evaluator\.\*' pyproject.toml || true

echo
echo "== first-party package location candidates =="
fd -i 'rag_forge_evaluator' packages || true

echo
echo "== python package declarations referencing evaluator =="
rg -n -C2 'name\s*=\s*"rag-forge-evaluator"|packages/evaluator' pyproject.toml packages || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 36, The mypy override in pyproject.toml currently
lists "rag_forge_evaluator.*" under the module ignore_missing_imports override,
which silences type errors in your own package; remove "rag_forge_evaluator.*"
from the module = [...] array so your first-party package is no longer ignored
by the override (leave other third-party entries intact), then run mypy/type
checks to confirm no regressions; locate the entry by searching for the module =
["anthropic.*", ..., "rag_forge_evaluator.*"] line in pyproject.toml and delete
only the rag_forge_evaluator.* item (and any trailing comma if needed).
packages/evaluator/tests/test_ragas_adapters.py (1)

154-168: 🧹 Nitpick | 🔵 Trivial

This test name overstates what is validated.

It currently verifies only internal state (_refusal_aware) and does not prove that RagasEvaluator actually passes the flag into RagForgeRagasLLM construction on the execution path. Consider monkeypatching the wrapper constructor (or wrapper factory point) and asserting the captured refusal_aware argument.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 154 - 168, The
test test_ragas_evaluator_threads_refusal_aware_into_wrapper only asserts the
internal flag _refusal_aware on RagasEvaluator but doesn't prove the flag is
forwarded to the wrapper; update the test to monkeypatch the wrapper
factory/constructor (the point that creates RagForgeRagasLLM used by
RagasEvaluator) to capture its kwargs and then instantiate RagasEvaluator with
refusal_aware=True and assert the captured call included refusal_aware=True;
locate the creation point in RagasEvaluator (search for where RagForgeRagasLLM
or the wrapper factory is invoked) and replace/patch that constructor within the
test to record the argument instead of actually constructing the real wrapper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 714-719: The computation for safety_refusal_rate is using both the
guarded local variable samples_evaluated and the original
result.samples_evaluated in a redundant conditional; simplify by using the
already-guarded samples_evaluated for the division (i.e., compute refusal_count
= result.scoring_modes_count.get("safety_refusal", 0) and then
safety_refusal_rate = refusal_count / samples_evaluated) and ensure you preserve
the zero-sample semantics by initializing samples_evaluated =
result.samples_evaluated or 1 as currently done.

---

Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 780-836: generate_partial_json omits the scoring_mode and
refusal_justification fields in each sample entry, causing inconsistency with
generate_json; update the sample_results construction in generate_partial_json
to include "scoring_mode": s.scoring_mode and "refusal_justification":
s.refusal_justification (matching the keys and behavior used in generate_json)
so partial reports contain the same per-sample fields as full reports,
preserving null/None values when those attributes are absent.

---

Duplicate comments:
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 154-168: The test
test_ragas_evaluator_threads_refusal_aware_into_wrapper only asserts the
internal flag _refusal_aware on RagasEvaluator but doesn't prove the flag is
forwarded to the wrapper; update the test to monkeypatch the wrapper
factory/constructor (the point that creates RagForgeRagasLLM used by
RagasEvaluator) to capture its kwargs and then instantiate RagasEvaluator with
refusal_aware=True and assert the captured call included refusal_aware=True;
locate the creation point in RagasEvaluator (search for where RagForgeRagasLLM
or the wrapper factory is invoked) and replace/patch that constructor within the
test to record the argument instead of actually constructing the real wrapper.

In `@pyproject.toml`:
- Line 36: The mypy override in pyproject.toml currently lists
"rag_forge_evaluator.*" under the module ignore_missing_imports override, which
silences type errors in your own package; remove "rag_forge_evaluator.*" from
the module = [...] array so your first-party package is no longer ignored by the
override (leave other third-party entries intact), then run mypy/type checks to
confirm no regressions; locate the entry by searching for the module =
["anthropic.*", ..., "rag_forge_evaluator.*"] line in pyproject.toml and delete
only the rag_forge_evaluator.* item (and any trailing comma if needed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 93235b5a-160a-415f-906b-5f4d6578d355

📥 Commits

Reviewing files that changed from the base of the PR and between 3ef4130 and dd12e1f.

📒 Files selected for processing (5)
  • packages/evaluator/src/rag_forge_evaluator/judge/mock_judge.py
  • packages/evaluator/src/rag_forge_evaluator/report/generator.py
  • packages/evaluator/tests/test_ragas_adapters.py
  • packages/evaluator/tests/test_ragas_extractor.py
  • pyproject.toml
📜 Review details
🔇 Additional comments (19)
packages/evaluator/src/rag_forge_evaluator/judge/mock_judge.py (1)

5-18: Strong backward/forward compatibility in the mock payload.

Including both combined-path fields and legacy per-metric fields here is the right shape for v0.2.0 and keeps both test paths valid.

packages/evaluator/tests/test_ragas_extractor.py (1)

49-93: Behavioral assertions are aligned with v0.2.0 failure semantics.

These tests correctly enforce the new ValueError contract for extraction failures and the evaluate() guard when no judge is provided.

packages/evaluator/tests/test_ragas_adapters.py (1)

79-152: Good adapter coverage across sync/async, provider wiring, and refusal-note prompt behavior.

The patched-client checks and prompt assertions provide solid guardrails for wrapper behavior without external dependencies.

packages/evaluator/src/rag_forge_evaluator/report/generator.py (16)

323-329: Data-handling disclaimer is misleading.

This text claims data is "processed in-memory only" and "no sample data is persisted," but the HTML/JSON reports are written to disk and include raw queries/responses. Additionally, claiming the report "contains no personally-identifiable information" is incorrect if sample data includes PII.

Proposed wording
     data_handling_html = (
-        "Queries, responses, and retrieved contexts are processed in-memory only. "
-        "No sample data is persisted beyond the audit run. "
+        "Queries, responses, and retrieved contexts are processed during the audit run. "
+        "If HTML or JSON reports are generated, included sample content is persisted in those artifacts. "
         "Evaluation inputs are transmitted to the configured judge endpoint "
         "under the operator's API key and subject to that provider's data policy. "
-        "This report contains no personally-identifiable information."
+        "Redact or disable per-sample content before sharing reports that may contain regulated or personally identifiable data."
     )

667-693: Backward-compatibility shim drops caller-provided parameters.

The method signature accepts rmm_level, trends, and sample_results, but the implementation discards all of them and passes hardcoded values to generate_html. This silently produces incorrect metadata (always "RAG-Forge Audit", "llm-judge", empty judge model) regardless of what the caller provides.

Suggested fix: pass through the available parameters
     def generate_html(
         self,
         result: EvaluationResult,
         rmm_level: RMMLevel,
         trends: dict[str, str] | None = None,
         sample_results: list[SampleResult] | None = None,
+        *,
+        project_name: str = "RAG-Forge Audit",
+        evaluator_name: str = "llm-judge",
+        judge_model_display: str = "",
     ) -> Path:
         ...
         html = generate_html(
             result,
-            project_name="RAG-Forge Audit",
-            evaluator_name="llm-judge",
-            judge_model_display="",
+            project_name=project_name,
+            evaluator_name=evaluator_name,
+            judge_model_display=judge_model_display,
         )

31-37: LGTM!

Good use of StrictUndefined to catch missing template variables and select_autoescape for XSS protection. Module-level environment initialization is an appropriate pattern for Jinja2.


82-90: LGTM!

Including both answer_relevance and answer_relevancy variants in _METRIC_DESCRIPTIONS is a good defensive choice for handling different metric naming conventions.


101-112: LGTM!

Clean delegation to RMMScorer keeps scoring logic centralized. Fallbacks for missing lookups are appropriate.


115-170: LGTM!

Good edge-case handling for empty passing/failing lists, and the priority recommendation logic is well-structured with appropriate fallbacks.


173-190: LGTM!

The ladder state logic correctly distinguishes between cleared, current, next, and future levels for visual representation.


193-228: LGTM!

Proper divide-by-zero guard and the 30% threshold warning aligns with the PR spec. Case collection logic correctly filters by scoring_mode.


246-256: LGTM!

Good defensive check at line 247-248 to skip samples with empty metrics before calling min(), and the fallback at lines 254-256 ensures a valid return even when all samples lack metric data.


279-306: LGTM!

Clean transformation functions with appropriate defaults for optional fields.


350-405: LGTM!

Duck-typing approach is pragmatic for optional cost tracking. Division guards with or 1 prevent zero-division errors.


426-456: LGTM!

SVG coordinate mapping is correct: X spans 15→145 with proper single-point centering, and Y inverts correctly for the SVG coordinate system.


469-485: LGTM!

Clean executive summary generation with all relevant metrics included.


493-614: LGTM!

The context dict comprehensively covers the template's expected variables. Using StrictUndefined ensures any mismatch will be caught at render time.


622-648: LGTM!

Legacy helpers retained for backward compatibility with proper edge-case handling.


755-774: LGTM!

The sample_results injection correctly includes per-sample scoring_mode and refusal_justification fields for v0.2.0 compatibility.

Comment on lines +714 to +719
# Compute safety refusal rate — v0.2.0
samples_evaluated = result.samples_evaluated or 1 # guard against divide-by-zero
refusal_count = result.scoring_modes_count.get("safety_refusal", 0)
safety_refusal_rate = (
refusal_count / samples_evaluated if result.samples_evaluated else 0.0
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Simplify the redundant conditional.

The guarded samples_evaluated (line 715) is always ≥ 1, but line 718 then checks result.samples_evaluated (the original unguarded value). While the end result is technically correct (0.0 when no samples exist), the logic is confusing. Consider simplifying:

Cleaner alternative
-        samples_evaluated = result.samples_evaluated or 1  # guard against divide-by-zero
-        refusal_count = result.scoring_modes_count.get("safety_refusal", 0)
-        safety_refusal_rate = (
-            refusal_count / samples_evaluated if result.samples_evaluated else 0.0
-        )
+        refusal_count = result.scoring_modes_count.get("safety_refusal", 0)
+        safety_refusal_rate = (
+            refusal_count / result.samples_evaluated
+            if result.samples_evaluated
+            else 0.0
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines
714 - 719, The computation for safety_refusal_rate is using both the guarded
local variable samples_evaluated and the original result.samples_evaluated in a
redundant conditional; simplify by using the already-guarded samples_evaluated
for the division (i.e., compute refusal_count =
result.scoring_modes_count.get("safety_refusal", 0) and then safety_refusal_rate
= refusal_count / samples_evaluated) and ensure you preserve the zero-sample
semantics by initializing samples_evaluated = result.samples_evaluated or 1 as
currently done.

Addresses the critical + major review comments on PR #33.

- **ragas_adapters.py** — implement the full ragas 0.4.x BaseRagasLLM
  contract. RagForgeRagasLLM.generate_text / agenerate_text now accept
  PromptValue + n/temperature/stop/callbacks kwargs and return
  LLMResult, so ragas cannot silently break the adapter when it starts
  passing additional parameters on minor version upgrades. Deferred
  langchain_core imports keep the module importable without the optional
  extras. A duck-typed _StringLLMResult fallback lets existing unit
  tests keep passing without pulling langchain as a dev dependency.
  RagForgeRagasEmbeddings also gains aembed_query / aembed_documents to
  match the BaseRagasEmbeddings async contract.

- **audit.py** — when config.judge_model is None on the ragas path,
  normalize to "mock" before the allowlist check so programmatic
  configs that leave judge_model unset no longer trip the guard.

- **engines/__init__.py** — if create_evaluator(engine="ragas") is
  called without a judge, default to MockJudge immediately rather than
  letting the missing-judge error surface mid-evaluate().

- **ragas_evaluator.py** — whole-batch crash path now reports
  samples_evaluated=len(samples) (attempted count) instead of 0, and
  aggregate-extraction path sets passed=False whenever any skip
  record exists, not just when every metric failed its threshold.
  Silent partial success was the v0.1.3 pathology we set out to kill.

- **llm_judge.py** — _are_standard_metrics now does an exact type-set
  match (frozenset of type objects) instead of isinstance, so
  MetricEvaluator subclasses no longer get silently rerouted into the
  combined single-call path that cannot honor their overrides.

- **llm_judge.py** — when _parse_combined_response marks every metric
  skipped for a sample, do not emit a normal SampleResult with
  misleading 0.0 scores or pollute scoring_modes_count with a fake
  "standard" classification. Per-metric outcomes still record the
  skip for aggregation; progress is still reported.

- **generator.py** — compliance footer's data-handling copy no longer
  falsely claims samples stay in-memory only. The text now tells
  operators that HTML/JSON/PDF reports persist sample content and
  does not claim automatic PII redaction.

- **generator.py** — ReportGenerator.generate_html compatibility shim
  now forwards project_name, project_description, evaluator_name,
  judge_model_display, report_number, and report_date to the
  module-level generate_html instead of hardcoding "RAG-Forge Audit"
  and "llm-judge". Callers that rely on the old defaults still work.

- **audit.html.j2** — sparkline SVG always renders the <svg> wrapper
  (preserving the existing "html contains svg" test assertions) but
  guards the polyline/circles behind {% if history_points %}, with a
  "First run — no history yet" text fallback so empty history does
  not produce a malformed SVG. Per-sample scores loop uses
  (sample.scores or {}).items() to tolerate None.

- **pyproject.toml** — add langchain_core.* to the mypy override list
  so the new deferred TYPE_CHECKING imports don't trip missing-stub
  errors on machines where langchain isn't installed.

Local gates:
  uv run ruff check packages/        → All checks passed!
  uv run mypy packages/core/src packages/evaluator/src packages/observability/src
                                     → Success: no issues found in 102 source files
  uv run pytest packages/evaluator/tests -q
                                     → 256 passed, 4 skipped
  rag-forge-eval audit --judge mock  → faithfulness 0.90, overall 0.90, passed: true
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/evaluator/src/rag_forge_evaluator/engines/__init__.py (1)

8-14: ⚠️ Potential issue | 🟠 Major

Plumb the new ragas options through the factory.

RagasEvaluator now accepts max_tokens and embeddings_provider, but create_evaluator() exposes no way to pass either. That leaves the new AuditConfig.ragas_max_tokens / AuditConfig.ragas_embeddings_provider knobs dead on the default path.

Proposed shape
 def create_evaluator(
     engine: str,
     judge: JudgeProvider | None = None,
     thresholds: dict[str, float] | None = None,
     progress: ProgressReporter | None = None,
     refusal_aware: bool = True,
+    ragas_max_tokens: int = 8192,
+    ragas_embeddings_provider: str | None = None,
 ) -> EvaluatorInterface:
@@
     if engine == "ragas":
         from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
         if judge is None:
             from rag_forge_evaluator.judge.mock_judge import MockJudge
             judge = MockJudge()
-        return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware)
+        return RagasEvaluator(
+            judge=judge,
+            thresholds=thresholds,
+            max_tokens=ragas_max_tokens,
+            embeddings_provider=ragas_embeddings_provider,
+            refusal_aware=refusal_aware,
+        )

Also applies to: 24-32

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py` around lines
8 - 14, The factory create_evaluator doesn't accept the new Ragas options, so
add optional parameters max_tokens: int | None and embeddings_provider:
EmbeddingsProvider | None to create_evaluator's signature and forward them into
the RagasEvaluator constructor (i.e. where RagasEvaluator(...) is created pass
max_tokens=max_tokens and embeddings_provider=embeddings_provider, and when
applicable fall back to AuditConfig.ragas_max_tokens /
AuditConfig.ragas_embeddings_provider if needed); also make the same
signature/forwarding change for the other create_evaluator variant in this
module so both paths plumb the ragas options through.
♻️ Duplicate comments (3)
packages/evaluator/src/rag_forge_evaluator/report/generator.py (1)

670-713: ⚠️ Potential issue | 🟠 Major

The compatibility wrapper still never supplies sparkline history.

The new template renders from history_points, but this wrapper only accepts trends and then drops it. On the normal audit.py path, the history panel will always fall back to the first-run empty state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines
670 - 713, The instance method generate_html currently accepts trends but
discards it before calling the module-level generate_html, so the template never
receives the historical sparkline data; update the method to forward the trends
data as the template's history_points (or convert trends into the expected
history_points shape) when calling the module-level generate_html function —
specifically modify the method generate_html in report/generator.py to include a
history_points=... argument (derived from the trends parameter) in the call to
the top-level generate_html(result, ...) so the template can render the
sparkline history.
pyproject.toml (1)

35-37: ⚠️ Potential issue | 🟠 Major

Do not suppress mypy on the repo’s own package.

rag_forge_evaluator.* is first-party code, so this override disables strict type-checking for the package this workspace owns. That hides real regressions instead of isolating optional third-party imports.

#!/bin/bash
set -euo pipefail

echo "mypy override:"
sed -n '34,37p' pyproject.toml

echo
echo "first-party package directories:"
fd -t d '^rag_forge_evaluator$' packages src . || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 35 - 37, The mypy override is incorrectly
silencing type-checks for our own package by including "rag_forge_evaluator.*";
remove "rag_forge_evaluator.*" from the modules list in the
[[tool.mypy.overrides]] block so the repo's package is not subject to
ignore_missing_imports, then run mypy to fix any real type errors surfaced in
the rag_forge_evaluator package (ensure the package name "rag_forge_evaluator"
matches the actual top-level package directory used in the repo).
packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py (1)

346-447: ⚠️ Potential issue | 🟠 Major

Emit SkipRecords for fully skipped combined samples.

Suppressing SampleResult here fixed the misleading worst-case path, but the failure still disappears from EvaluationResult.skipped_samples. HTML/JSON consumers only see an aggregate skip count and lose the sample ID plus parse reason entirely.

Possible fix
 from rag_forge_evaluator.engine import (
     EvaluationResult,
     EvaluationSample,
     EvaluatorInterface,
     MetricResult,
     SampleResult,
+    SkipRecord,
     ScoringMode,
 )
@@
         sample_results = self._partial_sample_results
         scoring_modes_count: dict[str, int] = {}
+        skipped_samples: list[SkipRecord] = []
@@
             all_skipped = sample_skipped == len(metric_names)

-            if not all_skipped:
+            if all_skipped:
+                for name in metric_names:
+                    result = per_metric_results[name]
+                    skipped_samples.append(
+                        SkipRecord(
+                            sample_id=sample.sample_id or sample.query[:40],
+                            metric_name=name,
+                            reason=result.details or "combined judge response could not be parsed",
+                            exception_type="JudgeParseError",
+                        )
+                    )
+            else:
                 worst_metric = min(
                     sample_metric_scores, key=sample_metric_scores.get  # type: ignore[arg-type]
                 )
@@
         return EvaluationResult(
             metrics=aggregated,
             overall_score=round(overall, 4),
             samples_evaluated=len(samples),
             passed=all(m.passed for m in aggregated),
             sample_results=sample_results,
             skipped_evaluations=total_skipped,
             scoring_modes_count=scoring_modes_count,
+            skipped_samples=skipped_samples,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py` around lines
346 - 447, In _evaluate_combined, when a sample is fully skipped (all_skipped
True) emit a SkipRecord that captures sample.sample_id and the parse/refusal
reason (use refusal_justification or a brief "unparseable" message) instead of
silently omitting the sample; append that SkipRecord to whatever collection is
returned as EvaluationResult.skipped_samples (or add a skipped_samples list to
the return payload) so HTML/JSON consumers receive the sample id and reason, and
keep the existing behavior of not creating a SampleResult or mutating
scoring_modes_count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 160-178: The ragas-specific allowlist wrongly rejects aliases like
"claude-sonnet"; update the validation in the evaluator_engine == "ragas" branch
to use the same alias normalization as _create_judge() (and
_KNOWN_JUDGE_ALIASES) before checking support: map config.judge_model (or the
normalized "mock" default) through _KNOWN_JUDGE_ALIASES or the same
alias-resolution function to obtain the canonical judge name, then validate that
canonical name is one of ("openai","gpt-4o","claude","mock") and perform the
voyageai installation check against the canonical "claude" value so aliases such
as "claude-sonnet" are accepted for ragas.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 102-105: The adapter stores ragas_max_tokens but never forwards it
to the judge: update the calls to _complete() (and any direct judge invocations
in RagasAdapter methods such as where ragas_max_tokens is set/used) to pass
ragas_max_tokens into the underlying judge call (e.g., as an explicit
max_tokens/max_tokens_override keyword or the third parameter expected by the
judge API) so the judge actually receives and applies the token limit; ensure
both the initial call around the _complete(system_prompt, user_prompt) site and
the other similar call sites (the block covering the later 115-132 region) are
changed to forward ragas_max_tokens consistently.

In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 208-239: The combined-metric parsing loop currently accepts any
float (e.g., -1, 2.5, nan, inf) from data[field]; update the logic in the loop
that iterates over _COMBINED_METRIC_FIELDS (where thresholds and
_DEFAULT_THRESHOLDS are used and MetricResult is constructed) so that after
converting with float(data[field]) you validate the value is finite and within
0.0–1.0 (use math.isfinite or equivalent and check 0.0 <= score <= 1.0); if it
is out of range or non-finite, log a warning via logger (including the offending
value and field), set results[field] to a skipped MetricResult with score 0.0,
passed False, and a descriptive details message (similar style to the existing
non-numeric handling), and only create a normal MetricResult when the score is
valid.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 333-350: The returned report dictionary includes a hardcoded
"page_total_display": "Page 1 of 1" which is invalid for multi-page reports;
remove this hardcoded key from the return value in generator.py (where the dict
with method_html, data_handling_html, issued_date/report_time_utc, etc. is
assembled) or replace it by deriving the string from real pagination metadata
(e.g., a page_count or pagination helper) so that "page_total_display" is either
omitted when unknown or computed like f"Page {current_page} of {total_pages}"
using the actual pagination variables.

---

Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py`:
- Around line 8-14: The factory create_evaluator doesn't accept the new Ragas
options, so add optional parameters max_tokens: int | None and
embeddings_provider: EmbeddingsProvider | None to create_evaluator's signature
and forward them into the RagasEvaluator constructor (i.e. where
RagasEvaluator(...) is created pass max_tokens=max_tokens and
embeddings_provider=embeddings_provider, and when applicable fall back to
AuditConfig.ragas_max_tokens / AuditConfig.ragas_embeddings_provider if needed);
also make the same signature/forwarding change for the other create_evaluator
variant in this module so both paths plumb the ragas options through.

---

Duplicate comments:
In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 346-447: In _evaluate_combined, when a sample is fully skipped
(all_skipped True) emit a SkipRecord that captures sample.sample_id and the
parse/refusal reason (use refusal_justification or a brief "unparseable"
message) instead of silently omitting the sample; append that SkipRecord to
whatever collection is returned as EvaluationResult.skipped_samples (or add a
skipped_samples list to the return payload) so HTML/JSON consumers receive the
sample id and reason, and keep the existing behavior of not creating a
SampleResult or mutating scoring_modes_count.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 670-713: The instance method generate_html currently accepts
trends but discards it before calling the module-level generate_html, so the
template never receives the historical sparkline data; update the method to
forward the trends data as the template's history_points (or convert trends into
the expected history_points shape) when calling the module-level generate_html
function — specifically modify the method generate_html in report/generator.py
to include a history_points=... argument (derived from the trends parameter) in
the call to the top-level generate_html(result, ...) so the template can render
the sparkline history.

In `@pyproject.toml`:
- Around line 35-37: The mypy override is incorrectly silencing type-checks for
our own package by including "rag_forge_evaluator.*"; remove
"rag_forge_evaluator.*" from the modules list in the [[tool.mypy.overrides]]
block so the repo's package is not subject to ignore_missing_imports, then run
mypy to fix any real type errors surfaced in the rag_forge_evaluator package
(ensure the package name "rag_forge_evaluator" matches the actual top-level
package directory used in the repo).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 80638afd-ae35-4e76-b876-6f2226fb968d

📥 Commits

Reviewing files that changed from the base of the PR and between dd12e1f and b7a15e2.

📒 Files selected for processing (8)
  • packages/evaluator/src/rag_forge_evaluator/audit.py
  • packages/evaluator/src/rag_forge_evaluator/engines/__init__.py
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py
  • packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py
  • packages/evaluator/src/rag_forge_evaluator/report/generator.py
  • packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2
  • pyproject.toml
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: RAG Quality Gate
  • GitHub Check: Lint, Typecheck & Test

Comment thread packages/evaluator/src/rag_forge_evaluator/audit.py
Comment on lines +102 to +105
passes through from its internal metric prompts. ``max_tokens`` is
retained on the instance so future Judge-protocol extensions can pick
it up; the current Judge interface does not accept it, so the underlying
model's own max-tokens default applies.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

max_tokens is a no-op in the current adapter.

This setting is stored, but _complete() still calls the judge with only (system_prompt, user_prompt). Callers will assume ragas_max_tokens constrains ragas prompts even though provider defaults still apply.

Also applies to: 115-132

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 102 - 105, The adapter stores ragas_max_tokens but never forwards it to
the judge: update the calls to _complete() (and any direct judge invocations in
RagasAdapter methods such as where ragas_max_tokens is set/used) to pass
ragas_max_tokens into the underlying judge call (e.g., as an explicit
max_tokens/max_tokens_override keyword or the third parameter expected by the
judge API) so the judge actually receives and applies the token limit; ensure
both the initial call around the _complete(system_prompt, user_prompt) site and
the other similar call sites (the block covering the later 115-132 region) are
changed to forward ragas_max_tokens consistently.

Comment thread packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py
Comment on lines +333 to +350
return {
"method_html": method_html,
"data_handling_html": data_handling_html,
"regulatory_html": _REGULATORY_ALIGNMENT_HTML,
"authored_by": "RAG-Forge",
"issued_date": report_date,
"issued_utc_time": report_time_utc,
"valid_until": "Next material change",
"limitations_html": (
"This report reflects the pipeline state at the time of the audit. "
"Scores are estimates produced by an LLM judge and should be interpreted "
"alongside human review for high-stakes decisions. "
"RMM level is based on metric thresholds; infrastructure-level requirements "
"(caching, RBAC, drift detection) are not automatically verified."
),
"github_url": "github.com/hallengray/rag-forge",
"page_total_display": "Page 1 of 1",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the hardcoded Page 1 of 1 attestation text.

Per-sample detail makes multi-page reports common, so this footer becomes false as soon as the report spans more than one page. It should be omitted or derived from real pagination metadata.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines
333 - 350, The returned report dictionary includes a hardcoded
"page_total_display": "Page 1 of 1" which is invalid for multi-page reports;
remove this hardcoded key from the return value in generator.py (where the dict
with method_html, data_handling_html, issued_date/report_time_utc, etc. is
assembled) or replace it by deriving the string from real pagination metadata
(e.g., a page_count or pagination helper) so that "page_total_display" is either
omitted when unknown or computed like f"Page {current_page} of {total_pages}"
using the actual pagination variables.

Final review pass on PR #33.

- **audit.py** — ragas allowlist now uses `_KNOWN_JUDGE_ALIASES` as the
  single source of truth. Previously rejected the `claude-sonnet` alias
  that `_create_judge()` happily accepted for the llm-judge path. The
  voyageai gate now triggers for both `claude` and `claude-sonnet`.

- **ragas_adapters.py + ragas_evaluator.py** — remove the no-op
  `max_tokens` field from `RagForgeRagasLLM.__init__` and from
  `RagasEvaluator.__init__`. The Judge protocol is
  `(system_prompt, user_prompt) -> str` with no per-call token cap,
  so storing a value that never reached the judge was misleading.
  The underlying judge's own default (ClaudeJudge: 4096, raisable via
  `RAG_FORGE_JUDGE_MAX_TOKENS` env var) applies during ragas runs.
  Proper per-call forwarding is a v0.2.1 follow-up documented in the
  wrapper's docstring.

- **audit.py** — drop `AuditConfig.ragas_max_tokens` for the same
  reason. The field was never wired through the factory, so it was a
  lie in two places.

- **metrics/llm_judge.py** — `_parse_combined_response` now rejects
  non-finite and out-of-range scores (< 0.0 or > 1.0) as malformed
  judge output and marks the metric skipped instead of polluting the
  aggregate. Covers the `-1`, `2.5`, `nan`, `inf` pathologies.

- **report/generator.py** — drop the hardcoded `"Page 1 of 1"` from
  the compliance footer. Per-sample detail makes multi-page reports
  common, and page numbers are handled at print time by the template's
  `@page` CSS counters.

- **report/generator.py** — simplify the redundant `safety_refusal_rate`
  divide-by-zero guard. The `or 1` sentinel and the separate
  `if result.samples_evaluated` branch were double-guarding against
  the same zero case.

- **CHANGELOG.md** — migration step now references the `>=0.4,<0.6`
  pin range instead of claiming ragas "will upgrade to 0.4.x"
  specifically.

- **docs/superpowers/specs/...** — add `text` language tag to the
  architecture diagram fenced block for markdownlint compliance.

- **Tests** — update `test_ragas_adapters.py` and
  `test_pearmedica_regression.py` to drop the now-removed `max_tokens`
  kwarg. Update `test_audit_config_validation.py` to assert the new
  `ragas_embeddings_provider` default instead of the removed
  `ragas_max_tokens` field.

Local gates:
  uv run ruff check packages/        → All checks passed!
  uv run mypy packages/core/src packages/evaluator/src packages/observability/src
                                     → Success: no issues found in 102 source files
  uv run pytest packages/evaluator/tests -q
                                     → 255 passed, 4 skipped
  rag-forge-eval audit --judge mock  → faithfulness 0.90, overall 0.90, passed: true
…-refresh

# Conflicts:
#	docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md
@hallengray hallengray merged commit 330465f into main Apr 14, 2026
2 checks passed
@hallengray hallengray deleted the feat/v0.2.0-evaluator-refresh branch April 14, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant