feat(evaluator): v0.2.0 evaluator refresh#33
Conversation
Captures the four workstreams locked during brainstorming in response to the PearMedica Cycle 2 audit findings: - Workstream A: rebuild the RAGAS adapter around injected LLM and embeddings wrappers (RagForgeRagasLLM + RagForgeRagasEmbeddings) so the evaluator stops depending on ragas's version-fragile default picker. Hard cut to ragas>=0.4,<0.6. Adds [ragas-voyage] extra for Claude-judge users. Deletes the 0.0-on-exception coercion; skips surface in reports. - Workstream B: default-on refusal-aware scoring with a --strict escape hatch. The judge classifies each sample as standard or safety_refusal inline and applies the appropriate rubric. A 30% spot-check banner guards against judge misclassification. - Workstream C: rebuild the audit report to top-tier standard. Single Jinja template with a shared browser/print stylesheet, Playwright print-media emulation, 14-section information architecture covering run manifest, RMM hero with historical sparkline, TL;DR callout, cost summary, safety refusals, per-sample detail, and a compliance footer aligned with NIST AI RMF / ISO 42001 / EU AI Act Article 15. - Workstream D: coordinated 0.2.0 version bump across core, evaluator, observability, cli, and mcp packages with a migration checklist. Includes a file-by-file change summary, four-layer testing strategy, locked decisions from brainstorming, and explicit non-goals.
…k with SkipRecord RagasEvaluator now accepts a judge, max_tokens, and embeddings_provider and injects RagForgeRagasLLM + RagForgeRagasEmbeddings into every ragas evaluate() call — ragas no longer reaches for its own default LLM or embeddings. _extract_ragas_score raises ValueError instead of silently returning 0.0 so broken infrastructure surfaces in the report as SkipRecords rather than producing invisibly-wrong scores (root cause of the PearMedica Cycle 2 silent failure). Whole-batch ragas crashes are also caught and converted to SkipRecords per sample×metric.
…onfig fields - Add _voyageai_installed() module-level probe so tests can monkeypatch the voyageai import check without requiring the real SDK installed - Add three new AuditConfig fields: refusal_aware=True, ragas_max_tokens=8192, ragas_embeddings_provider=None - Replace _validate_config ragas branch: claude is now allowed when voyageai is installed; blocks only on missing voyageai dep with a clear install hint pointing at [ragas-voyage] extra - Drop hardcoded 'gpt-4o-mini (RAGAS internal — judge override ignored)' banner — ragas now honors the configured judge end-to-end via RagForgeRagasLLM wrapper shipped in Wave 1 - Update test_ragas_with_mock_judge_is_blocked → _is_allowed: mock is a valid zero-cost judge with the new wrapper; old block premise obsolete
…llm-judge Adds refusal_aware=True parameter to LLMJudgeEvaluator. When enabled, a classification preamble is prepended to the combined single-call prompt so the judge can identify safety-refusal samples (context too thin, response correctly abstained) and apply the refusal rubric — all in one round-trip, zero extra API cost. scoring_mode and refusal_justification are threaded through SampleResult and aggregated into EvaluationResult.scoring_modes_count. Migrates the default evaluation path to a single combined judge call per sample (all four metrics in one JSON response). Per-metric path is preserved for callers who inject custom MetricEvaluator instances (detected via instance type check so patched test evaluators continue to work correctly).
Replaces the old Jinja FileSystemLoader + audit_report.html.j2 approach with a module-level generate_html() function that builds a typed context dict (RMM summary, TL;DR, refusals, worst-case, compliance footer, cost block, sparkline points) and renders the Task 12 audit.html.j2 template. ReportGenerator class preserved as a backward-compatible shim for audit.py.
…fusal_rate to JSON output
Add four new additive fields to the JSON audit report (v0.2.0):
- Top-level 'skipped_samples': list of skip records with sample_id, metric_name, reason, exception_type
- Top-level 'scoring_modes_count': dict of scoring mode counts (e.g. {'standard': 11, 'safety_refusal': 1})
- Top-level 'safety_refusal_rate': float ratio of safety_refusal samples to total samples_evaluated
- Per-sample 'scoring_mode' and 'refusal_justification': tracking the scoring context for each result
All existing JSON fields preserved. Computation handles edge cases (empty reports, zero samples).
8 new tests added covering all combinations.
- Add return type annotations to _openai_client() and _voyage_client() - Suppress TC001 for generator.py (types used in signatures) - Combine nested with statements in test_audit_config_validation.py - Remove unused pytest imports in test_refusal_aware_scoring.py - Fix import ordering and formatting per ruff rules
|
Caution Review failedPull request was closed or merged during review Summary by CodeRabbit
WalkthroughThis release updates RAG-Forge from v0.1.3 to v0.2.0 with default refusal-aware scoring, RAGAS LLM/embeddings adapter wrappers, skip-record error handling (no silent 0.0 coercion), revised evaluator wiring and CLI flags, and a rebuilt Jinja HTML/PDF report plus broad test and fixture additions. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant AuditConfig
participant EvaluatorFactory
participant LLMJudgeEvaluator
participant Judge
participant ReportGenerator
User->>CLI: audit --refusal-aware / --strict
CLI->>CLI: parse flags
CLI->>AuditConfig: create(refusal_aware=...)
AuditConfig->>EvaluatorFactory: create_evaluator(refusal_aware=...)
EvaluatorFactory->>LLMJudgeEvaluator: instantiate(refusal_aware=...)
User->>LLMJudgeEvaluator: evaluate(samples)
alt refusal_aware=True
LLMJudgeEvaluator->>Judge: judge(system + classification preamble + dual rubric)
Judge-->>LLMJudgeEvaluator: combined JSON (scoring_mode, scores, justification)
LLMJudgeEvaluator->>LLMJudgeEvaluator: apply safety_refusal rubric -> MetricResults with scoring_mode/refusal_justification
else refusal_aware=False
LLMJudgeEvaluator->>Judge: judge(system + standard rubric)
Judge-->>LLMJudgeEvaluator: JSON scores
LLMJudgeEvaluator->>LLMJudgeEvaluator: produce MetricResults (standard)
end
LLMJudgeEvaluator-->>ReportGenerator: EvaluationResult (metrics, skipped_samples, scoring_modes_count)
ReportGenerator->>ReportGenerator: build Jinja context
ReportGenerator-->>User: HTML report (safety refusals, skipped samples, metrics)
sequenceDiagram
participant User
participant RagasEvaluator
participant RagForgeRagasLLM
participant RagForgeRagasEmbeddings
participant Judge
participant RagasLib
User->>RagasEvaluator: evaluate(samples)
RagasEvaluator->>RagasEvaluator: auto-select embeddings provider
RagasEvaluator->>RagForgeRagasLLM: create(judge, refusal_aware)
RagasEvaluator->>RagForgeRagasEmbeddings: create(provider)
RagasEvaluator->>RagasLib: ragas_evaluate(samples, llm=LLM, embeddings=Embeddings)
alt refusal_aware=True
RagForgeRagasLLM->>Judge: generate_text(prompt + refusal note)
Judge-->>RagForgeRagasLLM: response
else refusal_aware=False
RagForgeRagasLLM->>Judge: generate_text(prompt)
Judge-->>RagForgeRagasLLM: response
end
RagasLib-->>RagasEvaluator: raw metric outputs or raise
alt extraction/exception failure
RagasEvaluator->>RagasEvaluator: record SkipRecord(sample_id, metric, reason, exception)
RagasEvaluator-->>User: EvaluationResult with skipped_samples
else success
RagasEvaluator-->>User: EvaluationResult with metrics
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Two regressions on PR #33 CI: 1. ruff failures in test_ragas_adapters.py and test_ragas_extractor.py: - I001: unsorted import block - F401: unused pytest import (top-level), unused SkipRecord import - F811: pytest redefined inside test function - F821: leftover captured_prompts reference in a dead stub class Fix: clean up the imports, remove the unused stub, sort the import block. 2. RAG Quality Gate: rag-forge-eval audit --judge mock scored faithfulness 0.0 because the combined-path parser (added in v0.2.0) expects {faithfulness, answer_relevance, context_relevance, hallucination} at the top level of the judge response, but MockJudge still returned the legacy per-metric shape (claims/ratings/score). Fix: additively include the combined-path fields in MockJudge's default response. Legacy per-metric fields stay so the partial- report tests (which monkey-patch _default_metrics) still work. Local verification: `rag-forge-eval audit --golden-set eval/golden_set.json --judge mock --evaluator llm-judge` now scores faithfulness 0.90, all four metrics pass, overall 0.90, RMM Level 3.
There was a problem hiding this comment.
Actionable comments posted: 21
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/evaluator/src/rag_forge_evaluator/audit.py (1)
219-225:⚠️ Potential issue | 🟠 MajorPropagate the new ragas-specific config here.
ragas_max_tokensandragas_embeddings_providerwere added toAuditConfig, but this construction path still only forwards the generic factory args. Any non-default value is silently ignored today, so callers cannot actually raise the ragas token cap or forcemock/voyage/openaiembeddings.Proposed fix
- evaluator = create_evaluator( - self.config.evaluator_engine, - judge=judge, - thresholds=self.config.thresholds, - progress=self._progress, - refusal_aware=self.config.refusal_aware, - ) + if self.config.evaluator_engine == "ragas": + from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator + + evaluator = RagasEvaluator( + judge=judge, + thresholds=self.config.thresholds, + max_tokens=self.config.ragas_max_tokens, + embeddings_provider=self.config.ragas_embeddings_provider, + refusal_aware=self.config.refusal_aware, + ) + else: + evaluator = create_evaluator( + self.config.evaluator_engine, + judge=judge, + thresholds=self.config.thresholds, + progress=self._progress, + refusal_aware=self.config.refusal_aware, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/audit.py` around lines 219 - 225, The create_evaluator call in audit.py currently ignores the new AuditConfig fields ragas_max_tokens and ragas_embeddings_provider, so non-default values are dropped; update the create_evaluator invocation (where evaluator is assigned) to pass through these new config values (e.g., ragas_max_tokens=self.config.ragas_max_tokens and ragas_embeddings_provider=self.config.ragas_embeddings_provider) so the factory receives them; ensure create_evaluator’s signature (and any downstream evaluator factory code) accepts and uses these parameters if not already implemented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Line 10: Update the migration text that currently says "ragas will upgrade to
0.4.x automatically" to match the constraint used above: replace that phrase
with the exact constraint "ragas>=0.4,<0.6" or with "ragas 0.4.x/0.5.x" so the
statement in the line referencing ragas mirrors the constraint at the top of the
file (the sentence referencing the `ragas` requirement).
In `@docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md`:
- Around line 1605-1610: Add a blank line before and after each fenced code
block (e.g., the ```bash block containing "uv pip install 'ragas>=0.4,<0.6'" and
"uv run pytest -m ragas_integration ...") so every ```...``` block is separated
by an empty line above and below; update all similar fenced blocks in the
document to maintain consistent markdown rendering.
In `@docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md`:
- Line 59: The fenced code block containing the ASCII architecture diagram
should include a language tag to satisfy markdownlint MD040; update the opening
fence for that block by changing the triple backticks to include "text" (i.e.,
```text) for the ASCII diagram block so the diagram is treated as plain text and
the linter passes.
In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 160-174: The validation in the ragas branch rejects
AuditConfig.judge_model when it is still the default None; update the check in
the block that inspects config.evaluator_engine == "ragas" so that it allows
judge to be None (or implicitly treated as "mock") — e.g., treat None as
equivalent to "mock" or skip the allowlist check when judge is None, leaving
normalization to _create_judge() which constructs MockJudge(); ensure references
to judge, config.evaluator_engine, AuditConfig.judge_model, and _create_judge()
are used so the change only affects the ragas-specific validation and preserves
the existing error paths for explicit invalid strings.
In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py`:
- Around line 24-26: When engine == "ragas", fail fast if no judge is provided:
before instantiating RagasEvaluator in __init__.py, check that the judge
parameter is not None and raise a descriptive ValueError (e.g., "ragas engine
requires a judge") so we don't defer the error until RagasEvaluator.evaluate();
then proceed to import/return RagasEvaluator(judge=judge, thresholds=thresholds,
refusal_aware=refusal_aware).
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 41-79: The RagForgeRagasLLM adapter currently duck-types ragas and
exposes a narrow generate_text( str )->str; update it to subclass ragas'
BaseRagasLLM and implement the full ragas contract: replace class
RagForgeRagasLLM to inherit BaseRagasLLM, change generate_text to the full
signature (accept PromptValue prompt, n:int, temperature:float, stop:list|None,
callbacks:Optional[...] etc.), return an LLMResult object, and ensure
agenerate_text mirrors the async signature and runs the sync implementation via
asyncio.to_thread; also thread the stored max_tokens into the underlying call
(pass max_tokens to _judge.judge or map it into the judge call parameters) and
keep model_name() forwarding to _judge.model_name() so ragas sees the proper
model metadata.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.py`:
- Around line 154-205: The failure branch currently appends SkipRecord entries
but returns samples_evaluated=0 and the aggregate branch can mark passed=True
even when some metrics were dropped; update the exception branch to return
samples_evaluated=len(samples) (use the same samples list used for SkipRecord)
and adjust the final passed logic to require that all expected metrics were
produced before marking the run passed (i.e., require len(aggregated) ==
len(_METRIC_NAMES) && all(m.passed for m in aggregated)); ensure you still
compute overall_score the same way and keep scored_count on each MetricResult as
len(samples). Reference: SkipRecord, EvaluationResult, _METRIC_NAMES,
_extract_ragas_score, MetricResult, self._thresholds, _DEFAULT_THRESHOLDS,
aggregated, samples_evaluated, passed.
In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 353-387: When all metrics in per_metric_results from
self._evaluate_sample_combined(sample) are skipped, do not build a SampleResult
or increment scoring_modes_count; instead create and append a SkipRecord and
continue. Concretely: after populating sample_metric_scores and computing
sample_skipped, detect if sample_skipped == len(metric_names) and if so import
and use SkipRecord to record the skip (including sample.sample_id,
sample.query/response context and refusal_justification/scoring_mode as
appropriate), do not compute worst_metric or root_cause or update
scoring_modes_count, and do not append to sample_results; otherwise keep the
existing flow (compute worst_metric via min(...), build thresholds_map and
root_cause via _determine_root_cause, update scoring_modes_count, and append
SampleResult). Ensure SkipRecord is imported at top of module.
- Around line 117-128: The guard in _are_standard_metrics uses isinstance()
which is too permissive and lets different instances (or subclassed evaluators
overriding evaluate_sample) pass; instead compare the concrete type multiset so
only the exact default metric types match. Replace the isinstance-based check
with an exact-type comparison, e.g. compare Counter(type(m) for m in metrics) ==
Counter(_STANDARD_METRIC_TYPES) (or an equivalent multiset comparison) inside
_are_standard_metrics so only the exact default metric classes are accepted.
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 673-692: The compatibility shim currently calls generate_html with
hardcoded metadata and drops rmm_level, trends, and sample_results; update the
wrapper so it forwards the caller's arguments into generate_html (e.g., pass
through project_name, evaluator_name, judge_model_display, rmm_level, trends,
and sample_results) or, if you intend a breaking change, make the API explicit
by removing the shim and documenting the new signature; locate the wrapper that
calls generate_html (and references self.output_dir) and ensure it propagates
the original inputs instead of the hardcoded strings or else convert it to a
clearly breaking call.
- Around line 322-328: Update the data_handling_html string in generator.py (the
data_handling_html variable) to remove the incorrect claim that
queries/responses are only processed in-memory and that the report contains no
PII; instead state explicitly that report artifacts (HTML/JSON) may persist
sample data to disk unless redacted and that evaluation inputs may be
transmitted to the configured judge endpoint under the operator's API key and
are subject to that provider's data policy. Keep the message concise, replace
the existing sentences in data_handling_html, and ensure it clearly warns users
about potential persistence of sample data and the need to redact regulated data
before running the audit.
In `@packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2`:
- Around line 1162-1169: The template loops over sample.scores.items() which
will error if sample.scores is None or missing; update the Jinja loop in the
audit.html.j2 template to defensively handle that by using a default/empty
mapping or a conditional check around sample.scores (e.g., use the |default({})
filter or an if sample.scores statement) so the for-loop over
sample.scores.items() only runs when scores exist and avoids runtime failures.
- Around line 933-951: The template currently renders an <svg> with a <polyline>
and <circle> elements based on history_points; add a Jinja guard so when
history_points is empty the sparkline block (the <svg> / <polyline> / <circle>
rendering) is not emitted or a fallback placeholder is shown instead.
Concretely, wrap the existing SVG block (the polyline points generation and the
loop that emits circles) in a Jinja if/else that checks "if history_points" (or
"if history_points|length") and only renders the polyline and circles when
non-empty; otherwise render a simple empty-state element or omit the SVG to
avoid an empty points attribute and missing markers.
In `@packages/evaluator/tests/test_cli_refusal_flags.py`:
- Line 177: Move the pytest import to the top of the module with the other
imports and remove the trailing `import pytest # noqa: E402` line; ensure any
references to pytest in test functions (used around the tests at lines where
pytest is currently referenced) continue to work and drop the E402 noqa comment
since pytest has no circular dependency here.
In `@packages/evaluator/tests/test_pdf_generator.py`:
- Around line 63-69: The test currently only checks that emulate_media was
called but not that it happened before navigation/content calls; update the
assertions to enforce ordering by inspecting mock_page.method_calls: find the
index of the emulate_media call (from emulate_calls or by scanning for call[0]
== "emulate_media") and the index of the goto and/or set_content calls (call[0]
== "goto" and call[0] == "set_content") and assert emulate_index < goto_index
(and emulate_index < set_content_index if relevant) so the test validates that
emulate_media executed before navigation/content operations.
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 3-8: Remove the duplicate local "import pytest" inside the
tests/test_ragas_adapters.py module so it doesn't shadow the top-level import;
keep only the top-level "import pytest" already present near the top of the file
and delete the inner import (the one inside the test scope that causes
F401/F811), then run ruff/formatter to normalize the import block for the
RagForgeRagasLLM and RagForgeRagasEmbeddings tests.
- Around line 152-175: Define a captured_prompts list and actually exercise the
code path that builds/uses RagForgeRagasLLM so the test proves refusal_aware is
propagated: add captured_prompts = [] before CapturingMockJudge and have the
test invoke the evaluator flow that calls judge (e.g., call the evaluator method
that triggers RagForgeRagasLLM -> judge) or monkeypatch/replace RagForgeRagasLLM
with a small fake that captures its init args; then assert the captured judge
calls happened and that the wrapper/fake received refusal_aware=True (or that
captured_prompts contains the expected user_prompt), instead of only checking
evaluator._refusal_aware. Reference symbols:
test_ragas_evaluator_threads_refusal_aware_into_wrapper,
CapturingMockJudge.judge, RagasEvaluator, and RagForgeRagasLLM.
In `@packages/evaluator/tests/test_ragas_extractor.py`:
- Around line 19-23: The import list in the test module unnecessarily includes
SkipRecord which causes a lint error; edit the import statement that currently
imports EvaluationSample and SkipRecord from rag_forge_evaluator.engine and
remove SkipRecord so only used symbols (e.g., EvaluationSample) are imported;
ensure the other imports (RagasEvaluator, _extract_ragas_score) remain unchanged
to avoid breaking references in the tests.
In `@packages/evaluator/tests/test_ragas_integration.py`:
- Around line 41-83: Update the two ragas tests to assert the documented
samples_evaluated / skip semantics: in test_ragas_end_to_end_with_mock_judge add
an assertion that result.samples_evaluated is either len(samples) or 0 (i.e.,
result.samples_evaluated in {len(samples), 0}) in addition to the existing
metrics/ skipped check; in test_ragas_whole_batch_crash_captured_as_skips assert
result.samples_evaluated is either len(samples) or 0 and, if it is 0, assert
that result.skipped_samples contains the input sample (e.g., sample_id
"broken-001") and that len(result.skipped_samples) == len(samples) so an empty
EvaluationResult no longer passes. Ensure you reference the EvaluationResult
fields samples_evaluated and skipped_samples when adding these assertions.
In `@packages/evaluator/tests/test_report_visual_regression.py`:
- Around line 131-135: The assertion currently uses a strict less-than check so
an exact 0.5% diff fails; change the assertion to allow the advertised tolerance
by using a less-than-or-equal comparison for diff_ratio (i.e., replace the "<
0.005" check in the assertion that references total_pixels, diff_pixels and
diff_ratio with "<= 0.005") and keep the existing formatted failure message
unchanged.
In `@pyproject.toml`:
- Line 36: The pyproject.toml currently suppresses mypy for a first-party
package by including "rag_forge_evaluator.*" in the module ignore list; remove
"rag_forge_evaluator.*" from the module = [...] entry (the line containing
module = ["anthropic.*", ..., "rag_forge_evaluator.*"]) so mypy will type-check
your own package, or alternatively replace the suppression by adding proper type
information (inline typings, py.typed, or stubs) for rag_forge_evaluator so it
can be checked under strict=true.
---
Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 219-225: The create_evaluator call in audit.py currently ignores
the new AuditConfig fields ragas_max_tokens and ragas_embeddings_provider, so
non-default values are dropped; update the create_evaluator invocation (where
evaluator is assigned) to pass through these new config values (e.g.,
ragas_max_tokens=self.config.ragas_max_tokens and
ragas_embeddings_provider=self.config.ragas_embeddings_provider) so the factory
receives them; ensure create_evaluator’s signature (and any downstream evaluator
factory code) accepts and uses these parameters if not already implemented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb95c54b-c1aa-404f-95cd-d939bdf537b5
📒 Files selected for processing (43)
.gitignoreCHANGELOG.mddocs/evaluators.mddocs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.mddocs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.mdpackages/cli/package.jsonpackages/core/pyproject.tomlpackages/evaluator/pyproject.tomlpackages/evaluator/src/rag_forge_evaluator/audit.pypackages/evaluator/src/rag_forge_evaluator/cli.pypackages/evaluator/src/rag_forge_evaluator/engine.pypackages/evaluator/src/rag_forge_evaluator/engines/__init__.pypackages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.pypackages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.pypackages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.pypackages/evaluator/src/rag_forge_evaluator/report/generator.pypackages/evaluator/src/rag_forge_evaluator/report/pdf.pypackages/evaluator/src/rag_forge_evaluator/report/templates/__init__.pypackages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2packages/evaluator/tests/fixtures/ragas_regression/__init__.pypackages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonlpackages/evaluator/tests/fixtures/ragas_tiny_fixture.jsonpackages/evaluator/tests/fixtures/visual_baseline/.gitkeeppackages/evaluator/tests/test_audit_config_validation.pypackages/evaluator/tests/test_cli_refusal_flags.pypackages/evaluator/tests/test_engine_types.pypackages/evaluator/tests/test_enhanced_report.pypackages/evaluator/tests/test_evaluator_factory.pypackages/evaluator/tests/test_json_report.pypackages/evaluator/tests/test_llm_judge_aggregation.pypackages/evaluator/tests/test_pdf_generator.pypackages/evaluator/tests/test_pearmedica_regression.pypackages/evaluator/tests/test_ragas_adapters.pypackages/evaluator/tests/test_ragas_extractor.pypackages/evaluator/tests/test_ragas_integration.pypackages/evaluator/tests/test_refusal_aware_scoring.pypackages/evaluator/tests/test_report.pypackages/evaluator/tests/test_report_generator.pypackages/evaluator/tests/test_report_template.pypackages/evaluator/tests/test_report_visual_regression.pypackages/mcp/package.jsonpackages/observability/pyproject.tomlpyproject.toml
📜 Review details
🧰 Additional context used
🪛 GitHub Actions: CI
packages/evaluator/tests/test_ragas_extractor.py
[error] 19-19: ruff check: rag_forge_evaluator.engine.SkipRecord imported but unused. (F401)
packages/evaluator/tests/test_ragas_adapters.py
[error] 1-8: ruff check: Import block is un-sorted or un-formatted. (Organize imports)
[error] 6-6: ruff check: pytest imported but unused. (F401)
[error] 106-106: ruff check: Redefinition of unused pytest from line 6. (F811)
[error] 159-159: ruff check: Undefined name captured_prompts. (F821)
🪛 LanguageTool
docs/evaluators.md
[style] ~13-~13: Consider an alternative for the overused word “exactly”.
Context: ...-relevance — even though the refusal is exactly what the pipeline was designed to do. ...
(EXACTLY_PRECISELY)
docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md
[style] ~36-~36: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rs.** DeepEval engine is untouched. - No core pipeline changes. `packages/core...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~37-~37: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...rsion bump for release consistency. - No offline / air-gapped embeddings. `sen...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~38-~38: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ers` support is deferred to v0.2.1. - No vercel.ts migration or other infrastr...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~39-~39: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...e churn unrelated to the evaluator. - No ragas 0.2.x compatibility. Hard cut t...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~209-~209: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or aggregate reporting. - SampleScore gains the same two fields. **audit.py / `A...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[grammar] ~267-~267: Use a hyphen to join words.
Context: ...ds with judge justification. 11. Worst performing case callout — sample ID, h...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.22.0)
docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md
[warning] 59-59: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md
[warning] 1608-1608: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2231-2231: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2324-2324: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2333-2333: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2341-2341: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2348-2348: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2360-2360: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 2370-2370: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
🔇 Additional comments (38)
.gitignore (1)
71-72: Looks good — scratch workspace is correctly excluded.Line 72 cleanly prevents accidental commits of local brainstorming artifacts.
packages/evaluator/tests/fixtures/ragas_tiny_fixture.json (1)
1-18: Fixture structure is solid and test-friendly.Both samples are minimal, deterministic, and include the required evaluation fields.
packages/mcp/package.json (1)
3-3: Version bump is consistent with release scope.Line 3 aligns MCP package metadata with the v0.2.0 rollout.
packages/core/pyproject.toml (1)
3-3: Version update is correct for coordinated release.Line 3 cleanly aligns
rag-forge-corewith v0.2.0.packages/observability/pyproject.toml (1)
3-3: Version bump looks good.Line 3 keeps observability package version in sync with the v0.2.0 release train.
packages/cli/package.json (1)
3-3: CLI version alignment is correct.Line 3 properly reflects the v0.2.0 release.
packages/evaluator/src/rag_forge_evaluator/report/pdf.py (1)
72-79: Print-media PDF rendering update looks correct.Line 72 and Lines 74-79 correctly force print media and honor CSS page sizing, which matches the new print-oriented template behavior.
packages/evaluator/tests/test_evaluator_factory.py (1)
24-29: Test setup now matches v0.2.0 ragas construction requirements.Passing
MockJudge()on Line 28 is the right fix to keep this test focused on deferred dependency import behavior.packages/evaluator/tests/test_report.py (2)
33-35: RMM badge assertions are aligned with the new template format.The
"Level"+"Better Trust"checks correctly match the updated header rendering.
40-42: Metric display-name assertion update is correct.Checking
"Faithfulness"(title-cased) reflects the generator’s display transformation and keeps the test contract accurate.packages/evaluator/tests/test_enhanced_report.py (2)
11-25: SVG assertion update matches the new sparkline implementation.The revised check is consistent with the template’s
<svg>+ polyline/circle rendering.
27-43: Backward-compatibility test fortrendskwarg is on point.This keeps API compatibility covered without asserting removed visual behavior.
packages/evaluator/src/rag_forge_evaluator/cli.py (2)
57-59: Use explicit tri-state resolution for refusal flags.Line 58 currently ignores
refusal_aware_flag; today it works by default, but it makes precedence less explicit and future-default changes riskier.[ suggest_recommended_refactor ]
♻️ Proposed refactor
- # Compute refusal_aware from mutually-exclusive flags - strict = bool(getattr(args, "strict", False) or getattr(args, "no_refusal_aware_flag", False)) - refusal_aware = not strict + # Compute refusal_aware from mutually-exclusive flags (explicit precedence) + if bool(getattr(args, "strict", False) or getattr(args, "no_refusal_aware_flag", False)): + refusal_aware = False + elif bool(getattr(args, "refusal_aware_flag", False)): + refusal_aware = True + else: + refusal_aware = True # default for v0.2.x
311-329: Mutually-exclusive refusal flag group is implemented correctly.The parser-level exclusivity cleanly prevents invalid combinations and keeps CLI behavior predictable.
packages/evaluator/tests/test_pdf_generator.py (1)
31-62: Good coverage addition for print-media emulation.This test meaningfully guards the new PDF rendering behavior and mocks the Playwright call chain cleanly.
packages/evaluator/src/rag_forge_evaluator/engines/__init__.py (1)
18-23: Refusal-aware threading into LLMJudgeEvaluator is correct.The new factory parameter is propagated cleanly and keeps behavior centralized.
packages/evaluator/pyproject.toml (3)
34-35: Optional dependency updates are coherent with the new evaluator paths.The
ragas+datasetspairing andragas-voyageextra are aligned with the new adapter wiring.
49-50: Template force-include is the right packaging safeguard.This ensures
audit.html.j2is available in wheel installs wherePackageLoaderis used.
52-56: Pytest marker registration is a good addition.Declaring opt-in markers here prevents unknown-marker noise and documents test gating clearly.
packages/evaluator/tests/test_json_report.py (4)
6-13: LGTM: Updated imports align with v0.2.0 data model additions.The imports correctly include
pytestforapproxusage andSkipRecordfor testing the new skip semantics. These align with the engine type changes.
65-92: LGTM: Comprehensive test forskipped_samplesfield.The test correctly validates the new
skipped_samplesJSON structure, including all four required fields (sample_id,metric_name,reason,exception_type). This aligns with theSkipRecorddataclass definition and thegenerate_json()implementation.
109-122: LGTM: Good use ofpytest.approxfor floating-point comparison.Testing
safety_refusal_ratewithpytest.approx(1 / 12)is the correct approach for floating-point equality in Python tests, avoiding precision issues.
155-163: LGTM: Edge case coverage for zero samples.Good defensive test ensuring that when
samples_evaluated=0, the report gracefully handles the division-by-zero case by returningsafety_refusal_rate=0.0and empty collections.packages/evaluator/tests/test_ragas_extractor.py (2)
49-58: LGTM: Test expectations correctly updated for v0.2.0 behavior.The tests now properly assert
ValueErroris raised instead of silent0.0fallback, matching the updated_extract_ragas_scoreimplementation that eliminates the bug where broken infrastructure was indistinguishable from zero scores.
74-93: LGTM: Good coverage for RagasEvaluator judge injection.These tests validate the factory threading and the new requirement that
RagasEvaluatormust have a judge injected. The expected metrics list matches_METRIC_NAMESfrom the implementation.packages/evaluator/tests/test_engine_types.py (1)
1-57: LGTM: Solid unit test coverage for new engine types.The test file correctly validates:
SkipRecorddataclass construction with all required fieldsMetricResultaccepts newscoring_modeandrefusal_justificationfieldsEvaluationResultdefaultsskipped_samplesto[]andscoring_modes_countto{}SampleResultproperly stores scoring mode and allowsNonejustificationThese tests align with the dataclass definitions in
engine.py.docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md (1)
1-57: LGTM: Comprehensive implementation plan with clear task breakdown.The plan document is well-structured with:
- Clear codebase orientation for new engineers
- File structure plan with responsibilities
- 20 tasks with TDD approach (failing test → implementation → pass)
- Spec coverage self-review checklist
This serves as excellent documentation for the v0.2.0 evaluator refresh.
packages/evaluator/tests/fixtures/ragas_regression/cycle-2-excerpt.jsonl (1)
1-3: LGTM: Well-designed regression fixture for v0.1.3 bug coverage.The three-sample fixture effectively covers:
- acs-001-typical-stemi: Standard clinical case for positive scoring
- htn-001-emergency-end-organ: Another standard case with multi-organ involvement
- t2dm-002-paediatric-dosing-refusal: Refusal case designed to test the new refusal-aware scoring path
The refusal case (line 3) is particularly valuable — it represents the PearMedica scenario where the v0.1.3 evaluator incorrectly penalized valid safety refusals. The sample IDs are unique and follow a consistent
{condition}-{number}-{descriptor}pattern.packages/evaluator/tests/test_audit_config_validation.py (3)
50-62: LGTM: Test correctly updated for v0.2.0 ragas+mock behavior.The test now validates that
ragas+mockjudge combination is allowed, reflecting the newRagForgeRagasLLMwrapper that honors the configured judge end-to-end. This replaces the old blocking behavior where mock would silently bypass the cost gate.
128-166: LGTM: Good coverage for voyageai conditional logic.The tests properly validate:
- When
_voyageai_installed()returnsTrue, ragas+claude construction succeeds- When it returns
False, aConfigurationErroris raised with "ragas-voyage" in the messageThe monkeypatching approach correctly isolates the voyageai availability check.
169-185: LGTM: Default value assertions match AuditConfig definition.The tests correctly validate:
refusal_awaredefaults toTrueragas_max_tokensdefaults to8192ragas_embeddings_providerdefaults toNone(auto-select)These match the
AuditConfigdataclass definition in the context snippet.packages/evaluator/tests/test_cli_refusal_flags.py (3)
9-63: LGTM: Parser reconstruction provides good test isolation.The
_build_parser()function correctly reconstructs the audit subparser with all the refusal-aware flags in a mutually exclusive group. This mirrors the actual CLI implementation and allows testing without invoking the full CLI main().
66-87: LGTM: Config builder correctly mirrors cli.py logic.The
_build_audit_config_from_args()function accurately replicates the refusal-aware computation fromcli.py:strict = bool(getattr(args, "strict", False) or getattr(args, "no_refusal_aware_flag", False)) refusal_aware = not strictThis ensures test assertions reflect actual CLI behavior.
140-174: LGTM: Comprehensive mutual exclusivity tests.Good coverage of all invalid flag combinations:
--strict+--refusal-aware--strict+--no-refusal-aware--refusal-aware+--no-refusal-awareUsing
pytest.raises(SystemExit)correctly handles argparse's behavior on conflicting arguments.packages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2 (2)
1-870: LGTM: Well-structured CSS with proper theming and print support.The CSS architecture is solid:
- CSS custom properties (
--color-*,--font-*) enable consistent theming- Responsive breakpoints at 720px handle mobile gracefully
- Print media rules properly expand
<details>elements and prevent page breaks inside cards- The
@pagerules with@bottom-rightand@bottom-leftcontent are appropriate for PDF generation
1176-1202: LGTM: Conditional skipped samples section with clear explanation.The conditional
{% if skipped_samples %}block correctly renders only when there are skipped samples, with a helpful note explaining that "Broken infrastructure is visible rather than hidden behind zero scores." This aligns with the v0.2.0 SkipRecord semantics.packages/evaluator/src/rag_forge_evaluator/engine.py (1)
5-84: Nice backwards-compatible engine model refresh.Adding the new scoring/skip fields as optional or default-factory-backed extensions keeps existing call sites stable while exposing the richer report state this release needs.
packages/evaluator/tests/test_llm_judge_aggregation.py (1)
37-163: Strong regression coverage for the combined-response path.These cases exercise both full-sample and per-metric skip behavior, including the PearMedica pollution scenario that motivated the aggregation fix.
| ```bash | ||
| uv pip install 'ragas>=0.4,<0.6' | ||
| uv run pytest -m ragas_integration packages/evaluator/tests/test_ragas_integration.py -v | ||
| ``` | ||
| Expected: both tests pass. | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Optional: Add blank lines around fenced code blocks for markdown consistency.
Several fenced code blocks in this document are not surrounded by blank lines (e.g., lines 1608, 2231, 2324, 2333, 2341, 2348, 2360, 2370). While this doesn't affect functionality, adding blank lines improves markdown rendering consistency across parsers.
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 1608-1608: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/superpowers/plans/2026-04-14-v0.2.0-evaluator-refresh.md` around lines
1605 - 1610, Add a blank line before and after each fenced code block (e.g., the
```bash block containing "uv pip install 'ragas>=0.4,<0.6'" and "uv run pytest
-m ragas_integration ...") so every ```...``` block is separated by an empty
line above and below; update all similar fenced blocks in the document to
maintain consistent markdown rendering.
| if engine == "ragas": | ||
| from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator | ||
| return RagasEvaluator(thresholds=thresholds) | ||
| return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware) |
There was a problem hiding this comment.
Fail fast when creating ragas evaluator without a judge.
Line 26 allows judge=None, but ragas now requires a judge and only errors later during evaluate(). This delays failure and weakens diagnostics.
🛠️ Proposed fix
if engine == "ragas":
from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator
+ if judge is None:
+ raise ValueError(
+ "Ragas evaluator requires a judge. "
+ "Pass --judge (e.g. mock|claude|openai) or provide a JudgeProvider."
+ )
return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if engine == "ragas": | |
| from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator | |
| return RagasEvaluator(thresholds=thresholds) | |
| return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware) | |
| if engine == "ragas": | |
| from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator | |
| if judge is None: | |
| raise ValueError( | |
| "Ragas evaluator requires a judge. " | |
| "Pass --judge (e.g. mock|claude|openai) or provide a JudgeProvider." | |
| ) | |
| return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py` around lines
24 - 26, When engine == "ragas", fail fast if no judge is provided: before
instantiating RagasEvaluator in __init__.py, check that the judge parameter is
not None and raise a descriptive ValueError (e.g., "ragas engine requires a
judge") so we don't defer the error until RagasEvaluator.evaluate(); then
proceed to import/return RagasEvaluator(judge=judge, thresholds=thresholds,
refusal_aware=refusal_aware).
| class RagForgeRagasLLM: | ||
| """ragas LLM wrapper that forwards to a rag-forge Judge. | ||
|
|
||
| Implements both sync (``generate_text``) and async (``agenerate_text``) | ||
| interfaces that ragas 0.4.x calls internally. The async variant runs | ||
| the sync path in a thread — the underlying Judge implementations are | ||
| sync, and ragas parallelizes metrics at a higher level anyway. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| judge: _JudgeLike, | ||
| max_tokens: int = 8192, | ||
| system_prompt: str = "", | ||
| refusal_aware: bool = True, | ||
| ) -> None: | ||
| self._judge = judge | ||
| self.max_tokens = max_tokens | ||
| self._system_prompt = system_prompt | ||
| self._refusal_aware = refusal_aware | ||
|
|
||
| def generate_text(self, prompt: str) -> str: | ||
| """Generate text using the underlying judge.""" | ||
| if self._refusal_aware: | ||
| prompt = _REFUSAL_NOTE + prompt | ||
| return self._judge.judge(self._system_prompt, prompt) | ||
|
|
||
| async def agenerate_text(self, prompt: str) -> str: | ||
| """Async variant of generate_text. | ||
|
|
||
| Runs the sync path in a thread to maintain compatibility | ||
| with ragas 0.4.x's async callsites. | ||
| """ | ||
| return await asyncio.to_thread(self.generate_text, prompt) | ||
|
|
||
| def model_name(self) -> str: | ||
| """Return the model name from the underlying judge.""" | ||
| return self._judge.model_name() | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd ragas_adapters.pyRepository: hallengray/rag-forge
Length of output: 180
🏁 Script executed:
wc -l packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.pyRepository: hallengray/rag-forge
Length of output: 136
🏁 Script executed:
find . -type f -name "*.py" -path "*/ragas_adapters.py"Repository: hallengray/rag-forge
Length of output: 134
🏁 Script executed:
git ls-files | grep ragas_adaptersRepository: hallengray/rag-forge
Length of output: 180
🏁 Script executed:
find . -type f -name "*.py" | grep -i ragasRepository: hallengray/rag-forge
Length of output: 425
🏁 Script executed:
fd "ragas" --type fRepository: hallengray/rag-forge
Length of output: 408
🏁 Script executed:
git ls-files | grep packages/evaluatorRepository: hallengray/rag-forge
Length of output: 4396
🏁 Script executed:
ls -la packages/ 2>/dev/null || find . -maxdepth 3 -name evaluator -type dRepository: hallengray/rag-forge
Length of output: 512
🏁 Script executed:
find . -type f -name "*.py" | grep -i "evaluator\|ragas" | head -30Repository: hallengray/rag-forge
Length of output: 1958
🏁 Script executed:
find . -type f -path "*ragas_adapters.py" -o -path "*rag_forge_evaluator*" | head -20Repository: hallengray/rag-forge
Length of output: 1249
🏁 Script executed:
git log --name-only --oneline --max-count=50 | grep -i ragasRepository: hallengray/rag-forge
Length of output: 544
🏁 Script executed:
git log --name-status --oneline -20 | head -50Repository: hallengray/rag-forge
Length of output: 1548
🏁 Script executed:
find . -type f -name "ragas_adapters.py" 2>/dev/null || git ls-files "*ragas*"Repository: hallengray/rag-forge
Length of output: 134
🏁 Script executed:
git show HEAD:packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py 2>/dev/null || echo "File not in HEAD"Repository: hallengray/rag-forge
Length of output: 6301
🏁 Script executed:
git statusRepository: hallengray/rag-forge
Length of output: 299
🏁 Script executed:
git diff --name-only --cachedRepository: hallengray/rag-forge
Length of output: 46
🏁 Script executed:
find . \( -name ".git" -o -name "node_modules" \) -prune -o -type f -name "*ragas*" -printRepository: hallengray/rag-forge
Length of output: 420
🏁 Script executed:
ls -la packages/ 2>/dev/null || ls -la . | head -20Repository: hallengray/rag-forge
Length of output: 512
🏁 Script executed:
git log --all --name-status --oneline | grep -i ragas | head -20Repository: hallengray/rag-forge
Length of output: 1057
🏁 Script executed:
rg "RagForgeRagasLLM" -lRepository: hallengray/rag-forge
Length of output: 577
🏁 Script executed:
rg "RagForgeRagasEmbeddings" -A 50Repository: hallengray/rag-forge
Length of output: 50376
🏁 Script executed:
rg "from ragas" -A 2Repository: hallengray/rag-forge
Length of output: 1694
🏁 Script executed:
rg "BaseRagasLLM|BaseRagasEmbeddings"Repository: hallengray/rag-forge
Length of output: 2742
🏁 Script executed:
find . -name "requirements*.txt" -o -name "pyproject.toml" -o -name "setup.py" | head -10Repository: hallengray/rag-forge
Length of output: 359
🏁 Script executed:
rg "ragas" pyproject.tomlRepository: hallengray/rag-forge
Length of output: 273
🏁 Script executed:
rg "agenerate_text|generate_text" --type python -B 3 -A 8Repository: hallengray/rag-forge
Length of output: 94
🏁 Script executed:
rg "_REFUSAL_NOTE" -B 5 -A 5Repository: hallengray/rag-forge
Length of output: 4488
🏁 Script executed:
rg "def judge\(" -B 5 -A 10Repository: hallengray/rag-forge
Length of output: 37452
🏁 Script executed:
rg "agenerate_text|RagForgeRagasEmbeddings" --type py -B 2 -A 15Repository: hallengray/rag-forge
Length of output: 22036
🏁 Script executed:
rg "ragas_adapters"Repository: hallengray/rag-forge
Length of output: 6684
🏁 Script executed:
python3 << 'EOF'
import os
import subprocess
# Get all files in repo
result = subprocess.run(['git', 'ls-files'], capture_output=True, text=True)
files = result.stdout.strip().split('\n')
for f in files:
if 'ragas' in f.lower() or 'rag_forge_evaluator' in f:
print(f)
EOFRepository: hallengray/rag-forge
Length of output: 2706
🌐 Web query:
ragas BaseRagasLLM BaseRagasEmbeddings evaluate() method contracts
💡 Result:
The evaluate method in Ragas is the primary function for performing evaluation on a dataset using specified metrics. Its key contracts regarding BaseRagasLLM and BaseRagasEmbeddings are as follows: - llm: Optional[BaseRagasLLM | LangchainLLM] - The language model used to generate scores for metrics requiring an LLM. If not provided, Ragas uses a default LLM. This can be overridden at the metric level via metric.llm. Custom LLMs must subclass BaseRagasLLM or be wrapped (e.g., LangchainLLMWrapper). - embeddings: Optional[BaseRagasEmbeddings | BaseRagasEmbedding | LangchainEmbeddings] - The embeddings model for metrics requiring embeddings. If not provided, Ragas uses a default. Custom embeddings must subclass BaseRagasEmbeddings (or the single-text BaseRagasEmbedding) or be wrapped (e.g., LangchainEmbeddingsWrapper). BaseRagasLLM is an abstract base class for LLMs in Ragas, providing methods like get_temperature and supporting run_config, caching, and multiple completions. It is used by MetricWithLLM subclasses. BaseRagasEmbeddings extends LangChain's Embeddings ABC, adding run_config support, caching (via embed_query, aembed_query, etc.), and methods like embed_text, embed_texts. BaseRagasEmbedding is a modern single-text embedding ABC with embed_text and aembed_text. These classes ensure compatibility: metrics initialize with provided llm/embeddings (raising ValueError if missing when required), set run_config, and use them for scoring (sync/async via _score/_ascore). Example usage: from ragas import evaluate result = evaluate( dataset=dataset, metrics=[...], llm=your_BaseRagasLLM_instance, embeddings=your_BaseRagasEmbeddings_instance ) Returns EvaluationResult with metric scores (e.g., {'context_precision': 0.817, ...}).
Citations:
- 1: https://docs.ragas.io/en/stable/references/evaluate/
- 2: https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/base.py
- 3: https://github.com/explodinggradients/ragas/blob/main/src/ragas/embeddings/base.py
- 4: https://docs.ragas.io/en/stable/references/embeddings/
- 5: https://docs.ragas.io/en/stable/references/llms/
- 6: https://docs.ragas.io/en/v0.2.13/references/evaluate/
🌐 Web query:
ragas 0.4.x LLMResult agenerate_text prompt n temperature stop callbacks
💡 Result:
In Ragas 0.4.x, LLMResult is the return type of text generation methods on BaseRagasLLM instances, imported from langchain_core.outputs (langchain_core.outputs.LLMResult). It contains generations (list of ChatGeneration or Generation objects with text, generation_info, etc.). Key methods using LLMResult: - BaseRagasLLM.generate(prompt: PromptValue, n: int = 1, temperature: Optional[float] = 0.01, stop: Optional[List[str]] = None, callbacks: Callbacks = None) -> LLMResult (async wrapper calling agenerate_text) - Subclasses implement generate_text(prompt: PromptValue, n: int = 1, temperature: Optional[float] = 0.01, stop: Optional[List[str]] = None, callbacks: Callbacks = None) -> LLMResult (sync) - agenerate_text(prompt: PromptValue, n: int = 1, temperature: Optional[float] = 0.01, stop: Optional[List[str]] = None, callbacks: Callbacks = None) -> LLMResult (async, abstract/implemented in wrappers like LangchainLLMWrapper, OCIGenAIWrapper) These are called internally by prompts (e.g., StringPrompt.generate calls llm.agenerate_text(...)) and higher-level components. No direct "agenerate_text prompt n temperature stop callbacks" method on LLMResult itself; it's the output. "agenerate_text" likely refers to the abstract async implementation in BaseRagasLLM subclasses. Example usage from source: llm_result = await llm.agenerate_text(StringPromptValue(text=data), n=1, temperature=temperature, stop=stop, callbacks=callbacks) Access: llm_result.generations[0][0].text, llm_result.flatten for is_finished checks. Ragas v0.4.x docs confirm this API in /en/v0.4.1/references/llms/ and GitHub src/ragas/llms/base.py.
Citations:
- 1: https://docs.ragas.io/en/v0.4.1/references/llms/
- 2: https://docs.ragas.io/en/stable/references/llms/
- 3: https://docs.ragas.io/en/v0.3.4/references/llms/
- 4: https://docs.ragas.io/en/v0.4.0/references/llms/
- 5: https://github.com/explodinggradients/ragas/blob/main/src/ragas/llms/base.py
- 6: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/prompt/base.py
- 7: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/llms/base.py
Implement proper ragas base class inheritance instead of relying on duck typing.
The adapter classes use narrower method signatures (generate_text(str) -> str, agenerate_text(str) -> str) than ragas 0.4.x contracts require (generate_text(prompt: PromptValue, n: int, temperature: float, stop: list, callbacks) -> LLMResult). While duck typing currently works because ragas metrics only call the basic subset, this is fragile. ragas can silently break these adapters on a minor version update if it starts passing additional parameters or expecting LLMResult. Additionally, max_tokens is stored but never passed to the underlying judge, limiting token control. Subclass BaseRagasLLM and BaseRagasEmbeddings directly, implement the full interface, and properly thread max_tokens through to the judge calls. (docs.ragas.io/en/latest/references/evaluate/)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 41 - 79, The RagForgeRagasLLM adapter currently duck-types ragas and
exposes a narrow generate_text( str )->str; update it to subclass ragas'
BaseRagasLLM and implement the full ragas contract: replace class
RagForgeRagasLLM to inherit BaseRagasLLM, change generate_text to the full
signature (accept PromptValue prompt, n:int, temperature:float, stop:list|None,
callbacks:Optional[...] etc.), return an LLMResult object, and ensure
agenerate_text mirrors the async signature and runs the sync implementation via
asyncio.to_thread; also thread the stored max_tokens into the underlying call
(pass max_tokens to _judge.judge or map it into the judge call parameters) and
keep model_name() forwarding to _judge.model_name() so ragas sees the proper
model metadata.
| def test_ragas_evaluator_threads_refusal_aware_into_wrapper(): | ||
| """RagasEvaluator(refusal_aware=True) should construct RagForgeRagasLLM | ||
| with refusal_aware=True. We verify by capturing what gets passed to the judge.""" | ||
| from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator | ||
|
|
||
| class CapturingMockJudge: | ||
| def judge(self, system_prompt: str, user_prompt: str) -> str: | ||
| captured_prompts.append(user_prompt) | ||
| # Return a minimal valid JSON that ragas metrics expect | ||
| return '{"faithfulness": 0.5}' | ||
|
|
||
| def model_name(self) -> str: | ||
| return "capture-judge" | ||
|
|
||
| evaluator = RagasEvaluator( | ||
| judge=CapturingMockJudge(), | ||
| thresholds={}, | ||
| refusal_aware=True, | ||
| embeddings_provider="mock", | ||
| ) | ||
|
|
||
| # To avoid needing ragas installed, just verify that the evaluator | ||
| # was constructed with refusal_aware=True by checking the attribute | ||
| assert evaluator._refusal_aware is True |
There was a problem hiding this comment.
Make this test exercise the behavior it claims to cover.
captured_prompts is undefined in CapturingMockJudge.judge() (ruff F821), and the test never calls the path that constructs RagForgeRagasLLM anyway—it only inspects a private attribute on RagasEvaluator. As written, it will not catch a regression where the evaluator stops passing refusal_aware into the wrapper.
🧰 Tools
🪛 GitHub Actions: CI
[error] 159-159: ruff check: Undefined name captured_prompts. (F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 152 - 175,
Define a captured_prompts list and actually exercise the code path that
builds/uses RagForgeRagasLLM so the test proves refusal_aware is propagated: add
captured_prompts = [] before CapturingMockJudge and have the test invoke the
evaluator flow that calls judge (e.g., call the evaluator method that triggers
RagForgeRagasLLM -> judge) or monkeypatch/replace RagForgeRagasLLM with a small
fake that captures its init args; then assert the captured judge calls happened
and that the wrapper/fake received refusal_aware=True (or that captured_prompts
contains the expected user_prompt), instead of only checking
evaluator._refusal_aware. Reference symbols:
test_ragas_evaluator_threads_refusal_aware_into_wrapper,
CapturingMockJudge.judge, RagasEvaluator, and RagForgeRagasLLM.
| @pytest.mark.ragas_integration | ||
| def test_ragas_end_to_end_with_mock_judge(): | ||
| """Wire real ragas 0.4.x through our injected wrappers. The exact | ||
| scores depend on ragas's internal prompts, but the contract is: | ||
| - no exception propagates out | ||
| - samples_evaluated matches the input count (or 0 if ragas crashed and everything was skipped) | ||
| - skipped_samples plus metrics cover what ragas tried to do | ||
| """ | ||
| samples = _load_samples() | ||
| evaluator = RagasEvaluator( | ||
| judge=MockJudge(), | ||
| thresholds={}, | ||
| embeddings_provider="mock", | ||
| ) | ||
| result = evaluator.evaluate(samples) | ||
|
|
||
| assert result is not None | ||
| # Either ragas ran and produced metrics OR the whole-batch-crash path | ||
| # populated skipped_samples — both are acceptable outcomes for this | ||
| # contract test. The invariant is that the evaluator did not raise. | ||
| assert len(result.metrics) + len(result.skipped_samples) >= 1 | ||
|
|
||
|
|
||
| @pytest.mark.ragas_integration | ||
| def test_ragas_whole_batch_crash_captured_as_skips(): | ||
| """Empty contexts is a known bad input for ragas 0.4.x — assert that | ||
| when it crashes, we emit SkipRecords instead of raising.""" | ||
| samples = [ | ||
| EvaluationSample( | ||
| query="broken", | ||
| contexts=[], # ragas rejects empty contexts | ||
| response="ignored", | ||
| sample_id="broken-001", | ||
| ), | ||
| ] | ||
| evaluator = RagasEvaluator( | ||
| judge=MockJudge(), | ||
| thresholds={}, | ||
| embeddings_provider="mock", | ||
| ) | ||
| # Must not raise. Either ragas tolerated it (unlikely) or we captured skips. | ||
| result = evaluator.evaluate(samples) | ||
| assert result is not None |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Tighten the contract assertions in these ragas integration tests.
Both docstrings promise more than the assertions enforce. In particular, the crash-path test still passes on an empty EvaluationResult, and the first test never checks the documented samples_evaluated invariant.
Proposed fix
`@pytest.mark.ragas_integration`
def test_ragas_end_to_end_with_mock_judge():
@@
result = evaluator.evaluate(samples)
assert result is not None
+ assert result.samples_evaluated in (0, len(samples))
# Either ragas ran and produced metrics OR the whole-batch-crash path
# populated skipped_samples — both are acceptable outcomes for this
# contract test. The invariant is that the evaluator did not raise.
assert len(result.metrics) + len(result.skipped_samples) >= 1
@@
`@pytest.mark.ragas_integration`
def test_ragas_whole_batch_crash_captured_as_skips():
@@
# Must not raise. Either ragas tolerated it (unlikely) or we captured skips.
result = evaluator.evaluate(samples)
assert result is not None
+ assert result.samples_evaluated in (0, len(samples))
+ assert len(result.metrics) + len(result.skipped_samples) >= 1
+ if result.skipped_samples:
+ assert all(s.sample_id == "broken-001" for s in result.skipped_samples)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.mark.ragas_integration | |
| def test_ragas_end_to_end_with_mock_judge(): | |
| """Wire real ragas 0.4.x through our injected wrappers. The exact | |
| scores depend on ragas's internal prompts, but the contract is: | |
| - no exception propagates out | |
| - samples_evaluated matches the input count (or 0 if ragas crashed and everything was skipped) | |
| - skipped_samples plus metrics cover what ragas tried to do | |
| """ | |
| samples = _load_samples() | |
| evaluator = RagasEvaluator( | |
| judge=MockJudge(), | |
| thresholds={}, | |
| embeddings_provider="mock", | |
| ) | |
| result = evaluator.evaluate(samples) | |
| assert result is not None | |
| # Either ragas ran and produced metrics OR the whole-batch-crash path | |
| # populated skipped_samples — both are acceptable outcomes for this | |
| # contract test. The invariant is that the evaluator did not raise. | |
| assert len(result.metrics) + len(result.skipped_samples) >= 1 | |
| @pytest.mark.ragas_integration | |
| def test_ragas_whole_batch_crash_captured_as_skips(): | |
| """Empty contexts is a known bad input for ragas 0.4.x — assert that | |
| when it crashes, we emit SkipRecords instead of raising.""" | |
| samples = [ | |
| EvaluationSample( | |
| query="broken", | |
| contexts=[], # ragas rejects empty contexts | |
| response="ignored", | |
| sample_id="broken-001", | |
| ), | |
| ] | |
| evaluator = RagasEvaluator( | |
| judge=MockJudge(), | |
| thresholds={}, | |
| embeddings_provider="mock", | |
| ) | |
| # Must not raise. Either ragas tolerated it (unlikely) or we captured skips. | |
| result = evaluator.evaluate(samples) | |
| assert result is not None | |
| `@pytest.mark.ragas_integration` | |
| def test_ragas_end_to_end_with_mock_judge(): | |
| """Wire real ragas 0.4.x through our injected wrappers. The exact | |
| scores depend on ragas's internal prompts, but the contract is: | |
| - no exception propagates out | |
| - samples_evaluated matches the input count (or 0 if ragas crashed and everything was skipped) | |
| - skipped_samples plus metrics cover what ragas tried to do | |
| """ | |
| samples = _load_samples() | |
| evaluator = RagasEvaluator( | |
| judge=MockJudge(), | |
| thresholds={}, | |
| embeddings_provider="mock", | |
| ) | |
| result = evaluator.evaluate(samples) | |
| assert result is not None | |
| assert result.samples_evaluated in (0, len(samples)) | |
| # Either ragas ran and produced metrics OR the whole-batch-crash path | |
| # populated skipped_samples — both are acceptable outcomes for this | |
| # contract test. The invariant is that the evaluator did not raise. | |
| assert len(result.metrics) + len(result.skipped_samples) >= 1 | |
| `@pytest.mark.ragas_integration` | |
| def test_ragas_whole_batch_crash_captured_as_skips(): | |
| """Empty contexts is a known bad input for ragas 0.4.x — assert that | |
| when it crashes, we emit SkipRecords instead of raising.""" | |
| samples = [ | |
| EvaluationSample( | |
| query="broken", | |
| contexts=[], # ragas rejects empty contexts | |
| response="ignored", | |
| sample_id="broken-001", | |
| ), | |
| ] | |
| evaluator = RagasEvaluator( | |
| judge=MockJudge(), | |
| thresholds={}, | |
| embeddings_provider="mock", | |
| ) | |
| # Must not raise. Either ragas tolerated it (unlikely) or we captured skips. | |
| result = evaluator.evaluate(samples) | |
| assert result is not None | |
| assert result.samples_evaluated in (0, len(samples)) | |
| assert len(result.metrics) + len(result.skipped_samples) >= 1 | |
| if result.skipped_samples: | |
| assert all(s.sample_id == "broken-001" for s in result.skipped_samples) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/tests/test_ragas_integration.py` around lines 41 - 83,
Update the two ragas tests to assert the documented samples_evaluated / skip
semantics: in test_ragas_end_to_end_with_mock_judge add an assertion that
result.samples_evaluated is either len(samples) or 0 (i.e.,
result.samples_evaluated in {len(samples), 0}) in addition to the existing
metrics/ skipped check; in test_ragas_whole_batch_crash_captured_as_skips assert
result.samples_evaluated is either len(samples) or 0 and, if it is 0, assert
that result.skipped_samples contains the input sample (e.g., sample_id
"broken-001") and that len(result.skipped_samples) == len(samples) so an empty
EvaluationResult no longer passes. Ensure you reference the EvaluationResult
fields samples_evaluated and skipped_samples when adding these assertions.
| total_pixels = baseline_img.width * baseline_img.height | ||
| diff_pixels = sum(1 for p in diff.getdata() if p != (0, 0, 0)) | ||
| diff_ratio = diff_pixels / total_pixels | ||
| assert diff_ratio < 0.005, ( | ||
| f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). " |
There was a problem hiding this comment.
Make the advertised 0.5% tolerance inclusive.
The module header says “up to 0.5%” drift is tolerated, but this assertion still fails at exactly 0.5%.
Proposed fix
- assert diff_ratio < 0.005, (
+ assert diff_ratio <= 0.005, (
f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). "
f"Re-capture baseline if the template deliberately changed."
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| total_pixels = baseline_img.width * baseline_img.height | |
| diff_pixels = sum(1 for p in diff.getdata() if p != (0, 0, 0)) | |
| diff_ratio = diff_pixels / total_pixels | |
| assert diff_ratio < 0.005, ( | |
| f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). " | |
| total_pixels = baseline_img.width * baseline_img.height | |
| diff_pixels = sum(1 for p in diff.getdata() if p != (0, 0, 0)) | |
| diff_ratio = diff_pixels / total_pixels | |
| assert diff_ratio <= 0.005, ( | |
| f"{diff_pixels}/{total_pixels} pixels differ from baseline ({diff_ratio:.2%}). " |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/tests/test_report_visual_regression.py` around lines 131 -
135, The assertion currently uses a strict less-than check so an exact 0.5% diff
fails; change the assertion to allow the advertised tolerance by using a
less-than-or-equal comparison for diff_ratio (i.e., replace the "< 0.005" check
in the assertion that references total_pixels, diff_pixels and diff_ratio with
"<= 0.005") and keep the existing formatted failure message unchanged.
| # Optional integrations installed as extras — stubs not available | ||
| [[tool.mypy.overrides]] | ||
| module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*"] | ||
| module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "rag_forge_evaluator.*"] |
There was a problem hiding this comment.
Do not suppress mypy errors for first-party package imports.
Line 36 adds rag_forge_evaluator.* to ignore_missing_imports; this can hide real typing regressions in your own code while strict=true is enabled.
Suggested fix
-module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "rag_forge_evaluator.*"]
+module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "rag_forge_evaluator.*"] | |
| module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*"] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 36, The pyproject.toml currently suppresses mypy for
a first-party package by including "rag_forge_evaluator.*" in the module ignore
list; remove "rag_forge_evaluator.*" from the module = [...] entry (the line
containing module = ["anthropic.*", ..., "rag_forge_evaluator.*"]) so mypy will
type-check your own package, or alternatively replace the suppression by adding
proper type information (inline typings, py.typed, or stubs) for
rag_forge_evaluator so it can be checked under strict=true.
The v0.2.0 report generator introduced 24 mypy errors under the project's strict mode: - 17x "Missing type arguments for generic type 'dict'" where the new helper functions returned bare `dict` / `list[dict]` annotations - 7x "float / int has incompatible type 'object'" in `_build_cost` where `_attr` was typed as returning `object` and mypy rejected passing it to numeric constructors Fix: parameterize every dict annotation as `dict[str, Any]`, retype `_attr` to return `Any`, and add `typing.Any` to the imports. Also add `voyageai.*` to the mypy override list so the optional-extra import in ragas_adapters.py and audit.py doesn't trip missing-stub errors.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/evaluator/src/rag_forge_evaluator/report/generator.py (1)
780-836: 🧹 Nitpick | 🔵 TrivialConsider adding v0.2.0 fields to partial JSON for consistency.
The
generate_partial_jsonmethod doesn't includescoring_modeorrefusal_justificationin itssample_resultsarray, unlikegenerate_json. If partial audits can have refusal-classified samples, this creates an inconsistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines 780 - 836, generate_partial_json omits the scoring_mode and refusal_justification fields in each sample entry, causing inconsistency with generate_json; update the sample_results construction in generate_partial_json to include "scoring_mode": s.scoring_mode and "refusal_justification": s.refusal_justification (matching the keys and behavior used in generate_json) so partial reports contain the same per-sample fields as full reports, preserving null/None values when those attributes are absent.
♻️ Duplicate comments (2)
pyproject.toml (1)
36-36:⚠️ Potential issue | 🟠 MajorRemove first-party namespace from
ignore_missing_importsoverride.Including
rag_forge_evaluator.*in this override can hide real typing regressions in your own package whilestrict = trueis enabled.Suggested patch
-module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "voyageai.*", "rag_forge_evaluator.*"] +module = ["anthropic.*", "cohere.*", "sentence_transformers.*", "bm25s.*", "jinja2.*", "presidio_analyzer.*", "ragas.*", "deepeval.*", "datasets.*", "redis.*", "playwright.*", "PIL.*", "voyageai.*"]#!/bin/bash # Verify that rag_forge_evaluator is a first-party namespace and currently suppressed in mypy overrides. set -euo pipefail echo "== pyproject override ==" rg -n -C2 '^\s*module\s*=\s*\[.*rag_forge_evaluator\.\*' pyproject.toml || true echo echo "== first-party package location candidates ==" fd -i 'rag_forge_evaluator' packages || true echo echo "== python package declarations referencing evaluator ==" rg -n -C2 'name\s*=\s*"rag-forge-evaluator"|packages/evaluator' pyproject.toml packages || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 36, The mypy override in pyproject.toml currently lists "rag_forge_evaluator.*" under the module ignore_missing_imports override, which silences type errors in your own package; remove "rag_forge_evaluator.*" from the module = [...] array so your first-party package is no longer ignored by the override (leave other third-party entries intact), then run mypy/type checks to confirm no regressions; locate the entry by searching for the module = ["anthropic.*", ..., "rag_forge_evaluator.*"] line in pyproject.toml and delete only the rag_forge_evaluator.* item (and any trailing comma if needed).packages/evaluator/tests/test_ragas_adapters.py (1)
154-168: 🧹 Nitpick | 🔵 TrivialThis test name overstates what is validated.
It currently verifies only internal state (
_refusal_aware) and does not prove thatRagasEvaluatoractually passes the flag intoRagForgeRagasLLMconstruction on the execution path. Consider monkeypatching the wrapper constructor (or wrapper factory point) and asserting the capturedrefusal_awareargument.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 154 - 168, The test test_ragas_evaluator_threads_refusal_aware_into_wrapper only asserts the internal flag _refusal_aware on RagasEvaluator but doesn't prove the flag is forwarded to the wrapper; update the test to monkeypatch the wrapper factory/constructor (the point that creates RagForgeRagasLLM used by RagasEvaluator) to capture its kwargs and then instantiate RagasEvaluator with refusal_aware=True and assert the captured call included refusal_aware=True; locate the creation point in RagasEvaluator (search for where RagForgeRagasLLM or the wrapper factory is invoked) and replace/patch that constructor within the test to record the argument instead of actually constructing the real wrapper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 714-719: The computation for safety_refusal_rate is using both the
guarded local variable samples_evaluated and the original
result.samples_evaluated in a redundant conditional; simplify by using the
already-guarded samples_evaluated for the division (i.e., compute refusal_count
= result.scoring_modes_count.get("safety_refusal", 0) and then
safety_refusal_rate = refusal_count / samples_evaluated) and ensure you preserve
the zero-sample semantics by initializing samples_evaluated =
result.samples_evaluated or 1 as currently done.
---
Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 780-836: generate_partial_json omits the scoring_mode and
refusal_justification fields in each sample entry, causing inconsistency with
generate_json; update the sample_results construction in generate_partial_json
to include "scoring_mode": s.scoring_mode and "refusal_justification":
s.refusal_justification (matching the keys and behavior used in generate_json)
so partial reports contain the same per-sample fields as full reports,
preserving null/None values when those attributes are absent.
---
Duplicate comments:
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 154-168: The test
test_ragas_evaluator_threads_refusal_aware_into_wrapper only asserts the
internal flag _refusal_aware on RagasEvaluator but doesn't prove the flag is
forwarded to the wrapper; update the test to monkeypatch the wrapper
factory/constructor (the point that creates RagForgeRagasLLM used by
RagasEvaluator) to capture its kwargs and then instantiate RagasEvaluator with
refusal_aware=True and assert the captured call included refusal_aware=True;
locate the creation point in RagasEvaluator (search for where RagForgeRagasLLM
or the wrapper factory is invoked) and replace/patch that constructor within the
test to record the argument instead of actually constructing the real wrapper.
In `@pyproject.toml`:
- Line 36: The mypy override in pyproject.toml currently lists
"rag_forge_evaluator.*" under the module ignore_missing_imports override, which
silences type errors in your own package; remove "rag_forge_evaluator.*" from
the module = [...] array so your first-party package is no longer ignored by the
override (leave other third-party entries intact), then run mypy/type checks to
confirm no regressions; locate the entry by searching for the module =
["anthropic.*", ..., "rag_forge_evaluator.*"] line in pyproject.toml and delete
only the rag_forge_evaluator.* item (and any trailing comma if needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 93235b5a-160a-415f-906b-5f4d6578d355
📒 Files selected for processing (5)
packages/evaluator/src/rag_forge_evaluator/judge/mock_judge.pypackages/evaluator/src/rag_forge_evaluator/report/generator.pypackages/evaluator/tests/test_ragas_adapters.pypackages/evaluator/tests/test_ragas_extractor.pypyproject.toml
📜 Review details
🔇 Additional comments (19)
packages/evaluator/src/rag_forge_evaluator/judge/mock_judge.py (1)
5-18: Strong backward/forward compatibility in the mock payload.Including both combined-path fields and legacy per-metric fields here is the right shape for v0.2.0 and keeps both test paths valid.
packages/evaluator/tests/test_ragas_extractor.py (1)
49-93: Behavioral assertions are aligned with v0.2.0 failure semantics.These tests correctly enforce the new
ValueErrorcontract for extraction failures and theevaluate()guard when no judge is provided.packages/evaluator/tests/test_ragas_adapters.py (1)
79-152: Good adapter coverage across sync/async, provider wiring, and refusal-note prompt behavior.The patched-client checks and prompt assertions provide solid guardrails for wrapper behavior without external dependencies.
packages/evaluator/src/rag_forge_evaluator/report/generator.py (16)
323-329: Data-handling disclaimer is misleading.This text claims data is "processed in-memory only" and "no sample data is persisted," but the HTML/JSON reports are written to disk and include raw queries/responses. Additionally, claiming the report "contains no personally-identifiable information" is incorrect if sample data includes PII.
Proposed wording
data_handling_html = ( - "Queries, responses, and retrieved contexts are processed in-memory only. " - "No sample data is persisted beyond the audit run. " + "Queries, responses, and retrieved contexts are processed during the audit run. " + "If HTML or JSON reports are generated, included sample content is persisted in those artifacts. " "Evaluation inputs are transmitted to the configured judge endpoint " "under the operator's API key and subject to that provider's data policy. " - "This report contains no personally-identifiable information." + "Redact or disable per-sample content before sharing reports that may contain regulated or personally identifiable data." )
667-693: Backward-compatibility shim drops caller-provided parameters.The method signature accepts
rmm_level,trends, andsample_results, but the implementation discards all of them and passes hardcoded values togenerate_html. This silently produces incorrect metadata (always "RAG-Forge Audit", "llm-judge", empty judge model) regardless of what the caller provides.Suggested fix: pass through the available parameters
def generate_html( self, result: EvaluationResult, rmm_level: RMMLevel, trends: dict[str, str] | None = None, sample_results: list[SampleResult] | None = None, + *, + project_name: str = "RAG-Forge Audit", + evaluator_name: str = "llm-judge", + judge_model_display: str = "", ) -> Path: ... html = generate_html( result, - project_name="RAG-Forge Audit", - evaluator_name="llm-judge", - judge_model_display="", + project_name=project_name, + evaluator_name=evaluator_name, + judge_model_display=judge_model_display, )
31-37: LGTM!Good use of
StrictUndefinedto catch missing template variables andselect_autoescapefor XSS protection. Module-level environment initialization is an appropriate pattern for Jinja2.
82-90: LGTM!Including both
answer_relevanceandanswer_relevancyvariants in_METRIC_DESCRIPTIONSis a good defensive choice for handling different metric naming conventions.
101-112: LGTM!Clean delegation to
RMMScorerkeeps scoring logic centralized. Fallbacks for missing lookups are appropriate.
115-170: LGTM!Good edge-case handling for empty passing/failing lists, and the priority recommendation logic is well-structured with appropriate fallbacks.
173-190: LGTM!The ladder state logic correctly distinguishes between cleared, current, next, and future levels for visual representation.
193-228: LGTM!Proper divide-by-zero guard and the 30% threshold warning aligns with the PR spec. Case collection logic correctly filters by
scoring_mode.
246-256: LGTM!Good defensive check at line 247-248 to skip samples with empty metrics before calling
min(), and the fallback at lines 254-256 ensures a valid return even when all samples lack metric data.
279-306: LGTM!Clean transformation functions with appropriate defaults for optional fields.
350-405: LGTM!Duck-typing approach is pragmatic for optional cost tracking. Division guards with
or 1prevent zero-division errors.
426-456: LGTM!SVG coordinate mapping is correct: X spans 15→145 with proper single-point centering, and Y inverts correctly for the SVG coordinate system.
469-485: LGTM!Clean executive summary generation with all relevant metrics included.
493-614: LGTM!The context dict comprehensively covers the template's expected variables. Using
StrictUndefinedensures any mismatch will be caught at render time.
622-648: LGTM!Legacy helpers retained for backward compatibility with proper edge-case handling.
755-774: LGTM!The sample_results injection correctly includes per-sample
scoring_modeandrefusal_justificationfields for v0.2.0 compatibility.
| # Compute safety refusal rate — v0.2.0 | ||
| samples_evaluated = result.samples_evaluated or 1 # guard against divide-by-zero | ||
| refusal_count = result.scoring_modes_count.get("safety_refusal", 0) | ||
| safety_refusal_rate = ( | ||
| refusal_count / samples_evaluated if result.samples_evaluated else 0.0 | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Simplify the redundant conditional.
The guarded samples_evaluated (line 715) is always ≥ 1, but line 718 then checks result.samples_evaluated (the original unguarded value). While the end result is technically correct (0.0 when no samples exist), the logic is confusing. Consider simplifying:
Cleaner alternative
- samples_evaluated = result.samples_evaluated or 1 # guard against divide-by-zero
- refusal_count = result.scoring_modes_count.get("safety_refusal", 0)
- safety_refusal_rate = (
- refusal_count / samples_evaluated if result.samples_evaluated else 0.0
- )
+ refusal_count = result.scoring_modes_count.get("safety_refusal", 0)
+ safety_refusal_rate = (
+ refusal_count / result.samples_evaluated
+ if result.samples_evaluated
+ else 0.0
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines
714 - 719, The computation for safety_refusal_rate is using both the guarded
local variable samples_evaluated and the original result.samples_evaluated in a
redundant conditional; simplify by using the already-guarded samples_evaluated
for the division (i.e., compute refusal_count =
result.scoring_modes_count.get("safety_refusal", 0) and then safety_refusal_rate
= refusal_count / samples_evaluated) and ensure you preserve the zero-sample
semantics by initializing samples_evaluated = result.samples_evaluated or 1 as
currently done.
Addresses the critical + major review comments on PR #33. - **ragas_adapters.py** — implement the full ragas 0.4.x BaseRagasLLM contract. RagForgeRagasLLM.generate_text / agenerate_text now accept PromptValue + n/temperature/stop/callbacks kwargs and return LLMResult, so ragas cannot silently break the adapter when it starts passing additional parameters on minor version upgrades. Deferred langchain_core imports keep the module importable without the optional extras. A duck-typed _StringLLMResult fallback lets existing unit tests keep passing without pulling langchain as a dev dependency. RagForgeRagasEmbeddings also gains aembed_query / aembed_documents to match the BaseRagasEmbeddings async contract. - **audit.py** — when config.judge_model is None on the ragas path, normalize to "mock" before the allowlist check so programmatic configs that leave judge_model unset no longer trip the guard. - **engines/__init__.py** — if create_evaluator(engine="ragas") is called without a judge, default to MockJudge immediately rather than letting the missing-judge error surface mid-evaluate(). - **ragas_evaluator.py** — whole-batch crash path now reports samples_evaluated=len(samples) (attempted count) instead of 0, and aggregate-extraction path sets passed=False whenever any skip record exists, not just when every metric failed its threshold. Silent partial success was the v0.1.3 pathology we set out to kill. - **llm_judge.py** — _are_standard_metrics now does an exact type-set match (frozenset of type objects) instead of isinstance, so MetricEvaluator subclasses no longer get silently rerouted into the combined single-call path that cannot honor their overrides. - **llm_judge.py** — when _parse_combined_response marks every metric skipped for a sample, do not emit a normal SampleResult with misleading 0.0 scores or pollute scoring_modes_count with a fake "standard" classification. Per-metric outcomes still record the skip for aggregation; progress is still reported. - **generator.py** — compliance footer's data-handling copy no longer falsely claims samples stay in-memory only. The text now tells operators that HTML/JSON/PDF reports persist sample content and does not claim automatic PII redaction. - **generator.py** — ReportGenerator.generate_html compatibility shim now forwards project_name, project_description, evaluator_name, judge_model_display, report_number, and report_date to the module-level generate_html instead of hardcoding "RAG-Forge Audit" and "llm-judge". Callers that rely on the old defaults still work. - **audit.html.j2** — sparkline SVG always renders the <svg> wrapper (preserving the existing "html contains svg" test assertions) but guards the polyline/circles behind {% if history_points %}, with a "First run — no history yet" text fallback so empty history does not produce a malformed SVG. Per-sample scores loop uses (sample.scores or {}).items() to tolerate None. - **pyproject.toml** — add langchain_core.* to the mypy override list so the new deferred TYPE_CHECKING imports don't trip missing-stub errors on machines where langchain isn't installed. Local gates: uv run ruff check packages/ → All checks passed! uv run mypy packages/core/src packages/evaluator/src packages/observability/src → Success: no issues found in 102 source files uv run pytest packages/evaluator/tests -q → 256 passed, 4 skipped rag-forge-eval audit --judge mock → faithfulness 0.90, overall 0.90, passed: true
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/evaluator/src/rag_forge_evaluator/engines/__init__.py (1)
8-14:⚠️ Potential issue | 🟠 MajorPlumb the new ragas options through the factory.
RagasEvaluatornow acceptsmax_tokensandembeddings_provider, butcreate_evaluator()exposes no way to pass either. That leaves the newAuditConfig.ragas_max_tokens/AuditConfig.ragas_embeddings_providerknobs dead on the default path.Proposed shape
def create_evaluator( engine: str, judge: JudgeProvider | None = None, thresholds: dict[str, float] | None = None, progress: ProgressReporter | None = None, refusal_aware: bool = True, + ragas_max_tokens: int = 8192, + ragas_embeddings_provider: str | None = None, ) -> EvaluatorInterface: @@ if engine == "ragas": from rag_forge_evaluator.engines.ragas_evaluator import RagasEvaluator if judge is None: from rag_forge_evaluator.judge.mock_judge import MockJudge judge = MockJudge() - return RagasEvaluator(judge=judge, thresholds=thresholds, refusal_aware=refusal_aware) + return RagasEvaluator( + judge=judge, + thresholds=thresholds, + max_tokens=ragas_max_tokens, + embeddings_provider=ragas_embeddings_provider, + refusal_aware=refusal_aware, + )Also applies to: 24-32
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py` around lines 8 - 14, The factory create_evaluator doesn't accept the new Ragas options, so add optional parameters max_tokens: int | None and embeddings_provider: EmbeddingsProvider | None to create_evaluator's signature and forward them into the RagasEvaluator constructor (i.e. where RagasEvaluator(...) is created pass max_tokens=max_tokens and embeddings_provider=embeddings_provider, and when applicable fall back to AuditConfig.ragas_max_tokens / AuditConfig.ragas_embeddings_provider if needed); also make the same signature/forwarding change for the other create_evaluator variant in this module so both paths plumb the ragas options through.
♻️ Duplicate comments (3)
packages/evaluator/src/rag_forge_evaluator/report/generator.py (1)
670-713:⚠️ Potential issue | 🟠 MajorThe compatibility wrapper still never supplies sparkline history.
The new template renders from
history_points, but this wrapper only acceptstrendsand then drops it. On the normalaudit.pypath, the history panel will always fall back to the first-run empty state.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines 670 - 713, The instance method generate_html currently accepts trends but discards it before calling the module-level generate_html, so the template never receives the historical sparkline data; update the method to forward the trends data as the template's history_points (or convert trends into the expected history_points shape) when calling the module-level generate_html function — specifically modify the method generate_html in report/generator.py to include a history_points=... argument (derived from the trends parameter) in the call to the top-level generate_html(result, ...) so the template can render the sparkline history.pyproject.toml (1)
35-37:⚠️ Potential issue | 🟠 MajorDo not suppress mypy on the repo’s own package.
rag_forge_evaluator.*is first-party code, so this override disables strict type-checking for the package this workspace owns. That hides real regressions instead of isolating optional third-party imports.#!/bin/bash set -euo pipefail echo "mypy override:" sed -n '34,37p' pyproject.toml echo echo "first-party package directories:" fd -t d '^rag_forge_evaluator$' packages src . || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 35 - 37, The mypy override is incorrectly silencing type-checks for our own package by including "rag_forge_evaluator.*"; remove "rag_forge_evaluator.*" from the modules list in the [[tool.mypy.overrides]] block so the repo's package is not subject to ignore_missing_imports, then run mypy to fix any real type errors surfaced in the rag_forge_evaluator package (ensure the package name "rag_forge_evaluator" matches the actual top-level package directory used in the repo).packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py (1)
346-447:⚠️ Potential issue | 🟠 MajorEmit
SkipRecords for fully skipped combined samples.Suppressing
SampleResulthere fixed the misleading worst-case path, but the failure still disappears fromEvaluationResult.skipped_samples. HTML/JSON consumers only see an aggregate skip count and lose the sample ID plus parse reason entirely.Possible fix
from rag_forge_evaluator.engine import ( EvaluationResult, EvaluationSample, EvaluatorInterface, MetricResult, SampleResult, + SkipRecord, ScoringMode, ) @@ sample_results = self._partial_sample_results scoring_modes_count: dict[str, int] = {} + skipped_samples: list[SkipRecord] = [] @@ all_skipped = sample_skipped == len(metric_names) - if not all_skipped: + if all_skipped: + for name in metric_names: + result = per_metric_results[name] + skipped_samples.append( + SkipRecord( + sample_id=sample.sample_id or sample.query[:40], + metric_name=name, + reason=result.details or "combined judge response could not be parsed", + exception_type="JudgeParseError", + ) + ) + else: worst_metric = min( sample_metric_scores, key=sample_metric_scores.get # type: ignore[arg-type] ) @@ return EvaluationResult( metrics=aggregated, overall_score=round(overall, 4), samples_evaluated=len(samples), passed=all(m.passed for m in aggregated), sample_results=sample_results, skipped_evaluations=total_skipped, scoring_modes_count=scoring_modes_count, + skipped_samples=skipped_samples, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py` around lines 346 - 447, In _evaluate_combined, when a sample is fully skipped (all_skipped True) emit a SkipRecord that captures sample.sample_id and the parse/refusal reason (use refusal_justification or a brief "unparseable" message) instead of silently omitting the sample; append that SkipRecord to whatever collection is returned as EvaluationResult.skipped_samples (or add a skipped_samples list to the return payload) so HTML/JSON consumers receive the sample id and reason, and keep the existing behavior of not creating a SampleResult or mutating scoring_modes_count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/audit.py`:
- Around line 160-178: The ragas-specific allowlist wrongly rejects aliases like
"claude-sonnet"; update the validation in the evaluator_engine == "ragas" branch
to use the same alias normalization as _create_judge() (and
_KNOWN_JUDGE_ALIASES) before checking support: map config.judge_model (or the
normalized "mock" default) through _KNOWN_JUDGE_ALIASES or the same
alias-resolution function to obtain the canonical judge name, then validate that
canonical name is one of ("openai","gpt-4o","claude","mock") and perform the
voyageai installation check against the canonical "claude" value so aliases such
as "claude-sonnet" are accepted for ragas.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 102-105: The adapter stores ragas_max_tokens but never forwards it
to the judge: update the calls to _complete() (and any direct judge invocations
in RagasAdapter methods such as where ragas_max_tokens is set/used) to pass
ragas_max_tokens into the underlying judge call (e.g., as an explicit
max_tokens/max_tokens_override keyword or the third parameter expected by the
judge API) so the judge actually receives and applies the token limit; ensure
both the initial call around the _complete(system_prompt, user_prompt) site and
the other similar call sites (the block covering the later 115-132 region) are
changed to forward ragas_max_tokens consistently.
In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 208-239: The combined-metric parsing loop currently accepts any
float (e.g., -1, 2.5, nan, inf) from data[field]; update the logic in the loop
that iterates over _COMBINED_METRIC_FIELDS (where thresholds and
_DEFAULT_THRESHOLDS are used and MetricResult is constructed) so that after
converting with float(data[field]) you validate the value is finite and within
0.0–1.0 (use math.isfinite or equivalent and check 0.0 <= score <= 1.0); if it
is out of range or non-finite, log a warning via logger (including the offending
value and field), set results[field] to a skipped MetricResult with score 0.0,
passed False, and a descriptive details message (similar style to the existing
non-numeric handling), and only create a normal MetricResult when the score is
valid.
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 333-350: The returned report dictionary includes a hardcoded
"page_total_display": "Page 1 of 1" which is invalid for multi-page reports;
remove this hardcoded key from the return value in generator.py (where the dict
with method_html, data_handling_html, issued_date/report_time_utc, etc. is
assembled) or replace it by deriving the string from real pagination metadata
(e.g., a page_count or pagination helper) so that "page_total_display" is either
omitted when unknown or computed like f"Page {current_page} of {total_pages}"
using the actual pagination variables.
---
Outside diff comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/__init__.py`:
- Around line 8-14: The factory create_evaluator doesn't accept the new Ragas
options, so add optional parameters max_tokens: int | None and
embeddings_provider: EmbeddingsProvider | None to create_evaluator's signature
and forward them into the RagasEvaluator constructor (i.e. where
RagasEvaluator(...) is created pass max_tokens=max_tokens and
embeddings_provider=embeddings_provider, and when applicable fall back to
AuditConfig.ragas_max_tokens / AuditConfig.ragas_embeddings_provider if needed);
also make the same signature/forwarding change for the other create_evaluator
variant in this module so both paths plumb the ragas options through.
---
Duplicate comments:
In `@packages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.py`:
- Around line 346-447: In _evaluate_combined, when a sample is fully skipped
(all_skipped True) emit a SkipRecord that captures sample.sample_id and the
parse/refusal reason (use refusal_justification or a brief "unparseable"
message) instead of silently omitting the sample; append that SkipRecord to
whatever collection is returned as EvaluationResult.skipped_samples (or add a
skipped_samples list to the return payload) so HTML/JSON consumers receive the
sample id and reason, and keep the existing behavior of not creating a
SampleResult or mutating scoring_modes_count.
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py`:
- Around line 670-713: The instance method generate_html currently accepts
trends but discards it before calling the module-level generate_html, so the
template never receives the historical sparkline data; update the method to
forward the trends data as the template's history_points (or convert trends into
the expected history_points shape) when calling the module-level generate_html
function — specifically modify the method generate_html in report/generator.py
to include a history_points=... argument (derived from the trends parameter) in
the call to the top-level generate_html(result, ...) so the template can render
the sparkline history.
In `@pyproject.toml`:
- Around line 35-37: The mypy override is incorrectly silencing type-checks for
our own package by including "rag_forge_evaluator.*"; remove
"rag_forge_evaluator.*" from the modules list in the [[tool.mypy.overrides]]
block so the repo's package is not subject to ignore_missing_imports, then run
mypy to fix any real type errors surfaced in the rag_forge_evaluator package
(ensure the package name "rag_forge_evaluator" matches the actual top-level
package directory used in the repo).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 80638afd-ae35-4e76-b876-6f2226fb968d
📒 Files selected for processing (8)
packages/evaluator/src/rag_forge_evaluator/audit.pypackages/evaluator/src/rag_forge_evaluator/engines/__init__.pypackages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.pypackages/evaluator/src/rag_forge_evaluator/engines/ragas_evaluator.pypackages/evaluator/src/rag_forge_evaluator/metrics/llm_judge.pypackages/evaluator/src/rag_forge_evaluator/report/generator.pypackages/evaluator/src/rag_forge_evaluator/report/templates/audit.html.j2pyproject.toml
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: RAG Quality Gate
- GitHub Check: Lint, Typecheck & Test
| passes through from its internal metric prompts. ``max_tokens`` is | ||
| retained on the instance so future Judge-protocol extensions can pick | ||
| it up; the current Judge interface does not accept it, so the underlying | ||
| model's own max-tokens default applies. |
There was a problem hiding this comment.
max_tokens is a no-op in the current adapter.
This setting is stored, but _complete() still calls the judge with only (system_prompt, user_prompt). Callers will assume ragas_max_tokens constrains ragas prompts even though provider defaults still apply.
Also applies to: 115-132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 102 - 105, The adapter stores ragas_max_tokens but never forwards it to
the judge: update the calls to _complete() (and any direct judge invocations in
RagasAdapter methods such as where ragas_max_tokens is set/used) to pass
ragas_max_tokens into the underlying judge call (e.g., as an explicit
max_tokens/max_tokens_override keyword or the third parameter expected by the
judge API) so the judge actually receives and applies the token limit; ensure
both the initial call around the _complete(system_prompt, user_prompt) site and
the other similar call sites (the block covering the later 115-132 region) are
changed to forward ragas_max_tokens consistently.
| return { | ||
| "method_html": method_html, | ||
| "data_handling_html": data_handling_html, | ||
| "regulatory_html": _REGULATORY_ALIGNMENT_HTML, | ||
| "authored_by": "RAG-Forge", | ||
| "issued_date": report_date, | ||
| "issued_utc_time": report_time_utc, | ||
| "valid_until": "Next material change", | ||
| "limitations_html": ( | ||
| "This report reflects the pipeline state at the time of the audit. " | ||
| "Scores are estimates produced by an LLM judge and should be interpreted " | ||
| "alongside human review for high-stakes decisions. " | ||
| "RMM level is based on metric thresholds; infrastructure-level requirements " | ||
| "(caching, RBAC, drift detection) are not automatically verified." | ||
| ), | ||
| "github_url": "github.com/hallengray/rag-forge", | ||
| "page_total_display": "Page 1 of 1", | ||
| } |
There was a problem hiding this comment.
Remove the hardcoded Page 1 of 1 attestation text.
Per-sample detail makes multi-page reports common, so this footer becomes false as soon as the report spans more than one page. It should be omitted or derived from real pagination metadata.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/report/generator.py` around lines
333 - 350, The returned report dictionary includes a hardcoded
"page_total_display": "Page 1 of 1" which is invalid for multi-page reports;
remove this hardcoded key from the return value in generator.py (where the dict
with method_html, data_handling_html, issued_date/report_time_utc, etc. is
assembled) or replace it by deriving the string from real pagination metadata
(e.g., a page_count or pagination helper) so that "page_total_display" is either
omitted when unknown or computed like f"Page {current_page} of {total_pages}"
using the actual pagination variables.
Final review pass on PR #33. - **audit.py** — ragas allowlist now uses `_KNOWN_JUDGE_ALIASES` as the single source of truth. Previously rejected the `claude-sonnet` alias that `_create_judge()` happily accepted for the llm-judge path. The voyageai gate now triggers for both `claude` and `claude-sonnet`. - **ragas_adapters.py + ragas_evaluator.py** — remove the no-op `max_tokens` field from `RagForgeRagasLLM.__init__` and from `RagasEvaluator.__init__`. The Judge protocol is `(system_prompt, user_prompt) -> str` with no per-call token cap, so storing a value that never reached the judge was misleading. The underlying judge's own default (ClaudeJudge: 4096, raisable via `RAG_FORGE_JUDGE_MAX_TOKENS` env var) applies during ragas runs. Proper per-call forwarding is a v0.2.1 follow-up documented in the wrapper's docstring. - **audit.py** — drop `AuditConfig.ragas_max_tokens` for the same reason. The field was never wired through the factory, so it was a lie in two places. - **metrics/llm_judge.py** — `_parse_combined_response` now rejects non-finite and out-of-range scores (< 0.0 or > 1.0) as malformed judge output and marks the metric skipped instead of polluting the aggregate. Covers the `-1`, `2.5`, `nan`, `inf` pathologies. - **report/generator.py** — drop the hardcoded `"Page 1 of 1"` from the compliance footer. Per-sample detail makes multi-page reports common, and page numbers are handled at print time by the template's `@page` CSS counters. - **report/generator.py** — simplify the redundant `safety_refusal_rate` divide-by-zero guard. The `or 1` sentinel and the separate `if result.samples_evaluated` branch were double-guarding against the same zero case. - **CHANGELOG.md** — migration step now references the `>=0.4,<0.6` pin range instead of claiming ragas "will upgrade to 0.4.x" specifically. - **docs/superpowers/specs/...** — add `text` language tag to the architecture diagram fenced block for markdownlint compliance. - **Tests** — update `test_ragas_adapters.py` and `test_pearmedica_regression.py` to drop the now-removed `max_tokens` kwarg. Update `test_audit_config_validation.py` to assert the new `ragas_embeddings_provider` default instead of the removed `ragas_max_tokens` field. Local gates: uv run ruff check packages/ → All checks passed! uv run mypy packages/core/src packages/evaluator/src packages/observability/src → Success: no issues found in 102 source files uv run pytest packages/evaluator/tests -q → 255 passed, 4 skipped rag-forge-eval audit --judge mock → faithfulness 0.90, overall 0.90, passed: true
…-refresh # Conflicts: # docs/superpowers/specs/2026-04-14-v0.2.0-evaluator-refresh-design.md
Summary
Implements the full v0.2.0 evaluator refresh per the design in #32. Rebuilds the RAGAS adapter around injected LLM + embeddings wrappers, ships default-on refusal-aware scoring with a
--strictescape hatch, and rewrites the audit report around a single Jinja template with shared browser/print CSS.Driven by findings from the PearMedica Cycle 2 audit (2026-04-14). Three v0.1.3 bugs fixed at the architectural root, one class of scoring pathology closed, and the report is rebuilt to top-tier standard.
What's in this PR (22 commits)
Workstream A — RAGAS adapter rebuild
engines/ragas_adapters.pywithRagForgeRagasLLM(sync + async) andRagForgeRagasEmbeddings(openai / voyage / mock providers)ragas_evaluator.pynow injects our wrappers instead of letting ragas pick defaults — version-stable across ragas releases_extract_ragas_scoreraisesValueErrorinstead of silently coercing to 0.0; exceptions are captured asSkipRecordentries visible in the reportragas>=0.4,<0.6[ragas-voyage]for Claude-judge usersWorkstream B — Refusal-aware scoring (default-on)
metrics/llm_judge.pyinjects an inline classification preamble into the scoring prompt. The judge classifies each sample asstandardorsafety_refusaland scores on the appropriate rubric in one round-trip (no extra API calls)RagForgeRagasLLMprepends a shorter refusal-aware NOTE to every prompt ragas sendsAuditConfig.refusal_aware: bool = True+ CLI flags--strict,--refusal-aware,--no-refusal-awareMetricResultandSampleResultgainscoring_modeandrefusal_justificationfieldsEvaluationResultgainsscoring_modes_countfor aggregate reportingWorkstream C — Top-tier audit report
report/templates/audit.html.j2(430 lines) with all 14 sections: header → title → run manifest → RMM hero with SVG sparkline → maturity ladder → executive summary → TL;DR callout → metric breakdown → cost summary → safety refusals → worst case → per-sample detail cards → skipped samples → compliance footerreport/generator.pyrewritten as a context builder around the new template. Pure helper functions for RMM, TL;DR auto-generation, refusals block, worst case, compliance footerreport/pdf.pyuses Playwright'semulate_media(media="print")so the template's@media printrules (page breaks,@pagerules, forced-open<details>) actually applyskipped_samples,scoring_modes_count,safety_refusal_rate, per-samplescoring_mode+refusal_justificationRAG-Forgeaudit.pydrops the hard block on--evaluator ragas --judge claudeand replaces it with a capability check forvoyageaigpt-4o-mini (RAGAS internal — judge override ignored)— ragas now honors--judgeend-to-endWorkstream D — Release coordination
rag-forge-core,rag-forge-evaluator,rag-forge-observability,@rag-forge/cli,@rag-forge/mcpBreaking changes
--strictto revert.>=0.4,<0.6. Users with pinned 0.2.x must upgrade ragas before upgrading rag-forge-evaluator.--evaluator ragas --judge claudenow works — was previously a hard error.Tests
ragas_integrationandvisualmarkers)rag-forge-eval audit --evaluator llm-judge --judge mockproduces HTML + JSON + PDF cleanlyRegression coverage
tests/test_ragas_integration.py— real ragas 0.4.x contract test (gated). If ragas 0.5 breaksBaseRagasLLM/BaseRagasEmbeddings, this test catches ittests/test_pearmedica_regression.py— sanitized 3-sample PearMedica fixture. Locks in fixes for Findings feat: Phase 2C — Evaluation Enhancements + CI/CD Gate #4, feat: Phase 2D — MCP Server Wiring + Templates #5, feat: Phase 3A — OpenTelemetry Observability #6 — if any of the v0.1.3 pathologies reappear, these assertions failtests/test_report_visual_regression.py— Playwright screenshot comparison against committed baseline (gated)Migration for existing users
Depends on
Test plan
pnpm turbo run buildgreen in CIuv run pytest packages/evaluator/tests -qgreen in CI (256 passed)rag-forge-eval auditwith mock judge againstpackages/evaluator/tests/fixtures/ragas_tiny_fixture.json— HTML/JSON/PDF all generated without errorsaudit-report.htmlin a browser and confirm layout matches the design spec (serif headings, RMM hero, TL;DR callout, compliance footer)rag-forge-eval auditagainst PearMedica Cycle 2 data withragas_integrationmarker enabled to confirm findings feat: Phase 2C — Evaluation Enhancements + CI/CD Gate #4/feat: Phase 2D — MCP Server Wiring + Templates #5/feat: Phase 3A — OpenTelemetry Observability #6 stay dead