feat: Phase 2D — MCP Server Wiring + Templates#5
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (11)
Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to data retention organization setting Summary by CodeRabbit
WalkthroughExtracts a Python-bridge into a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant NodeProcess as "MCP Node Process"
participant MCP as "MCP startServer"
participant Stdio as "StdioServerTransport"
participant Tool as "MCP Tool Handler"
participant Python as "Python Module"
User->>CLI: rag-forge serve --mcp <path>
CLI->>NodeProcess: spawn node <mcp/dist/main.js>
NodeProcess->>MCP: invoke startServer()
MCP->>Stdio: create transport
MCP->>Stdio: connect(transport)
Stdio-->>MCP: connected
User->>Stdio: invoke rag_query input
Stdio->>Tool: handleRagQuery(input)
Tool->>Python: runPythonModule("rag_forge_core.cli", ["query", ...])
Python-->>Tool: {stdout, stderr, exitCode}
alt exitCode == 0
Tool-->>Stdio: return stdout
else
Tool-->>Stdio: return error payload (stderr or message)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Extracts the Python subprocess bridge into a standalone workspace package so both the CLI and MCP server can share a single implementation without duplication. Exposes runPythonModule and checkPythonAvailable via ESM exports with full TypeScript declarations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the inline execa dependency from the CLI and replaces the full python-bridge implementation with a thin re-export of the shared package, ensuring a single source of truth for the subprocess bridge across all workspace consumers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the stub not-implemented handlers in rag-query, rag-audit, and rag-status with real implementations that delegate to Python via the shared runPythonModule bridge. Adds @rag-forge/shared as a workspace dependency so the MCP server shares the same subprocess layer as the CLI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
436c2f4 to
78a7081
Compare
… gate - turbo.json: typecheck now depends on ^build so @rag-forge/shared is built before CLI/MCP typecheck runs - eval/golden_set.json: root-level golden set for CI audit gate Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 18
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/cli/src/commands/serve.ts`:
- Around line 7-10: The getMcpMainPath function currently hardcodes upward
traversal to locate mcp/dist/main.js which breaks outside the monorepo; change
getMcpMainPath to resolve the target using Node's module resolution (e.g., via
createRequire(import.meta.url).resolve or require.resolve('mcp/dist/main.js'))
and fall back to a configurable override (env var or CLI option) if resolution
fails, and ensure you catch and surface resolution errors from getMcpMainPath so
callers can handle packaged/global installs robustly.
- Around line 27-29: The spawn call currently uses the literal "node" which can
break in PATH-restricted environments; change it to use process.execPath when
spawning the child in serve.ts so the child uses the same Node binary as the
parent (update the spawn invocation that creates child from spawn("node",
[mcpMain], { stdio: "inherit" }) to spawn(process.execPath, [mcpMain], { stdio:
"inherit" })). Ensure any tests or wrappers referencing the child variable
remain compatible.
- Around line 36-38: The exit handler for child.on("exit", (code) => ...)
ignores the signal parameter and treats null codes as success; update the
handler signature to accept both (code, signal), propagate signal-terminated
exits by re-emitting the same signal to the parent (e.g.,
process.kill(process.pid, signal)) when signal is set, and fall back to exiting
with a non-zero code by changing process.exit(code ?? 0) to process.exit(code ??
1) when no signal is present; update the child.on("exit", ...) callback
accordingly so signal and code are both handled.
In `@packages/core/src/rag_forge_core/cli.py`:
- Around line 260-265: The except branch around store.count(collection) in
cmd_status is dead because QdrantStore.count() already handles errors and
returns 0; remove the try/except and simplify to a direct call: set count =
store.count(collection) and indexed = count > 0. Update the block that currently
contains the try/except (referencing store.count and collection in cmd_status)
to use this simplified logic so the ValueError/KeyError branch is removed.
In `@packages/mcp/src/tools/rag-audit.ts`:
- Around line 15-19: The code builds CLI args in rag-audit.ts (const args =
["audit", "--judge", "mock"]; and subsequent pushes) but ignores input.metrics
from the tool schema; either wire input.metrics into the args before invoking
the evaluator or explicitly reject it. Update the args construction to map
input.metrics to the appropriate CLI flag(s) (e.g., push "--metrics" and the
metrics value or serialize as needed) when input.metrics is present, or add a
validation step that throws or returns an error if input.metrics is set to
prevent silent input loss; adjust the evaluator invocation that consumes args
accordingly so the metrics are actually passed through.
In `@packages/mcp/src/tools/rag-query.ts`:
- Around line 28-32: The error branch in the rag-query result handling (where
result is checked with result.exitCode !== 0) currently only returns
result.stderr, which can drop structured JSON error payloads emitted to stdout;
update the failure path in the function handling the child process result to
prefer and preserve result.stdout (parse it if JSON) before falling back to
result.stderr, and return the preserved stdout JSON/error message (or include
both stdout and stderr) so callers of the module get the CLI's structured error;
look for symbols result and the exitCode check in
packages/mcp/src/tools/rag-query.ts to locate the change.
In `@packages/shared/src/python-bridge.ts`:
- Around line 24-28: The return currently coerces undefined exit codes to 0
which mislabels signaled/canceled processes as success; change the mapping of
result.exitCode so that undefined is treated as a failure (e.g., map undefined
to a non‑zero value or propagate undefined explicitly and let callers treat it
as failure) instead of defaulting to 0 by replacing the `result.exitCode ?? 0`
logic; also add a sane timeout to the execa calls (the invocations using execa
shown earlier in this file) and ensure timeout/cancelled outcomes populate
stderr/exitCode consistently so callers (rag-query.ts, rag-audit.ts,
rag-status.ts) can detect failures.
- Around line 19-22: The execa calls in python-bridge.ts (the const result =
await execa(...) at the two invocation sites) run without a timeout; add a
timeout option (in milliseconds) to both execa option objects so stuck Python/uv
processes are killed automatically (e.g., timeout:
Number(process.env.PYTHON_BRIDGE_TIMEOUT_MS) || 30000). Update both calls that
invoke execa (the one at the shown line and the other around line 41) to include
this timeout setting in their options to prevent hanging tool requests.
In `@templates/enterprise/project/.github/workflows/rag-audit.yml`:
- Around line 30-34: Replace the hardcoded THRESHOLD with the value from the
generated report: read THRESHOLD from reports/audit-report.json (e.g.
THRESHOLD=$(jq -r '.metrics.faithfulness.threshold // "0.85"'
reports/audit-report.json)) and keep the rest of the comparison using SCORE and
THRESHOLD as before; ensure jq extraction returns a numeric string or provide a
sensible fallback so the existing bc comparison works.
In `@templates/enterprise/project/eval/config.yaml`:
- Around line 4-21: The template includes unsupported keys ("metrics",
"ci_gate", "golden_set") that the evaluator's config model (AuditConfig and the
CLI wiring) does not consume; either remove these keys from the template or
adapt the config parsing to populate AuditConfig from them. Fix by updating the
template to only include fields that AuditConfig expects (remove "metrics",
"ci_gate", "golden_set" and keep only supported threshold entries), or
alternatively update the CLI/config loader (the AuditConfig parsing logic) to
accept and map "metrics", "ci_gate", and "golden_set" into the corresponding
AuditConfig properties so the declared gate truly configures the evaluator.
Ensure references to AuditConfig and the config loader/parsing function are
updated to reflect the chosen approach.
In `@templates/enterprise/project/eval/golden_set.json`:
- Line 4: The expected_answer_keywords array in golden_set.json uses overly
broad tokens ("topic", "document") which allow false-positive matches; update
the "expected_answer_keywords" value(s) to use more specific phrases or entities
that reflect the true expected content (e.g., "primary research topic", "source
document title", "document ID", "document summary") so the evaluation checks
substantive overlap—locate the "expected_answer_keywords" key in the JSON entry
and replace the generic tokens with targeted keyword phrases relevant to the
question/answer pair.
In `@templates/enterprise/project/pyproject.toml`:
- Around line 22-23: The template sets reranker = "cohere" which triggers
RetrievalConfig validation requiring cohere_api_key; change the default reranker
in the template (the reranker field in
templates/enterprise/project/pyproject.toml) to a safe empty/none value (or
remove the reranker entry) so generated projects do not demand cohere_api_key at
load time; ensure the change references the reranker setting and leaves enrich =
true untouched so only the reranker default is modified.
In `@templates/enterprise/project/README.md`:
- Around line 15-18: The Quick Start should add an explicit credential setup
step before running commands like `rag-forge index`, `rag-forge query`, and
`rag-forge audit` because the template uses provider-backed
reranking/generation; update the README to instruct users to export or configure
required API keys (e.g., set provider-specific env vars or run the
project-specific credential setup command) and show a short example of the
expected variables or config (so first-run commands don't fail due to missing
keys).
In `@templates/enterprise/project/src/config.py`:
- Around line 23-25: PipelineConfig currently declares reranker and cohere_model
but lacks a Cohere credential field, so add a cohere_api_key attribute to
PipelineConfig (same style/type as other config fields, e.g., Optional[str] =
None or str = "") so users can supply the API key when reranker == "cohere";
update any validation/usage sites that read PipelineConfig.cohere_api_key to
handle missing/empty keys accordingly (look for class PipelineConfig, attributes
reranker and cohere_model to locate where to add the new cohere_api_key field).
In `@templates/hybrid/project/eval/config.yaml`:
- Around line 4-21: The template exposes keys (metrics, ci_gate, golden_set)
that AuditConfig in packages/evaluator/src/rag_forge_evaluator/audit.py and the
CLI in packages/evaluator/src/rag_forge_evaluator/cli.py don’t handle; either
remove them from the template or make the evaluator read and enforce them—update
AuditConfig to add typed fields for metrics: List[str], ci_gate: {metric: str,
threshold: float, block_on_failure: bool}, and golden_set: Optional[str], update
the CLI parsing in cli.py to load those fields from the config JSON, and
implement enforcement logic that uses AuditConfig.ci_gate to fail/exit when gate
conditions are not met and loads AuditConfig.golden_set when present to evaluate
golden examples.
In `@templates/hybrid/project/README.md`:
- Around line 15-17: The quick-start query command is missing the required
hybrid retrieval flag; update the example for the `rag-forge query` invocation
to include the `--sparse-index-path` argument (matching the earlier index step)
so hybrid mode runs correctly—modify the line with the `rag-forge query "your
question" --strategy hybrid --reranker cohere` example to add
`--sparse-index-path .rag-forge/sparse` so the documented flow is runnable.
In `@templates/hybrid/project/src/config.py`:
- Around line 23-25: PipelineConfig is missing a Cohere credential field while
defaulting reranker="cohere"; add a cohere_api_key field to the PipelineConfig
definition (e.g., cohere_api_key: Optional[str] = None) and import
typing.Optional if needed so users can set the Cohere API key alongside reranker
and cohere_model; update any code that constructs PipelineConfig (or reads its
fields) to use this new cohere_api_key when initializing the Cohere client.
In `@turbo.json`:
- Around line 17-19: The "typecheck" pipeline currently has "dependsOn":
["^build"] which only depends on build tasks in dependency packages; update the
"typecheck" pipeline's dependsOn to include the local build by adding "build" to
the array (i.e., "dependsOn": ["build", "^build"]) so the local build task runs
before the "typecheck" task; locate the "typecheck" entry and modify its
dependsOn accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 50e6ee76-0208-40ed-b836-adc9d5845aa5
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (34)
eval/golden_set.jsonpackages/cli/package.jsonpackages/cli/src/commands/serve.tspackages/cli/src/index.tspackages/cli/src/lib/python-bridge.tspackages/core/src/rag_forge_core/cli.pypackages/mcp/package.jsonpackages/mcp/src/main.tspackages/mcp/src/server.tspackages/mcp/src/tools/rag-audit.tspackages/mcp/src/tools/rag-query.tspackages/mcp/src/tools/rag-status.tspackages/mcp/tsup.config.tspackages/shared/package.jsonpackages/shared/src/python-bridge.tspackages/shared/tsconfig.jsonpackages/shared/tsup.config.tspnpm-workspace.yamltemplates/enterprise/README.mdtemplates/enterprise/project/.github/workflows/rag-audit.ymltemplates/enterprise/project/README.mdtemplates/enterprise/project/eval/config.yamltemplates/enterprise/project/eval/golden_set.jsontemplates/enterprise/project/pyproject.tomltemplates/enterprise/project/src/config.pytemplates/enterprise/project/src/pipeline.pytemplates/hybrid/README.mdtemplates/hybrid/project/README.mdtemplates/hybrid/project/eval/config.yamltemplates/hybrid/project/eval/golden_set.jsontemplates/hybrid/project/pyproject.tomltemplates/hybrid/project/src/config.pytemplates/hybrid/project/src/pipeline.pyturbo.json
📜 Review details
🧰 Additional context used
🪛 LanguageTool
templates/enterprise/project/README.md
[uncategorized] ~22-~22: The official name of this software platform is spelled with a capital “H”.
Context: ...olden_set.json ## CI/CD The included.github/workflows/rag-audit.yml` runs evaluatio...
(GITHUB)
🔇 Additional comments (21)
packages/mcp/src/server.ts (1)
4-7: Startup wiring is clean and correct.
startServer()correctly creates the MCP server and awaits stdio transport connection.packages/cli/src/index.ts (1)
6-6: Serve command registration is integrated correctly.Import and registration match the existing command pattern.
Also applies to: 21-21
packages/mcp/src/main.ts (1)
3-6: Entrypoint failure handling is solid.Top-level rejection handling logs context and exits non-zero as expected.
packages/mcp/tsup.config.ts (1)
4-4: Build entry expansion is correct.Adding
src/main.tsto tsup entries matches the new MCP runtime entrypoint.templates/hybrid/README.md (1)
5-5: Usage instruction is clear and immediately actionable.Line 5 gives a concrete command users can run, which is better than a phase-status placeholder.
pnpm-workspace.yaml (1)
2-2: Workspace package registration looks correct.Line 2 correctly adds
packages/shared, enabling workspace linking for@rag-forge/shared.packages/shared/tsconfig.json (1)
1-9: TypeScript package config is well-scoped.
rootDir,outDir, andinclude/excludeare consistent for buildingpackages/sharedcleanly.templates/enterprise/README.md (2)
3-3: Enterprise positioning text remains coherent.Line 3 still communicates security-focused, gated deployment intent clearly.
5-5: Direct initialization command is a good UX improvement.Line 5 provides a runnable command instead of release-phase wording.
packages/mcp/package.json (1)
23-23: Dependency addition is correct for shared bridge reuse.Line 23 properly declares
@rag-forge/sharedfor MCP runtime integration.packages/cli/package.json (1)
25-25: CLI dependency wiring to shared package looks correct.Line 25 correctly links the CLI to
@rag-forge/sharedvia workspace dependency.templates/hybrid/project/eval/golden_set.json (1)
1-14: Golden set structure is consistent and valid.
This dataset is well-formed and suitable as a starter template.packages/shared/tsup.config.ts (1)
3-10: Build config aligns with the package export contract.
The entry/output settings are coherent with the shared bridge packaging goals.eval/golden_set.json (1)
1-20: Starter golden set is well-structured.
Entries are consistent and usable for baseline evaluation flows.packages/mcp/src/tools/rag-status.ts (1)
4-16: Good wiring from MCP tool to Python status command.
Line 4–Line 16 correctly delegates execution and handles error/success paths cleanly.templates/hybrid/project/pyproject.toml (1)
1-23: Hybrid template manifest is coherent and complete for initial scaffolding.
Project/dependency metadata andtool.rag-forgesettings are internally consistent.packages/shared/package.json (1)
1-29: Package manifest wiring looks correct.Export map, scripts, engine floor, and dependency split are coherent for a shared ESM utility package.
packages/core/src/rag_forge_core/cli.py (1)
324-333: Status subcommand wiring is clean.Parser registration and command dispatch are correctly integrated into
main().packages/cli/src/lib/python-bridge.ts (1)
1-8: Re-export consolidation looks good.This keeps CLI bridge imports stable while centralizing implementation in
@rag-forge/shared.templates/hybrid/project/src/pipeline.py (1)
6-36: Template scaffold is clear and usable.The placeholder API shape and CLI hinting are coherent for generated starter code.
templates/enterprise/project/src/pipeline.py (1)
6-36: Enterprise template scaffold is consistent.The generated placeholders and invocation pattern are aligned with starter-template expectations.
| function getMcpMainPath(): string { | ||
| const currentDir = fileURLToPath(new URL(".", import.meta.url)); | ||
| return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"); | ||
| } |
There was a problem hiding this comment.
Path resolution is monorepo-coupled and can break outside this layout.
Line 9 hardcodes traversal to a sibling mcp/dist/main.js. This is fragile for packaged/global installs and any altered build layout.
🔧 Proposed fix
+import { existsSync } from "node:fs";
import { resolve } from "node:path";
import { fileURLToPath } from "node:url";
@@
function getMcpMainPath(): string {
const currentDir = fileURLToPath(new URL(".", import.meta.url));
- return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");
+ const candidates = [
+ process.env.RAG_FORGE_MCP_MAIN,
+ resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"),
+ ].filter((p): p is string => Boolean(p));
+
+ const resolved = candidates.find((p) => existsSync(p));
+ if (!resolved) {
+ throw new Error(
+ "Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.",
+ );
+ }
+ return resolved;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function getMcpMainPath(): string { | |
| const currentDir = fileURLToPath(new URL(".", import.meta.url)); | |
| return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"); | |
| } | |
| import { existsSync } from "node:fs"; | |
| import { resolve } from "node:path"; | |
| import { fileURLToPath } from "node:url"; | |
| function getMcpMainPath(): string { | |
| const currentDir = fileURLToPath(new URL(".", import.meta.url)); | |
| const candidates = [ | |
| process.env.RAG_FORGE_MCP_MAIN, | |
| resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"), | |
| ].filter((p): p is string => Boolean(p)); | |
| const resolved = candidates.find((p) => existsSync(p)); | |
| if (!resolved) { | |
| throw new Error( | |
| "Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.", | |
| ); | |
| } | |
| return resolved; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/cli/src/commands/serve.ts` around lines 7 - 10, The getMcpMainPath
function currently hardcodes upward traversal to locate mcp/dist/main.js which
breaks outside the monorepo; change getMcpMainPath to resolve the target using
Node's module resolution (e.g., via createRequire(import.meta.url).resolve or
require.resolve('mcp/dist/main.js')) and fall back to a configurable override
(env var or CLI option) if resolution fails, and ensure you catch and surface
resolution errors from getMcpMainPath so callers can handle packaged/global
installs robustly.
| child.on("exit", (code) => { | ||
| process.exit(code ?? 0); | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect current exit handler arity and fallback behavior
rg -n -C2 'child\.on\("exit"|process\.exit\(code \?\? 0\)' packages/cli/src/commands/serve.tsRepository: hallengray/rag-forge
Length of output: 178
🏁 Script executed:
cat packages/cli/package.json | head -50Repository: hallengray/rag-forge
Length of output: 1079
🏁 Script executed:
fd -type f -name "*.test.*" -o -name "*.spec.*" | grep -i serveRepository: hallengray/rag-forge
Length of output: 234
🏁 Script executed:
rg "child\.on\(.*exit" --type ts --type jsRepository: hallengray/rag-forge
Length of output: 133
🏁 Script executed:
find packages/cli -type f \( -name "*.test.*" -o -name "*.spec.*" \)Repository: hallengray/rag-forge
Length of output: 98
🏁 Script executed:
cat -n packages/cli/src/commands/serve.ts | head -60Repository: hallengray/rag-forge
Length of output: 1521
🏁 Script executed:
cat -n packages/cli/__tests__/cli.test.tsRepository: hallengray/rag-forge
Length of output: 1384
Signal-terminated child processes incorrectly exit with success code.
The exit handler only uses the code parameter and ignores signal. When a child process is terminated by a signal, code is null, so code ?? 0 exits the parent with code 0 (success), masking interrupt and termination signals like SIGTERM or SIGKILL.
Node.js >=20.0.0 is required, which provides both code and signal to the exit handler. Handle signals explicitly to propagate them:
Fix
- child.on("exit", (code) => {
- process.exit(code ?? 0);
+ child.on("exit", (code, signal) => {
+ if (signal) {
+ process.kill(process.pid, signal);
+ return;
+ }
+ process.exit(code ?? 1);
});Also change ?? 0 to ?? 1 to avoid masking unexpected null exit codes as success.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/cli/src/commands/serve.ts` around lines 36 - 38, The exit handler
for child.on("exit", (code) => ...) ignores the signal parameter and treats null
codes as success; update the handler signature to accept both (code, signal),
propagate signal-terminated exits by re-emitting the same signal to the parent
(e.g., process.kill(process.pid, signal)) when signal is set, and fall back to
exiting with a non-zero code by changing process.exit(code ?? 0) to
process.exit(code ?? 1) when no signal is present; update the child.on("exit",
...) callback accordingly so signal and code are both handled.
| try: | ||
| count = store.count(collection) | ||
| indexed = count > 0 | ||
| except (ValueError, KeyError): | ||
| count = 0 | ||
| indexed = False |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
cmd_status exception branch is effectively dead code.
QdrantStore.count() already swallows exceptions and returns 0, so this except (ValueError, KeyError) path is not doing real work. Simplifying this block will make status behavior clearer.
♻️ Proposed simplification
def cmd_status(args: argparse.Namespace) -> None:
"""Check pipeline status."""
collection = args.collection or "rag-forge"
store = QdrantStore()
- try:
- count = store.count(collection)
- indexed = count > 0
- except (ValueError, KeyError):
- count = 0
- indexed = False
+ count = store.count(collection)
+ indexed = count > 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| count = store.count(collection) | |
| indexed = count > 0 | |
| except (ValueError, KeyError): | |
| count = 0 | |
| indexed = False | |
| def cmd_status(args: argparse.Namespace) -> None: | |
| """Check pipeline status.""" | |
| collection = args.collection or "rag-forge" | |
| store = QdrantStore() | |
| count = store.count(collection) | |
| indexed = count > 0 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/core/src/rag_forge_core/cli.py` around lines 260 - 265, The except
branch around store.count(collection) in cmd_status is dead because
QdrantStore.count() already handles errors and returns 0; remove the try/except
and simplify to a direct call: set count = store.count(collection) and indexed =
count > 0. Update the block that currently contains the try/except (referencing
store.count and collection in cmd_status) to use this simplified logic so the
ValueError/KeyError branch is removed.
| const args = ["audit", "--judge", "mock"]; | ||
|
|
||
| if (input.golden_set_path) { | ||
| args.push("--golden-set", input.golden_set_path); | ||
| } |
There was a problem hiding this comment.
Do not silently ignore input.metrics.
Line 15–Line 19 build CLI args without using input.metrics, even though the tool schema accepts it. Either wire this field into the evaluator invocation or reject it explicitly to avoid a misleading API contract.
Proposed guard to avoid silent input loss
export async function handleRagAudit(input: RagAuditInput): Promise<string> {
+ if (input.metrics?.length) {
+ return JSON.stringify({
+ status: "error",
+ message: "The 'metrics' filter is not supported by rag_forge_evaluator.cli yet",
+ });
+ }
+
const args = ["audit", "--judge", "mock"];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const args = ["audit", "--judge", "mock"]; | |
| if (input.golden_set_path) { | |
| args.push("--golden-set", input.golden_set_path); | |
| } | |
| if (input.metrics?.length) { | |
| return JSON.stringify({ | |
| status: "error", | |
| message: "The 'metrics' filter is not supported by rag_forge_evaluator.cli yet", | |
| }); | |
| } | |
| const args = ["audit", "--judge", "mock"]; | |
| if (input.golden_set_path) { | |
| args.push("--golden-set", input.golden_set_path); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/mcp/src/tools/rag-audit.ts` around lines 15 - 19, The code builds
CLI args in rag-audit.ts (const args = ["audit", "--judge", "mock"]; and
subsequent pushes) but ignores input.metrics from the tool schema; either wire
input.metrics into the args before invoking the evaluator or explicitly reject
it. Update the args construction to map input.metrics to the appropriate CLI
flag(s) (e.g., push "--metrics" and the metrics value or serialize as needed)
when input.metrics is present, or add a validation step that throws or returns
an error if input.metrics is set to prevent silent input loss; adjust the
evaluator invocation that consumes args accordingly so the metrics are actually
passed through.
| metrics: | ||
| - context_relevance | ||
| - faithfulness | ||
| - answer_relevance | ||
| - hallucination | ||
|
|
||
| thresholds: | ||
| context_relevance: 0.80 | ||
| faithfulness: 0.85 | ||
| answer_relevance: 0.80 | ||
| hallucination_rate: 0.05 | ||
|
|
||
| ci_gate: | ||
| metric: faithfulness | ||
| threshold: 0.85 | ||
| block_on_failure: true | ||
|
|
||
| golden_set: eval/golden_set.json |
There was a problem hiding this comment.
metrics/ci_gate/golden_set appear non-functional with current evaluator config model.
Line 4–Line 21 define keys that are not represented in packages/evaluator/src/rag_forge_evaluator/audit.py AuditConfig, while packages/evaluator/src/rag_forge_evaluator/cli.py only reads thresholds from config JSON. This can mislead template users into thinking these gates are enforced when they are not.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@templates/hybrid/project/eval/config.yaml` around lines 4 - 21, The template
exposes keys (metrics, ci_gate, golden_set) that AuditConfig in
packages/evaluator/src/rag_forge_evaluator/audit.py and the CLI in
packages/evaluator/src/rag_forge_evaluator/cli.py don’t handle; either remove
them from the template or make the evaluator read and enforce them—update
AuditConfig to add typed fields for metrics: List[str], ci_gate: {metric: str,
threshold: float, block_on_failure: bool}, and golden_set: Optional[str], update
the CLI parsing in cli.py to load those fields from the config JSON, and
implement enforcement logic that uses AuditConfig.ci_gate to fail/exit when gate
conditions are not met and loads AuditConfig.golden_set when present to evaluate
golden examples.
| "typecheck": { | ||
| "dependsOn": ["^build"] | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In Turborepo task configuration, does "dependsOn": ["^build"] include the current package's build task, or only dependency packages' build tasks?
💡 Result:
"dependsOn": ["^build"] includes only dependency packages' build tasks, not the current package's build task. The ^ prefix specifically means "run the task in direct dependencies before the target package." To depend on the current package's own build task (e.g., for a test task), use "build" without the ^ prefix.
Citations:
- 1: https://turbo.build/docs/crafting-your-repository/configuring-tasks
- 2: https://turborepo.dev/docs/crafting-your-repository/configuring-tasks
typecheck should explicitly depend on the local build task.
In Turborepo, "dependsOn": ["^build"] targets only the build tasks in dependency packages, not the current package's build task. To ensure typecheck runs after the local build task completes, add "build" to the dependsOn array.
Proposed fix
"typecheck": {
- "dependsOn": ["^build"]
+ "dependsOn": ["build", "^build"]
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "typecheck": { | |
| "dependsOn": ["^build"] | |
| } | |
| "typecheck": { | |
| "dependsOn": ["build", "^build"] | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@turbo.json` around lines 17 - 19, The "typecheck" pipeline currently has
"dependsOn": ["^build"] which only depends on build tasks in dependency
packages; update the "typecheck" pipeline's dependsOn to include the local build
by adding "build" to the array (i.e., "dependsOn": ["build", "^build"]) so the
local build task runs before the "typecheck" task; locate the "typecheck" entry
and modify its dependsOn accordingly.
- serve.ts: validate MCP path exists before spawning, use process.execPath - rag-query/audit/status: preserve stdout on error for partial results - python-bridge: add 5min timeout, handle signal-terminated processes - enterprise/hybrid templates: default reranker to "none", add cohere_api_key field - hybrid README: fix query command args, add prerequisites section - enterprise README: add prerequisites section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Phase 2D — MCP Server Wiring + Templates
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:
1. EvaluationResult.skipped_evaluations (the integer counter the report's
TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
skipped_samples held real SkipRecords, but the counter stayed at 0.
Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
thought nothing had happened when in reality every job had crashed.
2. Individual-metric extraction failures created one SkipRecord per
metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
that failed entirely produced 4 records instead of 48 — the blast
radius was under-reported by 12x.
Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
single coarse failure out into one record per (sample, metric) pair
with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
both return sites so the counter and the detail list can never drift
apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
Python tracebacks don't blow up HTML/PDF rendering downstream.
Tests:
- test_ragas_skip_counter.py — three new tests covering:
(a) skipped_evaluations counter equals len(skipped_samples),
(b) whole-batch crashes fan out to sample_count * metric_count records
with real sample_ids (not "<aggregate>"),
(c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
fan-out change. The original assertions demanded "metrics must be
populated"; with fan-out, a MockJudge run legitimately produces zero
scored metrics and a full skip list. Tests now assert the absence of
Finding #4 (embed_query AttributeError) and Finding #5
(InstructorRetryException / max_tokens truncation) signatures in the
skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
with a reverse importorskip so it runs only on systems without the
[ragas] extra. CI matrices with ragas installed now skip it instead
of failing.
Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
G2's deliverable is proof — not claims — that v0.2.2's code actually closes every Cycle 2 / Cycle 3 finding attributed to RAG-Forge. This file is the artifact. One section per finding with: - what the original audit saw (raw error signatures) - why earlier cycles could not verify the fix (masking) - what v0.2.2 changes to close the gap - passing test output naming each test by ID Covers: Cycle 2 #4 — OpenAIEmbeddings.embed_query AttributeError (FIXED) Cycle 2 #5 — max_tokens overflow on long responses (FIXED) Cycle 2 #6 — silent 0.0 coercion (FULLY FIXED) Cycle 3 C3-2 — RagForgeRagasLLM.generate missing (FIXED) Cycle 3 C3-5 — stale __version__ (FIXED) Closes with a falsifiability section naming the concrete error signatures that, if they reappear in a Cycle 4 run against real PearMedica telemetry, would falsify the report and justify holding the release. The release bar is "Cycle 4 produces non-empty metrics OR skip records with real error signatures — not empty both." Combined test suite run shows 31/31 passing on the rebased G1+G3+G5 branch.
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:
1. EvaluationResult.skipped_evaluations (the integer counter the report's
TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
skipped_samples held real SkipRecords, but the counter stayed at 0.
Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
thought nothing had happened when in reality every job had crashed.
2. Individual-metric extraction failures created one SkipRecord per
metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
that failed entirely produced 4 records instead of 48 — the blast
radius was under-reported by 12x.
Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
single coarse failure out into one record per (sample, metric) pair
with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
both return sites so the counter and the detail list can never drift
apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
Python tracebacks don't blow up HTML/PDF rendering downstream.
Tests:
- test_ragas_skip_counter.py — three new tests covering:
(a) skipped_evaluations counter equals len(skipped_samples),
(b) whole-batch crashes fan out to sample_count * metric_count records
with real sample_ids (not "<aggregate>"),
(c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
fan-out change. The original assertions demanded "metrics must be
populated"; with fan-out, a MockJudge run legitimately produces zero
scored metrics and a full skip list. Tests now assert the absence of
Finding #4 (embed_query AttributeError) and Finding #5
(InstructorRetryException / max_tokens truncation) signatures in the
skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
with a reverse importorskip so it runs only on systems without the
[ragas] extra. CI matrices with ragas installed now skip it instead
of failing.
Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:
1. EvaluationResult.skipped_evaluations (the integer counter the report's
TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
skipped_samples held real SkipRecords, but the counter stayed at 0.
Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
thought nothing had happened when in reality every job had crashed.
2. Individual-metric extraction failures created one SkipRecord per
metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
that failed entirely produced 4 records instead of 48 — the blast
radius was under-reported by 12x.
Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
single coarse failure out into one record per (sample, metric) pair
with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
both return sites so the counter and the detail list can never drift
apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
Python tracebacks don't blow up HTML/PDF rendering downstream.
Tests:
- test_ragas_skip_counter.py — three new tests covering:
(a) skipped_evaluations counter equals len(skipped_samples),
(b) whole-batch crashes fan out to sample_count * metric_count records
with real sample_ids (not "<aggregate>"),
(c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
fan-out change. The original assertions demanded "metrics must be
populated"; with fan-out, a MockJudge run legitimately produces zero
scored metrics and a full skip list. Tests now assert the absence of
Finding #4 (embed_query AttributeError) and Finding #5
(InstructorRetryException / max_tokens truncation) signatures in the
skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
with a reverse importorskip so it runs only on systems without the
[ragas] extra. CI matrices with ragas installed now skip it instead
of failing.
Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:
1. EvaluationResult.skipped_evaluations (the integer counter the report's
TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
skipped_samples held real SkipRecords, but the counter stayed at 0.
Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
thought nothing had happened when in reality every job had crashed.
2. Individual-metric extraction failures created one SkipRecord per
metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
that failed entirely produced 4 records instead of 48 — the blast
radius was under-reported by 12x.
Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
single coarse failure out into one record per (sample, metric) pair
with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
both return sites so the counter and the detail list can never drift
apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
Python tracebacks don't blow up HTML/PDF rendering downstream.
Tests:
- test_ragas_skip_counter.py — three new tests covering:
(a) skipped_evaluations counter equals len(skipped_samples),
(b) whole-batch crashes fan out to sample_count * metric_count records
with real sample_ids (not "<aggregate>"),
(c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
fan-out change. The original assertions demanded "metrics must be
populated"; with fan-out, a MockJudge run legitimately produces zero
scored metrics and a full skip list. Tests now assert the absence of
Finding #4 (embed_query AttributeError) and Finding #5
(InstructorRetryException / max_tokens truncation) signatures in the
skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
with a reverse importorskip so it runs only on systems without the
[ragas] extra. CI matrices with ragas installed now skip it instead
of failing.
Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
…3) (#38) * fix(evaluator): add missing ragas BaseRagasLLM methods (G1) RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone. * fix(evaluator): address CodeRabbit review on G1 adapter shims Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates. * fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass * fix(evaluator): address CodeRabbit round-2 findings on G1 Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean. * fix(evaluator): route RAGAS exceptions into Skipped counter (G3) Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight. * fix(evaluator): tolerate cross-environment mypy on ragas_evaluate ignores The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs only fire when mypy can see ragas's real type signatures — i.e. when the [ragas] extra is installed (local dev). On CI without the extra, mypy falls back to Any and flags the ignores as [unused-ignore]. Adding unused-ignore to the ignore codes tells mypy to tolerate the comment when it has nothing to suppress. Both environments are now clean: local mypy with the extra still sees the arg-type mismatch and honours the suppression; CI mypy without the extra silently accepts the comment as dead. * fix(evaluator): remove dead real_import code in G3 skip counter test CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.
* docs(v0.2.2): Cycle 2/3 verification evidence report (G2) G2's deliverable is proof — not claims — that v0.2.2's code actually closes every Cycle 2 / Cycle 3 finding attributed to RAG-Forge. This file is the artifact. One section per finding with: - what the original audit saw (raw error signatures) - why earlier cycles could not verify the fix (masking) - what v0.2.2 changes to close the gap - passing test output naming each test by ID Covers: Cycle 2 #4 — OpenAIEmbeddings.embed_query AttributeError (FIXED) Cycle 2 #5 — max_tokens overflow on long responses (FIXED) Cycle 2 #6 — silent 0.0 coercion (FULLY FIXED) Cycle 3 C3-2 — RagForgeRagasLLM.generate missing (FIXED) Cycle 3 C3-5 — stale __version__ (FIXED) Closes with a falsifiability section naming the concrete error signatures that, if they reappear in a Cycle 4 run against real PearMedica telemetry, would falsify the report and justify holding the release. The release bar is "Cycle 4 produces non-empty metrics OR skip records with real error signatures — not empty both." Combined test suite run shows 31/31 passing on the rebased G1+G3+G5 branch. * docs(v0.2.2): markdown polish on G2 verification report Addresses CodeRabbit findings on PR #40. - Added language specifiers to every fenced code block (10 blocks, all previously bare ```). text for error traces and pytest output, bash for the shell command block. Silences markdownlint MD040. - Reformatted the inline 7-method enumeration in the Finding #4 section as a proper bulleted list. The methods now render one per line instead of as a comma-separated prose run, which was hard to scan. * docs(v0.2.2): clarify version-fix scoping + US spelling on G2 report CodeRabbit on PR #40 round 2 raised two findings on the C3-5 section: 1. The evidence snippet showing __version__ = "0.1.0" vs pyproject version = "0.2.1" could be read as a current-state claim rather than a pre-fix snapshot. Added an explicit "pre-v0.2.2 state" header and a paragraph scoping the snippet so nobody reads it as the repo's current state. 2. "parametrized" -> "parameterized" for US spelling consistency. Also updated the description to reflect G5's round-2 move to auto-discovery so the report matches the actual test impl.
Summary
@rag-forge/shared): Extracted Python bridge into a reusable package, used by both CLI and MCPrag_query,rag_audit,rag_statusnow call real Python pipelines via the bridge (no more stubs)server.ts+main.tswith stdio transport for Claude Code integrationserve --mcpCLI command: Launch MCP server viarag-forge serve --mcpstatussubcommand: Reports collection state (indexed, chunk count)Quality
Test plan
rag-forge serve --mcplaunches MCP serverrag-forge init hybridscaffolds correct filesrag-forge init enterprisescaffolds with CI workflow🤖 Generated with Claude Code