feat: Phase 2D — MCP Server Wiring + Templates by hallengray · Pull Request #5 · hallengray/rag-forge

hallengray · 2026-04-12T10:42:27Z

Summary

Shared bridge package (@rag-forge/shared): Extracted Python bridge into a reusable package, used by both CLI and MCP
MCP tools wired: rag_query, rag_audit, rag_status now call real Python pipelines via the bridge (no more stubs)
MCP server entry point: server.ts + main.ts with stdio transport for Claude Code integration
serve --mcp CLI command: Launch MCP server via rag-forge serve --mcp
Python status subcommand: Reports collection state (indexed, chunk count)
Hybrid template: 6-file scaffold with BM25 + reranking + contextual enrichment config
Enterprise template: 7-file scaffold with security guards + CI/CD gate workflow

Quality

273 Python tests passing (190 core + 83 evaluator)
Ruff, mypy, ESLint, TypeScript typecheck all clean
3 TS packages build successfully (shared, cli, mcp)

Test plan

All Python tests pass
All TS packages build and typecheck
rag-forge serve --mcp launches MCP server
rag-forge init hybrid scaffolds correct files
rag-forge init enterprise scaffolds with CI workflow

🤖 Generated with Claude Code

coderabbitai · 2026-04-12T10:42:43Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa46445d-3af1-40bc-8057-d727a601e511

📥 Commits

Reviewing files that changed from the base of the PR and between a216fe1 and f12958a.

📒 Files selected for processing (11)

packages/cli/src/commands/serve.ts
packages/mcp/src/tools/rag-audit.ts
packages/mcp/src/tools/rag-query.ts
packages/mcp/src/tools/rag-status.ts
packages/shared/src/python-bridge.ts
templates/enterprise/project/README.md
templates/enterprise/project/pyproject.toml
templates/enterprise/project/src/config.py
templates/hybrid/project/README.md
templates/hybrid/project/pyproject.toml
templates/hybrid/project/src/config.py

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

Summary by CodeRabbit

New Features
- New CLI: added serve command to run the MCP server and status command to report indexing state
- MCP server entrypoint and runtime startup flow
- Tools now invoke Python-based query/audit/status flows
Documentation
- Added Enterprise and Hybrid pipeline templates, docs, quick-starts, evaluation configs and golden sets
- CI quality-gate workflow for RAG audits
Chores
- Introduced shared utilities package and consolidated Python bridge
- Build/workspace updates (multi-entry build, workspace package, typecheck ordering)

Walkthrough

Extracts a Python-bridge into a new @rag-forge/shared package; refactors CLI and MCP to use it; adds an MCP runtime entrypoint and serve CLI command that spawns the MCP process; replaces MCP tool stubs with Python module invocations; adds a status Python CLI subcommand; and introduces enterprise/hybrid templates with evaluation configs and golden sets.

Changes

Cohort / File(s)	Summary
Shared package `packages/shared/package.json`, `packages/shared/src/python-bridge.ts`, `packages/shared/tsconfig.json`, `packages/shared/tsup.config.ts`	New package `@rag-forge/shared` implementing `runPythonModule` and `checkPythonAvailable` (types + build config).
Workspace & deps `pnpm-workspace.yaml`, `packages/cli/package.json`, `packages/mcp/package.json`	Added `packages/shared` to workspace; `cli` and `mcp` now depend on `@rag-forge/shared` (removed local `execa` from CLI).
CLI integration `packages/cli/src/lib/python-bridge.ts`, `packages/cli/src/commands/serve.ts`, `packages/cli/src/index.ts`	CLI now re-exports Python bridge from shared, registers new `serve` command which validates `--mcp`, resolves MCP main path, checks file existence, and spawns a node process with inherited stdio.
MCP runtime & build `packages/mcp/src/main.ts`, `packages/mcp/src/server.ts`, `packages/mcp/tsup.config.ts`	Added `main.ts` entrypoint that calls exported `startServer()`; `startServer()` creates a server and connects a `StdioServerTransport`; `main.ts` added to tsup entries.
MCP tools: Python execution `packages/mcp/src/tools/rag-query.ts`, `packages/mcp/src/tools/rag-audit.ts`, `packages/mcp/src/tools/rag-status.ts`	Replaced "not_implemented" stubs with calls to `runPythonModule(...)`; return `stdout` on success or structured/error output on failure; arguments constructed from tool inputs.
Python core CLI `packages/core/src/rag_forge_core/cli.py`	Added `status` subcommand (`--collection`) that queries Qdrant count, handles ValueError/KeyError, and emits JSON with `indexed`, `collection`, and `chunk_count`.
Templates — Enterprise `templates/enterprise/`, `templates/enterprise/project/` (README, pyproject.toml, src/config.py, src/pipeline.py, eval/*, .github/workflows/rag-audit.yml)	New enterprise template: docs, pipeline stubs, config dataclass, evaluation config and golden set, and a GitHub Actions audit gate workflow.
Templates — Hybrid `templates/hybrid/`, `templates/hybrid/project/` (README, pyproject.toml, src/config.py, src/pipeline.py, eval/*)	New hybrid template: docs, pipeline stubs, config dataclass, evaluation config and golden set.
Root changes `eval/golden_set.json`, `turbo.json`	Added root evaluation golden set; `typecheck` task now depends on `^build`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant NodeProcess as "MCP Node Process"
    participant MCP as "MCP startServer"
    participant Stdio as "StdioServerTransport"
    participant Tool as "MCP Tool Handler"
    participant Python as "Python Module"

    User->>CLI: rag-forge serve --mcp <path>
    CLI->>NodeProcess: spawn node <mcp/dist/main.js>
    NodeProcess->>MCP: invoke startServer()
    MCP->>Stdio: create transport
    MCP->>Stdio: connect(transport)
    Stdio-->>MCP: connected

    User->>Stdio: invoke rag_query input
    Stdio->>Tool: handleRagQuery(input)
    Tool->>Python: runPythonModule("rag_forge_core.cli", ["query", ...])
    Python-->>Tool: {stdout, stderr, exitCode}
    alt exitCode == 0
        Tool-->>Stdio: return stdout
    else
        Tool-->>Stdio: return error payload (stderr or message)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I tunneled through packages, shared and new,
Spawned a server, let the MCP brew.
Tools now call Python, templates take flight,
Status and serve keep the pipeline bright.
Hop, hop—RAG-Forge grows overnight! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: introducing MCP server wiring and templates (Hybrid and Enterprise) as part of Phase 2D development.
Description check	✅ Passed	The description is directly related to the changeset, detailing the shared bridge package, MCP tool wiring, server entry point, CLI serve command, Python status subcommand, and template scaffolds with quality metrics.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase2d-mcp-templates

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Extracts the Python subprocess bridge into a standalone workspace package so both the CLI and MCP server can share a single implementation without duplication. Exposes runPythonModule and checkPythonAvailable via ESM exports with full TypeScript declarations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Removes the inline execa dependency from the CLI and replaces the full python-bridge implementation with a thin re-export of the shared package, ensuring a single source of truth for the subprocess bridge across all workspace consumers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the stub not-implemented handlers in rag-query, rag-audit, and rag-status with real implementations that delegate to Python via the shared runPythonModule bridge. Adds @rag-forge/shared as a workspace dependency so the MCP server shares the same subprocess layer as the CLI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…gate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… gate - turbo.json: typecheck now depends on ^build so @rag-forge/shared is built before CLI/MCP typecheck runs - eval/golden_set.json: root-level golden set for CI audit gate Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/cli/src/commands/serve.ts`:
- Around line 7-10: The getMcpMainPath function currently hardcodes upward
traversal to locate mcp/dist/main.js which breaks outside the monorepo; change
getMcpMainPath to resolve the target using Node's module resolution (e.g., via
createRequire(import.meta.url).resolve or require.resolve('mcp/dist/main.js'))
and fall back to a configurable override (env var or CLI option) if resolution
fails, and ensure you catch and surface resolution errors from getMcpMainPath so
callers can handle packaged/global installs robustly.
- Around line 27-29: The spawn call currently uses the literal "node" which can
break in PATH-restricted environments; change it to use process.execPath when
spawning the child in serve.ts so the child uses the same Node binary as the
parent (update the spawn invocation that creates child from spawn("node",
[mcpMain], { stdio: "inherit" }) to spawn(process.execPath, [mcpMain], { stdio:
"inherit" })). Ensure any tests or wrappers referencing the child variable
remain compatible.
- Around line 36-38: The exit handler for child.on("exit", (code) => ...)
ignores the signal parameter and treats null codes as success; update the
handler signature to accept both (code, signal), propagate signal-terminated
exits by re-emitting the same signal to the parent (e.g.,
process.kill(process.pid, signal)) when signal is set, and fall back to exiting
with a non-zero code by changing process.exit(code ?? 0) to process.exit(code ??
1) when no signal is present; update the child.on("exit", ...) callback
accordingly so signal and code are both handled.

In `@packages/core/src/rag_forge_core/cli.py`:
- Around line 260-265: The except branch around store.count(collection) in
cmd_status is dead because QdrantStore.count() already handles errors and
returns 0; remove the try/except and simplify to a direct call: set count =
store.count(collection) and indexed = count > 0. Update the block that currently
contains the try/except (referencing store.count and collection in cmd_status)
to use this simplified logic so the ValueError/KeyError branch is removed.

In `@packages/mcp/src/tools/rag-audit.ts`:
- Around line 15-19: The code builds CLI args in rag-audit.ts (const args =
["audit", "--judge", "mock"]; and subsequent pushes) but ignores input.metrics
from the tool schema; either wire input.metrics into the args before invoking
the evaluator or explicitly reject it. Update the args construction to map
input.metrics to the appropriate CLI flag(s) (e.g., push "--metrics" and the
metrics value or serialize as needed) when input.metrics is present, or add a
validation step that throws or returns an error if input.metrics is set to
prevent silent input loss; adjust the evaluator invocation that consumes args
accordingly so the metrics are actually passed through.

In `@packages/mcp/src/tools/rag-query.ts`:
- Around line 28-32: The error branch in the rag-query result handling (where
result is checked with result.exitCode !== 0) currently only returns
result.stderr, which can drop structured JSON error payloads emitted to stdout;
update the failure path in the function handling the child process result to
prefer and preserve result.stdout (parse it if JSON) before falling back to
result.stderr, and return the preserved stdout JSON/error message (or include
both stdout and stderr) so callers of the module get the CLI's structured error;
look for symbols result and the exitCode check in
packages/mcp/src/tools/rag-query.ts to locate the change.

In `@packages/shared/src/python-bridge.ts`:
- Around line 24-28: The return currently coerces undefined exit codes to 0
which mislabels signaled/canceled processes as success; change the mapping of
result.exitCode so that undefined is treated as a failure (e.g., map undefined
to a non‑zero value or propagate undefined explicitly and let callers treat it
as failure) instead of defaulting to 0 by replacing the `result.exitCode ?? 0`
logic; also add a sane timeout to the execa calls (the invocations using execa
shown earlier in this file) and ensure timeout/cancelled outcomes populate
stderr/exitCode consistently so callers (rag-query.ts, rag-audit.ts,
rag-status.ts) can detect failures.
- Around line 19-22: The execa calls in python-bridge.ts (the const result =
await execa(...) at the two invocation sites) run without a timeout; add a
timeout option (in milliseconds) to both execa option objects so stuck Python/uv
processes are killed automatically (e.g., timeout:
Number(process.env.PYTHON_BRIDGE_TIMEOUT_MS) || 30000). Update both calls that
invoke execa (the one at the shown line and the other around line 41) to include
this timeout setting in their options to prevent hanging tool requests.

In `@templates/enterprise/project/.github/workflows/rag-audit.yml`:
- Around line 30-34: Replace the hardcoded THRESHOLD with the value from the
generated report: read THRESHOLD from reports/audit-report.json (e.g.
THRESHOLD=$(jq -r '.metrics.faithfulness.threshold // "0.85"'
reports/audit-report.json)) and keep the rest of the comparison using SCORE and
THRESHOLD as before; ensure jq extraction returns a numeric string or provide a
sensible fallback so the existing bc comparison works.

In `@templates/enterprise/project/eval/config.yaml`:
- Around line 4-21: The template includes unsupported keys ("metrics",
"ci_gate", "golden_set") that the evaluator's config model (AuditConfig and the
CLI wiring) does not consume; either remove these keys from the template or
adapt the config parsing to populate AuditConfig from them. Fix by updating the
template to only include fields that AuditConfig expects (remove "metrics",
"ci_gate", "golden_set" and keep only supported threshold entries), or
alternatively update the CLI/config loader (the AuditConfig parsing logic) to
accept and map "metrics", "ci_gate", and "golden_set" into the corresponding
AuditConfig properties so the declared gate truly configures the evaluator.
Ensure references to AuditConfig and the config loader/parsing function are
updated to reflect the chosen approach.

In `@templates/enterprise/project/eval/golden_set.json`:
- Line 4: The expected_answer_keywords array in golden_set.json uses overly
broad tokens ("topic", "document") which allow false-positive matches; update
the "expected_answer_keywords" value(s) to use more specific phrases or entities
that reflect the true expected content (e.g., "primary research topic", "source
document title", "document ID", "document summary") so the evaluation checks
substantive overlap—locate the "expected_answer_keywords" key in the JSON entry
and replace the generic tokens with targeted keyword phrases relevant to the
question/answer pair.

In `@templates/enterprise/project/pyproject.toml`:
- Around line 22-23: The template sets reranker = "cohere" which triggers
RetrievalConfig validation requiring cohere_api_key; change the default reranker
in the template (the reranker field in
templates/enterprise/project/pyproject.toml) to a safe empty/none value (or
remove the reranker entry) so generated projects do not demand cohere_api_key at
load time; ensure the change references the reranker setting and leaves enrich =
true untouched so only the reranker default is modified.

In `@templates/enterprise/project/README.md`:
- Around line 15-18: The Quick Start should add an explicit credential setup
step before running commands like `rag-forge index`, `rag-forge query`, and
`rag-forge audit` because the template uses provider-backed
reranking/generation; update the README to instruct users to export or configure
required API keys (e.g., set provider-specific env vars or run the
project-specific credential setup command) and show a short example of the
expected variables or config (so first-run commands don't fail due to missing
keys).

In `@templates/enterprise/project/src/config.py`:
- Around line 23-25: PipelineConfig currently declares reranker and cohere_model
but lacks a Cohere credential field, so add a cohere_api_key attribute to
PipelineConfig (same style/type as other config fields, e.g., Optional[str] =
None or str = "") so users can supply the API key when reranker == "cohere";
update any validation/usage sites that read PipelineConfig.cohere_api_key to
handle missing/empty keys accordingly (look for class PipelineConfig, attributes
reranker and cohere_model to locate where to add the new cohere_api_key field).

In `@templates/hybrid/project/eval/config.yaml`:
- Around line 4-21: The template exposes keys (metrics, ci_gate, golden_set)
that AuditConfig in packages/evaluator/src/rag_forge_evaluator/audit.py and the
CLI in packages/evaluator/src/rag_forge_evaluator/cli.py don’t handle; either
remove them from the template or make the evaluator read and enforce them—update
AuditConfig to add typed fields for metrics: List[str], ci_gate: {metric: str,
threshold: float, block_on_failure: bool}, and golden_set: Optional[str], update
the CLI parsing in cli.py to load those fields from the config JSON, and
implement enforcement logic that uses AuditConfig.ci_gate to fail/exit when gate
conditions are not met and loads AuditConfig.golden_set when present to evaluate
golden examples.

In `@templates/hybrid/project/README.md`:
- Around line 15-17: The quick-start query command is missing the required
hybrid retrieval flag; update the example for the `rag-forge query` invocation
to include the `--sparse-index-path` argument (matching the earlier index step)
so hybrid mode runs correctly—modify the line with the `rag-forge query "your
question" --strategy hybrid --reranker cohere` example to add
`--sparse-index-path .rag-forge/sparse` so the documented flow is runnable.

In `@templates/hybrid/project/src/config.py`:
- Around line 23-25: PipelineConfig is missing a Cohere credential field while
defaulting reranker="cohere"; add a cohere_api_key field to the PipelineConfig
definition (e.g., cohere_api_key: Optional[str] = None) and import
typing.Optional if needed so users can set the Cohere API key alongside reranker
and cohere_model; update any code that constructs PipelineConfig (or reads its
fields) to use this new cohere_api_key when initializing the Cohere client.

In `@turbo.json`:
- Around line 17-19: The "typecheck" pipeline currently has "dependsOn":
["^build"] which only depends on build tasks in dependency packages; update the
"typecheck" pipeline's dependsOn to include the local build by adding "build" to
the array (i.e., "dependsOn": ["build", "^build"]) so the local build task runs
before the "typecheck" task; locate the "typecheck" entry and modify its
dependsOn accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 50e6ee76-0208-40ed-b836-adc9d5845aa5

📥 Commits

Reviewing files that changed from the base of the PR and between 70ce9be and a216fe1.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (34)

eval/golden_set.json
packages/cli/package.json
packages/cli/src/commands/serve.ts
packages/cli/src/index.ts
packages/cli/src/lib/python-bridge.ts
packages/core/src/rag_forge_core/cli.py
packages/mcp/package.json
packages/mcp/src/main.ts
packages/mcp/src/server.ts
packages/mcp/src/tools/rag-audit.ts
packages/mcp/src/tools/rag-query.ts
packages/mcp/src/tools/rag-status.ts
packages/mcp/tsup.config.ts
packages/shared/package.json
packages/shared/src/python-bridge.ts
packages/shared/tsconfig.json
packages/shared/tsup.config.ts
pnpm-workspace.yaml
templates/enterprise/README.md
templates/enterprise/project/.github/workflows/rag-audit.yml
templates/enterprise/project/README.md
templates/enterprise/project/eval/config.yaml
templates/enterprise/project/eval/golden_set.json
templates/enterprise/project/pyproject.toml
templates/enterprise/project/src/config.py
templates/enterprise/project/src/pipeline.py
templates/hybrid/README.md
templates/hybrid/project/README.md
templates/hybrid/project/eval/config.yaml
templates/hybrid/project/eval/golden_set.json
templates/hybrid/project/pyproject.toml
templates/hybrid/project/src/config.py
templates/hybrid/project/src/pipeline.py
turbo.json

📜 Review details

🧰 Additional context used

🪛 LanguageTool

templates/enterprise/project/README.md

[uncategorized] ~22-~22: The official name of this software platform is spelled with a capital “H”.
Context: ...olden_set.json ## CI/CD The included.github/workflows/rag-audit.yml` runs evaluatio...

(GITHUB)

🔇 Additional comments (21)

packages/mcp/src/server.ts (1)

4-7: Startup wiring is clean and correct.

startServer() correctly creates the MCP server and awaits stdio transport connection.

packages/cli/src/index.ts (1)

6-6: Serve command registration is integrated correctly.

Import and registration match the existing command pattern.

Also applies to: 21-21

packages/mcp/src/main.ts (1)

3-6: Entrypoint failure handling is solid.

Top-level rejection handling logs context and exits non-zero as expected.

packages/mcp/tsup.config.ts (1)

4-4: Build entry expansion is correct.

Adding src/main.ts to tsup entries matches the new MCP runtime entrypoint.

templates/hybrid/README.md (1)

5-5: Usage instruction is clear and immediately actionable.

Line 5 gives a concrete command users can run, which is better than a phase-status placeholder.

pnpm-workspace.yaml (1)

2-2: Workspace package registration looks correct.

Line 2 correctly adds packages/shared, enabling workspace linking for @rag-forge/shared.

packages/shared/tsconfig.json (1)

1-9: TypeScript package config is well-scoped.

rootDir, outDir, and include/exclude are consistent for building packages/shared cleanly.

templates/enterprise/README.md (2)

3-3: Enterprise positioning text remains coherent.

Line 3 still communicates security-focused, gated deployment intent clearly.

5-5: Direct initialization command is a good UX improvement.

Line 5 provides a runnable command instead of release-phase wording.

packages/mcp/package.json (1)

23-23: Dependency addition is correct for shared bridge reuse.

Line 23 properly declares @rag-forge/shared for MCP runtime integration.

packages/cli/package.json (1)

25-25: CLI dependency wiring to shared package looks correct.

Line 25 correctly links the CLI to @rag-forge/shared via workspace dependency.

templates/hybrid/project/eval/golden_set.json (1)

1-14: Golden set structure is consistent and valid.
This dataset is well-formed and suitable as a starter template.

packages/shared/tsup.config.ts (1)

3-10: Build config aligns with the package export contract.
The entry/output settings are coherent with the shared bridge packaging goals.

eval/golden_set.json (1)

1-20: Starter golden set is well-structured.
Entries are consistent and usable for baseline evaluation flows.

packages/mcp/src/tools/rag-status.ts (1)

4-16: Good wiring from MCP tool to Python status command.
Line 4–Line 16 correctly delegates execution and handles error/success paths cleanly.

templates/hybrid/project/pyproject.toml (1)

1-23: Hybrid template manifest is coherent and complete for initial scaffolding.
Project/dependency metadata and tool.rag-forge settings are internally consistent.

packages/shared/package.json (1)

1-29: Package manifest wiring looks correct.

Export map, scripts, engine floor, and dependency split are coherent for a shared ESM utility package.

packages/core/src/rag_forge_core/cli.py (1)

324-333: Status subcommand wiring is clean.

Parser registration and command dispatch are correctly integrated into main().

packages/cli/src/lib/python-bridge.ts (1)

1-8: Re-export consolidation looks good.

This keeps CLI bridge imports stable while centralizing implementation in @rag-forge/shared.

templates/hybrid/project/src/pipeline.py (1)

6-36: Template scaffold is clear and usable.

The placeholder API shape and CLI hinting are coherent for generated starter code.

templates/enterprise/project/src/pipeline.py (1)

6-36: Enterprise template scaffold is consistent.

The generated placeholders and invocation pattern are aligned with starter-template expectations.

coderabbitai · 2026-04-12T11:00:34Z

+function getMcpMainPath(): string {
+  const currentDir = fileURLToPath(new URL(".", import.meta.url));
+  return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");
+}


⚠️ Potential issue | 🟠 Major

Path resolution is monorepo-coupled and can break outside this layout.

Line 9 hardcodes traversal to a sibling mcp/dist/main.js. This is fragile for packaged/global installs and any altered build layout.

🔧 Proposed fix

+import { existsSync } from "node:fs"; import { resolve } from "node:path"; import { fileURLToPath } from "node:url"; @@ function getMcpMainPath(): string { const currentDir = fileURLToPath(new URL(".", import.meta.url)); - return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"); + const candidates = [ + process.env.RAG_FORGE_MCP_MAIN, + resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"), + ].filter((p): p is string => Boolean(p)); + + const resolved = candidates.find((p) => existsSync(p)); + if (!resolved) { + throw new Error( + "Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.", + ); + } + return resolved; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function getMcpMainPath(): string {

const currentDir = fileURLToPath(new URL(".", import.meta.url));

return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");

}

import { existsSync } from "node:fs";

import { resolve } from "node:path";

import { fileURLToPath } from "node:url";

function getMcpMainPath(): string {

const currentDir = fileURLToPath(new URL(".", import.meta.url));

const candidates = [

process.env.RAG_FORGE_MCP_MAIN,

resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"),

].filter((p): p is string => Boolean(p));

const resolved = candidates.find((p) => existsSync(p));

if (!resolved) {

throw new Error(

"Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.",

);

}

return resolved;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/commands/serve.ts` around lines 7 - 10, The getMcpMainPath function currently hardcodes upward traversal to locate mcp/dist/main.js which breaks outside the monorepo; change getMcpMainPath to resolve the target using Node's module resolution (e.g., via createRequire(import.meta.url).resolve or require.resolve('mcp/dist/main.js')) and fall back to a configurable override (env var or CLI option) if resolution fails, and ensure you catch and surface resolution errors from getMcpMainPath so callers can handle packaged/global installs robustly.

coderabbitai · 2026-04-12T11:00:34Z

+      child.on("exit", (code) => {
+        process.exit(code ?? 0);
+      });


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect current exit handler arity and fallback behavior rg -n -C2 'child\.on\("exit"|process\.exit\(code \?\? 0\)' packages/cli/src/commands/serve.ts

Repository: hallengray/rag-forge

Length of output: 178

🏁 Script executed:

cat packages/cli/package.json | head -50

Repository: hallengray/rag-forge

Length of output: 1079

🏁 Script executed:

fd -type f -name "*.test.*" -o -name "*.spec.*" | grep -i serve

Repository: hallengray/rag-forge

Length of output: 234

🏁 Script executed:

rg "child\.on\(.*exit" --type ts --type js

Repository: hallengray/rag-forge

Length of output: 133

🏁 Script executed:

find packages/cli -type f \( -name "*.test.*" -o -name "*.spec.*" \)

Repository: hallengray/rag-forge

Length of output: 98

🏁 Script executed:

cat -n packages/cli/src/commands/serve.ts | head -60

Repository: hallengray/rag-forge

Length of output: 1521

🏁 Script executed:

cat -n packages/cli/__tests__/cli.test.ts

Repository: hallengray/rag-forge

Length of output: 1384

Signal-terminated child processes incorrectly exit with success code.

The exit handler only uses the code parameter and ignores signal. When a child process is terminated by a signal, code is null, so code ?? 0 exits the parent with code 0 (success), masking interrupt and termination signals like SIGTERM or SIGKILL.

Node.js >=20.0.0 is required, which provides both code and signal to the exit handler. Handle signals explicitly to propagate them:

Fix

- child.on("exit", (code) => { - process.exit(code ?? 0); + child.on("exit", (code, signal) => { + if (signal) { + process.kill(process.pid, signal); + return; + } + process.exit(code ?? 1); });

Also change ?? 0 to ?? 1 to avoid masking unexpected null exit codes as success.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/cli/src/commands/serve.ts` around lines 36 - 38, The exit handler for child.on("exit", (code) => ...) ignores the signal parameter and treats null codes as success; update the handler signature to accept both (code, signal), propagate signal-terminated exits by re-emitting the same signal to the parent (e.g., process.kill(process.pid, signal)) when signal is set, and fall back to exiting with a non-zero code by changing process.exit(code ?? 0) to process.exit(code ?? 1) when no signal is present; update the child.on("exit", ...) callback accordingly so signal and code are both handled.

coderabbitai · 2026-04-12T11:00:34Z

+    try:
+        count = store.count(collection)
+        indexed = count > 0
+    except (ValueError, KeyError):
+        count = 0
+        indexed = False


🧹 Nitpick | 🔵 Trivial

cmd_status exception branch is effectively dead code.

QdrantStore.count() already swallows exceptions and returns 0, so this except (ValueError, KeyError) path is not doing real work. Simplifying this block will make status behavior clearer.

♻️ Proposed simplification

def cmd_status(args: argparse.Namespace) -> None: """Check pipeline status.""" collection = args.collection or "rag-forge" store = QdrantStore() - try: - count = store.count(collection) - indexed = count > 0 - except (ValueError, KeyError): - count = 0 - indexed = False + count = store.count(collection) + indexed = count > 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

count = store.count(collection)

indexed = count > 0

except (ValueError, KeyError):

count = 0

indexed = False

def cmd_status(args: argparse.Namespace) -> None:

"""Check pipeline status."""

collection = args.collection or "rag-forge"

store = QdrantStore()

count = store.count(collection)

indexed = count > 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/core/src/rag_forge_core/cli.py` around lines 260 - 265, The except branch around store.count(collection) in cmd_status is dead because QdrantStore.count() already handles errors and returns 0; remove the try/except and simplify to a direct call: set count = store.count(collection) and indexed = count > 0. Update the block that currently contains the try/except (referencing store.count and collection in cmd_status) to use this simplified logic so the ValueError/KeyError branch is removed.

coderabbitai · 2026-04-12T11:00:34Z

+  const args = ["audit", "--judge", "mock"];
+
+  if (input.golden_set_path) {
+    args.push("--golden-set", input.golden_set_path);
+  }


⚠️ Potential issue | 🟠 Major

Do not silently ignore input.metrics.
Line 15–Line 19 build CLI args without using input.metrics, even though the tool schema accepts it. Either wire this field into the evaluator invocation or reject it explicitly to avoid a misleading API contract.

Proposed guard to avoid silent input loss

export async function handleRagAudit(input: RagAuditInput): Promise<string> { + if (input.metrics?.length) { + return JSON.stringify({ + status: "error", + message: "The 'metrics' filter is not supported by rag_forge_evaluator.cli yet", + }); + } + const args = ["audit", "--judge", "mock"];

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const args = ["audit", "--judge", "mock"];

if (input.golden_set_path) {

args.push("--golden-set", input.golden_set_path);

}

if (input.metrics?.length) {

return JSON.stringify({

status: "error",

message: "The 'metrics' filter is not supported by rag_forge_evaluator.cli yet",

});

}

const args = ["audit", "--judge", "mock"];

if (input.golden_set_path) {

args.push("--golden-set", input.golden_set_path);

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/mcp/src/tools/rag-audit.ts` around lines 15 - 19, The code builds CLI args in rag-audit.ts (const args = ["audit", "--judge", "mock"]; and subsequent pushes) but ignores input.metrics from the tool schema; either wire input.metrics into the args before invoking the evaluator or explicitly reject it. Update the args construction to map input.metrics to the appropriate CLI flag(s) (e.g., push "--metrics" and the metrics value or serialize as needed) when input.metrics is present, or add a validation step that throws or returns an error if input.metrics is set to prevent silent input loss; adjust the evaluator invocation that consumes args accordingly so the metrics are actually passed through.

coderabbitai · 2026-04-12T11:00:35Z

+metrics:
+  - context_relevance
+  - faithfulness
+  - answer_relevance
+  - hallucination
+
+thresholds:
+  context_relevance: 0.80
+  faithfulness: 0.85
+  answer_relevance: 0.80
+  hallucination_rate: 0.05
+
+ci_gate:
+  metric: faithfulness
+  threshold: 0.85
+  block_on_failure: true
+
+golden_set: eval/golden_set.json


⚠️ Potential issue | 🟠 Major

metrics/ci_gate/golden_set appear non-functional with current evaluator config model.
Line 4–Line 21 define keys that are not represented in packages/evaluator/src/rag_forge_evaluator/audit.py AuditConfig, while packages/evaluator/src/rag_forge_evaluator/cli.py only reads thresholds from config JSON. This can mislead template users into thinking these gates are enforced when they are not.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@templates/hybrid/project/eval/config.yaml` around lines 4 - 21, The template exposes keys (metrics, ci_gate, golden_set) that AuditConfig in packages/evaluator/src/rag_forge_evaluator/audit.py and the CLI in packages/evaluator/src/rag_forge_evaluator/cli.py don’t handle; either remove them from the template or make the evaluator read and enforce them—update AuditConfig to add typed fields for metrics: List[str], ci_gate: {metric: str, threshold: float, block_on_failure: bool}, and golden_set: Optional[str], update the CLI parsing in cli.py to load those fields from the config JSON, and implement enforcement logic that uses AuditConfig.ci_gate to fail/exit when gate conditions are not met and loads AuditConfig.golden_set when present to evaluate golden examples.

coderabbitai · 2026-04-12T11:00:35Z

+    "typecheck": {
+      "dependsOn": ["^build"]
+    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In Turborepo task configuration, does "dependsOn": ["^build"] include the current package's build task, or only dependency packages' build tasks?

💡 Result:

"dependsOn": ["^build"] includes only dependency packages' build tasks, not the current package's build task. The ^ prefix specifically means "run the task in direct dependencies before the target package." To depend on the current package's own build task (e.g., for a test task), use "build" without the ^ prefix.

Citations:

1: https://turbo.build/docs/crafting-your-repository/configuring-tasks

2: https://turborepo.dev/docs/crafting-your-repository/configuring-tasks

typecheck should explicitly depend on the local build task.

In Turborepo, "dependsOn": ["^build"] targets only the build tasks in dependency packages, not the current package's build task. To ensure typecheck runs after the local build task completes, add "build" to the dependsOn array.

Proposed fix

"typecheck": { - "dependsOn": ["^build"] + "dependsOn": ["build", "^build"] }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"typecheck": {

"dependsOn": ["^build"]

}

"typecheck": {

"dependsOn": ["build", "^build"]

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@turbo.json` around lines 17 - 19, The "typecheck" pipeline currently has "dependsOn": ["^build"] which only depends on build tasks in dependency packages; update the "typecheck" pipeline's dependsOn to include the local build by adding "build" to the array (i.e., "dependsOn": ["build", "^build"]) so the local build task runs before the "typecheck" task; locate the "typecheck" entry and modify its dependsOn accordingly.

- serve.ts: validate MCP path exists before spawning, use process.execPath - rag-query/audit/status: preserve stdout on error for partial results - python-bridge: add 5min timeout, handle signal-terminated processes - enterprise/hybrid templates: default reranker to "none", add cohere_api_key field - hybrid README: fix query command args, add prerequisites section - enterprise README: add prerequisites section Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: Phase 2D — MCP Server Wiring + Templates

Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight.

G2's deliverable is proof — not claims — that v0.2.2's code actually closes every Cycle 2 / Cycle 3 finding attributed to RAG-Forge. This file is the artifact. One section per finding with: - what the original audit saw (raw error signatures) - why earlier cycles could not verify the fix (masking) - what v0.2.2 changes to close the gap - passing test output naming each test by ID Covers: Cycle 2 #4 — OpenAIEmbeddings.embed_query AttributeError (FIXED) Cycle 2 #5 — max_tokens overflow on long responses (FIXED) Cycle 2 #6 — silent 0.0 coercion (FULLY FIXED) Cycle 3 C3-2 — RagForgeRagasLLM.generate missing (FIXED) Cycle 3 C3-5 — stale __version__ (FIXED) Closes with a falsifiability section naming the concrete error signatures that, if they reappear in a Cycle 4 run against real PearMedica telemetry, would falsify the report and justify holding the release. The release bar is "Cycle 4 produces non-empty metrics OR skip records with real error signatures — not empty both." Combined test suite run shows 31/31 passing on the rebased G1+G3+G5 branch.

Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight.

…3) (#38) * fix(evaluator): add missing ragas BaseRagasLLM methods (G1) RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone. * fix(evaluator): address CodeRabbit review on G1 adapter shims Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates. * fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass * fix(evaluator): address CodeRabbit round-2 findings on G1 Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean. * fix(evaluator): route RAGAS exceptions into Skipped counter (G3) Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight. * fix(evaluator): tolerate cross-environment mypy on ragas_evaluate ignores The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs only fire when mypy can see ragas's real type signatures — i.e. when the [ragas] extra is installed (local dev). On CI without the extra, mypy falls back to Any and flags the ignores as [unused-ignore]. Adding unused-ignore to the ignore codes tells mypy to tolerate the comment when it has nothing to suppress. Both environments are now clean: local mypy with the extra still sees the arg-type mismatch and honours the suppression; CI mypy without the extra silently accepts the comment as dead. * fix(evaluator): remove dead real_import code in G3 skip counter test CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.

* docs(v0.2.2): Cycle 2/3 verification evidence report (G2) G2's deliverable is proof — not claims — that v0.2.2's code actually closes every Cycle 2 / Cycle 3 finding attributed to RAG-Forge. This file is the artifact. One section per finding with: - what the original audit saw (raw error signatures) - why earlier cycles could not verify the fix (masking) - what v0.2.2 changes to close the gap - passing test output naming each test by ID Covers: Cycle 2 #4 — OpenAIEmbeddings.embed_query AttributeError (FIXED) Cycle 2 #5 — max_tokens overflow on long responses (FIXED) Cycle 2 #6 — silent 0.0 coercion (FULLY FIXED) Cycle 3 C3-2 — RagForgeRagasLLM.generate missing (FIXED) Cycle 3 C3-5 — stale __version__ (FIXED) Closes with a falsifiability section naming the concrete error signatures that, if they reappear in a Cycle 4 run against real PearMedica telemetry, would falsify the report and justify holding the release. The release bar is "Cycle 4 produces non-empty metrics OR skip records with real error signatures — not empty both." Combined test suite run shows 31/31 passing on the rebased G1+G3+G5 branch. * docs(v0.2.2): markdown polish on G2 verification report Addresses CodeRabbit findings on PR #40. - Added language specifiers to every fenced code block (10 blocks, all previously bare ```). text for error traces and pytest output, bash for the shell command block. Silences markdownlint MD040. - Reformatted the inline 7-method enumeration in the Finding #4 section as a proper bulleted list. The methods now render one per line instead of as a comma-separated prose run, which was hard to scan. * docs(v0.2.2): clarify version-fix scoping + US spelling on G2 report CodeRabbit on PR #40 round 2 raised two findings on the C3-5 section: 1. The evidence snippet showing __version__ = "0.1.0" vs pyproject version = "0.2.1" could be read as a current-state claim rather than a pre-fix snapshot. Added an explicit "pre-v0.2.2 state" header and a paragraph scoping the snippet so nobody reads it as the repo's current state. 2. "parametrized" -> "parameterized" for US spelling consistency. Also updated the description to reflect G5's round-2 move to auto-discovery so the report matches the actual test impl.

hallengray and others added 8 commits April 12, 2026 11:45

feat(core): add status subcommand to Python CLI

3f13446

feat(mcp): add MCP server entry point with stdio transport

6b61306

feat(cli): add serve --mcp command to launch MCP server

d229862

feat(templates): add hybrid template with BM25 + reranking + enrichment

382eafb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(templates): add enterprise template with security guards and CI …

78a7081

…gate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

hallengray force-pushed the feat/phase2d-mcp-templates branch from 436c2f4 to 78a7081 Compare April 12, 2026 10:46

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

hallengray merged commit 5b2babc into main Apr 12, 2026
2 of 3 checks passed

hallengray added a commit that referenced this pull request Apr 14, 2026

Merge pull request #5 from hallengray/feat/phase2d-mcp-templates

002ef20

feat: Phase 2D — MCP Server Wiring + Templates

This was referenced Apr 14, 2026

docs(spec): v0.2.0 evaluator refresh design #32

Merged

feat(evaluator): v0.2.0 evaluator refresh #33

Merged

This was referenced Apr 15, 2026

fix(evaluator): route RAGAS exceptions into Skipped counter (v0.2.2 G3) #38

Merged

docs(v0.2.2): Cycle 2/3 verification evidence report (G2) #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 2D — MCP Server Wiring + Templates#5

feat: Phase 2D — MCP Server Wiring + Templates#5
hallengray merged 10 commits intomainfrom
feat/phase2d-mcp-templates

hallengray commented Apr 12, 2026

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Review failed

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-function getMcpMainPath(): string {
-  const currentDir = fileURLToPath(new URL(".", import.meta.url));
-  return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");
-}
+import { existsSync } from "node:fs";
+import { resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+function getMcpMainPath(): string {
+  const currentDir = fileURLToPath(new URL(".", import.meta.url));
+  const candidates = [
+    process.env.RAG_FORGE_MCP_MAIN,
+    resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"),
+  ].filter((p): p is string => Boolean(p));
+  const resolved = candidates.find((p) => existsSync(p));
+  if (!resolved) {
+    throw new Error(
+      "Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.",
+    );
+  }
+  return resolved;
+}

Conversation

hallengray commented Apr 12, 2026

Summary

Quality

Test plan

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 12, 2026 •

edited

Loading