Skip to content

feat: Phase 2D — MCP Server Wiring + Templates#5

Merged
hallengray merged 10 commits intomainfrom
feat/phase2d-mcp-templates
Apr 12, 2026
Merged

feat: Phase 2D — MCP Server Wiring + Templates#5
hallengray merged 10 commits intomainfrom
feat/phase2d-mcp-templates

Conversation

@hallengray
Copy link
Copy Markdown
Owner

Summary

  • Shared bridge package (@rag-forge/shared): Extracted Python bridge into a reusable package, used by both CLI and MCP
  • MCP tools wired: rag_query, rag_audit, rag_status now call real Python pipelines via the bridge (no more stubs)
  • MCP server entry point: server.ts + main.ts with stdio transport for Claude Code integration
  • serve --mcp CLI command: Launch MCP server via rag-forge serve --mcp
  • Python status subcommand: Reports collection state (indexed, chunk count)
  • Hybrid template: 6-file scaffold with BM25 + reranking + contextual enrichment config
  • Enterprise template: 7-file scaffold with security guards + CI/CD gate workflow

Quality

  • 273 Python tests passing (190 core + 83 evaluator)
  • Ruff, mypy, ESLint, TypeScript typecheck all clean
  • 3 TS packages build successfully (shared, cli, mcp)

Test plan

  • All Python tests pass
  • All TS packages build and typecheck
  • rag-forge serve --mcp launches MCP server
  • rag-forge init hybrid scaffolds correct files
  • rag-forge init enterprise scaffolds with CI workflow

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa46445d-3af1-40bc-8057-d727a601e511

📥 Commits

Reviewing files that changed from the base of the PR and between a216fe1 and f12958a.

📒 Files selected for processing (11)
  • packages/cli/src/commands/serve.ts
  • packages/mcp/src/tools/rag-audit.ts
  • packages/mcp/src/tools/rag-query.ts
  • packages/mcp/src/tools/rag-status.ts
  • packages/shared/src/python-bridge.ts
  • templates/enterprise/project/README.md
  • templates/enterprise/project/pyproject.toml
  • templates/enterprise/project/src/config.py
  • templates/hybrid/project/README.md
  • templates/hybrid/project/pyproject.toml
  • templates/hybrid/project/src/config.py

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting


Summary by CodeRabbit

  • New Features

    • New CLI: added serve command to run the MCP server and status command to report indexing state
    • MCP server entrypoint and runtime startup flow
    • Tools now invoke Python-based query/audit/status flows
  • Documentation

    • Added Enterprise and Hybrid pipeline templates, docs, quick-starts, evaluation configs and golden sets
    • CI quality-gate workflow for RAG audits
  • Chores

    • Introduced shared utilities package and consolidated Python bridge
    • Build/workspace updates (multi-entry build, workspace package, typecheck ordering)

Walkthrough

Extracts a Python-bridge into a new @rag-forge/shared package; refactors CLI and MCP to use it; adds an MCP runtime entrypoint and serve CLI command that spawns the MCP process; replaces MCP tool stubs with Python module invocations; adds a status Python CLI subcommand; and introduces enterprise/hybrid templates with evaluation configs and golden sets.

Changes

Cohort / File(s) Summary
Shared package
packages/shared/package.json, packages/shared/src/python-bridge.ts, packages/shared/tsconfig.json, packages/shared/tsup.config.ts
New package @rag-forge/shared implementing runPythonModule and checkPythonAvailable (types + build config).
Workspace & deps
pnpm-workspace.yaml, packages/cli/package.json, packages/mcp/package.json
Added packages/shared to workspace; cli and mcp now depend on @rag-forge/shared (removed local execa from CLI).
CLI integration
packages/cli/src/lib/python-bridge.ts, packages/cli/src/commands/serve.ts, packages/cli/src/index.ts
CLI now re-exports Python bridge from shared, registers new serve command which validates --mcp, resolves MCP main path, checks file existence, and spawns a node process with inherited stdio.
MCP runtime & build
packages/mcp/src/main.ts, packages/mcp/src/server.ts, packages/mcp/tsup.config.ts
Added main.ts entrypoint that calls exported startServer(); startServer() creates a server and connects a StdioServerTransport; main.ts added to tsup entries.
MCP tools: Python execution
packages/mcp/src/tools/rag-query.ts, packages/mcp/src/tools/rag-audit.ts, packages/mcp/src/tools/rag-status.ts
Replaced "not_implemented" stubs with calls to runPythonModule(...); return stdout on success or structured/error output on failure; arguments constructed from tool inputs.
Python core CLI
packages/core/src/rag_forge_core/cli.py
Added status subcommand (--collection) that queries Qdrant count, handles ValueError/KeyError, and emits JSON with indexed, collection, and chunk_count.
Templates — Enterprise
templates/enterprise/*, templates/enterprise/project/* (README, pyproject.toml, src/config.py, src/pipeline.py, eval/*, .github/workflows/rag-audit.yml)
New enterprise template: docs, pipeline stubs, config dataclass, evaluation config and golden set, and a GitHub Actions audit gate workflow.
Templates — Hybrid
templates/hybrid/*, templates/hybrid/project/* (README, pyproject.toml, src/config.py, src/pipeline.py, eval/*)
New hybrid template: docs, pipeline stubs, config dataclass, evaluation config and golden set.
Root changes
eval/golden_set.json, turbo.json
Added root evaluation golden set; typecheck task now depends on ^build.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant NodeProcess as "MCP Node Process"
    participant MCP as "MCP startServer"
    participant Stdio as "StdioServerTransport"
    participant Tool as "MCP Tool Handler"
    participant Python as "Python Module"

    User->>CLI: rag-forge serve --mcp <path>
    CLI->>NodeProcess: spawn node <mcp/dist/main.js>
    NodeProcess->>MCP: invoke startServer()
    MCP->>Stdio: create transport
    MCP->>Stdio: connect(transport)
    Stdio-->>MCP: connected

    User->>Stdio: invoke rag_query input
    Stdio->>Tool: handleRagQuery(input)
    Tool->>Python: runPythonModule("rag_forge_core.cli", ["query", ...])
    Python-->>Tool: {stdout, stderr, exitCode}
    alt exitCode == 0
        Tool-->>Stdio: return stdout
    else
        Tool-->>Stdio: return error payload (stderr or message)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I tunneled through packages, shared and new,
Spawned a server, let the MCP brew.
Tools now call Python, templates take flight,
Status and serve keep the pipeline bright.
Hop, hop—RAG-Forge grows overnight! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: introducing MCP server wiring and templates (Hybrid and Enterprise) as part of Phase 2D development.
Description check ✅ Passed The description is directly related to the changeset, detailing the shared bridge package, MCP tool wiring, server entry point, CLI serve command, Python status subcommand, and template scaffolds with quality metrics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase2d-mcp-templates

Comment @coderabbitai help to get the list of available commands and usage tips.

hallengray and others added 8 commits April 12, 2026 11:45
Extracts the Python subprocess bridge into a standalone workspace package
so both the CLI and MCP server can share a single implementation without
duplication. Exposes runPythonModule and checkPythonAvailable via ESM
exports with full TypeScript declarations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the inline execa dependency from the CLI and replaces the
full python-bridge implementation with a thin re-export of the shared
package, ensuring a single source of truth for the subprocess bridge
across all workspace consumers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the stub not-implemented handlers in rag-query, rag-audit, and
rag-status with real implementations that delegate to Python via the
shared runPythonModule bridge. Adds @rag-forge/shared as a workspace
dependency so the MCP server shares the same subprocess layer as the CLI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hallengray hallengray force-pushed the feat/phase2d-mcp-templates branch from 436c2f4 to 78a7081 Compare April 12, 2026 10:46
… gate

- turbo.json: typecheck now depends on ^build so @rag-forge/shared is
  built before CLI/MCP typecheck runs
- eval/golden_set.json: root-level golden set for CI audit gate

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/cli/src/commands/serve.ts`:
- Around line 7-10: The getMcpMainPath function currently hardcodes upward
traversal to locate mcp/dist/main.js which breaks outside the monorepo; change
getMcpMainPath to resolve the target using Node's module resolution (e.g., via
createRequire(import.meta.url).resolve or require.resolve('mcp/dist/main.js'))
and fall back to a configurable override (env var or CLI option) if resolution
fails, and ensure you catch and surface resolution errors from getMcpMainPath so
callers can handle packaged/global installs robustly.
- Around line 27-29: The spawn call currently uses the literal "node" which can
break in PATH-restricted environments; change it to use process.execPath when
spawning the child in serve.ts so the child uses the same Node binary as the
parent (update the spawn invocation that creates child from spawn("node",
[mcpMain], { stdio: "inherit" }) to spawn(process.execPath, [mcpMain], { stdio:
"inherit" })). Ensure any tests or wrappers referencing the child variable
remain compatible.
- Around line 36-38: The exit handler for child.on("exit", (code) => ...)
ignores the signal parameter and treats null codes as success; update the
handler signature to accept both (code, signal), propagate signal-terminated
exits by re-emitting the same signal to the parent (e.g.,
process.kill(process.pid, signal)) when signal is set, and fall back to exiting
with a non-zero code by changing process.exit(code ?? 0) to process.exit(code ??
1) when no signal is present; update the child.on("exit", ...) callback
accordingly so signal and code are both handled.

In `@packages/core/src/rag_forge_core/cli.py`:
- Around line 260-265: The except branch around store.count(collection) in
cmd_status is dead because QdrantStore.count() already handles errors and
returns 0; remove the try/except and simplify to a direct call: set count =
store.count(collection) and indexed = count > 0. Update the block that currently
contains the try/except (referencing store.count and collection in cmd_status)
to use this simplified logic so the ValueError/KeyError branch is removed.

In `@packages/mcp/src/tools/rag-audit.ts`:
- Around line 15-19: The code builds CLI args in rag-audit.ts (const args =
["audit", "--judge", "mock"]; and subsequent pushes) but ignores input.metrics
from the tool schema; either wire input.metrics into the args before invoking
the evaluator or explicitly reject it. Update the args construction to map
input.metrics to the appropriate CLI flag(s) (e.g., push "--metrics" and the
metrics value or serialize as needed) when input.metrics is present, or add a
validation step that throws or returns an error if input.metrics is set to
prevent silent input loss; adjust the evaluator invocation that consumes args
accordingly so the metrics are actually passed through.

In `@packages/mcp/src/tools/rag-query.ts`:
- Around line 28-32: The error branch in the rag-query result handling (where
result is checked with result.exitCode !== 0) currently only returns
result.stderr, which can drop structured JSON error payloads emitted to stdout;
update the failure path in the function handling the child process result to
prefer and preserve result.stdout (parse it if JSON) before falling back to
result.stderr, and return the preserved stdout JSON/error message (or include
both stdout and stderr) so callers of the module get the CLI's structured error;
look for symbols result and the exitCode check in
packages/mcp/src/tools/rag-query.ts to locate the change.

In `@packages/shared/src/python-bridge.ts`:
- Around line 24-28: The return currently coerces undefined exit codes to 0
which mislabels signaled/canceled processes as success; change the mapping of
result.exitCode so that undefined is treated as a failure (e.g., map undefined
to a non‑zero value or propagate undefined explicitly and let callers treat it
as failure) instead of defaulting to 0 by replacing the `result.exitCode ?? 0`
logic; also add a sane timeout to the execa calls (the invocations using execa
shown earlier in this file) and ensure timeout/cancelled outcomes populate
stderr/exitCode consistently so callers (rag-query.ts, rag-audit.ts,
rag-status.ts) can detect failures.
- Around line 19-22: The execa calls in python-bridge.ts (the const result =
await execa(...) at the two invocation sites) run without a timeout; add a
timeout option (in milliseconds) to both execa option objects so stuck Python/uv
processes are killed automatically (e.g., timeout:
Number(process.env.PYTHON_BRIDGE_TIMEOUT_MS) || 30000). Update both calls that
invoke execa (the one at the shown line and the other around line 41) to include
this timeout setting in their options to prevent hanging tool requests.

In `@templates/enterprise/project/.github/workflows/rag-audit.yml`:
- Around line 30-34: Replace the hardcoded THRESHOLD with the value from the
generated report: read THRESHOLD from reports/audit-report.json (e.g.
THRESHOLD=$(jq -r '.metrics.faithfulness.threshold // "0.85"'
reports/audit-report.json)) and keep the rest of the comparison using SCORE and
THRESHOLD as before; ensure jq extraction returns a numeric string or provide a
sensible fallback so the existing bc comparison works.

In `@templates/enterprise/project/eval/config.yaml`:
- Around line 4-21: The template includes unsupported keys ("metrics",
"ci_gate", "golden_set") that the evaluator's config model (AuditConfig and the
CLI wiring) does not consume; either remove these keys from the template or
adapt the config parsing to populate AuditConfig from them. Fix by updating the
template to only include fields that AuditConfig expects (remove "metrics",
"ci_gate", "golden_set" and keep only supported threshold entries), or
alternatively update the CLI/config loader (the AuditConfig parsing logic) to
accept and map "metrics", "ci_gate", and "golden_set" into the corresponding
AuditConfig properties so the declared gate truly configures the evaluator.
Ensure references to AuditConfig and the config loader/parsing function are
updated to reflect the chosen approach.

In `@templates/enterprise/project/eval/golden_set.json`:
- Line 4: The expected_answer_keywords array in golden_set.json uses overly
broad tokens ("topic", "document") which allow false-positive matches; update
the "expected_answer_keywords" value(s) to use more specific phrases or entities
that reflect the true expected content (e.g., "primary research topic", "source
document title", "document ID", "document summary") so the evaluation checks
substantive overlap—locate the "expected_answer_keywords" key in the JSON entry
and replace the generic tokens with targeted keyword phrases relevant to the
question/answer pair.

In `@templates/enterprise/project/pyproject.toml`:
- Around line 22-23: The template sets reranker = "cohere" which triggers
RetrievalConfig validation requiring cohere_api_key; change the default reranker
in the template (the reranker field in
templates/enterprise/project/pyproject.toml) to a safe empty/none value (or
remove the reranker entry) so generated projects do not demand cohere_api_key at
load time; ensure the change references the reranker setting and leaves enrich =
true untouched so only the reranker default is modified.

In `@templates/enterprise/project/README.md`:
- Around line 15-18: The Quick Start should add an explicit credential setup
step before running commands like `rag-forge index`, `rag-forge query`, and
`rag-forge audit` because the template uses provider-backed
reranking/generation; update the README to instruct users to export or configure
required API keys (e.g., set provider-specific env vars or run the
project-specific credential setup command) and show a short example of the
expected variables or config (so first-run commands don't fail due to missing
keys).

In `@templates/enterprise/project/src/config.py`:
- Around line 23-25: PipelineConfig currently declares reranker and cohere_model
but lacks a Cohere credential field, so add a cohere_api_key attribute to
PipelineConfig (same style/type as other config fields, e.g., Optional[str] =
None or str = "") so users can supply the API key when reranker == "cohere";
update any validation/usage sites that read PipelineConfig.cohere_api_key to
handle missing/empty keys accordingly (look for class PipelineConfig, attributes
reranker and cohere_model to locate where to add the new cohere_api_key field).

In `@templates/hybrid/project/eval/config.yaml`:
- Around line 4-21: The template exposes keys (metrics, ci_gate, golden_set)
that AuditConfig in packages/evaluator/src/rag_forge_evaluator/audit.py and the
CLI in packages/evaluator/src/rag_forge_evaluator/cli.py don’t handle; either
remove them from the template or make the evaluator read and enforce them—update
AuditConfig to add typed fields for metrics: List[str], ci_gate: {metric: str,
threshold: float, block_on_failure: bool}, and golden_set: Optional[str], update
the CLI parsing in cli.py to load those fields from the config JSON, and
implement enforcement logic that uses AuditConfig.ci_gate to fail/exit when gate
conditions are not met and loads AuditConfig.golden_set when present to evaluate
golden examples.

In `@templates/hybrid/project/README.md`:
- Around line 15-17: The quick-start query command is missing the required
hybrid retrieval flag; update the example for the `rag-forge query` invocation
to include the `--sparse-index-path` argument (matching the earlier index step)
so hybrid mode runs correctly—modify the line with the `rag-forge query "your
question" --strategy hybrid --reranker cohere` example to add
`--sparse-index-path .rag-forge/sparse` so the documented flow is runnable.

In `@templates/hybrid/project/src/config.py`:
- Around line 23-25: PipelineConfig is missing a Cohere credential field while
defaulting reranker="cohere"; add a cohere_api_key field to the PipelineConfig
definition (e.g., cohere_api_key: Optional[str] = None) and import
typing.Optional if needed so users can set the Cohere API key alongside reranker
and cohere_model; update any code that constructs PipelineConfig (or reads its
fields) to use this new cohere_api_key when initializing the Cohere client.

In `@turbo.json`:
- Around line 17-19: The "typecheck" pipeline currently has "dependsOn":
["^build"] which only depends on build tasks in dependency packages; update the
"typecheck" pipeline's dependsOn to include the local build by adding "build" to
the array (i.e., "dependsOn": ["build", "^build"]) so the local build task runs
before the "typecheck" task; locate the "typecheck" entry and modify its
dependsOn accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 50e6ee76-0208-40ed-b836-adc9d5845aa5

📥 Commits

Reviewing files that changed from the base of the PR and between 70ce9be and a216fe1.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (34)
  • eval/golden_set.json
  • packages/cli/package.json
  • packages/cli/src/commands/serve.ts
  • packages/cli/src/index.ts
  • packages/cli/src/lib/python-bridge.ts
  • packages/core/src/rag_forge_core/cli.py
  • packages/mcp/package.json
  • packages/mcp/src/main.ts
  • packages/mcp/src/server.ts
  • packages/mcp/src/tools/rag-audit.ts
  • packages/mcp/src/tools/rag-query.ts
  • packages/mcp/src/tools/rag-status.ts
  • packages/mcp/tsup.config.ts
  • packages/shared/package.json
  • packages/shared/src/python-bridge.ts
  • packages/shared/tsconfig.json
  • packages/shared/tsup.config.ts
  • pnpm-workspace.yaml
  • templates/enterprise/README.md
  • templates/enterprise/project/.github/workflows/rag-audit.yml
  • templates/enterprise/project/README.md
  • templates/enterprise/project/eval/config.yaml
  • templates/enterprise/project/eval/golden_set.json
  • templates/enterprise/project/pyproject.toml
  • templates/enterprise/project/src/config.py
  • templates/enterprise/project/src/pipeline.py
  • templates/hybrid/README.md
  • templates/hybrid/project/README.md
  • templates/hybrid/project/eval/config.yaml
  • templates/hybrid/project/eval/golden_set.json
  • templates/hybrid/project/pyproject.toml
  • templates/hybrid/project/src/config.py
  • templates/hybrid/project/src/pipeline.py
  • turbo.json
📜 Review details
🧰 Additional context used
🪛 LanguageTool
templates/enterprise/project/README.md

[uncategorized] ~22-~22: The official name of this software platform is spelled with a capital “H”.
Context: ...olden_set.json ## CI/CD The included.github/workflows/rag-audit.yml` runs evaluatio...

(GITHUB)

🔇 Additional comments (21)
packages/mcp/src/server.ts (1)

4-7: Startup wiring is clean and correct.

startServer() correctly creates the MCP server and awaits stdio transport connection.

packages/cli/src/index.ts (1)

6-6: Serve command registration is integrated correctly.

Import and registration match the existing command pattern.

Also applies to: 21-21

packages/mcp/src/main.ts (1)

3-6: Entrypoint failure handling is solid.

Top-level rejection handling logs context and exits non-zero as expected.

packages/mcp/tsup.config.ts (1)

4-4: Build entry expansion is correct.

Adding src/main.ts to tsup entries matches the new MCP runtime entrypoint.

templates/hybrid/README.md (1)

5-5: Usage instruction is clear and immediately actionable.

Line 5 gives a concrete command users can run, which is better than a phase-status placeholder.

pnpm-workspace.yaml (1)

2-2: Workspace package registration looks correct.

Line 2 correctly adds packages/shared, enabling workspace linking for @rag-forge/shared.

packages/shared/tsconfig.json (1)

1-9: TypeScript package config is well-scoped.

rootDir, outDir, and include/exclude are consistent for building packages/shared cleanly.

templates/enterprise/README.md (2)

3-3: Enterprise positioning text remains coherent.

Line 3 still communicates security-focused, gated deployment intent clearly.


5-5: Direct initialization command is a good UX improvement.

Line 5 provides a runnable command instead of release-phase wording.

packages/mcp/package.json (1)

23-23: Dependency addition is correct for shared bridge reuse.

Line 23 properly declares @rag-forge/shared for MCP runtime integration.

packages/cli/package.json (1)

25-25: CLI dependency wiring to shared package looks correct.

Line 25 correctly links the CLI to @rag-forge/shared via workspace dependency.

templates/hybrid/project/eval/golden_set.json (1)

1-14: Golden set structure is consistent and valid.
This dataset is well-formed and suitable as a starter template.

packages/shared/tsup.config.ts (1)

3-10: Build config aligns with the package export contract.
The entry/output settings are coherent with the shared bridge packaging goals.

eval/golden_set.json (1)

1-20: Starter golden set is well-structured.
Entries are consistent and usable for baseline evaluation flows.

packages/mcp/src/tools/rag-status.ts (1)

4-16: Good wiring from MCP tool to Python status command.
Line 4–Line 16 correctly delegates execution and handles error/success paths cleanly.

templates/hybrid/project/pyproject.toml (1)

1-23: Hybrid template manifest is coherent and complete for initial scaffolding.
Project/dependency metadata and tool.rag-forge settings are internally consistent.

packages/shared/package.json (1)

1-29: Package manifest wiring looks correct.

Export map, scripts, engine floor, and dependency split are coherent for a shared ESM utility package.

packages/core/src/rag_forge_core/cli.py (1)

324-333: Status subcommand wiring is clean.

Parser registration and command dispatch are correctly integrated into main().

packages/cli/src/lib/python-bridge.ts (1)

1-8: Re-export consolidation looks good.

This keeps CLI bridge imports stable while centralizing implementation in @rag-forge/shared.

templates/hybrid/project/src/pipeline.py (1)

6-36: Template scaffold is clear and usable.

The placeholder API shape and CLI hinting are coherent for generated starter code.

templates/enterprise/project/src/pipeline.py (1)

6-36: Enterprise template scaffold is consistent.

The generated placeholders and invocation pattern are aligned with starter-template expectations.

Comment on lines +7 to +10
function getMcpMainPath(): string {
const currentDir = fileURLToPath(new URL(".", import.meta.url));
return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Path resolution is monorepo-coupled and can break outside this layout.

Line 9 hardcodes traversal to a sibling mcp/dist/main.js. This is fragile for packaged/global installs and any altered build layout.

🔧 Proposed fix
+import { existsSync } from "node:fs";
 import { resolve } from "node:path";
 import { fileURLToPath } from "node:url";
@@
 function getMcpMainPath(): string {
   const currentDir = fileURLToPath(new URL(".", import.meta.url));
-  return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");
+  const candidates = [
+    process.env.RAG_FORGE_MCP_MAIN,
+    resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"),
+  ].filter((p): p is string => Boolean(p));
+
+  const resolved = candidates.find((p) => existsSync(p));
+  if (!resolved) {
+    throw new Error(
+      "Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.",
+    );
+  }
+  return resolved;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function getMcpMainPath(): string {
const currentDir = fileURLToPath(new URL(".", import.meta.url));
return resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js");
}
import { existsSync } from "node:fs";
import { resolve } from "node:path";
import { fileURLToPath } from "node:url";
function getMcpMainPath(): string {
const currentDir = fileURLToPath(new URL(".", import.meta.url));
const candidates = [
process.env.RAG_FORGE_MCP_MAIN,
resolve(currentDir, "..", "..", "..", "mcp", "dist", "main.js"),
].filter((p): p is string => Boolean(p));
const resolved = candidates.find((p) => existsSync(p));
if (!resolved) {
throw new Error(
"Unable to locate MCP entrypoint. Set RAG_FORGE_MCP_MAIN to the built main.js path.",
);
}
return resolved;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cli/src/commands/serve.ts` around lines 7 - 10, The getMcpMainPath
function currently hardcodes upward traversal to locate mcp/dist/main.js which
breaks outside the monorepo; change getMcpMainPath to resolve the target using
Node's module resolution (e.g., via createRequire(import.meta.url).resolve or
require.resolve('mcp/dist/main.js')) and fall back to a configurable override
(env var or CLI option) if resolution fails, and ensure you catch and surface
resolution errors from getMcpMainPath so callers can handle packaged/global
installs robustly.

Comment thread packages/cli/src/commands/serve.ts Outdated
Comment on lines +36 to +38
child.on("exit", (code) => {
process.exit(code ?? 0);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect current exit handler arity and fallback behavior
rg -n -C2 'child\.on\("exit"|process\.exit\(code \?\? 0\)' packages/cli/src/commands/serve.ts

Repository: hallengray/rag-forge

Length of output: 178


🏁 Script executed:

cat packages/cli/package.json | head -50

Repository: hallengray/rag-forge

Length of output: 1079


🏁 Script executed:

fd -type f -name "*.test.*" -o -name "*.spec.*" | grep -i serve

Repository: hallengray/rag-forge

Length of output: 234


🏁 Script executed:

rg "child\.on\(.*exit" --type ts --type js

Repository: hallengray/rag-forge

Length of output: 133


🏁 Script executed:

find packages/cli -type f \( -name "*.test.*" -o -name "*.spec.*" \)

Repository: hallengray/rag-forge

Length of output: 98


🏁 Script executed:

cat -n packages/cli/src/commands/serve.ts | head -60

Repository: hallengray/rag-forge

Length of output: 1521


🏁 Script executed:

cat -n packages/cli/__tests__/cli.test.ts

Repository: hallengray/rag-forge

Length of output: 1384


Signal-terminated child processes incorrectly exit with success code.

The exit handler only uses the code parameter and ignores signal. When a child process is terminated by a signal, code is null, so code ?? 0 exits the parent with code 0 (success), masking interrupt and termination signals like SIGTERM or SIGKILL.

Node.js >=20.0.0 is required, which provides both code and signal to the exit handler. Handle signals explicitly to propagate them:

Fix
-      child.on("exit", (code) => {
-        process.exit(code ?? 0);
+      child.on("exit", (code, signal) => {
+        if (signal) {
+          process.kill(process.pid, signal);
+          return;
+        }
+        process.exit(code ?? 1);
       });

Also change ?? 0 to ?? 1 to avoid masking unexpected null exit codes as success.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cli/src/commands/serve.ts` around lines 36 - 38, The exit handler
for child.on("exit", (code) => ...) ignores the signal parameter and treats null
codes as success; update the handler signature to accept both (code, signal),
propagate signal-terminated exits by re-emitting the same signal to the parent
(e.g., process.kill(process.pid, signal)) when signal is set, and fall back to
exiting with a non-zero code by changing process.exit(code ?? 0) to
process.exit(code ?? 1) when no signal is present; update the child.on("exit",
...) callback accordingly so signal and code are both handled.

Comment on lines +260 to +265
try:
count = store.count(collection)
indexed = count > 0
except (ValueError, KeyError):
count = 0
indexed = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

cmd_status exception branch is effectively dead code.

QdrantStore.count() already swallows exceptions and returns 0, so this except (ValueError, KeyError) path is not doing real work. Simplifying this block will make status behavior clearer.

♻️ Proposed simplification
 def cmd_status(args: argparse.Namespace) -> None:
     """Check pipeline status."""
     collection = args.collection or "rag-forge"
     store = QdrantStore()
-    try:
-        count = store.count(collection)
-        indexed = count > 0
-    except (ValueError, KeyError):
-        count = 0
-        indexed = False
+    count = store.count(collection)
+    indexed = count > 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
count = store.count(collection)
indexed = count > 0
except (ValueError, KeyError):
count = 0
indexed = False
def cmd_status(args: argparse.Namespace) -> None:
"""Check pipeline status."""
collection = args.collection or "rag-forge"
store = QdrantStore()
count = store.count(collection)
indexed = count > 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/rag_forge_core/cli.py` around lines 260 - 265, The except
branch around store.count(collection) in cmd_status is dead because
QdrantStore.count() already handles errors and returns 0; remove the try/except
and simplify to a direct call: set count = store.count(collection) and indexed =
count > 0. Update the block that currently contains the try/except (referencing
store.count and collection in cmd_status) to use this simplified logic so the
ValueError/KeyError branch is removed.

Comment on lines +15 to +19
const args = ["audit", "--judge", "mock"];

if (input.golden_set_path) {
args.push("--golden-set", input.golden_set_path);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not silently ignore input.metrics.
Line 15–Line 19 build CLI args without using input.metrics, even though the tool schema accepts it. Either wire this field into the evaluator invocation or reject it explicitly to avoid a misleading API contract.

Proposed guard to avoid silent input loss
 export async function handleRagAudit(input: RagAuditInput): Promise<string> {
+  if (input.metrics?.length) {
+    return JSON.stringify({
+      status: "error",
+      message: "The 'metrics' filter is not supported by rag_forge_evaluator.cli yet",
+    });
+  }
+
   const args = ["audit", "--judge", "mock"];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const args = ["audit", "--judge", "mock"];
if (input.golden_set_path) {
args.push("--golden-set", input.golden_set_path);
}
if (input.metrics?.length) {
return JSON.stringify({
status: "error",
message: "The 'metrics' filter is not supported by rag_forge_evaluator.cli yet",
});
}
const args = ["audit", "--judge", "mock"];
if (input.golden_set_path) {
args.push("--golden-set", input.golden_set_path);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/mcp/src/tools/rag-audit.ts` around lines 15 - 19, The code builds
CLI args in rag-audit.ts (const args = ["audit", "--judge", "mock"]; and
subsequent pushes) but ignores input.metrics from the tool schema; either wire
input.metrics into the args before invoking the evaluator or explicitly reject
it. Update the args construction to map input.metrics to the appropriate CLI
flag(s) (e.g., push "--metrics" and the metrics value or serialize as needed)
when input.metrics is present, or add a validation step that throws or returns
an error if input.metrics is set to prevent silent input loss; adjust the
evaluator invocation that consumes args accordingly so the metrics are actually
passed through.

Comment thread templates/enterprise/project/src/config.py
Comment on lines +4 to +21
metrics:
- context_relevance
- faithfulness
- answer_relevance
- hallucination

thresholds:
context_relevance: 0.80
faithfulness: 0.85
answer_relevance: 0.80
hallucination_rate: 0.05

ci_gate:
metric: faithfulness
threshold: 0.85
block_on_failure: true

golden_set: eval/golden_set.json
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

metrics/ci_gate/golden_set appear non-functional with current evaluator config model.
Line 4–Line 21 define keys that are not represented in packages/evaluator/src/rag_forge_evaluator/audit.py AuditConfig, while packages/evaluator/src/rag_forge_evaluator/cli.py only reads thresholds from config JSON. This can mislead template users into thinking these gates are enforced when they are not.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@templates/hybrid/project/eval/config.yaml` around lines 4 - 21, The template
exposes keys (metrics, ci_gate, golden_set) that AuditConfig in
packages/evaluator/src/rag_forge_evaluator/audit.py and the CLI in
packages/evaluator/src/rag_forge_evaluator/cli.py don’t handle; either remove
them from the template or make the evaluator read and enforce them—update
AuditConfig to add typed fields for metrics: List[str], ci_gate: {metric: str,
threshold: float, block_on_failure: bool}, and golden_set: Optional[str], update
the CLI parsing in cli.py to load those fields from the config JSON, and
implement enforcement logic that uses AuditConfig.ci_gate to fail/exit when gate
conditions are not met and loads AuditConfig.golden_set when present to evaluate
golden examples.

Comment thread templates/hybrid/project/README.md
Comment thread templates/hybrid/project/src/config.py
Comment thread turbo.json
Comment on lines +17 to +19
"typecheck": {
"dependsOn": ["^build"]
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In Turborepo task configuration, does "dependsOn": ["^build"] include the current package's build task, or only dependency packages' build tasks?

💡 Result:

"dependsOn": ["^build"] includes only dependency packages' build tasks, not the current package's build task. The ^ prefix specifically means "run the task in direct dependencies before the target package." To depend on the current package's own build task (e.g., for a test task), use "build" without the ^ prefix.

Citations:


typecheck should explicitly depend on the local build task.

In Turborepo, "dependsOn": ["^build"] targets only the build tasks in dependency packages, not the current package's build task. To ensure typecheck runs after the local build task completes, add "build" to the dependsOn array.

Proposed fix
    "typecheck": {
-      "dependsOn": ["^build"]
+      "dependsOn": ["build", "^build"]
    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"typecheck": {
"dependsOn": ["^build"]
}
"typecheck": {
"dependsOn": ["build", "^build"]
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@turbo.json` around lines 17 - 19, The "typecheck" pipeline currently has
"dependsOn": ["^build"] which only depends on build tasks in dependency
packages; update the "typecheck" pipeline's dependsOn to include the local build
by adding "build" to the array (i.e., "dependsOn": ["build", "^build"]) so the
local build task runs before the "typecheck" task; locate the "typecheck" entry
and modify its dependsOn accordingly.

- serve.ts: validate MCP path exists before spawning, use process.execPath
- rag-query/audit/status: preserve stdout on error for partial results
- python-bridge: add 5min timeout, handle signal-terminated processes
- enterprise/hybrid templates: default reranker to "none", add cohere_api_key field
- hybrid README: fix query command args, add prerequisites section
- enterprise README: add prerequisites section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hallengray hallengray merged commit 5b2babc into main Apr 12, 2026
2 of 3 checks passed
hallengray added a commit that referenced this pull request Apr 14, 2026
feat: Phase 2D — MCP Server Wiring + Templates
hallengray added a commit that referenced this pull request Apr 15, 2026
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:

1. EvaluationResult.skipped_evaluations (the integer counter the report's
   TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
   skipped_samples held real SkipRecords, but the counter stayed at 0.
   Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
   thought nothing had happened when in reality every job had crashed.

2. Individual-metric extraction failures created one SkipRecord per
   metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
   that failed entirely produced 4 records instead of 48 — the blast
   radius was under-reported by 12x.

Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
  single coarse failure out into one record per (sample, metric) pair
  with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
  path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
  both return sites so the counter and the detail list can never drift
  apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
  Python tracebacks don't blow up HTML/PDF rendering downstream.

Tests:
- test_ragas_skip_counter.py — three new tests covering:
  (a) skipped_evaluations counter equals len(skipped_samples),
  (b) whole-batch crashes fan out to sample_count * metric_count records
      with real sample_ids (not "<aggregate>"),
  (c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
  fan-out change. The original assertions demanded "metrics must be
  populated"; with fan-out, a MockJudge run legitimately produces zero
  scored metrics and a full skip list. Tests now assert the absence of
  Finding #4 (embed_query AttributeError) and Finding #5
  (InstructorRetryException / max_tokens truncation) signatures in the
  skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
  with a reverse importorskip so it runs only on systems without the
  [ragas] extra. CI matrices with ragas installed now skip it instead
  of failing.

Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
hallengray added a commit that referenced this pull request Apr 15, 2026
G2's deliverable is proof — not claims — that v0.2.2's code actually
closes every Cycle 2 / Cycle 3 finding attributed to RAG-Forge. This
file is the artifact.

One section per finding with:
  - what the original audit saw (raw error signatures)
  - why earlier cycles could not verify the fix (masking)
  - what v0.2.2 changes to close the gap
  - passing test output naming each test by ID

Covers:
  Cycle 2 #4  — OpenAIEmbeddings.embed_query AttributeError  (FIXED)
  Cycle 2 #5  — max_tokens overflow on long responses        (FIXED)
  Cycle 2 #6  — silent 0.0 coercion                          (FULLY FIXED)
  Cycle 3 C3-2 — RagForgeRagasLLM.generate missing           (FIXED)
  Cycle 3 C3-5 — stale __version__                           (FIXED)

Closes with a falsifiability section naming the concrete error
signatures that, if they reappear in a Cycle 4 run against real
PearMedica telemetry, would falsify the report and justify holding
the release. The release bar is "Cycle 4 produces non-empty metrics
OR skip records with real error signatures — not empty both."

Combined test suite run shows 31/31 passing on the rebased G1+G3+G5
branch.
hallengray added a commit that referenced this pull request Apr 15, 2026
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:

1. EvaluationResult.skipped_evaluations (the integer counter the report's
   TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
   skipped_samples held real SkipRecords, but the counter stayed at 0.
   Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
   thought nothing had happened when in reality every job had crashed.

2. Individual-metric extraction failures created one SkipRecord per
   metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
   that failed entirely produced 4 records instead of 48 — the blast
   radius was under-reported by 12x.

Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
  single coarse failure out into one record per (sample, metric) pair
  with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
  path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
  both return sites so the counter and the detail list can never drift
  apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
  Python tracebacks don't blow up HTML/PDF rendering downstream.

Tests:
- test_ragas_skip_counter.py — three new tests covering:
  (a) skipped_evaluations counter equals len(skipped_samples),
  (b) whole-batch crashes fan out to sample_count * metric_count records
      with real sample_ids (not "<aggregate>"),
  (c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
  fan-out change. The original assertions demanded "metrics must be
  populated"; with fan-out, a MockJudge run legitimately produces zero
  scored metrics and a full skip list. Tests now assert the absence of
  Finding #4 (embed_query AttributeError) and Finding #5
  (InstructorRetryException / max_tokens truncation) signatures in the
  skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
  with a reverse importorskip so it runs only on systems without the
  [ragas] extra. CI matrices with ragas installed now skip it instead
  of failing.

Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
hallengray added a commit that referenced this pull request Apr 15, 2026
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:

1. EvaluationResult.skipped_evaluations (the integer counter the report's
   TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
   skipped_samples held real SkipRecords, but the counter stayed at 0.
   Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
   thought nothing had happened when in reality every job had crashed.

2. Individual-metric extraction failures created one SkipRecord per
   metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
   that failed entirely produced 4 records instead of 48 — the blast
   radius was under-reported by 12x.

Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
  single coarse failure out into one record per (sample, metric) pair
  with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
  path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
  both return sites so the counter and the detail list can never drift
  apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
  Python tracebacks don't blow up HTML/PDF rendering downstream.

Tests:
- test_ragas_skip_counter.py — three new tests covering:
  (a) skipped_evaluations counter equals len(skipped_samples),
  (b) whole-batch crashes fan out to sample_count * metric_count records
      with real sample_ids (not "<aggregate>"),
  (c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
  fan-out change. The original assertions demanded "metrics must be
  populated"; with fan-out, a MockJudge run legitimately produces zero
  scored metrics and a full skip list. Tests now assert the absence of
  Finding #4 (embed_query AttributeError) and Finding #5
  (InstructorRetryException / max_tokens truncation) signatures in the
  skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
  with a reverse importorskip so it runs only on systems without the
  [ragas] extra. CI matrices with ragas installed now skip it instead
  of failing.

Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
hallengray added a commit that referenced this pull request Apr 15, 2026
Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:

1. EvaluationResult.skipped_evaluations (the integer counter the report's
   TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
   skipped_samples held real SkipRecords, but the counter stayed at 0.
   Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
   thought nothing had happened when in reality every job had crashed.

2. Individual-metric extraction failures created one SkipRecord per
   metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
   that failed entirely produced 4 records instead of 48 — the blast
   radius was under-reported by 12x.

Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
  single coarse failure out into one record per (sample, metric) pair
  with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
  path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
  both return sites so the counter and the detail list can never drift
  apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
  Python tracebacks don't blow up HTML/PDF rendering downstream.

Tests:
- test_ragas_skip_counter.py — three new tests covering:
  (a) skipped_evaluations counter equals len(skipped_samples),
  (b) whole-batch crashes fan out to sample_count * metric_count records
      with real sample_ids (not "<aggregate>"),
  (c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
  fan-out change. The original assertions demanded "metrics must be
  populated"; with fan-out, a MockJudge run legitimately produces zero
  scored metrics and a full skip list. Tests now assert the absence of
  Finding #4 (embed_query AttributeError) and Finding #5
  (InstructorRetryException / max_tokens truncation) signatures in the
  skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
  with a reverse importorskip so it runs only on systems without the
  [ragas] extra. CI matrices with ragas installed now skip it instead
  of failing.

Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.
hallengray added a commit that referenced this pull request Apr 15, 2026
…3) (#38)

* fix(evaluator): add missing ragas BaseRagasLLM methods (G1)

RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the
concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes.
Every RAGAS metric job crashed on first contact with
  AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'
during the PearMedica Cycle 3 audit (2026-04-15), producing a
Scored: 0, Overall: 0.0000 report across all 48 metric evaluations.

Root cause: the adapter's design doc asserted "ragas only calls
generate_text / agenerate_text / model_name" and deliberately avoided
subclassing BaseRagasLLM to keep ragas as a soft dependency. That
assumption was wrong — BaseRagasLLM.generate is a concrete async helper
that ragas's metric code invokes on every LLM regardless of subclass
status. The duck-typed wrapper must re-declare it.

Fix preserves the no-hard-import design by adding duck-typed shims for
every public method on BaseRagasLLM and BaseRagasEmbeddings:

  RagForgeRagasLLM:
    - async generate()        — the specific shim Cycle 3 caught missing
    - is_finished()           — was abstract on base, returns True
    - get_temperature(n)      — matches base convention (0.01 / 0.3)
    - set_run_config()        — stores ragas RunConfig for compatibility
    - run_config attribute    — defaults to None

  RagForgeRagasEmbeddings:
    - embed_text(is_async)    — dispatch helper on base class
    - embed_texts(is_async)   — batch variant
    - set_run_config()
    - run_config attribute

Tests: a contract test iterates every public method on the real ragas
base classes and asserts our wrappers declare a callable of the same
name, so the next ragas release that adds a method fails in CI not in
a user audit. An end-to-end smoke test runs ragas.evaluate() against
our wrappers on a 1-sample dataset and asserts it never raises an
AttributeError naming our wrapper classes — the exact regression
signature from Cycle 3.

Three pre-existing test failures in test_cycle2_regression.py and
test_evaluator_factory.py are unrelated to this change (pre-existing
in v0.2.1 when ragas is installed; they assert on compound conditions
that break when MockJudge triggers the skip path). They belong in
workstream G3 (skip counter plumbing) and the G1 PR leaves them alone.

* fix(evaluator): address CodeRabbit review on G1 adapter shims

Five fixes from CodeRabbit review on PR #36. All were real issues the
original contract test missed because it only checked method names,
not signatures.

### generate(n > 1) now produces the correct LLMResult shape

ragas uses n > 1 for multi-sample metrics (answer_correctness
consistency checks, etc.) and expects LLMResult.generations shaped
[[gen1, gen2, ..., genN]] — one prompt run with N candidate
generations. The original shim ignored n and returned a single
generation, silently breaking any metric that relied on sample
diversity.

Fix: when n > 1, fan out n independent agenerate_text calls via
asyncio.gather and fuse the per-call results into a single
[[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper.
n == 1 (the common case) stays a single call.

New test: test_wrapper_async_generate_n_greater_than_one_produces_
fused_shape uses a counting judge to verify all N calls fire and the
fused result carries all N distinct outputs.

### embed_text / embed_texts now async (crash fix)

Critical bug: the original shims were sync methods wrapping
asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings
declares embed_text and embed_texts as async coroutines and metric
code invokes them with await embeddings.embed_text(...) from inside
ragas's evaluation event loop. asyncio.run() inside a running loop
crashes with RuntimeError: asyncio.run() cannot be called from a
running event loop — a real live-fire crash that would have taken
down any Cycle 4 run.

Fix: make both methods async. They now just await the existing
aembed_query / aembed_documents paths. The is_async parameter is
accepted for signature parity with the base class but ignored —
our underlying clients are synchronous and aembed_query already
runs them in a worker thread via asyncio.to_thread, so both flag
values land on the same code path.

New tests:
- test_embeddings_embed_text_is_async_and_awaitable: asserts both
  methods are inspect.iscoroutinefunction and round-trips through
  asyncio.run(embed.embed_text(...)).
- test_embeddings_embed_text_callable_from_running_event_loop:
  exercises the exact asyncio.run-inside-loop path ragas creates.

### Contract test now checks async/sync parity, not just names

CodeRabbit correctly pointed out that the original contract test
compared method names but not their async/sync shape — exactly why
the sync embed_text / embed_texts slipped past review.

Fix: two new tests (test_llm_wrapper_async_signature_matches_base
and test_embeddings_wrapper_async_signature_matches_base) iterate
every public method on the real ragas base classes and assert
inspect.iscoroutinefunction matches on both sides. A future release
that adds a method in one shape or the other will fail in CI.

### generate() default temperature: None -> 0.01

BaseRagasLLM.generate's default is 0.01. Our shim had None, which
is functionally equivalent (None triggers get_temperature()) but
diverges from the base signature. Changed to 0.01 to match.

Tests that exercise the None fallback still pass None explicitly.

### Tightened existing temperature-fallback test

The original test asserted "doesn't raise" but would have passed
even if get_temperature() was never called. Now patches
llm.get_temperature with wraps= and asserts assert_called_once_with(1).
Added inverse test that patches get_temperature and asserts it was
NOT called when an explicit temperature is passed — guards against
over-eager fallback firing.

### Narrowed e2e smoke test exception handler

Original except Exception accepted any non-AttributeError as a pass.
A bad shim raising TypeError or a nested-loop bug raising RuntimeError
would have counted as success. Tightened to allow only
RagasOutputParserException (expected downstream parser failure from
the stub judge's canned JSON); everything else propagates.

* fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI

The n>1 fan-out fix in the previous commit built a fused LLMResult
via langchain_core.outputs.LLMResult. On CI without the [ragas]
extra installed, langchain isn't available and generate_text falls
back to the _StringLLMResult stub — which _fuse_llm_results had no
fuse path for, so it hit the ImportError fallback and returned
results[0] (a single generation). The n>1 test then asserted
inner length == 3 and failed with 1.

Fix: separate the flatten step (build the fused_generations list)
from the wrap step (pick LLMResult or _StringLLMResult). The
flatten works on either shape because both expose the same
.generations[0] interface. Wrap picks langchain if available,
otherwise constructs via a new _StringLLMResult._from_generations
alt constructor that carries a pre-fused list.

Local verification:
  - without [ragas]: 18/18 adapter tests pass (contract + e2e skip)
  - with [ragas]: 28/28 adapter tests pass

* fix(evaluator): address CodeRabbit round-2 findings on G1

Three more findings from the second CodeRabbit review on PR #36 / #38.

### _fuse_llm_results now fails loud on malformed input

The previous commit's AttributeError/IndexError fallback silently
returned results[0] — turning a requested n>1 generation into a
single-sample result with no signal. Downstream ragas metrics that
rely on sample diversity would consume the degraded result as if it
were a valid N-sample fuse. CodeRabbit rightly flagged this as
hiding real correctness bugs.

Fix: raise ValueError with the observed result types and the
underlying exception chained. Added a targeted test
(test_fuse_llm_results_raises_on_malformed_input) covering three
malformed-input shapes: objects without .generations, empty outer
lists, and empty input lists.

### CountingJudge in n>1 test is now thread-safe

RagForgeRagasLLM.generate fans out via asyncio.gather, and each
agenerate_text call runs in a worker thread via asyncio.to_thread.
counter["i"] += 1 is not atomic across threads — the read-modify-
write race would produce duplicate sample labels or an undercounted
total, flaking the test. Added a threading.Lock around the
increment. Test now deterministic under concurrent fan-out.

### Contract tests now check parameter parity, not just name + async

Previously we asserted that each public method on BaseRagasLLM /
BaseRagasEmbeddings exists on our wrapper and matches async/sync.
That missed parameter drift — a future ragas release that adds
max_tokens to generate_text would silently break our wrapper until
a user audit caught it.

Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class
and test_embeddings_wrapper_parameter_names_cover_base_class) use
inspect.signature to enumerate the named parameters on each base-
class method and assert the wrapper accepts every name. Variadic
*args / **kwargs are ignored (they can absorb any kwarg by
definition). A helper _required_param_names() shares the logic.

Local: 30 passed on adapter + contract + e2e suites with ragas
installed, 18 passed on adapter suite alone without the [ragas]
extra. Ruff + mypy clean.

* fix(evaluator): route RAGAS exceptions into Skipped counter (G3)

Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:

1. EvaluationResult.skipped_evaluations (the integer counter the report's
   TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
   skipped_samples held real SkipRecords, but the counter stayed at 0.
   Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
   thought nothing had happened when in reality every job had crashed.

2. Individual-metric extraction failures created one SkipRecord per
   metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
   that failed entirely produced 4 records instead of 48 — the blast
   radius was under-reported by 12x.

Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
  single coarse failure out into one record per (sample, metric) pair
  with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
  path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
  both return sites so the counter and the detail list can never drift
  apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
  Python tracebacks don't blow up HTML/PDF rendering downstream.

Tests:
- test_ragas_skip_counter.py — three new tests covering:
  (a) skipped_evaluations counter equals len(skipped_samples),
  (b) whole-batch crashes fan out to sample_count * metric_count records
      with real sample_ids (not "<aggregate>"),
  (c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
  fan-out change. The original assertions demanded "metrics must be
  populated"; with fan-out, a MockJudge run legitimately produces zero
  scored metrics and a full skip list. Tests now assert the absence of
  Finding #4 (embed_query AttributeError) and Finding #5
  (InstructorRetryException / max_tokens truncation) signatures in the
  skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
  with a reverse importorskip so it runs only on systems without the
  [ragas] extra. CI matrices with ragas installed now skip it instead
  of failing.

Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.

* fix(evaluator): tolerate cross-environment mypy on ragas_evaluate ignores

The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs
only fire when mypy can see ragas's real type signatures — i.e. when
the [ragas] extra is installed (local dev). On CI without the extra,
mypy falls back to Any and flags the ignores as [unused-ignore].

Adding unused-ignore to the ignore codes tells mypy to tolerate the
comment when it has nothing to suppress. Both environments are now
clean: local mypy with the extra still sees the arg-type mismatch and
honours the suppression; CI mypy without the extra silently accepts
the comment as dead.

* fix(evaluator): remove dead real_import code in G3 skip counter test

CodeRabbit on PR #38 spotted the leftover
  real_import = mod.__dict__.copy()   # line 96
  _ = real_import                      # line 111
in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The
dict was copied but never used — an artefact of an earlier
monkeypatching approach that got simplified but not cleaned up.
Removing it and the unnecessary module-level import that fed it.

No behavioural change; test still passes.
hallengray added a commit that referenced this pull request Apr 15, 2026
* docs(v0.2.2): Cycle 2/3 verification evidence report (G2)

G2's deliverable is proof — not claims — that v0.2.2's code actually
closes every Cycle 2 / Cycle 3 finding attributed to RAG-Forge. This
file is the artifact.

One section per finding with:
  - what the original audit saw (raw error signatures)
  - why earlier cycles could not verify the fix (masking)
  - what v0.2.2 changes to close the gap
  - passing test output naming each test by ID

Covers:
  Cycle 2 #4  — OpenAIEmbeddings.embed_query AttributeError  (FIXED)
  Cycle 2 #5  — max_tokens overflow on long responses        (FIXED)
  Cycle 2 #6  — silent 0.0 coercion                          (FULLY FIXED)
  Cycle 3 C3-2 — RagForgeRagasLLM.generate missing           (FIXED)
  Cycle 3 C3-5 — stale __version__                           (FIXED)

Closes with a falsifiability section naming the concrete error
signatures that, if they reappear in a Cycle 4 run against real
PearMedica telemetry, would falsify the report and justify holding
the release. The release bar is "Cycle 4 produces non-empty metrics
OR skip records with real error signatures — not empty both."

Combined test suite run shows 31/31 passing on the rebased G1+G3+G5
branch.

* docs(v0.2.2): markdown polish on G2 verification report

Addresses CodeRabbit findings on PR #40.

- Added language specifiers to every fenced code block (10 blocks,
  all previously bare ```). text for error traces and pytest output,
  bash for the shell command block. Silences markdownlint MD040.
- Reformatted the inline 7-method enumeration in the Finding #4
  section as a proper bulleted list. The methods now render one per
  line instead of as a comma-separated prose run, which was hard to
  scan.

* docs(v0.2.2): clarify version-fix scoping + US spelling on G2 report

CodeRabbit on PR #40 round 2 raised two findings on the C3-5 section:

1. The evidence snippet showing __version__ = "0.1.0" vs pyproject
   version = "0.2.1" could be read as a current-state claim rather
   than a pre-fix snapshot. Added an explicit "pre-v0.2.2 state"
   header and a paragraph scoping the snippet so nobody reads it
   as the repo's current state.
2. "parametrized" -> "parameterized" for US spelling consistency.
   Also updated the description to reflect G5's round-2 move to
   auto-discovery so the report matches the actual test impl.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant