fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1) by hallengray · Pull Request #36 · hallengray/rag-forge

hallengray · 2026-04-15T21:42:33Z

Summary

Re-declares every public method on ragas.llms.base.BaseRagasLLM and ragas.embeddings.base.BaseRagasEmbeddings as duck-typed shims on RagForgeRagasLLM / RagForgeRagasEmbeddings
Fixes the Cycle 3 C3-2 regression: v0.2.1 crashed every RAGAS metric job in ~6s with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'
Adds a contract test that iterates every public method on the real ragas base classes and asserts the wrappers declare a callable of the same name — so the next ragas release that grows a new method fails fast in CI, not in a user audit

Root cause

The adapter's module docstring claimed "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM, regardless of subclass status. Because our duck-typed class doesn't inherit from BaseRagasLLM, it doesn't inherit .generate(), and RAGAS's first metric call crashes.

The fix preserves the no-hard-import design. We implement our own .generate() / is_finished() / get_temperature() / set_run_config() shims that forward to the existing generate_text / agenerate_text. When the contract test was added it caught two additional missing methods on the embeddings wrapper (embed_text, embed_texts) — exactly the kind of drift the test exists to prevent.

What v0.2.2 adds on `RagForgeRagasLLM`

async generate(prompt, n, temperature, stop, callbacks) -> LLMResult — the specific shim Cycle 3 caught missing
is_finished(response) -> bool — returns True; Judge protocol has no finish signal
get_temperature(n) -> float — matches base convention (0.01 / 0.3)
set_run_config(run_config) — stores ragas RunConfig for contract compatibility
run_config: Any attribute defaulting to None

What v0.2.2 adds on `RagForgeRagasEmbeddings`

embed_text(text, is_async) + embed_texts(texts, is_async) — dispatch helpers the base class exposes
set_run_config(run_config) — contract parity with the LLM wrapper

Tests

tests/test_ragas_adapters_contract.py — new, the contract tripwire. Iterates every public method on BaseRagasLLM / BaseRagasEmbeddings and asserts our wrappers declare them. Also instance-level tests for generate being async, is_finished returning True, get_temperature(n) matching base convention, set_run_config storing the value.
tests/test_ragas_adapters_e2e.py — new, runs ragas.evaluate() with our wrappers on a 1-sample dataset. Fails if ragas hits an AttributeError referencing our wrapper classes. This would have caught the C3-2 regression before v0.2.1 shipped.
tests/test_ragas_adapters.py — updated to extract text from real LLMResult objects (langchain now pulled in via the [ragas] extra), plus new tests for the async generate shim and the temperature=None default-resolution path.

Local run: 22 passed, 0 failed on the adapter test suite.

Test plan

pnpm run typecheck clean
pnpm run test:py green (filter to tests/test_ragas_adapters*.py if faster)
Contract test asserts every ragas base-class method (not just generate)
End-to-end smoke test runs ragas.evaluate() without hitting AttributeError
No change to the deliberate "don't hard-import ragas" design — the adapter module still imports cleanly without the [ragas] extra

Merge order

First of the v0.2.2 series. Plan is G1 → G3 → {G2, G4, G5} in parallel → release prep. G3's branch is already rebased on this one, so merging this unblocks G3 immediately.

Full release plan: docs/release-notes/v0.2.2-plan.md (lands with G2).

RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone.

coderabbitai · 2026-04-15T21:42:45Z

Summary by CodeRabbit

New Features
- Enhanced evaluation adapters with improved configuration management and state tracking.
- Added parallel generation support and dynamic temperature adjustment for LLM evaluation.
- Introduced async convenience methods for embeddings operations.
- Improved result validation with detailed error handling.
Tests
- Expanded test coverage including contract validation and end-to-end evaluation scenarios.

Walkthrough

Implements ragas v0.4.x public contract in RagForge adapters: adds async LLM generate, result fusion, is_finished, temperature helpers, set_run_config/run_config, async embedding conveniences, and accompanying unit, contract, and e2e tests to validate parity and behavior.

Changes

Cohort / File(s)	Summary
LLM & Embeddings Adapter `packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`	Adds `run_config` attributes; introduces `_fuse_llm_results(results)` to merge single-sample results; extends `RagForgeRagasLLM` with async `generate(...)`, `is_finished(...)`, `get_temperature(...)`, `set_run_config(...)` and fans out `agenerate_text(..., n=1)` calls for `n>1`, then fuses and validates outputs. Extends `RagForgeRagasEmbeddings` with `set_run_config(...)` and async convenience `embed_text`/`embed_texts` delegating to async embedding methods.
Unit Tests `packages/evaluator/tests/test_ragas_adapters.py`	Adds `_llm_result_text` helper; updates tests to use it; adds tests for async `generate` shim (prompt forwarding, `temperature=None` fallback to `get_temperature`, explicit temperature bypass, `n>1` fused shape), `_fuse_llm_results` validation, and embedding coroutine behavior including safe use inside an active event loop.
Contract Parity Tests `packages/evaluator/tests/test_ragas_adapters_contract.py`	New contract tests that import ragas bases (skip if absent) and assert wrappers expose the same public callable names, async/sync parity, and parameter-name coverage; includes instance checks for `generate` asyncness, `is_finished` default, `get_temperature` values, and `set_run_config` persistence.
End-to-end Smoke Test `packages/evaluator/tests/test_ragas_adapters_e2e.py`	Adds conditional e2e test using `ragas.evaluate()` with deterministic judge stub and tiny dataset; skips if optional deps missing; surfaces AttributeErrors caused by missing wrapper methods and tolerates ragas output/parser exceptions where expected.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant RagForgeRagasLLM as LLM
  participant Judge

  Client->>LLM: await generate(prompt, n, temperature=None)
  LLM->>LLM: temperature = get_temperature(n)  -- if temperature is None
  alt n == 1
    LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    Judge-->>LLM: LLMResult (single)
    LLM->>LLM: is_finished(result) ?
    LLM-->>Client: LLMResult
  else n > 1
    par call 1
      LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    and call 2
      LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    and ... call N
      LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    end
    Judge-->>LLM: multiple LLMResult objects
    LLM->>LLM: _fuse_llm_results(results)
    LLM->>LLM: is_finished(fused_result) ?
    LLM-->>Client: fused LLMResult (generations=[[...]])
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through wrappers, code in paw,

Fused many results—what a small awe!
Temperatures set, run_config snug and neat,
Async calls lined up, each task complete.
Tests guard the path—now hop on with me!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.40% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly summarizes the main change: adding missing ragas BaseRagasLLM methods to the evaluator adapter, specifically addressing the generate method that caused v0.2.1 crashes.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset, explaining the root cause of the regression, the specific methods added, test coverage, and merge order details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/v0.2.2-g1-ragas-adapter-generate

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 442-462: Replace the use of asyncio.run in the adapter helpers
embed_text and embed_texts with ragas's run_async_in_current_loop utility:
import run_async_in_current_loop from ragas.embeddings.utils and call
run_async_in_current_loop(self.aembed_query(text)) in embed_text and
run_async_in_current_loop(self.aembed_documents(texts)) in embed_texts so the
async helpers (aembed_query, aembed_documents) run correctly whether an event
loop is already running or not.

In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: The current _public_methods(cls) only compares method names
which misses incompatible wrappers; update the test to also compare
inspect.signature for each public callable: use inspect.signature on the
reference class (e.g., BaseRagasLLM) and on the class under test, iterate the
public method names produced by _public_methods, and assert that each method's
Signature has compatible parameters (matching parameter names, kinds, and
default presence/values where relevant) rather than just names — this should
detect cases like a wrapper using generate(**kwargs) or missing parameters;
reference _public_methods and the BaseRagasLLM.generate signature in the test to
implement the signature comparison and fail the test when signatures are not
compatible.

In `@packages/evaluator/tests/test_ragas_adapters_e2e.py`:
- Around line 143-151: Narrow the broad except Exception handler: instead of
treating any non-AttributeError as acceptable, only allow the downstream
parsing/schema/embedding errors you expect — e.g., catch Exception as exc then
if not isinstance(exc, (ragas.parser.ParserError, ragas.schema.SchemaError,
adapters.MockEmbeddingDimensionError, ValueError)): raise; otherwise continue
and keep the existing AttributeError assertion. Update the except block around
the test (the exc/exc_name handling) so unexpected exceptions are re-raised
rather than treated as success, importing the concrete exception classes used by
the parser/schema/adapter modules.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 83-92: Update the test to assert the resolved temperature value is
actually used: override or monkeypatch RagForgeRagasLLM.agenerate_text (or
llm.get_temperature) to capture the temperature passed into agenerate_text, call
llm.generate("prompt", n=1, temperature=None), and assert that the captured
temperature equals the value returned by llm.get_temperature(n) (not just that
no exception was raised); reference the RagForgeRagasLLM class, its generate
method, get_temperature, and agenerate_text to locate where to inject the
spy/capture and perform the equality assertion.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c68799c9-6093-482e-8766-4c0cc92a3672

📥 Commits

Reviewing files that changed from the base of the PR and between 18947f4 and 91d10ca.

📒 Files selected for processing (4)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters_contract.py
packages/evaluator/tests/test_ragas_adapters_e2e.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: RAG Quality Gate
GitHub Check: Lint, Typecheck & Test

coderabbitai · 2026-04-15T21:48:39Z

+    def embed_text(self, text: str, is_async: bool = True) -> list[float]:
+        """Dispatch helper that ragas's concrete ``BaseRagasEmbeddings``
+        exposes alongside ``embed_query``. Some metrics call ``embed_text``
+        instead of ``embed_query`` — they are conceptually the same, just
+        different entry points. We forward to ``embed_query`` for the
+        sync path and to ``aembed_query`` when ``is_async=True``.
+
+        ``is_async=True`` is the base-class default, so we match it — but
+        because our underlying Judge/embedding clients are synchronous,
+        the "async" path just runs the sync call in a worker thread
+        (same pattern as ``aembed_query``).
+        """
+        if is_async:
+            return asyncio.run(self.aembed_query(text))
+        return self.embed_query(text)
+
+    def embed_texts(self, texts: list[str], is_async: bool = True) -> list[list[float]]:
+        """Batch variant of ``embed_text`` — same dispatch semantics."""
+        if is_async:
+            return asyncio.run(self.aembed_documents(texts))
+        return self.embed_documents(texts)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

How does the current ragas release implement BaseRagasEmbeddings.embed_textandembed_texts, and is it safe to call those helpers via asyncio.run() from code that may already be inside an asyncio event loop?

💡 Result:

In the current Ragas release (main branch as of 2026-04-15), BaseRagasEmbeddings.embed_text and embed_texts are implemented as asynchronous methods in src/ragas/embeddings/base.py. embed_text(text: str, is_async: bool = True) -> List[float] awaits self.embed_texts([text], is_async=is_async) and returns the first embedding. embed_texts(texts: List[str], is_async: bool = True) -> List[List[float]]: - If is_async=True (default): Applies retry via add_async_retry to self.aembed_documents and awaits it. - If is_async=False: Applies retry to sync self.embed_documents and runs it via await loop.run_in_executor(None, ...). These are designed to be called from async contexts (e.g., await embedder.embed_texts(texts)). It is NOT safe to call them via asyncio.run from code already inside an event loop. asyncio.run requires no running loop and creates a new one, raising RuntimeError if called nested. Ragas provides run_async_in_current_loop(coro) in src/ragas/embeddings/utils.py specifically for this: detects running loops (e.g., Jupyter), spawns a thread with new loop if running, or uses loop.run_until_complete(coro) if not running. Used in BaseRagasEmbedding for sync-to-async bridging. For sync calling inside loops, use await loop.run_in_executor(None, lambda: embedder.embed_documents(texts)) or run_async_in_current_loop(embedder.embed_texts(texts)). Direct asyncio.run(awaitable) will fail.

Citations:

1: https://docs.ragas.io/en/v0.3.4/references/embeddings/

2: https://github.com/explodinggradients/ragas/blob/main/src/ragas/embeddings/base.py

3: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/embeddings/base.py

4: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/embeddings/utils.py

Use Ragas's run_async_in_current_loop() instead of asyncio.run() in these adapter helpers.

Calling asyncio.run(self.aembed_query(text)) or asyncio.run(self.aembed_documents(texts)) will raise RuntimeError if ragas invokes these methods from within an already-running event loop. Ragas provides run_async_in_current_loop() in src/ragas/embeddings/utils.py specifically for this—it detects whether a loop is running and either spawns a thread with a new loop or calls run_until_complete() as appropriate. Migrate these methods to use that utility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around lines 442 - 462, Replace the use of asyncio.run in the adapter helpers embed_text and embed_texts with ragas's run_async_in_current_loop utility: import run_async_in_current_loop from ragas.embeddings.utils and call run_async_in_current_loop(self.aembed_query(text)) in embed_text and run_async_in_current_loop(self.aembed_documents(texts)) in embed_texts so the async helpers (aembed_query, aembed_documents) run correctly whether an event loop is already running or not.

Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

packages/evaluator/tests/test_ragas_adapters.py (1)

96-101: ⚠️ Potential issue | 🟡 Minor

Assert the resolved temperature is forwarded into agenerate_text.

This still passes if generate() calls get_temperature() and then ignores its return value before invoking agenerate_text(). Spy on agenerate_text and assert the captured temperature matches the patched fallback.

Suggested tightening

     judge = FakeJudge(response="ok")
     llm = RagForgeRagasLLM(judge=judge, refusal_aware=False)
-    with patch.object(
-        llm, "get_temperature", wraps=llm.get_temperature
-    ) as get_temp:
+    captured: dict[str, float | None] = {}
+    original = llm.agenerate_text
+
+    async def spy(*args, **kwargs):
+        captured["temperature"] = kwargs["temperature"]
+        return await original(*args, **kwargs)
+
+    with patch.object(llm, "get_temperature", return_value=0.42) as get_temp, patch.object(
+        llm, "agenerate_text", side_effect=spy
+    ):
         result = asyncio.run(llm.generate("prompt", n=1, temperature=None))
     assert _llm_result_text(result) == "ok"
     get_temp.assert_called_once_with(1)
+    assert captured["temperature"] == 0.42

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 96 - 101, The
test currently only verifies get_temperature was called but not that its
resolved value is passed to the async implementation; update the test to spy on
the internal agenerate_text call (e.g., patch or mock llm.agenerate_text) and
capture the temperature argument used by agenerate_text when
llm.generate("prompt", n=1, temperature=None) is invoked; assert that the
captured temperature equals the patched fallback value (1) in addition to
keeping the existing get_temperature.assert_called_once_with(1) and result text
assertion so you validate the resolved temperature is forwarded into
agenerate_text.

packages/evaluator/tests/test_ragas_adapters_contract.py (1)

46-52: ⚠️ Potential issue | 🟠 Major

Compare signatures, not just method names.

These tests still pass if a wrapper keeps the right method names but drops callbacks, changes defaults, or replaces a positional signature with **kwargs. That drift only shows up later as TypeError inside ragas. Add an inspect.signature(...) comparison after stripping self.

#!/bin/bash
python - <<'PY'
import inspect

from rag_forge_evaluator.engines.ragas_adapters import (
    RagForgeRagasEmbeddings,
    RagForgeRagasLLM,
)

from ragas.embeddings.base import BaseRagasEmbeddings
from ragas.llms.base import BaseRagasLLM

def public_methods(cls):
    return sorted(
        name for name in dir(cls)
        if not name.startswith("_") and callable(getattr(cls, name))
    )

def normalized_signature(cls, name):
    sig = inspect.signature(getattr(cls, name))
    params = [
        (p.name, p.kind, p.default)
        for p in sig.parameters.values()
        if p.name != "self"
    ]
    return params

for base_cls, wrapper_cls in (
    (BaseRagasLLM, RagForgeRagasLLM),
    (BaseRagasEmbeddings, RagForgeRagasEmbeddings),
):
    print(f"\n{base_cls.__name__} vs {wrapper_cls.__name__}")
    for name in public_methods(base_cls):
        if not hasattr(wrapper_cls, name):
            print("MISSING", name)
            continue
        base_sig = inspect.signature(getattr(base_cls, name))
        wrapper_sig = inspect.signature(getattr(wrapper_cls, name))
        if normalized_signature(base_cls, name) != normalized_signature(wrapper_cls, name):
            print("DIFF", name, base_sig, wrapper_sig)
PY

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters_contract.py` around lines 46 -
52, The test helper _public_methods should be expanded to assert that wrappers
preserve method signatures, not just names: for each public method on the base
classes (BaseRagasLLM, BaseRagasEmbeddings) that exists on the wrapper classes
(RagForgeRagasLLM, RagForgeRagasEmbeddings), compute inspect.signature(...) for
both, normalize by removing the 'self' parameter and comparing (name, kind,
default) tuples for each parameter, and fail the test if any signature differs
(covers changed defaults, added **kwargs, positional->kw-only drift, missing
callbacks, etc.).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 131-148: The current _fuse_llm_results function silently returns
results[0] on any fusion error, collapsing n>1 samples into one; change this so
we do NOT degrade silently: on AttributeError or IndexError (malformed result
shapes) raise a ValueError with a concise diagnostic (include len(results) and
the problematic result type/repr) and do not return a single sample, and on
ImportError (missing langchain) re-raise or allow the ImportError to bubble so
the soft-dependency path is explicit rather than silently returning results[0];
update the except block accordingly and remove the fallback return results[0]
behavior.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 128-150: The CountingJudge's judge method mutates shared state
counter["i"] without synchronization causing race conditions when llm.generate
fans out to threads; make the counter increment atomic by adding a lock (e.g.,
threading.Lock) as a member of CountingJudge and acquiring it around
counter["i"] += 1, or replace the shared dict with a thread-safe atomic counter
(e.g., itertools.count or threading-safe integer wrapper) and update
CountingJudge.judge to use that safe increment so tests for n>1 become
deterministic.

---

Duplicate comments:
In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: The test helper _public_methods should be expanded to
assert that wrappers preserve method signatures, not just names: for each public
method on the base classes (BaseRagasLLM, BaseRagasEmbeddings) that exists on
the wrapper classes (RagForgeRagasLLM, RagForgeRagasEmbeddings), compute
inspect.signature(...) for both, normalize by removing the 'self' parameter and
comparing (name, kind, default) tuples for each parameter, and fail the test if
any signature differs (covers changed defaults, added **kwargs,
positional->kw-only drift, missing callbacks, etc.).

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 96-101: The test currently only verifies get_temperature was
called but not that its resolved value is passed to the async implementation;
update the test to spy on the internal agenerate_text call (e.g., patch or mock
llm.agenerate_text) and capture the temperature argument used by agenerate_text
when llm.generate("prompt", n=1, temperature=None) is invoked; assert that the
captured temperature equals the patched fallback value (1) in addition to
keeping the existing get_temperature.assert_called_once_with(1) and result text
assertion so you validate the resolved temperature is forwarded into
agenerate_text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: aee68e59-0676-4965-af91-a54a52be113b

📥 Commits

Reviewing files that changed from the base of the PR and between 91d10ca and 0724d99.

📒 Files selected for processing (4)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters_contract.py
packages/evaluator/tests/test_ragas_adapters_e2e.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: RAG Quality Gate
GitHub Check: Lint, Typecheck & Test

coderabbitai · 2026-04-15T22:17:22Z

+    If fusion fails (mixed result shapes, missing langchain, etc.) we
+    fall back to returning the first result — ragas will then see a
+    single generation, which is strictly worse than N but still
+    correct shape. The ``n>1`` path is rare in stock ragas metrics so
+    conservative fallback is preferable to crashing.
+    """
+    if not results:
+        msg = "_fuse_llm_results called with empty results list"
+        raise ValueError(msg)
+    if len(results) == 1:
+        return results[0]
+    try:
+        from langchain_core.outputs import LLMResult as _LLMResult
+
+        fused_generations = [gen for r in results for gen in r.generations[0]]
+        return _LLMResult(generations=[fused_generations])
+    except (ImportError, AttributeError, IndexError):
+        return results[0]


⚠️ Potential issue | 🟠 Major

Don't silently collapse n>1 into a single sample on fuse failure.

Falling back to results[0] here turns a requested multi-sample generation into a single-sample result with no signal. That both skews downstream ragas scoring and breaks the advertised soft-dependency path because _StringLLMResult never actually gets fused. Handle the stub case explicitly, and raise on malformed result shapes instead of degrading silently.

Proposed fix

try: from langchain_core.outputs import LLMResult as _LLMResult fused_generations = [gen for r in results for gen in r.generations[0]] return _LLMResult(generations=[fused_generations]) - except (ImportError, AttributeError, IndexError): - return results[0] + except ImportError: + fused = _StringLLMResult(results[0].generations[0][0].text) + fused.generations = [[gen for r in results for gen in r.generations[0]]] + return fused + except (AttributeError, IndexError, TypeError) as exc: + msg = "Cannot fuse malformed LLMResult payloads" + raise ValueError(msg) from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around lines 131 - 148, The current _fuse_llm_results function silently returns results[0] on any fusion error, collapsing n>1 samples into one; change this so we do NOT degrade silently: on AttributeError or IndexError (malformed result shapes) raise a ValueError with a concise diagnostic (include len(results) and the problematic result type/repr) and do not return a single sample, and on ImportError (missing langchain) re-raise or allow the ImportError to bubble so the soft-dependency path is explicit rather than silently returning results[0]; update the except block accordingly and remove the fallback return results[0] behavior.

The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass

Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 163-186: Add a test case to cover the scenario where each result
has an empty inner generations list so _fuse_llm_results raises a ValueError for
malformed input; specifically, in the test function
test_fuse_llm_results_raises_on_malformed_input add a class (e.g., EmptyInner)
whose self.generations is [[]] and assert that calling
_fuse_llm_results([EmptyInner(), EmptyInner()]) raises ValueError with the
"malformed result" message, referencing the existing _fuse_llm_results import to
locate the code under test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26dab04a-f0e5-4552-aac3-edde6d722f3d

📥 Commits

Reviewing files that changed from the base of the PR and between 0724d99 and 863394b.

📒 Files selected for processing (3)

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters.py
packages/evaluator/tests/test_ragas_adapters_contract.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: RAG Quality Gate
GitHub Check: Lint, Typecheck & Test

🔇 Additional comments (15)

packages/evaluator/tests/test_ragas_adapters.py (4)

15-26: LGTM!

The helper correctly normalizes assertions across environments by handling both raw strings and LLMResult-like objects. The lack of defensive checks for empty generations is acceptable in test code — tests should fail loudly on malformed responses.

83-101: LGTM!

The test now correctly verifies the fallback path by patching get_temperature with wraps and asserting it was called with the expected argument. This addresses the prior feedback about asserting the actual behavior rather than just absence of exceptions.

117-160: LGTM!

The test correctly validates the n>1 generation shape and uses thread-safe counting as recommended in prior review. The assertions cover the exact [[gen1, gen2, gen3]] shape ragas expects.

189-233: LGTM!

The tests thoroughly validate the async contract for embed_text and embed_texts, including the regression test that exercises calling from within a running event loop — exactly the scenario that crashed with the original asyncio.run() implementation.

packages/evaluator/tests/test_ragas_adapters_contract.py (5)

46-52: LGTM!

The helper correctly enumerates public callables including inherited methods, which is the right behavior for contract testing against base classes.

149-172: LGTM!

The helper correctly extracts named parameters while excluding self/cls and variadic catchalls. This enables the parameter-parity tests to detect missing keyword arguments that would cause TypeError at runtime.

101-146: LGTM!

The async/sync parity tests correctly use inspect.iscoroutinefunction to validate that wrapper methods match the base class's async nature. This would have caught the original embed_text sync implementation that crashed in async contexts.

175-215: LGTM!

The parameter-name coverage tests address the prior feedback about signature parity. They correctly check that the wrapper accepts all parameters the base class declares, preventing TypeError at runtime when ragas passes keyword arguments.

218-254: LGTM!

The instance-level tests comprehensively validate the specific shim behaviors: async generate, is_finished default, temperature convention matching ragas, and set_run_config persistence. Good coverage of the new v0.2.2 contract surface.

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py (6)

119-170: LGTM!

The implementation correctly addresses prior feedback by raising ValueError with descriptive messages on malformed input instead of silently falling back to results[0]. The error message includes the types of all result objects, aiding debugging.

267-341: LGTM!

The generate implementation correctly handles both n=1 and n>1 cases, resolves temperature via get_temperature(n) when None, and uses asyncio.gather for parallel fan-out. The defensive is_finished check and documented rationale for skipping add_async_retry show thoughtful design.

393-408: LGTM!

The factory method correctly constructs a fused _StringLLMResult stub for the no-langchain path, maintaining the [[gen1..genN]] shape and preserving str() / __eq__ usability via _text.

544-571: LGTM!

The methods are correctly implemented as async def coroutines, addressing the prior feedback about asyncio.run() crashing inside running event loops. The ignored is_async parameter is well-documented — both code paths converge on asyncio.to_thread anyway.

343-377: LGTM!

The helper methods correctly implement the ragas contract: is_finished conservatively returns True (documented limitation with clear upgrade path), get_temperature matches ragas's convention, and set_run_config stores the value for contract compatibility.

10-58: LGTM!

Excellent documentation of the ragas 0.4.x contract surface and the v0.2.0 → v0.2.2 evolution. The explicit listing of required methods and the rationale for duck-typing will help future maintainers understand the design constraints.

coderabbitai · 2026-04-15T22:51:07Z

+def test_fuse_llm_results_raises_on_malformed_input():
+    """``_fuse_llm_results`` must fail loud on malformed result shapes
+    instead of silently collapsing to ``results[0]``. CodeRabbit on
+    PR #36 round 2 pointed out that a silent fallback hides real
+    ``n > 1`` correctness bugs — a single returned sample skews
+    downstream ragas metrics with no signal.
+    """
+    from rag_forge_evaluator.engines.ragas_adapters import _fuse_llm_results
+
+    class NotAnLLMResult:
+        pass
+
+    with pytest.raises(ValueError, match="malformed result"):
+        _fuse_llm_results([NotAnLLMResult(), NotAnLLMResult()])
+
+    class EmptyOuter:
+        def __init__(self) -> None:
+            self.generations: list[list[object]] = []
+
+    with pytest.raises(ValueError, match="malformed result"):
+        _fuse_llm_results([EmptyOuter(), EmptyOuter()])
+
+    with pytest.raises(ValueError, match="empty results list"):
+        _fuse_llm_results([])


🧹 Nitpick | 🔵 Trivial

Consider adding a test for empty inner generations.

The test covers missing .generations, empty outer list, and empty results list. However, _fuse_llm_results also catches IndexError when r.generations[0] exists but is empty. Adding this case would complete coverage:

class EmptyInner: def __init__(self) -> None: self.generations: list[list[object]] = [[]] with pytest.raises(ValueError, match="malformed result"): _fuse_llm_results([EmptyInner(), EmptyInner()])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 163 - 186, Add a test case to cover the scenario where each result has an empty inner generations list so _fuse_llm_results raises a ValueError for malformed input; specifically, in the test function test_fuse_llm_results_raises_on_malformed_input add a class (e.g., EmptyInner) whose self.generations is [[]] and assert that calling _fuse_llm_results([EmptyInner(), EmptyInner()]) raises ValueError with the "malformed result" message, referencing the existing _fuse_llm_results import to locate the code under test.

…3) (#38) * fix(evaluator): add missing ragas BaseRagasLLM methods (G1) RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone. * fix(evaluator): address CodeRabbit review on G1 adapter shims Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates. * fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass * fix(evaluator): address CodeRabbit round-2 findings on G1 Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean. * fix(evaluator): route RAGAS exceptions into Skipped counter (G3) Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight. * fix(evaluator): tolerate cross-environment mypy on ragas_evaluate ignores The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs only fire when mypy can see ragas's real type signatures — i.e. when the [ragas] extra is installed (local dev). On CI without the extra, mypy falls back to Any and flags the ignores as [unused-ignore]. Adding unused-ignore to the ignore codes tells mypy to tolerate the comment when it has nothing to suppress. Both environments are now clean: local mypy with the extra still sees the arg-type mismatch and honours the suppression; CI mypy without the extra silently accepts the comment as dead. * fix(evaluator): remove dead real_import code in G3 skip counter test CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.

…ages Bumps 0.2.1 -> 0.2.2 on every package that ships to a registry, in lockstep. The v0.2.1 G5 drift guard enforces the three Python __version__ constants match their pyproject.toml siblings, and the publish.yml verify step refuses to upload if any of the six drifts from the release tag. Python (pyproject.toml + __init__.py __version__, both sides): - rag-forge-core 0.2.1 -> 0.2.2 - rag-forge-evaluator 0.2.1 -> 0.2.2 - rag-forge-observability 0.2.1 -> 0.2.2 npm (package.json "version"): - @rag-forge/cli 0.2.1 -> 0.2.2 - @rag-forge/mcp 0.2.1 -> 0.2.2 - @rag-forge/shared 0.2.1 -> 0.2.2 CHANGELOG: dated the v0.2.2 entry 2026-04-15 (removes "unreleased"). The five v0.2.2 workstreams (G1-G5) already merged to main — see PRs #36-#40. This PR is the final step before cutting the GitHub Release that triggers publish.yml. Once merged, the release workflow: 1. twine check + check-wheel-contents on every Python wheel 2. G1 contract tests (iterate real ragas base classes) 3. G5 __version__ drift guard (auto-discovers packages) 4. OIDC push to PyPI 5. NPM_TOKEN push to npm (pnpm publish --no-git-checks)

…ages (#41) Bumps 0.2.1 -> 0.2.2 on every package that ships to a registry, in lockstep. The v0.2.1 G5 drift guard enforces the three Python __version__ constants match their pyproject.toml siblings, and the publish.yml verify step refuses to upload if any of the six drifts from the release tag. Python (pyproject.toml + __init__.py __version__, both sides): - rag-forge-core 0.2.1 -> 0.2.2 - rag-forge-evaluator 0.2.1 -> 0.2.2 - rag-forge-observability 0.2.1 -> 0.2.2 npm (package.json "version"): - @rag-forge/cli 0.2.1 -> 0.2.2 - @rag-forge/mcp 0.2.1 -> 0.2.2 - @rag-forge/shared 0.2.1 -> 0.2.2 CHANGELOG: dated the v0.2.2 entry 2026-04-15 (removes "unreleased"). The five v0.2.2 workstreams (G1-G5) already merged to main — see PRs #36-#40. This PR is the final step before cutting the GitHub Release that triggers publish.yml. Once merged, the release workflow: 1. twine check + check-wheel-contents on every Python wheel 2. G1 contract tests (iterate real ragas base classes) 3. G5 __version__ drift guard (auto-discovers packages) 4. OIDC push to PyPI 5. NPM_TOKEN push to npm (pnpm publish --no-git-checks)

hallengray mentioned this pull request Apr 15, 2026

fix(evaluator): route RAGAS exceptions into Skipped counter (v0.2.2 G3) #38

Merged

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

hallengray added 2 commits April 15, 2026 23:25

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

hallengray merged commit 73d8c5e into main Apr 15, 2026
3 checks passed

hallengray deleted the fix/v0.2.2-g1-ragas-adapter-generate branch April 15, 2026 22:52

hallengray mentioned this pull request Apr 15, 2026

chore(release): prep v0.2.2 — lockstep version bump (6 packages) #41

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1)#36

fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1)#36
hallengray merged 4 commits intomainfrom
fix/v0.2.2-g1-ragas-adapter-generate

hallengray commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallengray commented Apr 15, 2026

Summary

Root cause

What v0.2.2 adds on RagForgeRagasLLM

What v0.2.2 adds on RagForgeRagasEmbeddings

Tests

Test plan

Merge order

Uh oh!

coderabbitai bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What v0.2.2 adds on `RagForgeRagasLLM`

What v0.2.2 adds on `RagForgeRagasEmbeddings`

coderabbitai bot commented Apr 15, 2026 •

edited

Loading