Skip to content

fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1)#36

Merged
hallengray merged 4 commits intomainfrom
fix/v0.2.2-g1-ragas-adapter-generate
Apr 15, 2026
Merged

fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1)#36
hallengray merged 4 commits intomainfrom
fix/v0.2.2-g1-ragas-adapter-generate

Conversation

@hallengray
Copy link
Copy Markdown
Owner

Summary

  • Re-declares every public method on ragas.llms.base.BaseRagasLLM and ragas.embeddings.base.BaseRagasEmbeddings as duck-typed shims on RagForgeRagasLLM / RagForgeRagasEmbeddings
  • Fixes the Cycle 3 C3-2 regression: v0.2.1 crashed every RAGAS metric job in ~6s with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'
  • Adds a contract test that iterates every public method on the real ragas base classes and asserts the wrappers declare a callable of the same name — so the next ragas release that grows a new method fails fast in CI, not in a user audit

Root cause

The adapter's module docstring claimed "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM, regardless of subclass status. Because our duck-typed class doesn't inherit from BaseRagasLLM, it doesn't inherit .generate(), and RAGAS's first metric call crashes.

The fix preserves the no-hard-import design. We implement our own .generate() / is_finished() / get_temperature() / set_run_config() shims that forward to the existing generate_text / agenerate_text. When the contract test was added it caught two additional missing methods on the embeddings wrapper (embed_text, embed_texts) — exactly the kind of drift the test exists to prevent.

What v0.2.2 adds on RagForgeRagasLLM

  • async generate(prompt, n, temperature, stop, callbacks) -> LLMResult — the specific shim Cycle 3 caught missing
  • is_finished(response) -> bool — returns True; Judge protocol has no finish signal
  • get_temperature(n) -> float — matches base convention (0.01 / 0.3)
  • set_run_config(run_config) — stores ragas RunConfig for contract compatibility
  • run_config: Any attribute defaulting to None

What v0.2.2 adds on RagForgeRagasEmbeddings

  • embed_text(text, is_async) + embed_texts(texts, is_async) — dispatch helpers the base class exposes
  • set_run_config(run_config) — contract parity with the LLM wrapper

Tests

  • tests/test_ragas_adapters_contract.pynew, the contract tripwire. Iterates every public method on BaseRagasLLM / BaseRagasEmbeddings and asserts our wrappers declare them. Also instance-level tests for generate being async, is_finished returning True, get_temperature(n) matching base convention, set_run_config storing the value.
  • tests/test_ragas_adapters_e2e.pynew, runs ragas.evaluate() with our wrappers on a 1-sample dataset. Fails if ragas hits an AttributeError referencing our wrapper classes. This would have caught the C3-2 regression before v0.2.1 shipped.
  • tests/test_ragas_adapters.py — updated to extract text from real LLMResult objects (langchain now pulled in via the [ragas] extra), plus new tests for the async generate shim and the temperature=None default-resolution path.

Local run: 22 passed, 0 failed on the adapter test suite.

Test plan

  • pnpm run typecheck clean
  • pnpm run test:py green (filter to tests/test_ragas_adapters*.py if faster)
  • Contract test asserts every ragas base-class method (not just generate)
  • End-to-end smoke test runs ragas.evaluate() without hitting AttributeError
  • No change to the deliberate "don't hard-import ragas" design — the adapter module still imports cleanly without the [ragas] extra

Merge order

First of the v0.2.2 series. Plan is G1 → G3 → {G2, G4, G5} in parallel → release prep. G3's branch is already rebased on this one, so merging this unblocks G3 immediately.

Full release plan: docs/release-notes/v0.2.2-plan.md (lands with G2).

RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the
concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes.
Every RAGAS metric job crashed on first contact with
  AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'
during the PearMedica Cycle 3 audit (2026-04-15), producing a
Scored: 0, Overall: 0.0000 report across all 48 metric evaluations.

Root cause: the adapter's design doc asserted "ragas only calls
generate_text / agenerate_text / model_name" and deliberately avoided
subclassing BaseRagasLLM to keep ragas as a soft dependency. That
assumption was wrong — BaseRagasLLM.generate is a concrete async helper
that ragas's metric code invokes on every LLM regardless of subclass
status. The duck-typed wrapper must re-declare it.

Fix preserves the no-hard-import design by adding duck-typed shims for
every public method on BaseRagasLLM and BaseRagasEmbeddings:

  RagForgeRagasLLM:
    - async generate()        — the specific shim Cycle 3 caught missing
    - is_finished()           — was abstract on base, returns True
    - get_temperature(n)      — matches base convention (0.01 / 0.3)
    - set_run_config()        — stores ragas RunConfig for compatibility
    - run_config attribute    — defaults to None

  RagForgeRagasEmbeddings:
    - embed_text(is_async)    — dispatch helper on base class
    - embed_texts(is_async)   — batch variant
    - set_run_config()
    - run_config attribute

Tests: a contract test iterates every public method on the real ragas
base classes and asserts our wrappers declare a callable of the same
name, so the next ragas release that adds a method fails in CI not in
a user audit. An end-to-end smoke test runs ragas.evaluate() against
our wrappers on a 1-sample dataset and asserts it never raises an
AttributeError naming our wrapper classes — the exact regression
signature from Cycle 3.

Three pre-existing test failures in test_cycle2_regression.py and
test_evaluator_factory.py are unrelated to this change (pre-existing
in v0.2.1 when ragas is installed; they assert on compound conditions
that break when MockJudge triggers the skip path). They belong in
workstream G3 (skip counter plumbing) and the G1 PR leaves them alone.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 15, 2026

Summary by CodeRabbit

  • New Features

    • Enhanced evaluation adapters with improved configuration management and state tracking.
    • Added parallel generation support and dynamic temperature adjustment for LLM evaluation.
    • Introduced async convenience methods for embeddings operations.
    • Improved result validation with detailed error handling.
  • Tests

    • Expanded test coverage including contract validation and end-to-end evaluation scenarios.

Walkthrough

Implements ragas v0.4.x public contract in RagForge adapters: adds async LLM generate, result fusion, is_finished, temperature helpers, set_run_config/run_config, async embedding conveniences, and accompanying unit, contract, and e2e tests to validate parity and behavior.

Changes

Cohort / File(s) Summary
LLM & Embeddings Adapter
packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
Adds run_config attributes; introduces _fuse_llm_results(results) to merge single-sample results; extends RagForgeRagasLLM with async generate(...), is_finished(...), get_temperature(...), set_run_config(...) and fans out agenerate_text(..., n=1) calls for n>1, then fuses and validates outputs. Extends RagForgeRagasEmbeddings with set_run_config(...) and async convenience embed_text/embed_texts delegating to async embedding methods.
Unit Tests
packages/evaluator/tests/test_ragas_adapters.py
Adds _llm_result_text helper; updates tests to use it; adds tests for async generate shim (prompt forwarding, temperature=None fallback to get_temperature, explicit temperature bypass, n>1 fused shape), _fuse_llm_results validation, and embedding coroutine behavior including safe use inside an active event loop.
Contract Parity Tests
packages/evaluator/tests/test_ragas_adapters_contract.py
New contract tests that import ragas bases (skip if absent) and assert wrappers expose the same public callable names, async/sync parity, and parameter-name coverage; includes instance checks for generate asyncness, is_finished default, get_temperature values, and set_run_config persistence.
End-to-end Smoke Test
packages/evaluator/tests/test_ragas_adapters_e2e.py
Adds conditional e2e test using ragas.evaluate() with deterministic judge stub and tiny dataset; skips if optional deps missing; surfaces AttributeErrors caused by missing wrapper methods and tolerates ragas output/parser exceptions where expected.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant RagForgeRagasLLM as LLM
  participant Judge

  Client->>LLM: await generate(prompt, n, temperature=None)
  LLM->>LLM: temperature = get_temperature(n)  -- if temperature is None
  alt n == 1
    LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    Judge-->>LLM: LLMResult (single)
    LLM->>LLM: is_finished(result) ?
    LLM-->>Client: LLMResult
  else n > 1
    par call 1
      LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    and call 2
      LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    and ... call N
      LLM->>Judge: agenerate_text(prompt, n=1, temperature)
    end
    Judge-->>LLM: multiple LLMResult objects
    LLM->>LLM: _fuse_llm_results(results)
    LLM->>LLM: is_finished(fused_result) ?
    LLM-->>Client: fused LLMResult (generations=[[...]])
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through wrappers, code in paw,

Fused many results—what a small awe!
Temperatures set, run_config snug and neat,
Async calls lined up, each task complete.
Tests guard the path—now hop on with me!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.40% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: adding missing ragas BaseRagasLLM methods to the evaluator adapter, specifically addressing the generate method that caused v0.2.1 crashes.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, explaining the root cause of the regression, the specific methods added, test coverage, and merge order details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/v0.2.2-g1-ragas-adapter-generate

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 442-462: Replace the use of asyncio.run in the adapter helpers
embed_text and embed_texts with ragas's run_async_in_current_loop utility:
import run_async_in_current_loop from ragas.embeddings.utils and call
run_async_in_current_loop(self.aembed_query(text)) in embed_text and
run_async_in_current_loop(self.aembed_documents(texts)) in embed_texts so the
async helpers (aembed_query, aembed_documents) run correctly whether an event
loop is already running or not.

In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: The current _public_methods(cls) only compares method names
which misses incompatible wrappers; update the test to also compare
inspect.signature for each public callable: use inspect.signature on the
reference class (e.g., BaseRagasLLM) and on the class under test, iterate the
public method names produced by _public_methods, and assert that each method's
Signature has compatible parameters (matching parameter names, kinds, and
default presence/values where relevant) rather than just names — this should
detect cases like a wrapper using generate(**kwargs) or missing parameters;
reference _public_methods and the BaseRagasLLM.generate signature in the test to
implement the signature comparison and fail the test when signatures are not
compatible.

In `@packages/evaluator/tests/test_ragas_adapters_e2e.py`:
- Around line 143-151: Narrow the broad except Exception handler: instead of
treating any non-AttributeError as acceptable, only allow the downstream
parsing/schema/embedding errors you expect — e.g., catch Exception as exc then
if not isinstance(exc, (ragas.parser.ParserError, ragas.schema.SchemaError,
adapters.MockEmbeddingDimensionError, ValueError)): raise; otherwise continue
and keep the existing AttributeError assertion. Update the except block around
the test (the exc/exc_name handling) so unexpected exceptions are re-raised
rather than treated as success, importing the concrete exception classes used by
the parser/schema/adapter modules.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 83-92: Update the test to assert the resolved temperature value is
actually used: override or monkeypatch RagForgeRagasLLM.agenerate_text (or
llm.get_temperature) to capture the temperature passed into agenerate_text, call
llm.generate("prompt", n=1, temperature=None), and assert that the captured
temperature equals the value returned by llm.get_temperature(n) (not just that
no exception was raised); reference the RagForgeRagasLLM class, its generate
method, get_temperature, and agenerate_text to locate where to inject the
spy/capture and perform the equality assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c68799c9-6093-482e-8766-4c0cc92a3672

📥 Commits

Reviewing files that changed from the base of the PR and between 18947f4 and 91d10ca.

📒 Files selected for processing (4)
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
  • packages/evaluator/tests/test_ragas_adapters.py
  • packages/evaluator/tests/test_ragas_adapters_contract.py
  • packages/evaluator/tests/test_ragas_adapters_e2e.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: RAG Quality Gate
  • GitHub Check: Lint, Typecheck & Test

Comment thread packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py Outdated
Comment on lines +442 to +462
def embed_text(self, text: str, is_async: bool = True) -> list[float]:
"""Dispatch helper that ragas's concrete ``BaseRagasEmbeddings``
exposes alongside ``embed_query``. Some metrics call ``embed_text``
instead of ``embed_query`` — they are conceptually the same, just
different entry points. We forward to ``embed_query`` for the
sync path and to ``aembed_query`` when ``is_async=True``.

``is_async=True`` is the base-class default, so we match it — but
because our underlying Judge/embedding clients are synchronous,
the "async" path just runs the sync call in a worker thread
(same pattern as ``aembed_query``).
"""
if is_async:
return asyncio.run(self.aembed_query(text))
return self.embed_query(text)

def embed_texts(self, texts: list[str], is_async: bool = True) -> list[list[float]]:
"""Batch variant of ``embed_text`` — same dispatch semantics."""
if is_async:
return asyncio.run(self.aembed_documents(texts))
return self.embed_documents(texts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

How does the current ragas release implement BaseRagasEmbeddings.embed_textandembed_texts, and is it safe to call those helpers via asyncio.run() from code that may already be inside an asyncio event loop?

💡 Result:

In the current Ragas release (main branch as of 2026-04-15), BaseRagasEmbeddings.embed_text and embed_texts are implemented as asynchronous methods in src/ragas/embeddings/base.py. embed_text(text: str, is_async: bool = True) -> List[float] awaits self.embed_texts([text], is_async=is_async) and returns the first embedding. embed_texts(texts: List[str], is_async: bool = True) -> List[List[float]]: - If is_async=True (default): Applies retry via add_async_retry to self.aembed_documents and awaits it. - If is_async=False: Applies retry to sync self.embed_documents and runs it via await loop.run_in_executor(None, ...). These are designed to be called from async contexts (e.g., await embedder.embed_texts(texts)). It is NOT safe to call them via asyncio.run from code already inside an event loop. asyncio.run requires no running loop and creates a new one, raising RuntimeError if called nested. Ragas provides run_async_in_current_loop(coro) in src/ragas/embeddings/utils.py specifically for this: detects running loops (e.g., Jupyter), spawns a thread with new loop if running, or uses loop.run_until_complete(coro) if not running. Used in BaseRagasEmbedding for sync-to-async bridging. For sync calling inside loops, use await loop.run_in_executor(None, lambda: embedder.embed_documents(texts)) or run_async_in_current_loop(embedder.embed_texts(texts)). Direct asyncio.run(awaitable) will fail.

Citations:


Use Ragas's run_async_in_current_loop() instead of asyncio.run() in these adapter helpers.

Calling asyncio.run(self.aembed_query(text)) or asyncio.run(self.aembed_documents(texts)) will raise RuntimeError if ragas invokes these methods from within an already-running event loop. Ragas provides run_async_in_current_loop() in src/ragas/embeddings/utils.py specifically for this—it detects whether a loop is running and either spawns a thread with a new loop or calls run_until_complete() as appropriate. Migrate these methods to use that utility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 442 - 462, Replace the use of asyncio.run in the adapter helpers
embed_text and embed_texts with ragas's run_async_in_current_loop utility:
import run_async_in_current_loop from ragas.embeddings.utils and call
run_async_in_current_loop(self.aembed_query(text)) in embed_text and
run_async_in_current_loop(self.aembed_documents(texts)) in embed_texts so the
async helpers (aembed_query, aembed_documents) run correctly whether an event
loop is already running or not.

Comment thread packages/evaluator/tests/test_ragas_adapters_contract.py
Comment thread packages/evaluator/tests/test_ragas_adapters_e2e.py Outdated
Comment thread packages/evaluator/tests/test_ragas_adapters.py
Five fixes from CodeRabbit review on PR #36. All were real issues the
original contract test missed because it only checked method names,
not signatures.

### generate(n > 1) now produces the correct LLMResult shape

ragas uses n > 1 for multi-sample metrics (answer_correctness
consistency checks, etc.) and expects LLMResult.generations shaped
[[gen1, gen2, ..., genN]] — one prompt run with N candidate
generations. The original shim ignored n and returned a single
generation, silently breaking any metric that relied on sample
diversity.

Fix: when n > 1, fan out n independent agenerate_text calls via
asyncio.gather and fuse the per-call results into a single
[[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper.
n == 1 (the common case) stays a single call.

New test: test_wrapper_async_generate_n_greater_than_one_produces_
fused_shape uses a counting judge to verify all N calls fire and the
fused result carries all N distinct outputs.

### embed_text / embed_texts now async (crash fix)

Critical bug: the original shims were sync methods wrapping
asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings
declares embed_text and embed_texts as async coroutines and metric
code invokes them with await embeddings.embed_text(...) from inside
ragas's evaluation event loop. asyncio.run() inside a running loop
crashes with RuntimeError: asyncio.run() cannot be called from a
running event loop — a real live-fire crash that would have taken
down any Cycle 4 run.

Fix: make both methods async. They now just await the existing
aembed_query / aembed_documents paths. The is_async parameter is
accepted for signature parity with the base class but ignored —
our underlying clients are synchronous and aembed_query already
runs them in a worker thread via asyncio.to_thread, so both flag
values land on the same code path.

New tests:
- test_embeddings_embed_text_is_async_and_awaitable: asserts both
  methods are inspect.iscoroutinefunction and round-trips through
  asyncio.run(embed.embed_text(...)).
- test_embeddings_embed_text_callable_from_running_event_loop:
  exercises the exact asyncio.run-inside-loop path ragas creates.

### Contract test now checks async/sync parity, not just names

CodeRabbit correctly pointed out that the original contract test
compared method names but not their async/sync shape — exactly why
the sync embed_text / embed_texts slipped past review.

Fix: two new tests (test_llm_wrapper_async_signature_matches_base
and test_embeddings_wrapper_async_signature_matches_base) iterate
every public method on the real ragas base classes and assert
inspect.iscoroutinefunction matches on both sides. A future release
that adds a method in one shape or the other will fail in CI.

### generate() default temperature: None -> 0.01

BaseRagasLLM.generate's default is 0.01. Our shim had None, which
is functionally equivalent (None triggers get_temperature()) but
diverges from the base signature. Changed to 0.01 to match.

Tests that exercise the None fallback still pass None explicitly.

### Tightened existing temperature-fallback test

The original test asserted "doesn't raise" but would have passed
even if get_temperature() was never called. Now patches
llm.get_temperature with wraps= and asserts assert_called_once_with(1).
Added inverse test that patches get_temperature and asserts it was
NOT called when an explicit temperature is passed — guards against
over-eager fallback firing.

### Narrowed e2e smoke test exception handler

Original except Exception accepted any non-AttributeError as a pass.
A bad shim raising TypeError or a nested-loop bug raising RuntimeError
would have counted as success. Tightened to allow only
RagasOutputParserException (expected downstream parser failure from
the stub judge's canned JSON); everything else propagates.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
packages/evaluator/tests/test_ragas_adapters.py (1)

96-101: ⚠️ Potential issue | 🟡 Minor

Assert the resolved temperature is forwarded into agenerate_text.

This still passes if generate() calls get_temperature() and then ignores its return value before invoking agenerate_text(). Spy on agenerate_text and assert the captured temperature matches the patched fallback.

Suggested tightening
     judge = FakeJudge(response="ok")
     llm = RagForgeRagasLLM(judge=judge, refusal_aware=False)
-    with patch.object(
-        llm, "get_temperature", wraps=llm.get_temperature
-    ) as get_temp:
+    captured: dict[str, float | None] = {}
+    original = llm.agenerate_text
+
+    async def spy(*args, **kwargs):
+        captured["temperature"] = kwargs["temperature"]
+        return await original(*args, **kwargs)
+
+    with patch.object(llm, "get_temperature", return_value=0.42) as get_temp, patch.object(
+        llm, "agenerate_text", side_effect=spy
+    ):
         result = asyncio.run(llm.generate("prompt", n=1, temperature=None))
     assert _llm_result_text(result) == "ok"
     get_temp.assert_called_once_with(1)
+    assert captured["temperature"] == 0.42
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 96 - 101, The
test currently only verifies get_temperature was called but not that its
resolved value is passed to the async implementation; update the test to spy on
the internal agenerate_text call (e.g., patch or mock llm.agenerate_text) and
capture the temperature argument used by agenerate_text when
llm.generate("prompt", n=1, temperature=None) is invoked; assert that the
captured temperature equals the patched fallback value (1) in addition to
keeping the existing get_temperature.assert_called_once_with(1) and result text
assertion so you validate the resolved temperature is forwarded into
agenerate_text.
packages/evaluator/tests/test_ragas_adapters_contract.py (1)

46-52: ⚠️ Potential issue | 🟠 Major

Compare signatures, not just method names.

These tests still pass if a wrapper keeps the right method names but drops callbacks, changes defaults, or replaces a positional signature with **kwargs. That drift only shows up later as TypeError inside ragas. Add an inspect.signature(...) comparison after stripping self.

#!/bin/bash
python - <<'PY'
import inspect

from rag_forge_evaluator.engines.ragas_adapters import (
    RagForgeRagasEmbeddings,
    RagForgeRagasLLM,
)

from ragas.embeddings.base import BaseRagasEmbeddings
from ragas.llms.base import BaseRagasLLM

def public_methods(cls):
    return sorted(
        name for name in dir(cls)
        if not name.startswith("_") and callable(getattr(cls, name))
    )

def normalized_signature(cls, name):
    sig = inspect.signature(getattr(cls, name))
    params = [
        (p.name, p.kind, p.default)
        for p in sig.parameters.values()
        if p.name != "self"
    ]
    return params

for base_cls, wrapper_cls in (
    (BaseRagasLLM, RagForgeRagasLLM),
    (BaseRagasEmbeddings, RagForgeRagasEmbeddings),
):
    print(f"\n{base_cls.__name__} vs {wrapper_cls.__name__}")
    for name in public_methods(base_cls):
        if not hasattr(wrapper_cls, name):
            print("MISSING", name)
            continue
        base_sig = inspect.signature(getattr(base_cls, name))
        wrapper_sig = inspect.signature(getattr(wrapper_cls, name))
        if normalized_signature(base_cls, name) != normalized_signature(wrapper_cls, name):
            print("DIFF", name, base_sig, wrapper_sig)
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters_contract.py` around lines 46 -
52, The test helper _public_methods should be expanded to assert that wrappers
preserve method signatures, not just names: for each public method on the base
classes (BaseRagasLLM, BaseRagasEmbeddings) that exists on the wrapper classes
(RagForgeRagasLLM, RagForgeRagasEmbeddings), compute inspect.signature(...) for
both, normalize by removing the 'self' parameter and comparing (name, kind,
default) tuples for each parameter, and fail the test if any signature differs
(covers changed defaults, added **kwargs, positional->kw-only drift, missing
callbacks, etc.).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 131-148: The current _fuse_llm_results function silently returns
results[0] on any fusion error, collapsing n>1 samples into one; change this so
we do NOT degrade silently: on AttributeError or IndexError (malformed result
shapes) raise a ValueError with a concise diagnostic (include len(results) and
the problematic result type/repr) and do not return a single sample, and on
ImportError (missing langchain) re-raise or allow the ImportError to bubble so
the soft-dependency path is explicit rather than silently returning results[0];
update the except block accordingly and remove the fallback return results[0]
behavior.

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 128-150: The CountingJudge's judge method mutates shared state
counter["i"] without synchronization causing race conditions when llm.generate
fans out to threads; make the counter increment atomic by adding a lock (e.g.,
threading.Lock) as a member of CountingJudge and acquiring it around
counter["i"] += 1, or replace the shared dict with a thread-safe atomic counter
(e.g., itertools.count or threading-safe integer wrapper) and update
CountingJudge.judge to use that safe increment so tests for n>1 become
deterministic.

---

Duplicate comments:
In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: The test helper _public_methods should be expanded to
assert that wrappers preserve method signatures, not just names: for each public
method on the base classes (BaseRagasLLM, BaseRagasEmbeddings) that exists on
the wrapper classes (RagForgeRagasLLM, RagForgeRagasEmbeddings), compute
inspect.signature(...) for both, normalize by removing the 'self' parameter and
comparing (name, kind, default) tuples for each parameter, and fail the test if
any signature differs (covers changed defaults, added **kwargs,
positional->kw-only drift, missing callbacks, etc.).

In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 96-101: The test currently only verifies get_temperature was
called but not that its resolved value is passed to the async implementation;
update the test to spy on the internal agenerate_text call (e.g., patch or mock
llm.agenerate_text) and capture the temperature argument used by agenerate_text
when llm.generate("prompt", n=1, temperature=None) is invoked; assert that the
captured temperature equals the patched fallback value (1) in addition to
keeping the existing get_temperature.assert_called_once_with(1) and result text
assertion so you validate the resolved temperature is forwarded into
agenerate_text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: aee68e59-0676-4965-af91-a54a52be113b

📥 Commits

Reviewing files that changed from the base of the PR and between 91d10ca and 0724d99.

📒 Files selected for processing (4)
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
  • packages/evaluator/tests/test_ragas_adapters.py
  • packages/evaluator/tests/test_ragas_adapters_contract.py
  • packages/evaluator/tests/test_ragas_adapters_e2e.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: RAG Quality Gate
  • GitHub Check: Lint, Typecheck & Test

Comment on lines +131 to +148
If fusion fails (mixed result shapes, missing langchain, etc.) we
fall back to returning the first result — ragas will then see a
single generation, which is strictly worse than N but still
correct shape. The ``n>1`` path is rare in stock ragas metrics so
conservative fallback is preferable to crashing.
"""
if not results:
msg = "_fuse_llm_results called with empty results list"
raise ValueError(msg)
if len(results) == 1:
return results[0]
try:
from langchain_core.outputs import LLMResult as _LLMResult

fused_generations = [gen for r in results for gen in r.generations[0]]
return _LLMResult(generations=[fused_generations])
except (ImportError, AttributeError, IndexError):
return results[0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently collapse n>1 into a single sample on fuse failure.

Falling back to results[0] here turns a requested multi-sample generation into a single-sample result with no signal. That both skews downstream ragas scoring and breaks the advertised soft-dependency path because _StringLLMResult never actually gets fused. Handle the stub case explicitly, and raise on malformed result shapes instead of degrading silently.

Proposed fix
     try:
         from langchain_core.outputs import LLMResult as _LLMResult
 
         fused_generations = [gen for r in results for gen in r.generations[0]]
         return _LLMResult(generations=[fused_generations])
-    except (ImportError, AttributeError, IndexError):
-        return results[0]
+    except ImportError:
+        fused = _StringLLMResult(results[0].generations[0][0].text)
+        fused.generations = [[gen for r in results for gen in r.generations[0]]]
+        return fused
+    except (AttributeError, IndexError, TypeError) as exc:
+        msg = "Cannot fuse malformed LLMResult payloads"
+        raise ValueError(msg) from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 131 - 148, The current _fuse_llm_results function silently returns
results[0] on any fusion error, collapsing n>1 samples into one; change this so
we do NOT degrade silently: on AttributeError or IndexError (malformed result
shapes) raise a ValueError with a concise diagnostic (include len(results) and
the problematic result type/repr) and do not return a single sample, and on
ImportError (missing langchain) re-raise or allow the ImportError to bubble so
the soft-dependency path is explicit rather than silently returning results[0];
update the except block accordingly and remove the fallback return results[0]
behavior.

Comment thread packages/evaluator/tests/test_ragas_adapters.py
The n>1 fan-out fix in the previous commit built a fused LLMResult
via langchain_core.outputs.LLMResult. On CI without the [ragas]
extra installed, langchain isn't available and generate_text falls
back to the _StringLLMResult stub — which _fuse_llm_results had no
fuse path for, so it hit the ImportError fallback and returned
results[0] (a single generation). The n>1 test then asserted
inner length == 3 and failed with 1.

Fix: separate the flatten step (build the fused_generations list)
from the wrap step (pick LLMResult or _StringLLMResult). The
flatten works on either shape because both expose the same
.generations[0] interface. Wrap picks langchain if available,
otherwise constructs via a new _StringLLMResult._from_generations
alt constructor that carries a pre-fused list.

Local verification:
  - without [ragas]: 18/18 adapter tests pass (contract + e2e skip)
  - with [ragas]: 28/28 adapter tests pass
Three more findings from the second CodeRabbit review on PR #36 / #38.

### _fuse_llm_results now fails loud on malformed input

The previous commit's AttributeError/IndexError fallback silently
returned results[0] — turning a requested n>1 generation into a
single-sample result with no signal. Downstream ragas metrics that
rely on sample diversity would consume the degraded result as if it
were a valid N-sample fuse. CodeRabbit rightly flagged this as
hiding real correctness bugs.

Fix: raise ValueError with the observed result types and the
underlying exception chained. Added a targeted test
(test_fuse_llm_results_raises_on_malformed_input) covering three
malformed-input shapes: objects without .generations, empty outer
lists, and empty input lists.

### CountingJudge in n>1 test is now thread-safe

RagForgeRagasLLM.generate fans out via asyncio.gather, and each
agenerate_text call runs in a worker thread via asyncio.to_thread.
counter["i"] += 1 is not atomic across threads — the read-modify-
write race would produce duplicate sample labels or an undercounted
total, flaking the test. Added a threading.Lock around the
increment. Test now deterministic under concurrent fan-out.

### Contract tests now check parameter parity, not just name + async

Previously we asserted that each public method on BaseRagasLLM /
BaseRagasEmbeddings exists on our wrapper and matches async/sync.
That missed parameter drift — a future ragas release that adds
max_tokens to generate_text would silently break our wrapper until
a user audit caught it.

Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class
and test_embeddings_wrapper_parameter_names_cover_base_class) use
inspect.signature to enumerate the named parameters on each base-
class method and assert the wrapper accepts every name. Variadic
*args / **kwargs are ignored (they can absorb any kwarg by
definition). A helper _required_param_names() shares the logic.

Local: 30 passed on adapter + contract + e2e suites with ragas
installed, 18 passed on adapter suite alone without the [ragas]
extra. Ruff + mypy clean.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 163-186: Add a test case to cover the scenario where each result
has an empty inner generations list so _fuse_llm_results raises a ValueError for
malformed input; specifically, in the test function
test_fuse_llm_results_raises_on_malformed_input add a class (e.g., EmptyInner)
whose self.generations is [[]] and assert that calling
_fuse_llm_results([EmptyInner(), EmptyInner()]) raises ValueError with the
"malformed result" message, referencing the existing _fuse_llm_results import to
locate the code under test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26dab04a-f0e5-4552-aac3-edde6d722f3d

📥 Commits

Reviewing files that changed from the base of the PR and between 0724d99 and 863394b.

📒 Files selected for processing (3)
  • packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py
  • packages/evaluator/tests/test_ragas_adapters.py
  • packages/evaluator/tests/test_ragas_adapters_contract.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: RAG Quality Gate
  • GitHub Check: Lint, Typecheck & Test
🔇 Additional comments (15)
packages/evaluator/tests/test_ragas_adapters.py (4)

15-26: LGTM!

The helper correctly normalizes assertions across environments by handling both raw strings and LLMResult-like objects. The lack of defensive checks for empty generations is acceptable in test code — tests should fail loudly on malformed responses.


83-101: LGTM!

The test now correctly verifies the fallback path by patching get_temperature with wraps and asserting it was called with the expected argument. This addresses the prior feedback about asserting the actual behavior rather than just absence of exceptions.


117-160: LGTM!

The test correctly validates the n>1 generation shape and uses thread-safe counting as recommended in prior review. The assertions cover the exact [[gen1, gen2, gen3]] shape ragas expects.


189-233: LGTM!

The tests thoroughly validate the async contract for embed_text and embed_texts, including the regression test that exercises calling from within a running event loop — exactly the scenario that crashed with the original asyncio.run() implementation.

packages/evaluator/tests/test_ragas_adapters_contract.py (5)

46-52: LGTM!

The helper correctly enumerates public callables including inherited methods, which is the right behavior for contract testing against base classes.


149-172: LGTM!

The helper correctly extracts named parameters while excluding self/cls and variadic catchalls. This enables the parameter-parity tests to detect missing keyword arguments that would cause TypeError at runtime.


101-146: LGTM!

The async/sync parity tests correctly use inspect.iscoroutinefunction to validate that wrapper methods match the base class's async nature. This would have caught the original embed_text sync implementation that crashed in async contexts.


175-215: LGTM!

The parameter-name coverage tests address the prior feedback about signature parity. They correctly check that the wrapper accepts all parameters the base class declares, preventing TypeError at runtime when ragas passes keyword arguments.


218-254: LGTM!

The instance-level tests comprehensively validate the specific shim behaviors: async generate, is_finished default, temperature convention matching ragas, and set_run_config persistence. Good coverage of the new v0.2.2 contract surface.

packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py (6)

119-170: LGTM!

The implementation correctly addresses prior feedback by raising ValueError with descriptive messages on malformed input instead of silently falling back to results[0]. The error message includes the types of all result objects, aiding debugging.


267-341: LGTM!

The generate implementation correctly handles both n=1 and n>1 cases, resolves temperature via get_temperature(n) when None, and uses asyncio.gather for parallel fan-out. The defensive is_finished check and documented rationale for skipping add_async_retry show thoughtful design.


393-408: LGTM!

The factory method correctly constructs a fused _StringLLMResult stub for the no-langchain path, maintaining the [[gen1..genN]] shape and preserving str() / __eq__ usability via _text.


544-571: LGTM!

The methods are correctly implemented as async def coroutines, addressing the prior feedback about asyncio.run() crashing inside running event loops. The ignored is_async parameter is well-documented — both code paths converge on asyncio.to_thread anyway.


343-377: LGTM!

The helper methods correctly implement the ragas contract: is_finished conservatively returns True (documented limitation with clear upgrade path), get_temperature matches ragas's convention, and set_run_config stores the value for contract compatibility.


10-58: LGTM!

Excellent documentation of the ragas 0.4.x contract surface and the v0.2.0 → v0.2.2 evolution. The explicit listing of required methods and the rationale for duck-typing will help future maintainers understand the design constraints.

Comment on lines +163 to +186
def test_fuse_llm_results_raises_on_malformed_input():
"""``_fuse_llm_results`` must fail loud on malformed result shapes
instead of silently collapsing to ``results[0]``. CodeRabbit on
PR #36 round 2 pointed out that a silent fallback hides real
``n > 1`` correctness bugs — a single returned sample skews
downstream ragas metrics with no signal.
"""
from rag_forge_evaluator.engines.ragas_adapters import _fuse_llm_results

class NotAnLLMResult:
pass

with pytest.raises(ValueError, match="malformed result"):
_fuse_llm_results([NotAnLLMResult(), NotAnLLMResult()])

class EmptyOuter:
def __init__(self) -> None:
self.generations: list[list[object]] = []

with pytest.raises(ValueError, match="malformed result"):
_fuse_llm_results([EmptyOuter(), EmptyOuter()])

with pytest.raises(ValueError, match="empty results list"):
_fuse_llm_results([])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding a test for empty inner generations.

The test covers missing .generations, empty outer list, and empty results list. However, _fuse_llm_results also catches IndexError when r.generations[0] exists but is empty. Adding this case would complete coverage:

class EmptyInner:
    def __init__(self) -> None:
        self.generations: list[list[object]] = [[]]

with pytest.raises(ValueError, match="malformed result"):
    _fuse_llm_results([EmptyInner(), EmptyInner()])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 163 - 186, Add
a test case to cover the scenario where each result has an empty inner
generations list so _fuse_llm_results raises a ValueError for malformed input;
specifically, in the test function
test_fuse_llm_results_raises_on_malformed_input add a class (e.g., EmptyInner)
whose self.generations is [[]] and assert that calling
_fuse_llm_results([EmptyInner(), EmptyInner()]) raises ValueError with the
"malformed result" message, referencing the existing _fuse_llm_results import to
locate the code under test.

@hallengray hallengray merged commit 73d8c5e into main Apr 15, 2026
3 checks passed
@hallengray hallengray deleted the fix/v0.2.2-g1-ragas-adapter-generate branch April 15, 2026 22:52
hallengray added a commit that referenced this pull request Apr 15, 2026
…3) (#38)

* fix(evaluator): add missing ragas BaseRagasLLM methods (G1)

RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the
concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes.
Every RAGAS metric job crashed on first contact with
  AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'
during the PearMedica Cycle 3 audit (2026-04-15), producing a
Scored: 0, Overall: 0.0000 report across all 48 metric evaluations.

Root cause: the adapter's design doc asserted "ragas only calls
generate_text / agenerate_text / model_name" and deliberately avoided
subclassing BaseRagasLLM to keep ragas as a soft dependency. That
assumption was wrong — BaseRagasLLM.generate is a concrete async helper
that ragas's metric code invokes on every LLM regardless of subclass
status. The duck-typed wrapper must re-declare it.

Fix preserves the no-hard-import design by adding duck-typed shims for
every public method on BaseRagasLLM and BaseRagasEmbeddings:

  RagForgeRagasLLM:
    - async generate()        — the specific shim Cycle 3 caught missing
    - is_finished()           — was abstract on base, returns True
    - get_temperature(n)      — matches base convention (0.01 / 0.3)
    - set_run_config()        — stores ragas RunConfig for compatibility
    - run_config attribute    — defaults to None

  RagForgeRagasEmbeddings:
    - embed_text(is_async)    — dispatch helper on base class
    - embed_texts(is_async)   — batch variant
    - set_run_config()
    - run_config attribute

Tests: a contract test iterates every public method on the real ragas
base classes and asserts our wrappers declare a callable of the same
name, so the next ragas release that adds a method fails in CI not in
a user audit. An end-to-end smoke test runs ragas.evaluate() against
our wrappers on a 1-sample dataset and asserts it never raises an
AttributeError naming our wrapper classes — the exact regression
signature from Cycle 3.

Three pre-existing test failures in test_cycle2_regression.py and
test_evaluator_factory.py are unrelated to this change (pre-existing
in v0.2.1 when ragas is installed; they assert on compound conditions
that break when MockJudge triggers the skip path). They belong in
workstream G3 (skip counter plumbing) and the G1 PR leaves them alone.

* fix(evaluator): address CodeRabbit review on G1 adapter shims

Five fixes from CodeRabbit review on PR #36. All were real issues the
original contract test missed because it only checked method names,
not signatures.

### generate(n > 1) now produces the correct LLMResult shape

ragas uses n > 1 for multi-sample metrics (answer_correctness
consistency checks, etc.) and expects LLMResult.generations shaped
[[gen1, gen2, ..., genN]] — one prompt run with N candidate
generations. The original shim ignored n and returned a single
generation, silently breaking any metric that relied on sample
diversity.

Fix: when n > 1, fan out n independent agenerate_text calls via
asyncio.gather and fuse the per-call results into a single
[[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper.
n == 1 (the common case) stays a single call.

New test: test_wrapper_async_generate_n_greater_than_one_produces_
fused_shape uses a counting judge to verify all N calls fire and the
fused result carries all N distinct outputs.

### embed_text / embed_texts now async (crash fix)

Critical bug: the original shims were sync methods wrapping
asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings
declares embed_text and embed_texts as async coroutines and metric
code invokes them with await embeddings.embed_text(...) from inside
ragas's evaluation event loop. asyncio.run() inside a running loop
crashes with RuntimeError: asyncio.run() cannot be called from a
running event loop — a real live-fire crash that would have taken
down any Cycle 4 run.

Fix: make both methods async. They now just await the existing
aembed_query / aembed_documents paths. The is_async parameter is
accepted for signature parity with the base class but ignored —
our underlying clients are synchronous and aembed_query already
runs them in a worker thread via asyncio.to_thread, so both flag
values land on the same code path.

New tests:
- test_embeddings_embed_text_is_async_and_awaitable: asserts both
  methods are inspect.iscoroutinefunction and round-trips through
  asyncio.run(embed.embed_text(...)).
- test_embeddings_embed_text_callable_from_running_event_loop:
  exercises the exact asyncio.run-inside-loop path ragas creates.

### Contract test now checks async/sync parity, not just names

CodeRabbit correctly pointed out that the original contract test
compared method names but not their async/sync shape — exactly why
the sync embed_text / embed_texts slipped past review.

Fix: two new tests (test_llm_wrapper_async_signature_matches_base
and test_embeddings_wrapper_async_signature_matches_base) iterate
every public method on the real ragas base classes and assert
inspect.iscoroutinefunction matches on both sides. A future release
that adds a method in one shape or the other will fail in CI.

### generate() default temperature: None -> 0.01

BaseRagasLLM.generate's default is 0.01. Our shim had None, which
is functionally equivalent (None triggers get_temperature()) but
diverges from the base signature. Changed to 0.01 to match.

Tests that exercise the None fallback still pass None explicitly.

### Tightened existing temperature-fallback test

The original test asserted "doesn't raise" but would have passed
even if get_temperature() was never called. Now patches
llm.get_temperature with wraps= and asserts assert_called_once_with(1).
Added inverse test that patches get_temperature and asserts it was
NOT called when an explicit temperature is passed — guards against
over-eager fallback firing.

### Narrowed e2e smoke test exception handler

Original except Exception accepted any non-AttributeError as a pass.
A bad shim raising TypeError or a nested-loop bug raising RuntimeError
would have counted as success. Tightened to allow only
RagasOutputParserException (expected downstream parser failure from
the stub judge's canned JSON); everything else propagates.

* fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI

The n>1 fan-out fix in the previous commit built a fused LLMResult
via langchain_core.outputs.LLMResult. On CI without the [ragas]
extra installed, langchain isn't available and generate_text falls
back to the _StringLLMResult stub — which _fuse_llm_results had no
fuse path for, so it hit the ImportError fallback and returned
results[0] (a single generation). The n>1 test then asserted
inner length == 3 and failed with 1.

Fix: separate the flatten step (build the fused_generations list)
from the wrap step (pick LLMResult or _StringLLMResult). The
flatten works on either shape because both expose the same
.generations[0] interface. Wrap picks langchain if available,
otherwise constructs via a new _StringLLMResult._from_generations
alt constructor that carries a pre-fused list.

Local verification:
  - without [ragas]: 18/18 adapter tests pass (contract + e2e skip)
  - with [ragas]: 28/28 adapter tests pass

* fix(evaluator): address CodeRabbit round-2 findings on G1

Three more findings from the second CodeRabbit review on PR #36 / #38.

### _fuse_llm_results now fails loud on malformed input

The previous commit's AttributeError/IndexError fallback silently
returned results[0] — turning a requested n>1 generation into a
single-sample result with no signal. Downstream ragas metrics that
rely on sample diversity would consume the degraded result as if it
were a valid N-sample fuse. CodeRabbit rightly flagged this as
hiding real correctness bugs.

Fix: raise ValueError with the observed result types and the
underlying exception chained. Added a targeted test
(test_fuse_llm_results_raises_on_malformed_input) covering three
malformed-input shapes: objects without .generations, empty outer
lists, and empty input lists.

### CountingJudge in n>1 test is now thread-safe

RagForgeRagasLLM.generate fans out via asyncio.gather, and each
agenerate_text call runs in a worker thread via asyncio.to_thread.
counter["i"] += 1 is not atomic across threads — the read-modify-
write race would produce duplicate sample labels or an undercounted
total, flaking the test. Added a threading.Lock around the
increment. Test now deterministic under concurrent fan-out.

### Contract tests now check parameter parity, not just name + async

Previously we asserted that each public method on BaseRagasLLM /
BaseRagasEmbeddings exists on our wrapper and matches async/sync.
That missed parameter drift — a future ragas release that adds
max_tokens to generate_text would silently break our wrapper until
a user audit caught it.

Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class
and test_embeddings_wrapper_parameter_names_cover_base_class) use
inspect.signature to enumerate the named parameters on each base-
class method and assert the wrapper accepts every name. Variadic
*args / **kwargs are ignored (they can absorb any kwarg by
definition). A helper _required_param_names() shares the logic.

Local: 30 passed on adapter + contract + e2e suites with ragas
installed, 18 passed on adapter suite alone without the [ragas]
extra. Ruff + mypy clean.

* fix(evaluator): route RAGAS exceptions into Skipped counter (G3)

Cycle 3's PearMedica audit (2026-04-15) documented two residual skip-
handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError:

1. EvaluationResult.skipped_evaluations (the integer counter the report's
   TL;DR "Skipped: N" line reads) was never set by RagasEvaluator.
   skipped_samples held real SkipRecords, but the counter stayed at 0.
   Users reading the top-level summary saw "Scored: 0, Skipped: 0" and
   thought nothing had happened when in reality every job had crashed.

2. Individual-metric extraction failures created one SkipRecord per
   metric name with sample_id="<aggregate>". A 12-sample x 4-metric run
   that failed entirely produced 4 records instead of 48 — the blast
   radius was under-reported by 12x.

Fix:
- Extract skip-record creation into _fan_out_skip_records() — fans a
  single coarse failure out into one record per (sample, metric) pair
  with the real sample_id attached.
- Both the whole-batch crash path and the per-metric extraction failure
  path now use it.
- Set EvaluationResult.skipped_evaluations = len(skipped_samples) on
  both return sites so the counter and the detail list can never drift
  apart silently again.
- Truncate reason strings to 400 chars (with trailing ellipsis) so long
  Python tracebacks don't blow up HTML/PDF rendering downstream.

Tests:
- test_ragas_skip_counter.py — three new tests covering:
  (a) skipped_evaluations counter equals len(skipped_samples),
  (b) whole-batch crashes fan out to sample_count * metric_count records
      with real sample_ids (not "<aggregate>"),
  (c) reason truncation at 400 chars.
- test_cycle2_regression.py — updated two assertions that pre-dated the
  fan-out change. The original assertions demanded "metrics must be
  populated"; with fan-out, a MockJudge run legitimately produces zero
  scored metrics and a full skip list. Tests now assert the absence of
  Finding #4 (embed_query AttributeError) and Finding #5
  (InstructorRetryException / max_tokens truncation) signatures in the
  skip records instead, which is the actual regression guard.
- test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate
  with a reverse importorskip so it runs only on systems without the
  [ragas] extra. CI matrices with ragas installed now skip it instead
  of failing.

Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced
with targeted # type: ignore comments and a docstring note pointing
readers at the adapter contract test as the real interface guard — duck
typing is the deliberate design, not an oversight.

* fix(evaluator): tolerate cross-environment mypy on ragas_evaluate ignores

The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs
only fire when mypy can see ragas's real type signatures — i.e. when
the [ragas] extra is installed (local dev). On CI without the extra,
mypy falls back to Any and flags the ignores as [unused-ignore].

Adding unused-ignore to the ignore codes tells mypy to tolerate the
comment when it has nothing to suppress. Both environments are now
clean: local mypy with the extra still sees the arg-type mismatch and
honours the suppression; CI mypy without the extra silently accepts
the comment as dead.

* fix(evaluator): remove dead real_import code in G3 skip counter test

CodeRabbit on PR #38 spotted the leftover
  real_import = mod.__dict__.copy()   # line 96
  _ = real_import                      # line 111
in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The
dict was copied but never used — an artefact of an earlier
monkeypatching approach that got simplified but not cleaned up.
Removing it and the unnecessary module-level import that fed it.

No behavioural change; test still passes.
hallengray added a commit that referenced this pull request Apr 15, 2026
…ages

Bumps 0.2.1 -> 0.2.2 on every package that ships to a registry, in
lockstep. The v0.2.1 G5 drift guard enforces the three Python
__version__ constants match their pyproject.toml siblings, and the
publish.yml verify step refuses to upload if any of the six drifts
from the release tag.

Python (pyproject.toml + __init__.py __version__, both sides):
  - rag-forge-core          0.2.1 -> 0.2.2
  - rag-forge-evaluator     0.2.1 -> 0.2.2
  - rag-forge-observability 0.2.1 -> 0.2.2

npm (package.json "version"):
  - @rag-forge/cli          0.2.1 -> 0.2.2
  - @rag-forge/mcp          0.2.1 -> 0.2.2
  - @rag-forge/shared       0.2.1 -> 0.2.2

CHANGELOG: dated the v0.2.2 entry 2026-04-15 (removes "unreleased").

The five v0.2.2 workstreams (G1-G5) already merged to main — see
PRs #36-#40. This PR is the final step before cutting the GitHub
Release that triggers publish.yml.

Once merged, the release workflow:
  1. twine check + check-wheel-contents on every Python wheel
  2. G1 contract tests (iterate real ragas base classes)
  3. G5 __version__ drift guard (auto-discovers packages)
  4. OIDC push to PyPI
  5. NPM_TOKEN push to npm (pnpm publish --no-git-checks)
hallengray added a commit that referenced this pull request Apr 15, 2026
…ages (#41)

Bumps 0.2.1 -> 0.2.2 on every package that ships to a registry, in
lockstep. The v0.2.1 G5 drift guard enforces the three Python
__version__ constants match their pyproject.toml siblings, and the
publish.yml verify step refuses to upload if any of the six drifts
from the release tag.

Python (pyproject.toml + __init__.py __version__, both sides):
  - rag-forge-core          0.2.1 -> 0.2.2
  - rag-forge-evaluator     0.2.1 -> 0.2.2
  - rag-forge-observability 0.2.1 -> 0.2.2

npm (package.json "version"):
  - @rag-forge/cli          0.2.1 -> 0.2.2
  - @rag-forge/mcp          0.2.1 -> 0.2.2
  - @rag-forge/shared       0.2.1 -> 0.2.2

CHANGELOG: dated the v0.2.2 entry 2026-04-15 (removes "unreleased").

The five v0.2.2 workstreams (G1-G5) already merged to main — see
PRs #36-#40. This PR is the final step before cutting the GitHub
Release that triggers publish.yml.

Once merged, the release workflow:
  1. twine check + check-wheel-contents on every Python wheel
  2. G1 contract tests (iterate real ragas base classes)
  3. G5 __version__ drift guard (auto-discovers packages)
  4. OIDC push to PyPI
  5. NPM_TOKEN push to npm (pnpm publish --no-git-checks)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant