fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1)#36
fix(evaluator): add missing ragas BaseRagasLLM methods (v0.2.2 G1)#36hallengray merged 4 commits intomainfrom
Conversation
RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the
concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes.
Every RAGAS metric job crashed on first contact with
AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'
during the PearMedica Cycle 3 audit (2026-04-15), producing a
Scored: 0, Overall: 0.0000 report across all 48 metric evaluations.
Root cause: the adapter's design doc asserted "ragas only calls
generate_text / agenerate_text / model_name" and deliberately avoided
subclassing BaseRagasLLM to keep ragas as a soft dependency. That
assumption was wrong — BaseRagasLLM.generate is a concrete async helper
that ragas's metric code invokes on every LLM regardless of subclass
status. The duck-typed wrapper must re-declare it.
Fix preserves the no-hard-import design by adding duck-typed shims for
every public method on BaseRagasLLM and BaseRagasEmbeddings:
RagForgeRagasLLM:
- async generate() — the specific shim Cycle 3 caught missing
- is_finished() — was abstract on base, returns True
- get_temperature(n) — matches base convention (0.01 / 0.3)
- set_run_config() — stores ragas RunConfig for compatibility
- run_config attribute — defaults to None
RagForgeRagasEmbeddings:
- embed_text(is_async) — dispatch helper on base class
- embed_texts(is_async) — batch variant
- set_run_config()
- run_config attribute
Tests: a contract test iterates every public method on the real ragas
base classes and asserts our wrappers declare a callable of the same
name, so the next ragas release that adds a method fails in CI not in
a user audit. An end-to-end smoke test runs ragas.evaluate() against
our wrappers on a 1-sample dataset and asserts it never raises an
AttributeError naming our wrapper classes — the exact regression
signature from Cycle 3.
Three pre-existing test failures in test_cycle2_regression.py and
test_evaluator_factory.py are unrelated to this change (pre-existing
in v0.2.1 when ragas is installed; they assert on compound conditions
that break when MockJudge triggers the skip path). They belong in
workstream G3 (skip counter plumbing) and the G1 PR leaves them alone.
Summary by CodeRabbit
WalkthroughImplements ragas v0.4.x public contract in RagForge adapters: adds async LLM Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RagForgeRagasLLM as LLM
participant Judge
Client->>LLM: await generate(prompt, n, temperature=None)
LLM->>LLM: temperature = get_temperature(n) -- if temperature is None
alt n == 1
LLM->>Judge: agenerate_text(prompt, n=1, temperature)
Judge-->>LLM: LLMResult (single)
LLM->>LLM: is_finished(result) ?
LLM-->>Client: LLMResult
else n > 1
par call 1
LLM->>Judge: agenerate_text(prompt, n=1, temperature)
and call 2
LLM->>Judge: agenerate_text(prompt, n=1, temperature)
and ... call N
LLM->>Judge: agenerate_text(prompt, n=1, temperature)
end
Judge-->>LLM: multiple LLMResult objects
LLM->>LLM: _fuse_llm_results(results)
LLM->>LLM: is_finished(fused_result) ?
LLM-->>Client: fused LLMResult (generations=[[...]])
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 442-462: Replace the use of asyncio.run in the adapter helpers
embed_text and embed_texts with ragas's run_async_in_current_loop utility:
import run_async_in_current_loop from ragas.embeddings.utils and call
run_async_in_current_loop(self.aembed_query(text)) in embed_text and
run_async_in_current_loop(self.aembed_documents(texts)) in embed_texts so the
async helpers (aembed_query, aembed_documents) run correctly whether an event
loop is already running or not.
In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: The current _public_methods(cls) only compares method names
which misses incompatible wrappers; update the test to also compare
inspect.signature for each public callable: use inspect.signature on the
reference class (e.g., BaseRagasLLM) and on the class under test, iterate the
public method names produced by _public_methods, and assert that each method's
Signature has compatible parameters (matching parameter names, kinds, and
default presence/values where relevant) rather than just names — this should
detect cases like a wrapper using generate(**kwargs) or missing parameters;
reference _public_methods and the BaseRagasLLM.generate signature in the test to
implement the signature comparison and fail the test when signatures are not
compatible.
In `@packages/evaluator/tests/test_ragas_adapters_e2e.py`:
- Around line 143-151: Narrow the broad except Exception handler: instead of
treating any non-AttributeError as acceptable, only allow the downstream
parsing/schema/embedding errors you expect — e.g., catch Exception as exc then
if not isinstance(exc, (ragas.parser.ParserError, ragas.schema.SchemaError,
adapters.MockEmbeddingDimensionError, ValueError)): raise; otherwise continue
and keep the existing AttributeError assertion. Update the except block around
the test (the exc/exc_name handling) so unexpected exceptions are re-raised
rather than treated as success, importing the concrete exception classes used by
the parser/schema/adapter modules.
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 83-92: Update the test to assert the resolved temperature value is
actually used: override or monkeypatch RagForgeRagasLLM.agenerate_text (or
llm.get_temperature) to capture the temperature passed into agenerate_text, call
llm.generate("prompt", n=1, temperature=None), and assert that the captured
temperature equals the value returned by llm.get_temperature(n) (not just that
no exception was raised); reference the RagForgeRagasLLM class, its generate
method, get_temperature, and agenerate_text to locate where to inject the
spy/capture and perform the equality assertion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c68799c9-6093-482e-8766-4c0cc92a3672
📒 Files selected for processing (4)
packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.pypackages/evaluator/tests/test_ragas_adapters.pypackages/evaluator/tests/test_ragas_adapters_contract.pypackages/evaluator/tests/test_ragas_adapters_e2e.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: RAG Quality Gate
- GitHub Check: Lint, Typecheck & Test
| def embed_text(self, text: str, is_async: bool = True) -> list[float]: | ||
| """Dispatch helper that ragas's concrete ``BaseRagasEmbeddings`` | ||
| exposes alongside ``embed_query``. Some metrics call ``embed_text`` | ||
| instead of ``embed_query`` — they are conceptually the same, just | ||
| different entry points. We forward to ``embed_query`` for the | ||
| sync path and to ``aembed_query`` when ``is_async=True``. | ||
|
|
||
| ``is_async=True`` is the base-class default, so we match it — but | ||
| because our underlying Judge/embedding clients are synchronous, | ||
| the "async" path just runs the sync call in a worker thread | ||
| (same pattern as ``aembed_query``). | ||
| """ | ||
| if is_async: | ||
| return asyncio.run(self.aembed_query(text)) | ||
| return self.embed_query(text) | ||
|
|
||
| def embed_texts(self, texts: list[str], is_async: bool = True) -> list[list[float]]: | ||
| """Batch variant of ``embed_text`` — same dispatch semantics.""" | ||
| if is_async: | ||
| return asyncio.run(self.aembed_documents(texts)) | ||
| return self.embed_documents(texts) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
How does the current ragas release implement BaseRagasEmbeddings.embed_textandembed_texts, and is it safe to call those helpers via asyncio.run() from code that may already be inside an asyncio event loop?
💡 Result:
In the current Ragas release (main branch as of 2026-04-15), BaseRagasEmbeddings.embed_text and embed_texts are implemented as asynchronous methods in src/ragas/embeddings/base.py. embed_text(text: str, is_async: bool = True) -> List[float] awaits self.embed_texts([text], is_async=is_async) and returns the first embedding. embed_texts(texts: List[str], is_async: bool = True) -> List[List[float]]: - If is_async=True (default): Applies retry via add_async_retry to self.aembed_documents and awaits it. - If is_async=False: Applies retry to sync self.embed_documents and runs it via await loop.run_in_executor(None, ...). These are designed to be called from async contexts (e.g., await embedder.embed_texts(texts)). It is NOT safe to call them via asyncio.run from code already inside an event loop. asyncio.run requires no running loop and creates a new one, raising RuntimeError if called nested. Ragas provides run_async_in_current_loop(coro) in src/ragas/embeddings/utils.py specifically for this: detects running loops (e.g., Jupyter), spawns a thread with new loop if running, or uses loop.run_until_complete(coro) if not running. Used in BaseRagasEmbedding for sync-to-async bridging. For sync calling inside loops, use await loop.run_in_executor(None, lambda: embedder.embed_documents(texts)) or run_async_in_current_loop(embedder.embed_texts(texts)). Direct asyncio.run(awaitable) will fail.
Citations:
- 1: https://docs.ragas.io/en/v0.3.4/references/embeddings/
- 2: https://github.com/explodinggradients/ragas/blob/main/src/ragas/embeddings/base.py
- 3: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/embeddings/base.py
- 4: https://github.com/explodinggradients/ragas/blob/298b6827/src/ragas/embeddings/utils.py
Use Ragas's run_async_in_current_loop() instead of asyncio.run() in these adapter helpers.
Calling asyncio.run(self.aembed_query(text)) or asyncio.run(self.aembed_documents(texts)) will raise RuntimeError if ragas invokes these methods from within an already-running event loop. Ragas provides run_async_in_current_loop() in src/ragas/embeddings/utils.py specifically for this—it detects whether a loop is running and either spawns a thread with a new loop or calls run_until_complete() as appropriate. Migrate these methods to use that utility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 442 - 462, Replace the use of asyncio.run in the adapter helpers
embed_text and embed_texts with ragas's run_async_in_current_loop utility:
import run_async_in_current_loop from ragas.embeddings.utils and call
run_async_in_current_loop(self.aembed_query(text)) in embed_text and
run_async_in_current_loop(self.aembed_documents(texts)) in embed_texts so the
async helpers (aembed_query, aembed_documents) run correctly whether an event
loop is already running or not.
Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
packages/evaluator/tests/test_ragas_adapters.py (1)
96-101:⚠️ Potential issue | 🟡 MinorAssert the resolved temperature is forwarded into
agenerate_text.This still passes if
generate()callsget_temperature()and then ignores its return value before invokingagenerate_text(). Spy onagenerate_textand assert the capturedtemperaturematches the patched fallback.Suggested tightening
judge = FakeJudge(response="ok") llm = RagForgeRagasLLM(judge=judge, refusal_aware=False) - with patch.object( - llm, "get_temperature", wraps=llm.get_temperature - ) as get_temp: + captured: dict[str, float | None] = {} + original = llm.agenerate_text + + async def spy(*args, **kwargs): + captured["temperature"] = kwargs["temperature"] + return await original(*args, **kwargs) + + with patch.object(llm, "get_temperature", return_value=0.42) as get_temp, patch.object( + llm, "agenerate_text", side_effect=spy + ): result = asyncio.run(llm.generate("prompt", n=1, temperature=None)) assert _llm_result_text(result) == "ok" get_temp.assert_called_once_with(1) + assert captured["temperature"] == 0.42🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 96 - 101, The test currently only verifies get_temperature was called but not that its resolved value is passed to the async implementation; update the test to spy on the internal agenerate_text call (e.g., patch or mock llm.agenerate_text) and capture the temperature argument used by agenerate_text when llm.generate("prompt", n=1, temperature=None) is invoked; assert that the captured temperature equals the patched fallback value (1) in addition to keeping the existing get_temperature.assert_called_once_with(1) and result text assertion so you validate the resolved temperature is forwarded into agenerate_text.packages/evaluator/tests/test_ragas_adapters_contract.py (1)
46-52:⚠️ Potential issue | 🟠 MajorCompare signatures, not just method names.
These tests still pass if a wrapper keeps the right method names but drops
callbacks, changes defaults, or replaces a positional signature with**kwargs. That drift only shows up later asTypeErrorinside ragas. Add aninspect.signature(...)comparison after strippingself.#!/bin/bash python - <<'PY' import inspect from rag_forge_evaluator.engines.ragas_adapters import ( RagForgeRagasEmbeddings, RagForgeRagasLLM, ) from ragas.embeddings.base import BaseRagasEmbeddings from ragas.llms.base import BaseRagasLLM def public_methods(cls): return sorted( name for name in dir(cls) if not name.startswith("_") and callable(getattr(cls, name)) ) def normalized_signature(cls, name): sig = inspect.signature(getattr(cls, name)) params = [ (p.name, p.kind, p.default) for p in sig.parameters.values() if p.name != "self" ] return params for base_cls, wrapper_cls in ( (BaseRagasLLM, RagForgeRagasLLM), (BaseRagasEmbeddings, RagForgeRagasEmbeddings), ): print(f"\n{base_cls.__name__} vs {wrapper_cls.__name__}") for name in public_methods(base_cls): if not hasattr(wrapper_cls, name): print("MISSING", name) continue base_sig = inspect.signature(getattr(base_cls, name)) wrapper_sig = inspect.signature(getattr(wrapper_cls, name)) if normalized_signature(base_cls, name) != normalized_signature(wrapper_cls, name): print("DIFF", name, base_sig, wrapper_sig) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/evaluator/tests/test_ragas_adapters_contract.py` around lines 46 - 52, The test helper _public_methods should be expanded to assert that wrappers preserve method signatures, not just names: for each public method on the base classes (BaseRagasLLM, BaseRagasEmbeddings) that exists on the wrapper classes (RagForgeRagasLLM, RagForgeRagasEmbeddings), compute inspect.signature(...) for both, normalize by removing the 'self' parameter and comparing (name, kind, default) tuples for each parameter, and fail the test if any signature differs (covers changed defaults, added **kwargs, positional->kw-only drift, missing callbacks, etc.).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py`:
- Around line 131-148: The current _fuse_llm_results function silently returns
results[0] on any fusion error, collapsing n>1 samples into one; change this so
we do NOT degrade silently: on AttributeError or IndexError (malformed result
shapes) raise a ValueError with a concise diagnostic (include len(results) and
the problematic result type/repr) and do not return a single sample, and on
ImportError (missing langchain) re-raise or allow the ImportError to bubble so
the soft-dependency path is explicit rather than silently returning results[0];
update the except block accordingly and remove the fallback return results[0]
behavior.
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 128-150: The CountingJudge's judge method mutates shared state
counter["i"] without synchronization causing race conditions when llm.generate
fans out to threads; make the counter increment atomic by adding a lock (e.g.,
threading.Lock) as a member of CountingJudge and acquiring it around
counter["i"] += 1, or replace the shared dict with a thread-safe atomic counter
(e.g., itertools.count or threading-safe integer wrapper) and update
CountingJudge.judge to use that safe increment so tests for n>1 become
deterministic.
---
Duplicate comments:
In `@packages/evaluator/tests/test_ragas_adapters_contract.py`:
- Around line 46-52: The test helper _public_methods should be expanded to
assert that wrappers preserve method signatures, not just names: for each public
method on the base classes (BaseRagasLLM, BaseRagasEmbeddings) that exists on
the wrapper classes (RagForgeRagasLLM, RagForgeRagasEmbeddings), compute
inspect.signature(...) for both, normalize by removing the 'self' parameter and
comparing (name, kind, default) tuples for each parameter, and fail the test if
any signature differs (covers changed defaults, added **kwargs,
positional->kw-only drift, missing callbacks, etc.).
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 96-101: The test currently only verifies get_temperature was
called but not that its resolved value is passed to the async implementation;
update the test to spy on the internal agenerate_text call (e.g., patch or mock
llm.agenerate_text) and capture the temperature argument used by agenerate_text
when llm.generate("prompt", n=1, temperature=None) is invoked; assert that the
captured temperature equals the patched fallback value (1) in addition to
keeping the existing get_temperature.assert_called_once_with(1) and result text
assertion so you validate the resolved temperature is forwarded into
agenerate_text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: aee68e59-0676-4965-af91-a54a52be113b
📒 Files selected for processing (4)
packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.pypackages/evaluator/tests/test_ragas_adapters.pypackages/evaluator/tests/test_ragas_adapters_contract.pypackages/evaluator/tests/test_ragas_adapters_e2e.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: RAG Quality Gate
- GitHub Check: Lint, Typecheck & Test
| If fusion fails (mixed result shapes, missing langchain, etc.) we | ||
| fall back to returning the first result — ragas will then see a | ||
| single generation, which is strictly worse than N but still | ||
| correct shape. The ``n>1`` path is rare in stock ragas metrics so | ||
| conservative fallback is preferable to crashing. | ||
| """ | ||
| if not results: | ||
| msg = "_fuse_llm_results called with empty results list" | ||
| raise ValueError(msg) | ||
| if len(results) == 1: | ||
| return results[0] | ||
| try: | ||
| from langchain_core.outputs import LLMResult as _LLMResult | ||
|
|
||
| fused_generations = [gen for r in results for gen in r.generations[0]] | ||
| return _LLMResult(generations=[fused_generations]) | ||
| except (ImportError, AttributeError, IndexError): | ||
| return results[0] |
There was a problem hiding this comment.
Don't silently collapse n>1 into a single sample on fuse failure.
Falling back to results[0] here turns a requested multi-sample generation into a single-sample result with no signal. That both skews downstream ragas scoring and breaks the advertised soft-dependency path because _StringLLMResult never actually gets fused. Handle the stub case explicitly, and raise on malformed result shapes instead of degrading silently.
Proposed fix
try:
from langchain_core.outputs import LLMResult as _LLMResult
fused_generations = [gen for r in results for gen in r.generations[0]]
return _LLMResult(generations=[fused_generations])
- except (ImportError, AttributeError, IndexError):
- return results[0]
+ except ImportError:
+ fused = _StringLLMResult(results[0].generations[0][0].text)
+ fused.generations = [[gen for r in results for gen in r.generations[0]]]
+ return fused
+ except (AttributeError, IndexError, TypeError) as exc:
+ msg = "Cannot fuse malformed LLMResult payloads"
+ raise ValueError(msg) from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py` around
lines 131 - 148, The current _fuse_llm_results function silently returns
results[0] on any fusion error, collapsing n>1 samples into one; change this so
we do NOT degrade silently: on AttributeError or IndexError (malformed result
shapes) raise a ValueError with a concise diagnostic (include len(results) and
the problematic result type/repr) and do not return a single sample, and on
ImportError (missing langchain) re-raise or allow the ImportError to bubble so
the soft-dependency path is explicit rather than silently returning results[0];
update the except block accordingly and remove the fallback return results[0]
behavior.
The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass
Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/evaluator/tests/test_ragas_adapters.py`:
- Around line 163-186: Add a test case to cover the scenario where each result
has an empty inner generations list so _fuse_llm_results raises a ValueError for
malformed input; specifically, in the test function
test_fuse_llm_results_raises_on_malformed_input add a class (e.g., EmptyInner)
whose self.generations is [[]] and assert that calling
_fuse_llm_results([EmptyInner(), EmptyInner()]) raises ValueError with the
"malformed result" message, referencing the existing _fuse_llm_results import to
locate the code under test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 26dab04a-f0e5-4552-aac3-edde6d722f3d
📒 Files selected for processing (3)
packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.pypackages/evaluator/tests/test_ragas_adapters.pypackages/evaluator/tests/test_ragas_adapters_contract.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: RAG Quality Gate
- GitHub Check: Lint, Typecheck & Test
🔇 Additional comments (15)
packages/evaluator/tests/test_ragas_adapters.py (4)
15-26: LGTM!The helper correctly normalizes assertions across environments by handling both raw strings and
LLMResult-like objects. The lack of defensive checks for empty generations is acceptable in test code — tests should fail loudly on malformed responses.
83-101: LGTM!The test now correctly verifies the fallback path by patching
get_temperaturewithwrapsand asserting it was called with the expected argument. This addresses the prior feedback about asserting the actual behavior rather than just absence of exceptions.
117-160: LGTM!The test correctly validates the
n>1generation shape and uses thread-safe counting as recommended in prior review. The assertions cover the exact[[gen1, gen2, gen3]]shape ragas expects.
189-233: LGTM!The tests thoroughly validate the async contract for
embed_textandembed_texts, including the regression test that exercises calling from within a running event loop — exactly the scenario that crashed with the originalasyncio.run()implementation.packages/evaluator/tests/test_ragas_adapters_contract.py (5)
46-52: LGTM!The helper correctly enumerates public callables including inherited methods, which is the right behavior for contract testing against base classes.
149-172: LGTM!The helper correctly extracts named parameters while excluding
self/clsand variadic catchalls. This enables the parameter-parity tests to detect missing keyword arguments that would causeTypeErrorat runtime.
101-146: LGTM!The async/sync parity tests correctly use
inspect.iscoroutinefunctionto validate that wrapper methods match the base class's async nature. This would have caught the originalembed_textsync implementation that crashed in async contexts.
175-215: LGTM!The parameter-name coverage tests address the prior feedback about signature parity. They correctly check that the wrapper accepts all parameters the base class declares, preventing
TypeErrorat runtime when ragas passes keyword arguments.
218-254: LGTM!The instance-level tests comprehensively validate the specific shim behaviors: async
generate,is_finisheddefault, temperature convention matching ragas, andset_run_configpersistence. Good coverage of the new v0.2.2 contract surface.packages/evaluator/src/rag_forge_evaluator/engines/ragas_adapters.py (6)
119-170: LGTM!The implementation correctly addresses prior feedback by raising
ValueErrorwith descriptive messages on malformed input instead of silently falling back toresults[0]. The error message includes the types of all result objects, aiding debugging.
267-341: LGTM!The
generateimplementation correctly handles bothn=1andn>1cases, resolves temperature viaget_temperature(n)whenNone, and usesasyncio.gatherfor parallel fan-out. The defensiveis_finishedcheck and documented rationale for skippingadd_async_retryshow thoughtful design.
393-408: LGTM!The factory method correctly constructs a fused
_StringLLMResultstub for the no-langchain path, maintaining the[[gen1..genN]]shape and preservingstr()/__eq__usability via_text.
544-571: LGTM!The methods are correctly implemented as
async defcoroutines, addressing the prior feedback aboutasyncio.run()crashing inside running event loops. The ignoredis_asyncparameter is well-documented — both code paths converge onasyncio.to_threadanyway.
343-377: LGTM!The helper methods correctly implement the ragas contract:
is_finishedconservatively returnsTrue(documented limitation with clear upgrade path),get_temperaturematches ragas's convention, andset_run_configstores the value for contract compatibility.
10-58: LGTM!Excellent documentation of the ragas 0.4.x contract surface and the v0.2.0 → v0.2.2 evolution. The explicit listing of required methods and the rationale for duck-typing will help future maintainers understand the design constraints.
| def test_fuse_llm_results_raises_on_malformed_input(): | ||
| """``_fuse_llm_results`` must fail loud on malformed result shapes | ||
| instead of silently collapsing to ``results[0]``. CodeRabbit on | ||
| PR #36 round 2 pointed out that a silent fallback hides real | ||
| ``n > 1`` correctness bugs — a single returned sample skews | ||
| downstream ragas metrics with no signal. | ||
| """ | ||
| from rag_forge_evaluator.engines.ragas_adapters import _fuse_llm_results | ||
|
|
||
| class NotAnLLMResult: | ||
| pass | ||
|
|
||
| with pytest.raises(ValueError, match="malformed result"): | ||
| _fuse_llm_results([NotAnLLMResult(), NotAnLLMResult()]) | ||
|
|
||
| class EmptyOuter: | ||
| def __init__(self) -> None: | ||
| self.generations: list[list[object]] = [] | ||
|
|
||
| with pytest.raises(ValueError, match="malformed result"): | ||
| _fuse_llm_results([EmptyOuter(), EmptyOuter()]) | ||
|
|
||
| with pytest.raises(ValueError, match="empty results list"): | ||
| _fuse_llm_results([]) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding a test for empty inner generations.
The test covers missing .generations, empty outer list, and empty results list. However, _fuse_llm_results also catches IndexError when r.generations[0] exists but is empty. Adding this case would complete coverage:
class EmptyInner:
def __init__(self) -> None:
self.generations: list[list[object]] = [[]]
with pytest.raises(ValueError, match="malformed result"):
_fuse_llm_results([EmptyInner(), EmptyInner()])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/evaluator/tests/test_ragas_adapters.py` around lines 163 - 186, Add
a test case to cover the scenario where each result has an empty inner
generations list so _fuse_llm_results raises a ValueError for malformed input;
specifically, in the test function
test_fuse_llm_results_raises_on_malformed_input add a class (e.g., EmptyInner)
whose self.generations is [[]] and assert that calling
_fuse_llm_results([EmptyInner(), EmptyInner()]) raises ValueError with the
"malformed result" message, referencing the existing _fuse_llm_results import to
locate the code under test.
…3) (#38) * fix(evaluator): add missing ragas BaseRagasLLM methods (G1) RAG-Forge v0.2.1's RagForgeRagasLLM wrapper was shipped without the concrete .generate() method that ragas 0.4.x's BaseRagasLLM exposes. Every RAGAS metric job crashed on first contact with AttributeError: 'RagForgeRagasLLM' object has no attribute 'generate' during the PearMedica Cycle 3 audit (2026-04-15), producing a Scored: 0, Overall: 0.0000 report across all 48 metric evaluations. Root cause: the adapter's design doc asserted "ragas only calls generate_text / agenerate_text / model_name" and deliberately avoided subclassing BaseRagasLLM to keep ragas as a soft dependency. That assumption was wrong — BaseRagasLLM.generate is a concrete async helper that ragas's metric code invokes on every LLM regardless of subclass status. The duck-typed wrapper must re-declare it. Fix preserves the no-hard-import design by adding duck-typed shims for every public method on BaseRagasLLM and BaseRagasEmbeddings: RagForgeRagasLLM: - async generate() — the specific shim Cycle 3 caught missing - is_finished() — was abstract on base, returns True - get_temperature(n) — matches base convention (0.01 / 0.3) - set_run_config() — stores ragas RunConfig for compatibility - run_config attribute — defaults to None RagForgeRagasEmbeddings: - embed_text(is_async) — dispatch helper on base class - embed_texts(is_async) — batch variant - set_run_config() - run_config attribute Tests: a contract test iterates every public method on the real ragas base classes and asserts our wrappers declare a callable of the same name, so the next ragas release that adds a method fails in CI not in a user audit. An end-to-end smoke test runs ragas.evaluate() against our wrappers on a 1-sample dataset and asserts it never raises an AttributeError naming our wrapper classes — the exact regression signature from Cycle 3. Three pre-existing test failures in test_cycle2_regression.py and test_evaluator_factory.py are unrelated to this change (pre-existing in v0.2.1 when ragas is installed; they assert on compound conditions that break when MockJudge triggers the skip path). They belong in workstream G3 (skip counter plumbing) and the G1 PR leaves them alone. * fix(evaluator): address CodeRabbit review on G1 adapter shims Five fixes from CodeRabbit review on PR #36. All were real issues the original contract test missed because it only checked method names, not signatures. ### generate(n > 1) now produces the correct LLMResult shape ragas uses n > 1 for multi-sample metrics (answer_correctness consistency checks, etc.) and expects LLMResult.generations shaped [[gen1, gen2, ..., genN]] — one prompt run with N candidate generations. The original shim ignored n and returned a single generation, silently breaking any metric that relied on sample diversity. Fix: when n > 1, fan out n independent agenerate_text calls via asyncio.gather and fuse the per-call results into a single [[gen1..genN]]-shaped LLMResult via the new _fuse_llm_results helper. n == 1 (the common case) stays a single call. New test: test_wrapper_async_generate_n_greater_than_one_produces_ fused_shape uses a counting judge to verify all N calls fire and the fused result carries all N distinct outputs. ### embed_text / embed_texts now async (crash fix) Critical bug: the original shims were sync methods wrapping asyncio.run(self.aembed_query(text)). ragas's BaseRagasEmbeddings declares embed_text and embed_texts as async coroutines and metric code invokes them with await embeddings.embed_text(...) from inside ragas's evaluation event loop. asyncio.run() inside a running loop crashes with RuntimeError: asyncio.run() cannot be called from a running event loop — a real live-fire crash that would have taken down any Cycle 4 run. Fix: make both methods async. They now just await the existing aembed_query / aembed_documents paths. The is_async parameter is accepted for signature parity with the base class but ignored — our underlying clients are synchronous and aembed_query already runs them in a worker thread via asyncio.to_thread, so both flag values land on the same code path. New tests: - test_embeddings_embed_text_is_async_and_awaitable: asserts both methods are inspect.iscoroutinefunction and round-trips through asyncio.run(embed.embed_text(...)). - test_embeddings_embed_text_callable_from_running_event_loop: exercises the exact asyncio.run-inside-loop path ragas creates. ### Contract test now checks async/sync parity, not just names CodeRabbit correctly pointed out that the original contract test compared method names but not their async/sync shape — exactly why the sync embed_text / embed_texts slipped past review. Fix: two new tests (test_llm_wrapper_async_signature_matches_base and test_embeddings_wrapper_async_signature_matches_base) iterate every public method on the real ragas base classes and assert inspect.iscoroutinefunction matches on both sides. A future release that adds a method in one shape or the other will fail in CI. ### generate() default temperature: None -> 0.01 BaseRagasLLM.generate's default is 0.01. Our shim had None, which is functionally equivalent (None triggers get_temperature()) but diverges from the base signature. Changed to 0.01 to match. Tests that exercise the None fallback still pass None explicitly. ### Tightened existing temperature-fallback test The original test asserted "doesn't raise" but would have passed even if get_temperature() was never called. Now patches llm.get_temperature with wraps= and asserts assert_called_once_with(1). Added inverse test that patches get_temperature and asserts it was NOT called when an explicit temperature is passed — guards against over-eager fallback firing. ### Narrowed e2e smoke test exception handler Original except Exception accepted any non-AttributeError as a pass. A bad shim raising TypeError or a nested-loop bug raising RuntimeError would have counted as success. Tightened to allow only RagasOutputParserException (expected downstream parser failure from the stub judge's canned JSON); everything else propagates. * fix(evaluator): fuse _StringLLMResult stubs in n>1 path for no-ragas CI The n>1 fan-out fix in the previous commit built a fused LLMResult via langchain_core.outputs.LLMResult. On CI without the [ragas] extra installed, langchain isn't available and generate_text falls back to the _StringLLMResult stub — which _fuse_llm_results had no fuse path for, so it hit the ImportError fallback and returned results[0] (a single generation). The n>1 test then asserted inner length == 3 and failed with 1. Fix: separate the flatten step (build the fused_generations list) from the wrap step (pick LLMResult or _StringLLMResult). The flatten works on either shape because both expose the same .generations[0] interface. Wrap picks langchain if available, otherwise constructs via a new _StringLLMResult._from_generations alt constructor that carries a pre-fused list. Local verification: - without [ragas]: 18/18 adapter tests pass (contract + e2e skip) - with [ragas]: 28/28 adapter tests pass * fix(evaluator): address CodeRabbit round-2 findings on G1 Three more findings from the second CodeRabbit review on PR #36 / #38. ### _fuse_llm_results now fails loud on malformed input The previous commit's AttributeError/IndexError fallback silently returned results[0] — turning a requested n>1 generation into a single-sample result with no signal. Downstream ragas metrics that rely on sample diversity would consume the degraded result as if it were a valid N-sample fuse. CodeRabbit rightly flagged this as hiding real correctness bugs. Fix: raise ValueError with the observed result types and the underlying exception chained. Added a targeted test (test_fuse_llm_results_raises_on_malformed_input) covering three malformed-input shapes: objects without .generations, empty outer lists, and empty input lists. ### CountingJudge in n>1 test is now thread-safe RagForgeRagasLLM.generate fans out via asyncio.gather, and each agenerate_text call runs in a worker thread via asyncio.to_thread. counter["i"] += 1 is not atomic across threads — the read-modify- write race would produce duplicate sample labels or an undercounted total, flaking the test. Added a threading.Lock around the increment. Test now deterministic under concurrent fan-out. ### Contract tests now check parameter parity, not just name + async Previously we asserted that each public method on BaseRagasLLM / BaseRagasEmbeddings exists on our wrapper and matches async/sync. That missed parameter drift — a future ragas release that adds max_tokens to generate_text would silently break our wrapper until a user audit caught it. Fix: two new tests (test_llm_wrapper_parameter_names_cover_base_class and test_embeddings_wrapper_parameter_names_cover_base_class) use inspect.signature to enumerate the named parameters on each base- class method and assert the wrapper accepts every name. Variadic *args / **kwargs are ignored (they can absorb any kwarg by definition). A helper _required_param_names() shares the logic. Local: 30 passed on adapter + contract + e2e suites with ragas installed, 18 passed on adapter suite alone without the [ragas] extra. Ruff + mypy clean. * fix(evaluator): route RAGAS exceptions into Skipped counter (G3) Cycle 3's PearMedica audit (2026-04-15) documented two residual skip- handling gaps in v0.2.1, separate from the C3-2 generate() AttributeError: 1. EvaluationResult.skipped_evaluations (the integer counter the report's TL;DR "Skipped: N" line reads) was never set by RagasEvaluator. skipped_samples held real SkipRecords, but the counter stayed at 0. Users reading the top-level summary saw "Scored: 0, Skipped: 0" and thought nothing had happened when in reality every job had crashed. 2. Individual-metric extraction failures created one SkipRecord per metric name with sample_id="<aggregate>". A 12-sample x 4-metric run that failed entirely produced 4 records instead of 48 — the blast radius was under-reported by 12x. Fix: - Extract skip-record creation into _fan_out_skip_records() — fans a single coarse failure out into one record per (sample, metric) pair with the real sample_id attached. - Both the whole-batch crash path and the per-metric extraction failure path now use it. - Set EvaluationResult.skipped_evaluations = len(skipped_samples) on both return sites so the counter and the detail list can never drift apart silently again. - Truncate reason strings to 400 chars (with trailing ellipsis) so long Python tracebacks don't blow up HTML/PDF rendering downstream. Tests: - test_ragas_skip_counter.py — three new tests covering: (a) skipped_evaluations counter equals len(skipped_samples), (b) whole-batch crashes fan out to sample_count * metric_count records with real sample_ids (not "<aggregate>"), (c) reason truncation at 400 chars. - test_cycle2_regression.py — updated two assertions that pre-dated the fan-out change. The original assertions demanded "metrics must be populated"; with fan-out, a MockJudge run legitimately produces zero scored metrics and a full skip list. Tests now assert the absence of Finding #4 (embed_query AttributeError) and Finding #5 (InstructorRetryException / max_tokens truncation) signatures in the skip records instead, which is the actual regression guard. - test_evaluator_factory.py — gate test_ragas_not_installed_raises_on_evaluate with a reverse importorskip so it runs only on systems without the [ragas] extra. CI matrices with ragas installed now skip it instead of failing. Pre-existing mypy arg-type errors on the ragas_evaluate() kwargs silenced with targeted # type: ignore comments and a docstring note pointing readers at the adapter contract test as the real interface guard — duck typing is the deliberate design, not an oversight. * fix(evaluator): tolerate cross-environment mypy on ragas_evaluate ignores The # type: ignore[arg-type] comments on the ragas_evaluate() kwargs only fire when mypy can see ragas's real type signatures — i.e. when the [ragas] extra is installed (local dev). On CI without the extra, mypy falls back to Any and flags the ignores as [unused-ignore]. Adding unused-ignore to the ignore codes tells mypy to tolerate the comment when it has nothing to suppress. Both environments are now clean: local mypy with the extra still sees the arg-type mismatch and honours the suppression; CI mypy without the extra silently accepts the comment as dead. * fix(evaluator): remove dead real_import code in G3 skip counter test CodeRabbit on PR #38 spotted the leftover real_import = mod.__dict__.copy() # line 96 _ = real_import # line 111 in test_whole_batch_crash_fans_out_to_every_sample_metric_pair. The dict was copied but never used — an artefact of an earlier monkeypatching approach that got simplified but not cleaned up. Removing it and the unnecessary module-level import that fed it. No behavioural change; test still passes.
…ages Bumps 0.2.1 -> 0.2.2 on every package that ships to a registry, in lockstep. The v0.2.1 G5 drift guard enforces the three Python __version__ constants match their pyproject.toml siblings, and the publish.yml verify step refuses to upload if any of the six drifts from the release tag. Python (pyproject.toml + __init__.py __version__, both sides): - rag-forge-core 0.2.1 -> 0.2.2 - rag-forge-evaluator 0.2.1 -> 0.2.2 - rag-forge-observability 0.2.1 -> 0.2.2 npm (package.json "version"): - @rag-forge/cli 0.2.1 -> 0.2.2 - @rag-forge/mcp 0.2.1 -> 0.2.2 - @rag-forge/shared 0.2.1 -> 0.2.2 CHANGELOG: dated the v0.2.2 entry 2026-04-15 (removes "unreleased"). The five v0.2.2 workstreams (G1-G5) already merged to main — see PRs #36-#40. This PR is the final step before cutting the GitHub Release that triggers publish.yml. Once merged, the release workflow: 1. twine check + check-wheel-contents on every Python wheel 2. G1 contract tests (iterate real ragas base classes) 3. G5 __version__ drift guard (auto-discovers packages) 4. OIDC push to PyPI 5. NPM_TOKEN push to npm (pnpm publish --no-git-checks)
…ages (#41) Bumps 0.2.1 -> 0.2.2 on every package that ships to a registry, in lockstep. The v0.2.1 G5 drift guard enforces the three Python __version__ constants match their pyproject.toml siblings, and the publish.yml verify step refuses to upload if any of the six drifts from the release tag. Python (pyproject.toml + __init__.py __version__, both sides): - rag-forge-core 0.2.1 -> 0.2.2 - rag-forge-evaluator 0.2.1 -> 0.2.2 - rag-forge-observability 0.2.1 -> 0.2.2 npm (package.json "version"): - @rag-forge/cli 0.2.1 -> 0.2.2 - @rag-forge/mcp 0.2.1 -> 0.2.2 - @rag-forge/shared 0.2.1 -> 0.2.2 CHANGELOG: dated the v0.2.2 entry 2026-04-15 (removes "unreleased"). The five v0.2.2 workstreams (G1-G5) already merged to main — see PRs #36-#40. This PR is the final step before cutting the GitHub Release that triggers publish.yml. Once merged, the release workflow: 1. twine check + check-wheel-contents on every Python wheel 2. G1 contract tests (iterate real ragas base classes) 3. G5 __version__ drift guard (auto-discovers packages) 4. OIDC push to PyPI 5. NPM_TOKEN push to npm (pnpm publish --no-git-checks)
Summary
ragas.llms.base.BaseRagasLLMandragas.embeddings.base.BaseRagasEmbeddingsas duck-typed shims onRagForgeRagasLLM/RagForgeRagasEmbeddingsAttributeError: 'RagForgeRagasLLM' object has no attribute 'generate'Root cause
The adapter's module docstring claimed "ragas only calls
generate_text/agenerate_text/model_name" and deliberately avoided subclassingBaseRagasLLMto keep ragas a soft dependency. That assumption was wrong —BaseRagasLLM.generateis a concrete async helper that ragas's metric code invokes on every LLM, regardless of subclass status. Because our duck-typed class doesn't inherit fromBaseRagasLLM, it doesn't inherit.generate(), and RAGAS's first metric call crashes.The fix preserves the no-hard-import design. We implement our own
.generate()/is_finished()/get_temperature()/set_run_config()shims that forward to the existinggenerate_text/agenerate_text. When the contract test was added it caught two additional missing methods on the embeddings wrapper (embed_text,embed_texts) — exactly the kind of drift the test exists to prevent.What v0.2.2 adds on
RagForgeRagasLLMasync generate(prompt, n, temperature, stop, callbacks) -> LLMResult— the specific shim Cycle 3 caught missingis_finished(response) -> bool— returns True; Judge protocol has no finish signalget_temperature(n) -> float— matches base convention (0.01 / 0.3)set_run_config(run_config)— stores ragasRunConfigfor contract compatibilityrun_config: Anyattribute defaulting to NoneWhat v0.2.2 adds on
RagForgeRagasEmbeddingsembed_text(text, is_async)+embed_texts(texts, is_async)— dispatch helpers the base class exposesset_run_config(run_config)— contract parity with the LLM wrapperTests
tests/test_ragas_adapters_contract.py— new, the contract tripwire. Iterates every public method onBaseRagasLLM/BaseRagasEmbeddingsand asserts our wrappers declare them. Also instance-level tests forgeneratebeing async,is_finishedreturning True,get_temperature(n)matching base convention,set_run_configstoring the value.tests/test_ragas_adapters_e2e.py— new, runsragas.evaluate()with our wrappers on a 1-sample dataset. Fails if ragas hits anAttributeErrorreferencing our wrapper classes. This would have caught the C3-2 regression before v0.2.1 shipped.tests/test_ragas_adapters.py— updated to extract text from realLLMResultobjects (langchain now pulled in via the[ragas]extra), plus new tests for the asyncgenerateshim and thetemperature=Nonedefault-resolution path.Local run: 22 passed, 0 failed on the adapter test suite.
Test plan
pnpm run typecheckcleanpnpm run test:pygreen (filter totests/test_ragas_adapters*.pyif faster)generate)ragas.evaluate()without hitting AttributeError[ragas]extraMerge order
First of the v0.2.2 series. Plan is
G1 → G3 → {G2, G4, G5} in parallel → release prep. G3's branch is already rebased on this one, so merging this unblocks G3 immediately.Full release plan:
docs/release-notes/v0.2.2-plan.md(lands with G2).