π΄ Required Information
Is your feature request related to a specific problem?
There is no clean public API for running evaluations programmatically and getting results back in memory. The two available options are:
-
AgentEvaluator β has _get_eval_results_by_eval_id which runs inference + evaluation and returns dict[str, list[EvalCaseResult]] in memory. This is exactly the right abstraction, but:
- It's a private method (prefixed with
_)
app_name is hardcoded to "test_app" inside the method
- It creates its own
InMemoryEvalSetsManager internally β you can't pass your own eval_sets_manager, session_service, artifact_service, or eval_set_results_manager
- The public method
evaluate_eval_set wraps it but immediately runs assertions (assert not failures), making it impossible to retrieve results without the pass/fail side effect
-
cli_eval β a CLI tool that writes results to disk. To use it programmatically, you have to call it, then read results back from the filesystem using LocalEvalSetResultsManager, parse the files, and do your own post-processing. This is the workaround we currently use.
LocalEvalService itself is actually configurable (accepts custom services, eval sets managers, etc.), but it requires manually constructing InferenceRequest / EvaluateRequest / EvaluateConfig, and yields results via async generators. AgentEvaluator._get_eval_results_by_eval_id is the convenience wrapper that handles this plumbing β it's just not public or configurable.
Describe the Solution You'd Like
A public method on AgentEvaluator (or a new class) that:
- Returns
EvalCaseResult objects in memory without running assertions or requiring disk I/O
- Accepts configurable services β at minimum
app_name, and optionally session_service, artifact_service, eval_sets_manager, eval_set_results_manager
- Separates evaluation from assertion β let the caller decide what to do with results (assert, post-process, export, run additional judges, etc.)
The method signature could look something like:
@staticmethod
async def run_eval(
agent_module_or_agent: str | BaseAgent,
eval_set: EvalSet,
eval_config: EvalConfig,
app_name: str = "test_app",
num_runs: int = 1,
session_service: Optional[BaseSessionService] = None,
artifact_service: Optional[BaseArtifactService] = None,
eval_set_results_manager: Optional[EvalSetResultsManager] = None,
) -> dict[str, list[EvalCaseResult]]:
"""Run evaluation and return results grouped by eval case ID."""
Impact on your work
We run evaluations programmatically with custom post-processing: running additional LLM judges on results, exporting to Excel, computing custom statistics, and integrating with our own reporting pipeline. All of this requires EvalCaseResult objects in memory.
Currently we call the cli_eval(), then read results back from disk via LocalEvalSetResultsManager to get the EvalCaseResult objects that were already in memory during evaluation but were only persisted to files. This roundtrip is unnecessary and fragile β it depends on filesystem conventions, timestamp-based result filtering, and internal directory structures (_get_eval_history_dir).
Willingness to contribute
No
π‘ Recommended Information
Describe Alternatives You've Considered
1. Calling cli_eval() programmatically and reading results from disk (current approach):
# Run evaluation via CLI tool
cli_eval(["agents/my_app", "evals/my_eval.evalset.json", ...], standalone_mode=False)
# Read results back from disk
results_manager = LocalEvalSetResultsManager(agents_dir="agents")
result_ids = results_manager.list_eval_set_results("my_app")
for result_id in result_ids:
result = results_manager.get_eval_set_result("my_app", result_id)
if result.creation_timestamp >= start_time:
# Now we finally have EvalCaseResult objects to work with
for case in result.eval_case_results:
run_custom_judges(case)
This works but is fragile: it depends on filesystem paths, timestamp-based filtering to identify which results belong to the current run, and internal LocalEvalSetResultsManager directory conventions.
2. Using LocalEvalService directly:
This is the most flexible option and actually supports custom services. However, it requires significant boilerplate: manually constructing InferenceRequest, InferenceConfig, EvaluateRequest, EvaluateConfig, setting up an EvalSetsManager, iterating async generators, and collecting results. AgentEvaluator._get_eval_results_by_eval_id already does all of this β it just needs to be public and configurable.
3. Calling AgentEvaluator._get_eval_results_by_eval_id directly:
This private method does exactly what we need, but relying on private APIs is not sustainable. It also hardcodes app_name = "test_app" and creates its own services internally, so even using it directly doesn't give us the configurability we need.
Proposed API / Implementation
The simplest path would be to refactor the existing _get_eval_results_by_eval_id into a public method with configurable parameters:
class AgentEvaluator:
@staticmethod
async def run_eval(
agent_module_or_agent: str | BaseAgent,
eval_set: EvalSet,
eval_config: EvalConfig,
app_name: str = "test_app",
num_runs: int = 1,
agent_name: Optional[str] = None,
session_service: Optional[BaseSessionService] = None,
artifact_service: Optional[BaseArtifactService] = None,
eval_set_results_manager: Optional[EvalSetResultsManager] = None,
) -> dict[str, list[EvalCaseResult]]:
"""Run evaluation and return results grouped by eval case ID.
Unlike evaluate_eval_set, this method returns results directly
without running assertions, allowing callers to implement custom
post-processing.
"""
...
Key design choices:
agent_module_or_agent: Accept either a module path (existing behavior) or a BaseAgent directly, to avoid the module import convention for users who construct agents programmatically.
- Return
dict[str, list[EvalCaseResult]]: Same structure as _get_eval_results_by_eval_id, grouped by eval case ID to support num_runs > 1.
- No assertions: The caller decides what to do with results.
- Optional services: Default to in-memory implementations (existing behavior) but allow overrides.
evaluate_eval_set could then be refactored to call run_eval internally, keeping the assert-based behavior for backward compatibility.
Additional Context
LocalEvalService already supports all the configurability needed (custom services, eval sets managers, etc.). The gap is purely in the public convenience API β AgentEvaluator wraps LocalEvalService but doesn't expose its flexibility.
- Accepting a
BaseAgent directly (not just a module path string) would also help users who construct agents programmatically or need to configure them before evaluation (e.g., injecting test dependencies, setting up plugins).
π΄ Required Information
Is your feature request related to a specific problem?
There is no clean public API for running evaluations programmatically and getting results back in memory. The two available options are:
AgentEvaluatorβ has_get_eval_results_by_eval_idwhich runs inference + evaluation and returnsdict[str, list[EvalCaseResult]]in memory. This is exactly the right abstraction, but:_)app_nameis hardcoded to"test_app"inside the methodInMemoryEvalSetsManagerinternally β you can't pass your owneval_sets_manager,session_service,artifact_service, oreval_set_results_managerevaluate_eval_setwraps it but immediately runs assertions (assert not failures), making it impossible to retrieve results without the pass/fail side effectcli_evalβ a CLI tool that writes results to disk. To use it programmatically, you have to call it, then read results back from the filesystem usingLocalEvalSetResultsManager, parse the files, and do your own post-processing. This is the workaround we currently use.LocalEvalServiceitself is actually configurable (accepts custom services, eval sets managers, etc.), but it requires manually constructingInferenceRequest/EvaluateRequest/EvaluateConfig, and yields results via async generators.AgentEvaluator._get_eval_results_by_eval_idis the convenience wrapper that handles this plumbing β it's just not public or configurable.Describe the Solution You'd Like
A public method on
AgentEvaluator(or a new class) that:EvalCaseResultobjects in memory without running assertions or requiring disk I/Oapp_name, and optionallysession_service,artifact_service,eval_sets_manager,eval_set_results_managerThe method signature could look something like:
Impact on your work
We run evaluations programmatically with custom post-processing: running additional LLM judges on results, exporting to Excel, computing custom statistics, and integrating with our own reporting pipeline. All of this requires
EvalCaseResultobjects in memory.Currently we call the
cli_eval(), then read results back from disk viaLocalEvalSetResultsManagerto get theEvalCaseResultobjects that were already in memory during evaluation but were only persisted to files. This roundtrip is unnecessary and fragile β it depends on filesystem conventions, timestamp-based result filtering, and internal directory structures (_get_eval_history_dir).Willingness to contribute
No
π‘ Recommended Information
Describe Alternatives You've Considered
1. Calling
cli_eval()programmatically and reading results from disk (current approach):This works but is fragile: it depends on filesystem paths, timestamp-based filtering to identify which results belong to the current run, and internal
LocalEvalSetResultsManagerdirectory conventions.2. Using
LocalEvalServicedirectly:This is the most flexible option and actually supports custom services. However, it requires significant boilerplate: manually constructing
InferenceRequest,InferenceConfig,EvaluateRequest,EvaluateConfig, setting up anEvalSetsManager, iterating async generators, and collecting results.AgentEvaluator._get_eval_results_by_eval_idalready does all of this β it just needs to be public and configurable.3. Calling
AgentEvaluator._get_eval_results_by_eval_iddirectly:This private method does exactly what we need, but relying on private APIs is not sustainable. It also hardcodes
app_name = "test_app"and creates its own services internally, so even using it directly doesn't give us the configurability we need.Proposed API / Implementation
The simplest path would be to refactor the existing
_get_eval_results_by_eval_idinto a public method with configurable parameters:Key design choices:
agent_module_or_agent: Accept either a module path (existing behavior) or aBaseAgentdirectly, to avoid the module import convention for users who construct agents programmatically.dict[str, list[EvalCaseResult]]: Same structure as_get_eval_results_by_eval_id, grouped by eval case ID to supportnum_runs > 1.evaluate_eval_setcould then be refactored to callrun_evalinternally, keeping theassert-based behavior for backward compatibility.Additional Context
LocalEvalServicealready supports all the configurability needed (custom services, eval sets managers, etc.). The gap is purely in the public convenience API βAgentEvaluatorwrapsLocalEvalServicebut doesn't expose its flexibility.BaseAgentdirectly (not just a module path string) would also help users who construct agents programmatically or need to configure them before evaluation (e.g., injecting test dependencies, setting up plugins).