Skip to content

Provide a public programmatic API for running evaluations and retrieving results in memoryΒ #4534

@lucasbarzotto-axonify

Description

@lucasbarzotto-axonify

πŸ”΄ Required Information

Is your feature request related to a specific problem?

There is no clean public API for running evaluations programmatically and getting results back in memory. The two available options are:

  1. AgentEvaluator β€” has _get_eval_results_by_eval_id which runs inference + evaluation and returns dict[str, list[EvalCaseResult]] in memory. This is exactly the right abstraction, but:

    • It's a private method (prefixed with _)
    • app_name is hardcoded to "test_app" inside the method
    • It creates its own InMemoryEvalSetsManager internally β€” you can't pass your own eval_sets_manager, session_service, artifact_service, or eval_set_results_manager
    • The public method evaluate_eval_set wraps it but immediately runs assertions (assert not failures), making it impossible to retrieve results without the pass/fail side effect
  2. cli_eval β€” a CLI tool that writes results to disk. To use it programmatically, you have to call it, then read results back from the filesystem using LocalEvalSetResultsManager, parse the files, and do your own post-processing. This is the workaround we currently use.

LocalEvalService itself is actually configurable (accepts custom services, eval sets managers, etc.), but it requires manually constructing InferenceRequest / EvaluateRequest / EvaluateConfig, and yields results via async generators. AgentEvaluator._get_eval_results_by_eval_id is the convenience wrapper that handles this plumbing β€” it's just not public or configurable.

Describe the Solution You'd Like

A public method on AgentEvaluator (or a new class) that:

  1. Returns EvalCaseResult objects in memory without running assertions or requiring disk I/O
  2. Accepts configurable services β€” at minimum app_name, and optionally session_service, artifact_service, eval_sets_manager, eval_set_results_manager
  3. Separates evaluation from assertion β€” let the caller decide what to do with results (assert, post-process, export, run additional judges, etc.)

The method signature could look something like:

@staticmethod
async def run_eval(
    agent_module_or_agent: str | BaseAgent,
    eval_set: EvalSet,
    eval_config: EvalConfig,
    app_name: str = "test_app",
    num_runs: int = 1,
    session_service: Optional[BaseSessionService] = None,
    artifact_service: Optional[BaseArtifactService] = None,
    eval_set_results_manager: Optional[EvalSetResultsManager] = None,
) -> dict[str, list[EvalCaseResult]]:
    """Run evaluation and return results grouped by eval case ID."""

Impact on your work

We run evaluations programmatically with custom post-processing: running additional LLM judges on results, exporting to Excel, computing custom statistics, and integrating with our own reporting pipeline. All of this requires EvalCaseResult objects in memory.

Currently we call the cli_eval(), then read results back from disk via LocalEvalSetResultsManager to get the EvalCaseResult objects that were already in memory during evaluation but were only persisted to files. This roundtrip is unnecessary and fragile β€” it depends on filesystem conventions, timestamp-based result filtering, and internal directory structures (_get_eval_history_dir).

Willingness to contribute

No


🟑 Recommended Information

Describe Alternatives You've Considered

1. Calling cli_eval() programmatically and reading results from disk (current approach):

# Run evaluation via CLI tool
cli_eval(["agents/my_app", "evals/my_eval.evalset.json", ...], standalone_mode=False)

# Read results back from disk
results_manager = LocalEvalSetResultsManager(agents_dir="agents")
result_ids = results_manager.list_eval_set_results("my_app")
for result_id in result_ids:
    result = results_manager.get_eval_set_result("my_app", result_id)
    if result.creation_timestamp >= start_time:
        # Now we finally have EvalCaseResult objects to work with
        for case in result.eval_case_results:
            run_custom_judges(case)

This works but is fragile: it depends on filesystem paths, timestamp-based filtering to identify which results belong to the current run, and internal LocalEvalSetResultsManager directory conventions.

2. Using LocalEvalService directly:

This is the most flexible option and actually supports custom services. However, it requires significant boilerplate: manually constructing InferenceRequest, InferenceConfig, EvaluateRequest, EvaluateConfig, setting up an EvalSetsManager, iterating async generators, and collecting results. AgentEvaluator._get_eval_results_by_eval_id already does all of this β€” it just needs to be public and configurable.

3. Calling AgentEvaluator._get_eval_results_by_eval_id directly:

This private method does exactly what we need, but relying on private APIs is not sustainable. It also hardcodes app_name = "test_app" and creates its own services internally, so even using it directly doesn't give us the configurability we need.

Proposed API / Implementation

The simplest path would be to refactor the existing _get_eval_results_by_eval_id into a public method with configurable parameters:

class AgentEvaluator:

    @staticmethod
    async def run_eval(
        agent_module_or_agent: str | BaseAgent,
        eval_set: EvalSet,
        eval_config: EvalConfig,
        app_name: str = "test_app",
        num_runs: int = 1,
        agent_name: Optional[str] = None,
        session_service: Optional[BaseSessionService] = None,
        artifact_service: Optional[BaseArtifactService] = None,
        eval_set_results_manager: Optional[EvalSetResultsManager] = None,
    ) -> dict[str, list[EvalCaseResult]]:
        """Run evaluation and return results grouped by eval case ID.

        Unlike evaluate_eval_set, this method returns results directly
        without running assertions, allowing callers to implement custom
        post-processing.
        """
        ...

Key design choices:

  • agent_module_or_agent: Accept either a module path (existing behavior) or a BaseAgent directly, to avoid the module import convention for users who construct agents programmatically.
  • Return dict[str, list[EvalCaseResult]]: Same structure as _get_eval_results_by_eval_id, grouped by eval case ID to support num_runs > 1.
  • No assertions: The caller decides what to do with results.
  • Optional services: Default to in-memory implementations (existing behavior) but allow overrides.

evaluate_eval_set could then be refactored to call run_eval internally, keeping the assert-based behavior for backward compatibility.

Additional Context

  • LocalEvalService already supports all the configurability needed (custom services, eval sets managers, etc.). The gap is purely in the public convenience API β€” AgentEvaluator wraps LocalEvalService but doesn't expose its flexibility.
  • Accepting a BaseAgent directly (not just a module path string) would also help users who construct agents programmatically or need to configure them before evaluation (e.g., injecting test dependencies, setting up plugins).

Metadata

Metadata

Labels

eval[Component] This issue is related to evaluationneeds review[Status] The PR/issue is awaiting review from the maintainer

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions