Provide a public programmatic API for running evaluations and retrieving results in memory

## 🔴 Required Information

### Is your feature request related to a specific problem?

There is no clean public API for running evaluations programmatically and getting results back in memory. The two available options are:

1. **`AgentEvaluator`** — has `_get_eval_results_by_eval_id` which runs inference + evaluation and returns `dict[str, list[EvalCaseResult]]` in memory. This is exactly the right abstraction, but:
   - It's a private method (prefixed with `_`)
   - `app_name` is hardcoded to `"test_app"` inside the method
   - It creates its own `InMemoryEvalSetsManager` internally — you can't pass your own `eval_sets_manager`, `session_service`, `artifact_service`, or `eval_set_results_manager`
   - The public method `evaluate_eval_set` wraps it but immediately runs assertions (`assert not failures`), making it impossible to retrieve results without the pass/fail side effect

2. **`cli_eval`** — a CLI tool that writes results to disk. To use it programmatically, you have to call it, then read results back from the filesystem using `LocalEvalSetResultsManager`, parse the files, and do your own post-processing. This is the workaround we currently use.

`LocalEvalService` itself is actually configurable (accepts custom services, eval sets managers, etc.), but it requires manually constructing `InferenceRequest` / `EvaluateRequest` / `EvaluateConfig`, and yields results via async generators. `AgentEvaluator._get_eval_results_by_eval_id` is the convenience wrapper that handles this plumbing — it's just not public or configurable.

### Describe the Solution You'd Like

A public method on `AgentEvaluator` (or a new class) that:

1. **Returns `EvalCaseResult` objects in memory** without running assertions or requiring disk I/O
2. **Accepts configurable services** — at minimum `app_name`, and optionally `session_service`, `artifact_service`, `eval_sets_manager`, `eval_set_results_manager`
3. **Separates evaluation from assertion** — let the caller decide what to do with results (assert, post-process, export, run additional judges, etc.)

The method signature could look something like:

```python
@staticmethod
async def run_eval(
    agent_module_or_agent: str | BaseAgent,
    eval_set: EvalSet,
    eval_config: EvalConfig,
    app_name: str = "test_app",
    num_runs: int = 1,
    session_service: Optional[BaseSessionService] = None,
    artifact_service: Optional[BaseArtifactService] = None,
    eval_set_results_manager: Optional[EvalSetResultsManager] = None,
) -> dict[str, list[EvalCaseResult]]:
    """Run evaluation and return results grouped by eval case ID."""
```

### Impact on your work

We run evaluations programmatically with custom post-processing: running additional LLM judges on results, exporting to Excel, computing custom statistics, and integrating with our own reporting pipeline. All of this requires `EvalCaseResult` objects in memory.

Currently we call the `cli_eval()`, then read results back from disk via `LocalEvalSetResultsManager` to get the `EvalCaseResult` objects that were already in memory during evaluation but were only persisted to files. This roundtrip is unnecessary and fragile — it depends on filesystem conventions, timestamp-based result filtering, and internal directory structures (`_get_eval_history_dir`).

### Willingness to contribute

No

---

## 🟡 Recommended Information

### Describe Alternatives You've Considered

**1. Calling `cli_eval()` programmatically and reading results from disk (current approach):**

```python
# Run evaluation via CLI tool
cli_eval(["agents/my_app", "evals/my_eval.evalset.json", ...], standalone_mode=False)

# Read results back from disk
results_manager = LocalEvalSetResultsManager(agents_dir="agents")
result_ids = results_manager.list_eval_set_results("my_app")
for result_id in result_ids:
    result = results_manager.get_eval_set_result("my_app", result_id)
    if result.creation_timestamp >= start_time:
        # Now we finally have EvalCaseResult objects to work with
        for case in result.eval_case_results:
            run_custom_judges(case)
```

This works but is fragile: it depends on filesystem paths, timestamp-based filtering to identify which results belong to the current run, and internal `LocalEvalSetResultsManager` directory conventions.

**2. Using `LocalEvalService` directly:**

This is the most flexible option and actually supports custom services. However, it requires significant boilerplate: manually constructing `InferenceRequest`, `InferenceConfig`, `EvaluateRequest`, `EvaluateConfig`, setting up an `EvalSetsManager`, iterating async generators, and collecting results. `AgentEvaluator._get_eval_results_by_eval_id` already does all of this — it just needs to be public and configurable.

**3. Calling `AgentEvaluator._get_eval_results_by_eval_id` directly:**

This private method does exactly what we need, but relying on private APIs is not sustainable. It also hardcodes `app_name = "test_app"` and creates its own services internally, so even using it directly doesn't give us the configurability we need.

### Proposed API / Implementation

The simplest path would be to refactor the existing `_get_eval_results_by_eval_id` into a public method with configurable parameters:

```python
class AgentEvaluator:

    @staticmethod
    async def run_eval(
        agent_module_or_agent: str | BaseAgent,
        eval_set: EvalSet,
        eval_config: EvalConfig,
        app_name: str = "test_app",
        num_runs: int = 1,
        agent_name: Optional[str] = None,
        session_service: Optional[BaseSessionService] = None,
        artifact_service: Optional[BaseArtifactService] = None,
        eval_set_results_manager: Optional[EvalSetResultsManager] = None,
    ) -> dict[str, list[EvalCaseResult]]:
        """Run evaluation and return results grouped by eval case ID.

        Unlike evaluate_eval_set, this method returns results directly
        without running assertions, allowing callers to implement custom
        post-processing.
        """
        ...
```

Key design choices:
- **`agent_module_or_agent`**: Accept either a module path (existing behavior) or a `BaseAgent` directly, to avoid the module import convention for users who construct agents programmatically.
- **Return `dict[str, list[EvalCaseResult]]`**: Same structure as `_get_eval_results_by_eval_id`, grouped by eval case ID to support `num_runs > 1`.
- **No assertions**: The caller decides what to do with results.
- **Optional services**: Default to in-memory implementations (existing behavior) but allow overrides.

`evaluate_eval_set` could then be refactored to call `run_eval` internally, keeping the `assert`-based behavior for backward compatibility.

### Additional Context

- `LocalEvalService` already supports all the configurability needed (custom services, eval sets managers, etc.). The gap is purely in the public convenience API — `AgentEvaluator` wraps `LocalEvalService` but doesn't expose its flexibility.
- Accepting a `BaseAgent` directly (not just a module path string) would also help users who construct agents programmatically or need to configure them before evaluation (e.g., injecting test dependencies, setting up plugins).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a public programmatic API for running evaluations and retrieving results in memory #4534

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide a public programmatic API for running evaluations and retrieving results in memory #4534

Description

🔴 Required Information

Is your feature request related to a specific problem?

Describe the Solution You'd Like

Impact on your work

Willingness to contribute

🟡 Recommended Information

Describe Alternatives You've Considered

Proposed API / Implementation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions