049-eval-trials-parallelism

# Evaluation Trials and Parallelism Enhancement

## 1. Problem Statement

The current evaluation system in FastSkill runs only one trial per case per `eval run` execution, making it unsuitable for reliably assessing stochastic agent behavior. Each case execution through `AikitEvalRunner::run_case()` in `src/eval/runner.rs:75-210` produces a single `CaseResult` with binary pass/fail status. This creates false confidence when a single lucky/unlucky run doesn't represent typical agent performance. The system lacks configurable pass rate thresholds and parallelism options that are essential for CI/CD integration where deterministic pass rates (e.g., "80% of trials must pass") are required for gating merges. Out of scope: changes to agent execution semantics, check definition syntax, or evaluation report formats beyond adding trial aggregation data.

## 2. Goals

1. Support configurable trials per case with deterministic aggregation rules for pass/fail determination
2. Enable optional parallelism for trial execution to reduce total evaluation time
3. Provide CI-friendly threshold-based exit semantics with configurable pass rates (0.0-1.0)
4. Maintain backward compatibility with existing single-trial configurations
5. Implement cost/token awareness with warnings for large trial × case combinations
6. Extend artifact storage to handle multiple trials per case with summary aggregation

## 3. Current Behavior

| Component | Role | Mutable at runtime? |
|-----------|------|-------------------|
| `AikitEvalRunner` | Executes single case via aikit-sdk, produces one `CaseResult` | No |
| `EvalConfig` | Holds prompts path, checks path, timeout, fail-on-missing settings | No |
| `CaseRunOptions` | Runtime parameters: agent, model, project root, timeout | No |
| `SummaryResult` | Aggregates case results with binary suite pass/fail | No |
| `allocate_run_dir()` | Creates timestamped run directory for artifacts | No |

Current build flow: CLI parses args → load `EvalConfig` from TOML → load `EvalSuite` from CSV → execute cases sequentially via `AikitEvalRunner::run_case()` → write artifacts per case → write `summary.json`. Resume flow: not supported; each run is independent with new run directory allocation.

## 4. Design

### 4.1 Data Structures

**Extended EvalConfigToml** (in `src/core/manifest.rs`):
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalConfigToml {
    pub prompts: PathBuf,
    pub checks: Option<PathBuf>,
    pub timeout_seconds: u64,
    pub fail_on_missing_agent: bool,
    // New fields:
    #[serde(default = "default_trials_per_case")]
    pub trials_per_case: u32,
    #[serde(default)]
    pub parallel: Option<u32>,
    #[serde(default = "default_pass_threshold")]
    pub pass_threshold: f64,
}

fn default_trials_per_case() -> u32 { 1 }
fn default_pass_threshold() -> f64 { 1.0 }
```

**Trial Result Types** (new in `src/eval/artifacts.rs`):
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TrialResult {
    pub trial_id: u32,
    pub status: CaseStatus,
    pub command_count: Option<usize>,
    pub input_tokens: Option<u64>,
    pub output_tokens: Option<u64>,
    pub check_results: Vec<CheckResult>,
    pub error_message: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CaseTrialsResult {
    pub id: String,
    pub trials: Vec<TrialResult>,
    pub aggregated_status: CaseStatus,
    pub pass_count: u32,
    pub total_trials: u32,
    pub pass_rate: f64,
}
```

### 4.2 Concurrency Model

**Parallelism Strategy**: Use Tokio's `JoinSet` for bounded concurrency with semaphore-based limiting. Maximum parallelism defaults to number of CPU cores, configurable via `parallel` field. Each trial executes independently in `spawn_blocking` (existing aikit pattern) with no shared state beyond read-only configuration.

**Locking Strategy**: No explicit locking required. Each trial writes to separate artifact paths using trial ID suffix (`case-id/trial-N/`). Aggregation happens after all trials complete.

### 4.3 API Surface

**New CLI Options** (in `src/cli/commands/eval/run.rs`):
```rust
#[derive(Debug, Args)]
pub struct RunArgs {
    // Existing fields...
    
    /// Override trials per case from config
    #[arg(long)]
    pub trials: Option<u32>,
    
    /// Enable CI mode: exit non-zero if pass rate below threshold
    #[arg(long)]
    pub ci: bool,
    
    /// Override pass threshold (0.0-1.0)
    #[arg(long)]
    pub threshold: Option<f64>,
}
```

### 4.4 Integration Points

**EvalRunner Trait Extension**:
```rust
#[async_trait]
pub trait EvalRunner: Send + Sync {
    /// Run multiple trials for one case
    async fn run_case_trials(
        &self,
        case: &EvalCase,
        opts: &CaseRunOptions,
        checks: &[CheckDefinition],
        trial_count: u32,
        max_parallelism: Option<u32>,
    ) -> CaseTrialsResult;
}
```

### 4.5 Invariants

1. **Deterministic Aggregation**: Case passes if `pass_count / total_trials >= pass_threshold`
2. **Trial Independence**: Each trial MUST execute in isolation with no shared mutable state
3. **Artifact Consistency**: Trial artifacts MUST be stored in separate subdirectories: `run-dir/case-id/trial-N/`
4. **Configuration Inheritance**: CLI args override TOML config; unspecified values use TOML defaults

### 4.6 Identity and Naming

- **Trial Directories**: `{run_dir}/{case_id}/trial-{trial_id}/` where `trial_id` is 1-indexed
- **Aggregation Files**: `{run_dir}/{case_id}/aggregated.json` contains `CaseTrialsResult`
- **Summary Format**: Existing `summary.json` structure extended with trial-aware fields

### 4.7 Error Conditions

- **Invalid Configuration**: `trials_per_case < 1` or `pass_threshold` not in [0.0, 1.0]
- **Resource Exhaustion**: Parallel execution fails due to system limits
- **Partial Trial Failure**: Some trials timeout/error while others succeed

### 4.8 Configuration Example

```toml
[tool.fastskill.eval]
prompts = "evals/prompts.csv"
checks = "evals/checks.toml"
timeout_seconds = 300
trials_per_case = 5
parallel = 4
pass_threshold = 0.8
fail_on_missing_agent = true
```

## 5. Schema and API

**Extended Configuration Types**:
```rust
// In src/eval/config.rs
#[derive(Debug, Clone)]
pub struct EvalConfig {
    pub prompts_path: PathBuf,
    pub checks_path: Option<PathBuf>,
    pub timeout_seconds: u64,
    pub fail_on_missing_agent: bool,
    pub project_root: PathBuf,
    // New fields:
    pub trials_per_case: u32,
    pub parallel: Option<u32>,
    pub pass_threshold: f64,
}
```

**New Runner Methods**:
```rust
// In src/eval/runner.rs
impl AikitEvalRunner {
    pub async fn run_case_trials(
        &self,
        case: &EvalCase,
        opts: &CaseRunOptions,
        checks: &[CheckDefinition],
        trial_count: u32,
        max_parallelism: Option<u32>,
    ) -> CaseTrialsResult {
        let semaphore = Arc::new(Semaphore::new(max_parallelism.unwrap_or(num_cpus::get())));
        let mut join_set = JoinSet::new();
        
        for trial_id in 1..=trial_count {
            let permit = Arc::clone(&semaphore);
            let case_clone = case.clone();
            let opts_clone = opts.clone();
            let checks_clone = checks.to_vec();
            
            join_set.spawn(async move {
                let _permit = permit.acquire().await.unwrap();
                let (output, result, trace) = self.run_case_inner(&case_clone, &opts_clone, &checks_clone).await;
                TrialResult {
                    trial_id,
                    status: result.status,
                    command_count: result.command_count,
                    input_tokens: result.input_tokens,
                    output_tokens: result.output_tokens,
                    check_results: result.check_results,
                    error_message: result.error_message,
                }
            });
        }
        
        let mut trials = Vec::new();
        while let Some(trial_result) = join_set.join_next().await {
            trials.push(trial_result.unwrap());
        }
        
        trials.sort_by_key(|t| t.trial_id);
        let pass_count = trials.iter().filter(|t| t.status == CaseStatus::Passed).count() as u32;
        let total_trials = trial_count;
        let pass_rate = pass_count as f64 / total_trials as f64;
        let threshold = opts.pass_threshold; // Added to CaseRunOptions
        
        CaseTrialsResult {
            id: case.id.clone(),
            trials,
            aggregated_status: if pass_rate >= threshold { CaseStatus::Passed } else { CaseStatus::Failed },
            pass_count,
            total_trials,
            pass_rate,
        }
    }
}
```

**Artifact Storage Extensions**:
```rust
// In src/eval/artifacts.rs
pub fn write_trial_artifacts(
    run_dir: &Path,
    case_id: &str,
    trial_id: u32,
    stdout: &[u8],
    stderr: &[u8],
    trace_jsonl: &str,
    result: &TrialResult,
) -> Result<PathBuf, ArtifactsError>;

pub fn write_case_trials_summary(
    run_dir: &Path,
    case_id: &str,
    trials_result: &CaseTrialsResult,
) -> Result<(), ArtifactsError>;
```

## 6. Error Codes

| Code | Category | Trigger |
|------|----------|---------|
| `EVAL_INVALID_TRIALS_CONFIG` | Configuration | `trials_per_case < 1` or `> 1000` |
| `EVAL_INVALID_THRESHOLD` | Configuration | `pass_threshold` not in range [0.0, 1.0] |
| `EVAL_PARALLEL_EXHAUSTION` | Runtime | Semaphore acquisition fails or join_set panics |
| `EVAL_TRIAL_ARTIFACTS_CORRUPT` | Storage | Cannot write trial artifacts due to filesystem errors |
| `EVAL_COST_WARNING` | Advisory | `trials_per_case * case_count > 100` (configurable threshold) |

## 7. Acceptance Criteria

1. **Backward Compatibility**: Existing projects with no trials configuration MUST continue working with `trials_per_case = 1`
2. **Configuration Validation**: Invalid `trials_per_case` or `pass_threshold` values MUST be rejected with clear error messages
3. **Deterministic Aggregation**: Case with 3 passed trials out of 5 total and threshold 0.6 MUST result in `CaseStatus::Passed`
4. **Parallel Execution**: Trials MUST execute concurrently when `parallel > 1` is configured
5. **Artifact Structure**: Each trial MUST write artifacts to `{run_dir}/{case_id}/trial-{trial_id}/` directory
6. **CI Exit Semantics**: `--ci` flag MUST cause non-zero exit when suite aggregate pass rate is below threshold
7. **Cost Warning**: Commands MUST warn when `trials_per_case * total_cases >= 100`
8. **No Agent Changes**: Implementation MUST NOT modify agent execution behavior or aikit-sdk integration
9. **Summary Extension**: `summary.json` MUST include trial aggregation data while maintaining existing field compatibility

## 8. Functionality Comparison (Before vs After)

**Before**:
- Single trial per case with binary pass/fail
- Sequential execution only
- No threshold-based CI integration
- Basic `summary.json` with case-level results
- No cost/token awareness

**After**:
- Configurable trials per case (1-N) with aggregated pass rates
- Optional parallel execution with bounded concurrency
- CI-friendly `--ci` flag with threshold-based exit codes
- Extended `summary.json` with trial breakdowns and pass rates
- Cost warnings for large trial × case combinations
- Backward compatible defaults (trials=1, threshold=1.0)

## 9. Benefit of Implementing This Functionality

This enhancement addresses a critical gap in FastSkill's evaluation system by enabling reliable assessment of stochastic agent behavior. Stochastic agents are increasingly common in production AI systems, and single-shot evaluations provide false confidence that can lead to deployment of unreliable agents. By implementing configurable trials with deterministic aggregation rules, FastSkill becomes suitable for CI/CD pipelines where merge gates require statistical confidence (e.g., "agent passes 80% of trials"). Optional parallelism reduces evaluation time from linear to logarithmic scaling, making larger trial counts practical. The cost awareness features prevent accidental expensive runs while maintaining development velocity for smaller test suites.

## 10. Stages

**Stage 1: Core Configuration Extension**
- **Deliverable**: Extended `EvalConfigToml` and `EvalConfig` with trials, parallelism, and threshold fields
- **Scope**: Modify `src/core/manifest.rs` (add new fields), `src/eval/config.rs` (extend resolution logic), add validation functions
- **Dependencies**: None

**Stage 2: Trial Result Types and Aggregation Logic**
- **Deliverable**: `TrialResult`, `CaseTrialsResult` types with deterministic aggregation rules
- **Scope**: Extend `src/eval/artifacts.rs` with new types, implement aggregation functions, add serialization support
- **Dependencies**: Stage 1 (requires extended configuration types)

**Stage 3: Runner Extension for Multi-Trial Execution**
- **Deliverable**: `run_case_trials()` method with parallel execution and trial management
- **Scope**: Extend `src/eval/runner.rs` with new trait method, implement in `AikitEvalRunner`, add semaphore-based concurrency control
- **Dependencies**: Stage 2 (requires trial result types)

**Stage 4: Artifact Storage for Trial Data**
- **Deliverable**: Trial-specific artifact writing with directory structure `{run_dir}/{case_id}/trial-{trial_id}/`
- **Scope**: Extend `src/eval/artifacts.rs` with `write_trial_artifacts()`, `write_case_trials_summary()`, update directory allocation
- **Dependencies**: Stage 2 (requires trial result types)

**Stage 5: CLI Interface and Orchestration**
- **Deliverable**: Extended `RunArgs` with `--trials`, `--ci`, `--threshold` options and integration with multi-trial runner
- **Scope**: Modify `src/cli/commands/eval/run.rs` to handle new arguments, orchestrate multi-trial execution, implement CI exit semantics
- **Dependencies**: Stages 3-4 (requires runner extension and artifact storage)

**Stage 6: Cost Awareness and Warnings**
- **Deliverable**: Token/cost estimation with warnings for large trial × case combinations
- **Scope**: Add cost calculation logic to CLI orchestration, implement warning thresholds, extend help text
- **Dependencies**: Stage 5 (requires CLI integration)

## 11. Breaking Changes

**None**. All changes are backward compatible:
- Existing TOML configurations without trials fields use default values (`trials_per_case = 1`, `pass_threshold = 1.0`)
- Existing CLI commands continue working without new flags
- `summary.json` format is extended but maintains all existing fields
- `EvalRunner` trait gains new methods but existing `run_case()` remains unchanged

## 12. Modified Files and Summary

| File/path | Summary of modifications |
|-----------|-------------------------|
| `src/core/manifest.rs` | Add `trials_per_case`, `parallel`, `pass_threshold` fields to `EvalConfigToml` with default functions |
| `src/eval/config.rs` | Extend `EvalConfig` struct and `resolve_from_toml()` function to handle new configuration fields |
| `src/eval/artifacts.rs` | Add `TrialResult`, `CaseTrialsResult` types; implement `write_trial_artifacts()`, `write_case_trials_summary()` functions |
| `src/eval/runner.rs` | Add `run_case_trials()` method to `EvalRunner` trait; implement parallel execution in `AikitEvalRunner` |
| `src/cli/commands/eval/run.rs` | Extend `RunArgs` with trials/CI options; modify `execute_run_with_runner()` for multi-trial orchestration |
| `tests/cli/eval_tests.rs` | Add integration tests for trials configuration, parallel execution, CI exit semantics |
| `src/eval/config.rs` (tests) | Extend existing config tests with trials field validation |

## 13. Dependencies

**External Dependencies**:
- `tokio::task::JoinSet` (already in Cargo.toml) for parallel trial execution
- `tokio::sync::Semaphore` (already in Cargo.toml) for bounded concurrency
- `num_cpus` crate (new) for default parallelism detection

**Internal Dependencies**:
- No blocking dependencies on other teams or services
- Optional: Performance testing team consultation for large-scale trial validation (non-blocking, Stage 6)

If `num_cpus` dependency is skipped: Default parallelism falls back to hardcoded value (4 cores).

## 14. Proposed Folder Structure

**Modified paths only** (no new directories):
- Trial artifacts use existing run directory pattern: `{run_dir}/{case_id}/trial-{trial_id}/`
- Aggregated results stored as: `{run_dir}/{case_id}/aggregated.json`
- No changes to top-level project structure

## 15. Design Decisions

| # | Question | Decision | Rationale |
|---|----------|----------|-----------|
| 1 | How to handle partial trial failures? | Aggregate all completed trials, include failed/errored trials in pass rate calculation | Provides transparent reporting; users can set thresholds to handle expected failure rates |
| 2 | Should parallelism be per-case or per-trial? | Per-trial within each case | Maintains case isolation while maximizing concurrency; prevents resource exhaustion |
| 3 | What maximum trials limit to enforce? | 1000 trials per case | Prevents accidental expensive runs while allowing extensive testing scenarios |
| 4 | How to handle trial artifact naming conflicts? | Use trial-specific subdirectories with consistent naming | Avoids filesystem conflicts; enables easy trial comparison |
| 5 | Should pass threshold apply per-case or suite-wide? | Per-case with suite aggregation | Provides fine-grained control while maintaining intuitive suite-level semantics |

## 16. Out of Scope (Explicit)

- **Agent Execution Changes**: No modifications to aikit-sdk integration or agent execution semantics
- **Check Definition Syntax**: No changes to existing `checks.toml` format or check execution logic
- **Report Format Overhaul**: No changes to existing CLI output formats beyond adding trial aggregation data
- **Resume/Checkpoint Support**: No implementation of partial run resumption or trial checkpointing
- **Dynamic Trial Allocation**: No support for adaptive trial counts based on variance or early stopping
- **Cross-Agent Comparison**: No multi-agent trial comparison or ranking features
- **Custom Aggregation Rules**: No support for aggregation functions beyond simple pass rate thresholds

## 17. References

**Key Files to Start From**:
- `src/eval/runner.rs:75-210` - `AikitEvalRunner::run_case()` implementation to extend for multi-trial support
- `src/eval/config.rs:55-84` - `resolve_from_toml()` function for configuration resolution logic
- `src/cli/commands/eval/run.rs:78-272` - `execute_run_with_runner()` orchestration function
- `src/eval/artifacts.rs:109-129` - `write_case_artifacts()` pattern to replicate for trial artifacts
- `src/core/manifest.rs:283-295` - `EvalConfigToml` struct to extend with new fields
- `tests/cli/eval_tests.rs:134-150` - Existing eval integration test patterns to follow for trials testing

File/path	Summary of modifications
`src/core/manifest.rs`	Add `trials_per_case`, `parallel`, `pass_threshold` fields to `EvalConfigToml` with default functions
`src/eval/config.rs`	Extend `EvalConfig` struct and `resolve_from_toml()` function to handle new configuration fields
`src/eval/artifacts.rs`	Add `TrialResult`, `CaseTrialsResult` types; implement `write_trial_artifacts()`, `write_case_trials_summary()` functions
`src/eval/runner.rs`	Add `run_case_trials()` method to `EvalRunner` trait; implement parallel execution in `AikitEvalRunner`
`src/cli/commands/eval/run.rs`	Extend `RunArgs` with trials/CI options; modify `execute_run_with_runner()` for multi-trial orchestration
`tests/cli/eval_tests.rs`	Add integration tests for trials configuration, parallel execution, CI exit semantics
`src/eval/config.rs` (tests)	Extend existing config tests with trials field validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

049-eval-trials-parallelism #121

Evaluation Trials and Parallelism Enhancement

1. Problem Statement

2. Goals

3. Current Behavior

4. Design

4.1 Data Structures

4.2 Concurrency Model

4.3 API Surface

4.4 Integration Points

4.5 Invariants

4.6 Identity and Naming

4.7 Error Conditions

4.8 Configuration Example

5. Schema and API

6. Error Codes

7. Acceptance Criteria

8. Functionality Comparison (Before vs After)

9. Benefit of Implementing This Functionality

10. Stages

11. Breaking Changes

12. Modified Files and Summary

13. Dependencies

14. Proposed Folder Structure

15. Design Decisions

16. Out of Scope (Explicit)

17. References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Role	Mutable at runtime?
`AikitEvalRunner`	Executes single case via aikit-sdk, produces one `CaseResult`	No
`EvalConfig`	Holds prompts path, checks path, timeout, fail-on-missing settings	No
`CaseRunOptions`	Runtime parameters: agent, model, project root, timeout	No
`SummaryResult`	Aggregates case results with binary suite pass/fail	No
`allocate_run_dir()`	Creates timestamped run directory for artifacts	No

Code	Category	Trigger
`EVAL_INVALID_TRIALS_CONFIG`	Configuration	`trials_per_case < 1` or `> 1000`
`EVAL_INVALID_THRESHOLD`	Configuration	`pass_threshold` not in range [0.0, 1.0]
`EVAL_PARALLEL_EXHAUSTION`	Runtime	Semaphore acquisition fails or join_set panics
`EVAL_TRIAL_ARTIFACTS_CORRUPT`	Storage	Cannot write trial artifacts due to filesystem errors
`EVAL_COST_WARNING`	Advisory	`trials_per_case * case_count > 100` (configurable threshold)

#	Question	Decision	Rationale
1	How to handle partial trial failures?	Aggregate all completed trials, include failed/errored trials in pass rate calculation	Provides transparent reporting; users can set thresholds to handle expected failure rates
2	Should parallelism be per-case or per-trial?	Per-trial within each case	Maintains case isolation while maximizing concurrency; prevents resource exhaustion
3	What maximum trials limit to enforce?	1000 trials per case	Prevents accidental expensive runs while allowing extensive testing scenarios
4	How to handle trial artifact naming conflicts?	Use trial-specific subdirectories with consistent naming	Avoids filesystem conflicts; enables easy trial comparison
5	Should pass threshold apply per-case or suite-wide?	Per-case with suite aggregation	Provides fine-grained control while maintaining intuitive suite-level semantics

049-eval-trials-parallelism #121

Description

Evaluation Trials and Parallelism Enhancement

1. Problem Statement

2. Goals

3. Current Behavior

4. Design

4.1 Data Structures

4.2 Concurrency Model

4.3 API Surface

4.4 Integration Points

4.5 Invariants

4.6 Identity and Naming

4.7 Error Conditions

4.8 Configuration Example

5. Schema and API

6. Error Codes

7. Acceptance Criteria

8. Functionality Comparison (Before vs After)

9. Benefit of Implementing This Functionality

10. Stages

11. Breaking Changes

12. Modified Files and Summary

13. Dependencies

14. Proposed Folder Structure

15. Design Decisions

16. Out of Scope (Explicit)

17. References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions