Skip to content

049-eval-trials-parallelism #121

@aroff

Description

@aroff

Evaluation Trials and Parallelism Enhancement

1. Problem Statement

The current evaluation system in FastSkill runs only one trial per case per eval run execution, making it unsuitable for reliably assessing stochastic agent behavior. Each case execution through AikitEvalRunner::run_case() in src/eval/runner.rs:75-210 produces a single CaseResult with binary pass/fail status. This creates false confidence when a single lucky/unlucky run doesn't represent typical agent performance. The system lacks configurable pass rate thresholds and parallelism options that are essential for CI/CD integration where deterministic pass rates (e.g., "80% of trials must pass") are required for gating merges. Out of scope: changes to agent execution semantics, check definition syntax, or evaluation report formats beyond adding trial aggregation data.

2. Goals

  1. Support configurable trials per case with deterministic aggregation rules for pass/fail determination
  2. Enable optional parallelism for trial execution to reduce total evaluation time
  3. Provide CI-friendly threshold-based exit semantics with configurable pass rates (0.0-1.0)
  4. Maintain backward compatibility with existing single-trial configurations
  5. Implement cost/token awareness with warnings for large trial × case combinations
  6. Extend artifact storage to handle multiple trials per case with summary aggregation

3. Current Behavior

Component Role Mutable at runtime?
AikitEvalRunner Executes single case via aikit-sdk, produces one CaseResult No
EvalConfig Holds prompts path, checks path, timeout, fail-on-missing settings No
CaseRunOptions Runtime parameters: agent, model, project root, timeout No
SummaryResult Aggregates case results with binary suite pass/fail No
allocate_run_dir() Creates timestamped run directory for artifacts No

Current build flow: CLI parses args → load EvalConfig from TOML → load EvalSuite from CSV → execute cases sequentially via AikitEvalRunner::run_case() → write artifacts per case → write summary.json. Resume flow: not supported; each run is independent with new run directory allocation.

4. Design

4.1 Data Structures

Extended EvalConfigToml (in src/core/manifest.rs):

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalConfigToml {
    pub prompts: PathBuf,
    pub checks: Option<PathBuf>,
    pub timeout_seconds: u64,
    pub fail_on_missing_agent: bool,
    // New fields:
    #[serde(default = "default_trials_per_case")]
    pub trials_per_case: u32,
    #[serde(default)]
    pub parallel: Option<u32>,
    #[serde(default = "default_pass_threshold")]
    pub pass_threshold: f64,
}

fn default_trials_per_case() -> u32 { 1 }
fn default_pass_threshold() -> f64 { 1.0 }

Trial Result Types (new in src/eval/artifacts.rs):

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TrialResult {
    pub trial_id: u32,
    pub status: CaseStatus,
    pub command_count: Option<usize>,
    pub input_tokens: Option<u64>,
    pub output_tokens: Option<u64>,
    pub check_results: Vec<CheckResult>,
    pub error_message: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CaseTrialsResult {
    pub id: String,
    pub trials: Vec<TrialResult>,
    pub aggregated_status: CaseStatus,
    pub pass_count: u32,
    pub total_trials: u32,
    pub pass_rate: f64,
}

4.2 Concurrency Model

Parallelism Strategy: Use Tokio's JoinSet for bounded concurrency with semaphore-based limiting. Maximum parallelism defaults to number of CPU cores, configurable via parallel field. Each trial executes independently in spawn_blocking (existing aikit pattern) with no shared state beyond read-only configuration.

Locking Strategy: No explicit locking required. Each trial writes to separate artifact paths using trial ID suffix (case-id/trial-N/). Aggregation happens after all trials complete.

4.3 API Surface

New CLI Options (in src/cli/commands/eval/run.rs):

#[derive(Debug, Args)]
pub struct RunArgs {
    // Existing fields...
    
    /// Override trials per case from config
    #[arg(long)]
    pub trials: Option<u32>,
    
    /// Enable CI mode: exit non-zero if pass rate below threshold
    #[arg(long)]
    pub ci: bool,
    
    /// Override pass threshold (0.0-1.0)
    #[arg(long)]
    pub threshold: Option<f64>,
}

4.4 Integration Points

EvalRunner Trait Extension:

#[async_trait]
pub trait EvalRunner: Send + Sync {
    /// Run multiple trials for one case
    async fn run_case_trials(
        &self,
        case: &EvalCase,
        opts: &CaseRunOptions,
        checks: &[CheckDefinition],
        trial_count: u32,
        max_parallelism: Option<u32>,
    ) -> CaseTrialsResult;
}

4.5 Invariants

  1. Deterministic Aggregation: Case passes if pass_count / total_trials >= pass_threshold
  2. Trial Independence: Each trial MUST execute in isolation with no shared mutable state
  3. Artifact Consistency: Trial artifacts MUST be stored in separate subdirectories: run-dir/case-id/trial-N/
  4. Configuration Inheritance: CLI args override TOML config; unspecified values use TOML defaults

4.6 Identity and Naming

  • Trial Directories: {run_dir}/{case_id}/trial-{trial_id}/ where trial_id is 1-indexed
  • Aggregation Files: {run_dir}/{case_id}/aggregated.json contains CaseTrialsResult
  • Summary Format: Existing summary.json structure extended with trial-aware fields

4.7 Error Conditions

  • Invalid Configuration: trials_per_case < 1 or pass_threshold not in [0.0, 1.0]
  • Resource Exhaustion: Parallel execution fails due to system limits
  • Partial Trial Failure: Some trials timeout/error while others succeed

4.8 Configuration Example

[tool.fastskill.eval]
prompts = "evals/prompts.csv"
checks = "evals/checks.toml"
timeout_seconds = 300
trials_per_case = 5
parallel = 4
pass_threshold = 0.8
fail_on_missing_agent = true

5. Schema and API

Extended Configuration Types:

// In src/eval/config.rs
#[derive(Debug, Clone)]
pub struct EvalConfig {
    pub prompts_path: PathBuf,
    pub checks_path: Option<PathBuf>,
    pub timeout_seconds: u64,
    pub fail_on_missing_agent: bool,
    pub project_root: PathBuf,
    // New fields:
    pub trials_per_case: u32,
    pub parallel: Option<u32>,
    pub pass_threshold: f64,
}

New Runner Methods:

// In src/eval/runner.rs
impl AikitEvalRunner {
    pub async fn run_case_trials(
        &self,
        case: &EvalCase,
        opts: &CaseRunOptions,
        checks: &[CheckDefinition],
        trial_count: u32,
        max_parallelism: Option<u32>,
    ) -> CaseTrialsResult {
        let semaphore = Arc::new(Semaphore::new(max_parallelism.unwrap_or(num_cpus::get())));
        let mut join_set = JoinSet::new();
        
        for trial_id in 1..=trial_count {
            let permit = Arc::clone(&semaphore);
            let case_clone = case.clone();
            let opts_clone = opts.clone();
            let checks_clone = checks.to_vec();
            
            join_set.spawn(async move {
                let _permit = permit.acquire().await.unwrap();
                let (output, result, trace) = self.run_case_inner(&case_clone, &opts_clone, &checks_clone).await;
                TrialResult {
                    trial_id,
                    status: result.status,
                    command_count: result.command_count,
                    input_tokens: result.input_tokens,
                    output_tokens: result.output_tokens,
                    check_results: result.check_results,
                    error_message: result.error_message,
                }
            });
        }
        
        let mut trials = Vec::new();
        while let Some(trial_result) = join_set.join_next().await {
            trials.push(trial_result.unwrap());
        }
        
        trials.sort_by_key(|t| t.trial_id);
        let pass_count = trials.iter().filter(|t| t.status == CaseStatus::Passed).count() as u32;
        let total_trials = trial_count;
        let pass_rate = pass_count as f64 / total_trials as f64;
        let threshold = opts.pass_threshold; // Added to CaseRunOptions
        
        CaseTrialsResult {
            id: case.id.clone(),
            trials,
            aggregated_status: if pass_rate >= threshold { CaseStatus::Passed } else { CaseStatus::Failed },
            pass_count,
            total_trials,
            pass_rate,
        }
    }
}

Artifact Storage Extensions:

// In src/eval/artifacts.rs
pub fn write_trial_artifacts(
    run_dir: &Path,
    case_id: &str,
    trial_id: u32,
    stdout: &[u8],
    stderr: &[u8],
    trace_jsonl: &str,
    result: &TrialResult,
) -> Result<PathBuf, ArtifactsError>;

pub fn write_case_trials_summary(
    run_dir: &Path,
    case_id: &str,
    trials_result: &CaseTrialsResult,
) -> Result<(), ArtifactsError>;

6. Error Codes

Code Category Trigger
EVAL_INVALID_TRIALS_CONFIG Configuration trials_per_case < 1 or > 1000
EVAL_INVALID_THRESHOLD Configuration pass_threshold not in range [0.0, 1.0]
EVAL_PARALLEL_EXHAUSTION Runtime Semaphore acquisition fails or join_set panics
EVAL_TRIAL_ARTIFACTS_CORRUPT Storage Cannot write trial artifacts due to filesystem errors
EVAL_COST_WARNING Advisory trials_per_case * case_count > 100 (configurable threshold)

7. Acceptance Criteria

  1. Backward Compatibility: Existing projects with no trials configuration MUST continue working with trials_per_case = 1
  2. Configuration Validation: Invalid trials_per_case or pass_threshold values MUST be rejected with clear error messages
  3. Deterministic Aggregation: Case with 3 passed trials out of 5 total and threshold 0.6 MUST result in CaseStatus::Passed
  4. Parallel Execution: Trials MUST execute concurrently when parallel > 1 is configured
  5. Artifact Structure: Each trial MUST write artifacts to {run_dir}/{case_id}/trial-{trial_id}/ directory
  6. CI Exit Semantics: --ci flag MUST cause non-zero exit when suite aggregate pass rate is below threshold
  7. Cost Warning: Commands MUST warn when trials_per_case * total_cases >= 100
  8. No Agent Changes: Implementation MUST NOT modify agent execution behavior or aikit-sdk integration
  9. Summary Extension: summary.json MUST include trial aggregation data while maintaining existing field compatibility

8. Functionality Comparison (Before vs After)

Before:

  • Single trial per case with binary pass/fail
  • Sequential execution only
  • No threshold-based CI integration
  • Basic summary.json with case-level results
  • No cost/token awareness

After:

  • Configurable trials per case (1-N) with aggregated pass rates
  • Optional parallel execution with bounded concurrency
  • CI-friendly --ci flag with threshold-based exit codes
  • Extended summary.json with trial breakdowns and pass rates
  • Cost warnings for large trial × case combinations
  • Backward compatible defaults (trials=1, threshold=1.0)

9. Benefit of Implementing This Functionality

This enhancement addresses a critical gap in FastSkill's evaluation system by enabling reliable assessment of stochastic agent behavior. Stochastic agents are increasingly common in production AI systems, and single-shot evaluations provide false confidence that can lead to deployment of unreliable agents. By implementing configurable trials with deterministic aggregation rules, FastSkill becomes suitable for CI/CD pipelines where merge gates require statistical confidence (e.g., "agent passes 80% of trials"). Optional parallelism reduces evaluation time from linear to logarithmic scaling, making larger trial counts practical. The cost awareness features prevent accidental expensive runs while maintaining development velocity for smaller test suites.

10. Stages

Stage 1: Core Configuration Extension

  • Deliverable: Extended EvalConfigToml and EvalConfig with trials, parallelism, and threshold fields
  • Scope: Modify src/core/manifest.rs (add new fields), src/eval/config.rs (extend resolution logic), add validation functions
  • Dependencies: None

Stage 2: Trial Result Types and Aggregation Logic

  • Deliverable: TrialResult, CaseTrialsResult types with deterministic aggregation rules
  • Scope: Extend src/eval/artifacts.rs with new types, implement aggregation functions, add serialization support
  • Dependencies: Stage 1 (requires extended configuration types)

Stage 3: Runner Extension for Multi-Trial Execution

  • Deliverable: run_case_trials() method with parallel execution and trial management
  • Scope: Extend src/eval/runner.rs with new trait method, implement in AikitEvalRunner, add semaphore-based concurrency control
  • Dependencies: Stage 2 (requires trial result types)

Stage 4: Artifact Storage for Trial Data

  • Deliverable: Trial-specific artifact writing with directory structure {run_dir}/{case_id}/trial-{trial_id}/
  • Scope: Extend src/eval/artifacts.rs with write_trial_artifacts(), write_case_trials_summary(), update directory allocation
  • Dependencies: Stage 2 (requires trial result types)

Stage 5: CLI Interface and Orchestration

  • Deliverable: Extended RunArgs with --trials, --ci, --threshold options and integration with multi-trial runner
  • Scope: Modify src/cli/commands/eval/run.rs to handle new arguments, orchestrate multi-trial execution, implement CI exit semantics
  • Dependencies: Stages 3-4 (requires runner extension and artifact storage)

Stage 6: Cost Awareness and Warnings

  • Deliverable: Token/cost estimation with warnings for large trial × case combinations
  • Scope: Add cost calculation logic to CLI orchestration, implement warning thresholds, extend help text
  • Dependencies: Stage 5 (requires CLI integration)

11. Breaking Changes

None. All changes are backward compatible:

  • Existing TOML configurations without trials fields use default values (trials_per_case = 1, pass_threshold = 1.0)
  • Existing CLI commands continue working without new flags
  • summary.json format is extended but maintains all existing fields
  • EvalRunner trait gains new methods but existing run_case() remains unchanged

12. Modified Files and Summary

File/path Summary of modifications
src/core/manifest.rs Add trials_per_case, parallel, pass_threshold fields to EvalConfigToml with default functions
src/eval/config.rs Extend EvalConfig struct and resolve_from_toml() function to handle new configuration fields
src/eval/artifacts.rs Add TrialResult, CaseTrialsResult types; implement write_trial_artifacts(), write_case_trials_summary() functions
src/eval/runner.rs Add run_case_trials() method to EvalRunner trait; implement parallel execution in AikitEvalRunner
src/cli/commands/eval/run.rs Extend RunArgs with trials/CI options; modify execute_run_with_runner() for multi-trial orchestration
tests/cli/eval_tests.rs Add integration tests for trials configuration, parallel execution, CI exit semantics
src/eval/config.rs (tests) Extend existing config tests with trials field validation

13. Dependencies

External Dependencies:

  • tokio::task::JoinSet (already in Cargo.toml) for parallel trial execution
  • tokio::sync::Semaphore (already in Cargo.toml) for bounded concurrency
  • num_cpus crate (new) for default parallelism detection

Internal Dependencies:

  • No blocking dependencies on other teams or services
  • Optional: Performance testing team consultation for large-scale trial validation (non-blocking, Stage 6)

If num_cpus dependency is skipped: Default parallelism falls back to hardcoded value (4 cores).

14. Proposed Folder Structure

Modified paths only (no new directories):

  • Trial artifacts use existing run directory pattern: {run_dir}/{case_id}/trial-{trial_id}/
  • Aggregated results stored as: {run_dir}/{case_id}/aggregated.json
  • No changes to top-level project structure

15. Design Decisions

# Question Decision Rationale
1 How to handle partial trial failures? Aggregate all completed trials, include failed/errored trials in pass rate calculation Provides transparent reporting; users can set thresholds to handle expected failure rates
2 Should parallelism be per-case or per-trial? Per-trial within each case Maintains case isolation while maximizing concurrency; prevents resource exhaustion
3 What maximum trials limit to enforce? 1000 trials per case Prevents accidental expensive runs while allowing extensive testing scenarios
4 How to handle trial artifact naming conflicts? Use trial-specific subdirectories with consistent naming Avoids filesystem conflicts; enables easy trial comparison
5 Should pass threshold apply per-case or suite-wide? Per-case with suite aggregation Provides fine-grained control while maintaining intuitive suite-level semantics

16. Out of Scope (Explicit)

  • Agent Execution Changes: No modifications to aikit-sdk integration or agent execution semantics
  • Check Definition Syntax: No changes to existing checks.toml format or check execution logic
  • Report Format Overhaul: No changes to existing CLI output formats beyond adding trial aggregation data
  • Resume/Checkpoint Support: No implementation of partial run resumption or trial checkpointing
  • Dynamic Trial Allocation: No support for adaptive trial counts based on variance or early stopping
  • Cross-Agent Comparison: No multi-agent trial comparison or ranking features
  • Custom Aggregation Rules: No support for aggregation functions beyond simple pass rate thresholds

17. References

Key Files to Start From:

  • src/eval/runner.rs:75-210 - AikitEvalRunner::run_case() implementation to extend for multi-trial support
  • src/eval/config.rs:55-84 - resolve_from_toml() function for configuration resolution logic
  • src/cli/commands/eval/run.rs:78-272 - execute_run_with_runner() orchestration function
  • src/eval/artifacts.rs:109-129 - write_case_artifacts() pattern to replicate for trial artifacts
  • src/core/manifest.rs:283-295 - EvalConfigToml struct to extend with new fields
  • tests/cli/eval_tests.rs:134-150 - Existing eval integration test patterns to follow for trials testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions