Evaluation Trials and Parallelism Enhancement
1. Problem Statement
The current evaluation system in FastSkill runs only one trial per case per eval run execution, making it unsuitable for reliably assessing stochastic agent behavior. Each case execution through AikitEvalRunner::run_case() in src/eval/runner.rs:75-210 produces a single CaseResult with binary pass/fail status. This creates false confidence when a single lucky/unlucky run doesn't represent typical agent performance. The system lacks configurable pass rate thresholds and parallelism options that are essential for CI/CD integration where deterministic pass rates (e.g., "80% of trials must pass") are required for gating merges. Out of scope: changes to agent execution semantics, check definition syntax, or evaluation report formats beyond adding trial aggregation data.
2. Goals
- Support configurable trials per case with deterministic aggregation rules for pass/fail determination
- Enable optional parallelism for trial execution to reduce total evaluation time
- Provide CI-friendly threshold-based exit semantics with configurable pass rates (0.0-1.0)
- Maintain backward compatibility with existing single-trial configurations
- Implement cost/token awareness with warnings for large trial × case combinations
- Extend artifact storage to handle multiple trials per case with summary aggregation
3. Current Behavior
| Component |
Role |
Mutable at runtime? |
AikitEvalRunner |
Executes single case via aikit-sdk, produces one CaseResult |
No |
EvalConfig |
Holds prompts path, checks path, timeout, fail-on-missing settings |
No |
CaseRunOptions |
Runtime parameters: agent, model, project root, timeout |
No |
SummaryResult |
Aggregates case results with binary suite pass/fail |
No |
allocate_run_dir() |
Creates timestamped run directory for artifacts |
No |
Current build flow: CLI parses args → load EvalConfig from TOML → load EvalSuite from CSV → execute cases sequentially via AikitEvalRunner::run_case() → write artifacts per case → write summary.json. Resume flow: not supported; each run is independent with new run directory allocation.
4. Design
4.1 Data Structures
Extended EvalConfigToml (in src/core/manifest.rs):
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalConfigToml {
pub prompts: PathBuf,
pub checks: Option<PathBuf>,
pub timeout_seconds: u64,
pub fail_on_missing_agent: bool,
// New fields:
#[serde(default = "default_trials_per_case")]
pub trials_per_case: u32,
#[serde(default)]
pub parallel: Option<u32>,
#[serde(default = "default_pass_threshold")]
pub pass_threshold: f64,
}
fn default_trials_per_case() -> u32 { 1 }
fn default_pass_threshold() -> f64 { 1.0 }
Trial Result Types (new in src/eval/artifacts.rs):
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TrialResult {
pub trial_id: u32,
pub status: CaseStatus,
pub command_count: Option<usize>,
pub input_tokens: Option<u64>,
pub output_tokens: Option<u64>,
pub check_results: Vec<CheckResult>,
pub error_message: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CaseTrialsResult {
pub id: String,
pub trials: Vec<TrialResult>,
pub aggregated_status: CaseStatus,
pub pass_count: u32,
pub total_trials: u32,
pub pass_rate: f64,
}
4.2 Concurrency Model
Parallelism Strategy: Use Tokio's JoinSet for bounded concurrency with semaphore-based limiting. Maximum parallelism defaults to number of CPU cores, configurable via parallel field. Each trial executes independently in spawn_blocking (existing aikit pattern) with no shared state beyond read-only configuration.
Locking Strategy: No explicit locking required. Each trial writes to separate artifact paths using trial ID suffix (case-id/trial-N/). Aggregation happens after all trials complete.
4.3 API Surface
New CLI Options (in src/cli/commands/eval/run.rs):
#[derive(Debug, Args)]
pub struct RunArgs {
// Existing fields...
/// Override trials per case from config
#[arg(long)]
pub trials: Option<u32>,
/// Enable CI mode: exit non-zero if pass rate below threshold
#[arg(long)]
pub ci: bool,
/// Override pass threshold (0.0-1.0)
#[arg(long)]
pub threshold: Option<f64>,
}
4.4 Integration Points
EvalRunner Trait Extension:
#[async_trait]
pub trait EvalRunner: Send + Sync {
/// Run multiple trials for one case
async fn run_case_trials(
&self,
case: &EvalCase,
opts: &CaseRunOptions,
checks: &[CheckDefinition],
trial_count: u32,
max_parallelism: Option<u32>,
) -> CaseTrialsResult;
}
4.5 Invariants
- Deterministic Aggregation: Case passes if
pass_count / total_trials >= pass_threshold
- Trial Independence: Each trial MUST execute in isolation with no shared mutable state
- Artifact Consistency: Trial artifacts MUST be stored in separate subdirectories:
run-dir/case-id/trial-N/
- Configuration Inheritance: CLI args override TOML config; unspecified values use TOML defaults
4.6 Identity and Naming
- Trial Directories:
{run_dir}/{case_id}/trial-{trial_id}/ where trial_id is 1-indexed
- Aggregation Files:
{run_dir}/{case_id}/aggregated.json contains CaseTrialsResult
- Summary Format: Existing
summary.json structure extended with trial-aware fields
4.7 Error Conditions
- Invalid Configuration:
trials_per_case < 1 or pass_threshold not in [0.0, 1.0]
- Resource Exhaustion: Parallel execution fails due to system limits
- Partial Trial Failure: Some trials timeout/error while others succeed
4.8 Configuration Example
[tool.fastskill.eval]
prompts = "evals/prompts.csv"
checks = "evals/checks.toml"
timeout_seconds = 300
trials_per_case = 5
parallel = 4
pass_threshold = 0.8
fail_on_missing_agent = true
5. Schema and API
Extended Configuration Types:
// In src/eval/config.rs
#[derive(Debug, Clone)]
pub struct EvalConfig {
pub prompts_path: PathBuf,
pub checks_path: Option<PathBuf>,
pub timeout_seconds: u64,
pub fail_on_missing_agent: bool,
pub project_root: PathBuf,
// New fields:
pub trials_per_case: u32,
pub parallel: Option<u32>,
pub pass_threshold: f64,
}
New Runner Methods:
// In src/eval/runner.rs
impl AikitEvalRunner {
pub async fn run_case_trials(
&self,
case: &EvalCase,
opts: &CaseRunOptions,
checks: &[CheckDefinition],
trial_count: u32,
max_parallelism: Option<u32>,
) -> CaseTrialsResult {
let semaphore = Arc::new(Semaphore::new(max_parallelism.unwrap_or(num_cpus::get())));
let mut join_set = JoinSet::new();
for trial_id in 1..=trial_count {
let permit = Arc::clone(&semaphore);
let case_clone = case.clone();
let opts_clone = opts.clone();
let checks_clone = checks.to_vec();
join_set.spawn(async move {
let _permit = permit.acquire().await.unwrap();
let (output, result, trace) = self.run_case_inner(&case_clone, &opts_clone, &checks_clone).await;
TrialResult {
trial_id,
status: result.status,
command_count: result.command_count,
input_tokens: result.input_tokens,
output_tokens: result.output_tokens,
check_results: result.check_results,
error_message: result.error_message,
}
});
}
let mut trials = Vec::new();
while let Some(trial_result) = join_set.join_next().await {
trials.push(trial_result.unwrap());
}
trials.sort_by_key(|t| t.trial_id);
let pass_count = trials.iter().filter(|t| t.status == CaseStatus::Passed).count() as u32;
let total_trials = trial_count;
let pass_rate = pass_count as f64 / total_trials as f64;
let threshold = opts.pass_threshold; // Added to CaseRunOptions
CaseTrialsResult {
id: case.id.clone(),
trials,
aggregated_status: if pass_rate >= threshold { CaseStatus::Passed } else { CaseStatus::Failed },
pass_count,
total_trials,
pass_rate,
}
}
}
Artifact Storage Extensions:
// In src/eval/artifacts.rs
pub fn write_trial_artifacts(
run_dir: &Path,
case_id: &str,
trial_id: u32,
stdout: &[u8],
stderr: &[u8],
trace_jsonl: &str,
result: &TrialResult,
) -> Result<PathBuf, ArtifactsError>;
pub fn write_case_trials_summary(
run_dir: &Path,
case_id: &str,
trials_result: &CaseTrialsResult,
) -> Result<(), ArtifactsError>;
6. Error Codes
| Code |
Category |
Trigger |
EVAL_INVALID_TRIALS_CONFIG |
Configuration |
trials_per_case < 1 or > 1000 |
EVAL_INVALID_THRESHOLD |
Configuration |
pass_threshold not in range [0.0, 1.0] |
EVAL_PARALLEL_EXHAUSTION |
Runtime |
Semaphore acquisition fails or join_set panics |
EVAL_TRIAL_ARTIFACTS_CORRUPT |
Storage |
Cannot write trial artifacts due to filesystem errors |
EVAL_COST_WARNING |
Advisory |
trials_per_case * case_count > 100 (configurable threshold) |
7. Acceptance Criteria
- Backward Compatibility: Existing projects with no trials configuration MUST continue working with
trials_per_case = 1
- Configuration Validation: Invalid
trials_per_case or pass_threshold values MUST be rejected with clear error messages
- Deterministic Aggregation: Case with 3 passed trials out of 5 total and threshold 0.6 MUST result in
CaseStatus::Passed
- Parallel Execution: Trials MUST execute concurrently when
parallel > 1 is configured
- Artifact Structure: Each trial MUST write artifacts to
{run_dir}/{case_id}/trial-{trial_id}/ directory
- CI Exit Semantics:
--ci flag MUST cause non-zero exit when suite aggregate pass rate is below threshold
- Cost Warning: Commands MUST warn when
trials_per_case * total_cases >= 100
- No Agent Changes: Implementation MUST NOT modify agent execution behavior or aikit-sdk integration
- Summary Extension:
summary.json MUST include trial aggregation data while maintaining existing field compatibility
8. Functionality Comparison (Before vs After)
Before:
- Single trial per case with binary pass/fail
- Sequential execution only
- No threshold-based CI integration
- Basic
summary.json with case-level results
- No cost/token awareness
After:
- Configurable trials per case (1-N) with aggregated pass rates
- Optional parallel execution with bounded concurrency
- CI-friendly
--ci flag with threshold-based exit codes
- Extended
summary.json with trial breakdowns and pass rates
- Cost warnings for large trial × case combinations
- Backward compatible defaults (trials=1, threshold=1.0)
9. Benefit of Implementing This Functionality
This enhancement addresses a critical gap in FastSkill's evaluation system by enabling reliable assessment of stochastic agent behavior. Stochastic agents are increasingly common in production AI systems, and single-shot evaluations provide false confidence that can lead to deployment of unreliable agents. By implementing configurable trials with deterministic aggregation rules, FastSkill becomes suitable for CI/CD pipelines where merge gates require statistical confidence (e.g., "agent passes 80% of trials"). Optional parallelism reduces evaluation time from linear to logarithmic scaling, making larger trial counts practical. The cost awareness features prevent accidental expensive runs while maintaining development velocity for smaller test suites.
10. Stages
Stage 1: Core Configuration Extension
- Deliverable: Extended
EvalConfigToml and EvalConfig with trials, parallelism, and threshold fields
- Scope: Modify
src/core/manifest.rs (add new fields), src/eval/config.rs (extend resolution logic), add validation functions
- Dependencies: None
Stage 2: Trial Result Types and Aggregation Logic
- Deliverable:
TrialResult, CaseTrialsResult types with deterministic aggregation rules
- Scope: Extend
src/eval/artifacts.rs with new types, implement aggregation functions, add serialization support
- Dependencies: Stage 1 (requires extended configuration types)
Stage 3: Runner Extension for Multi-Trial Execution
- Deliverable:
run_case_trials() method with parallel execution and trial management
- Scope: Extend
src/eval/runner.rs with new trait method, implement in AikitEvalRunner, add semaphore-based concurrency control
- Dependencies: Stage 2 (requires trial result types)
Stage 4: Artifact Storage for Trial Data
- Deliverable: Trial-specific artifact writing with directory structure
{run_dir}/{case_id}/trial-{trial_id}/
- Scope: Extend
src/eval/artifacts.rs with write_trial_artifacts(), write_case_trials_summary(), update directory allocation
- Dependencies: Stage 2 (requires trial result types)
Stage 5: CLI Interface and Orchestration
- Deliverable: Extended
RunArgs with --trials, --ci, --threshold options and integration with multi-trial runner
- Scope: Modify
src/cli/commands/eval/run.rs to handle new arguments, orchestrate multi-trial execution, implement CI exit semantics
- Dependencies: Stages 3-4 (requires runner extension and artifact storage)
Stage 6: Cost Awareness and Warnings
- Deliverable: Token/cost estimation with warnings for large trial × case combinations
- Scope: Add cost calculation logic to CLI orchestration, implement warning thresholds, extend help text
- Dependencies: Stage 5 (requires CLI integration)
11. Breaking Changes
None. All changes are backward compatible:
- Existing TOML configurations without trials fields use default values (
trials_per_case = 1, pass_threshold = 1.0)
- Existing CLI commands continue working without new flags
summary.json format is extended but maintains all existing fields
EvalRunner trait gains new methods but existing run_case() remains unchanged
12. Modified Files and Summary
| File/path |
Summary of modifications |
src/core/manifest.rs |
Add trials_per_case, parallel, pass_threshold fields to EvalConfigToml with default functions |
src/eval/config.rs |
Extend EvalConfig struct and resolve_from_toml() function to handle new configuration fields |
src/eval/artifacts.rs |
Add TrialResult, CaseTrialsResult types; implement write_trial_artifacts(), write_case_trials_summary() functions |
src/eval/runner.rs |
Add run_case_trials() method to EvalRunner trait; implement parallel execution in AikitEvalRunner |
src/cli/commands/eval/run.rs |
Extend RunArgs with trials/CI options; modify execute_run_with_runner() for multi-trial orchestration |
tests/cli/eval_tests.rs |
Add integration tests for trials configuration, parallel execution, CI exit semantics |
src/eval/config.rs (tests) |
Extend existing config tests with trials field validation |
13. Dependencies
External Dependencies:
tokio::task::JoinSet (already in Cargo.toml) for parallel trial execution
tokio::sync::Semaphore (already in Cargo.toml) for bounded concurrency
num_cpus crate (new) for default parallelism detection
Internal Dependencies:
- No blocking dependencies on other teams or services
- Optional: Performance testing team consultation for large-scale trial validation (non-blocking, Stage 6)
If num_cpus dependency is skipped: Default parallelism falls back to hardcoded value (4 cores).
14. Proposed Folder Structure
Modified paths only (no new directories):
- Trial artifacts use existing run directory pattern:
{run_dir}/{case_id}/trial-{trial_id}/
- Aggregated results stored as:
{run_dir}/{case_id}/aggregated.json
- No changes to top-level project structure
15. Design Decisions
| # |
Question |
Decision |
Rationale |
| 1 |
How to handle partial trial failures? |
Aggregate all completed trials, include failed/errored trials in pass rate calculation |
Provides transparent reporting; users can set thresholds to handle expected failure rates |
| 2 |
Should parallelism be per-case or per-trial? |
Per-trial within each case |
Maintains case isolation while maximizing concurrency; prevents resource exhaustion |
| 3 |
What maximum trials limit to enforce? |
1000 trials per case |
Prevents accidental expensive runs while allowing extensive testing scenarios |
| 4 |
How to handle trial artifact naming conflicts? |
Use trial-specific subdirectories with consistent naming |
Avoids filesystem conflicts; enables easy trial comparison |
| 5 |
Should pass threshold apply per-case or suite-wide? |
Per-case with suite aggregation |
Provides fine-grained control while maintaining intuitive suite-level semantics |
16. Out of Scope (Explicit)
- Agent Execution Changes: No modifications to aikit-sdk integration or agent execution semantics
- Check Definition Syntax: No changes to existing
checks.toml format or check execution logic
- Report Format Overhaul: No changes to existing CLI output formats beyond adding trial aggregation data
- Resume/Checkpoint Support: No implementation of partial run resumption or trial checkpointing
- Dynamic Trial Allocation: No support for adaptive trial counts based on variance or early stopping
- Cross-Agent Comparison: No multi-agent trial comparison or ranking features
- Custom Aggregation Rules: No support for aggregation functions beyond simple pass rate thresholds
17. References
Key Files to Start From:
src/eval/runner.rs:75-210 - AikitEvalRunner::run_case() implementation to extend for multi-trial support
src/eval/config.rs:55-84 - resolve_from_toml() function for configuration resolution logic
src/cli/commands/eval/run.rs:78-272 - execute_run_with_runner() orchestration function
src/eval/artifacts.rs:109-129 - write_case_artifacts() pattern to replicate for trial artifacts
src/core/manifest.rs:283-295 - EvalConfigToml struct to extend with new fields
tests/cli/eval_tests.rs:134-150 - Existing eval integration test patterns to follow for trials testing
Evaluation Trials and Parallelism Enhancement
1. Problem Statement
The current evaluation system in FastSkill runs only one trial per case per
eval runexecution, making it unsuitable for reliably assessing stochastic agent behavior. Each case execution throughAikitEvalRunner::run_case()insrc/eval/runner.rs:75-210produces a singleCaseResultwith binary pass/fail status. This creates false confidence when a single lucky/unlucky run doesn't represent typical agent performance. The system lacks configurable pass rate thresholds and parallelism options that are essential for CI/CD integration where deterministic pass rates (e.g., "80% of trials must pass") are required for gating merges. Out of scope: changes to agent execution semantics, check definition syntax, or evaluation report formats beyond adding trial aggregation data.2. Goals
3. Current Behavior
AikitEvalRunnerCaseResultEvalConfigCaseRunOptionsSummaryResultallocate_run_dir()Current build flow: CLI parses args → load
EvalConfigfrom TOML → loadEvalSuitefrom CSV → execute cases sequentially viaAikitEvalRunner::run_case()→ write artifacts per case → writesummary.json. Resume flow: not supported; each run is independent with new run directory allocation.4. Design
4.1 Data Structures
Extended EvalConfigToml (in
src/core/manifest.rs):Trial Result Types (new in
src/eval/artifacts.rs):4.2 Concurrency Model
Parallelism Strategy: Use Tokio's
JoinSetfor bounded concurrency with semaphore-based limiting. Maximum parallelism defaults to number of CPU cores, configurable viaparallelfield. Each trial executes independently inspawn_blocking(existing aikit pattern) with no shared state beyond read-only configuration.Locking Strategy: No explicit locking required. Each trial writes to separate artifact paths using trial ID suffix (
case-id/trial-N/). Aggregation happens after all trials complete.4.3 API Surface
New CLI Options (in
src/cli/commands/eval/run.rs):4.4 Integration Points
EvalRunner Trait Extension:
4.5 Invariants
pass_count / total_trials >= pass_thresholdrun-dir/case-id/trial-N/4.6 Identity and Naming
{run_dir}/{case_id}/trial-{trial_id}/wheretrial_idis 1-indexed{run_dir}/{case_id}/aggregated.jsoncontainsCaseTrialsResultsummary.jsonstructure extended with trial-aware fields4.7 Error Conditions
trials_per_case < 1orpass_thresholdnot in [0.0, 1.0]4.8 Configuration Example
5. Schema and API
Extended Configuration Types:
New Runner Methods:
Artifact Storage Extensions:
6. Error Codes
EVAL_INVALID_TRIALS_CONFIGtrials_per_case < 1or> 1000EVAL_INVALID_THRESHOLDpass_thresholdnot in range [0.0, 1.0]EVAL_PARALLEL_EXHAUSTIONEVAL_TRIAL_ARTIFACTS_CORRUPTEVAL_COST_WARNINGtrials_per_case * case_count > 100(configurable threshold)7. Acceptance Criteria
trials_per_case = 1trials_per_caseorpass_thresholdvalues MUST be rejected with clear error messagesCaseStatus::Passedparallel > 1is configured{run_dir}/{case_id}/trial-{trial_id}/directory--ciflag MUST cause non-zero exit when suite aggregate pass rate is below thresholdtrials_per_case * total_cases >= 100summary.jsonMUST include trial aggregation data while maintaining existing field compatibility8. Functionality Comparison (Before vs After)
Before:
summary.jsonwith case-level resultsAfter:
--ciflag with threshold-based exit codessummary.jsonwith trial breakdowns and pass rates9. Benefit of Implementing This Functionality
This enhancement addresses a critical gap in FastSkill's evaluation system by enabling reliable assessment of stochastic agent behavior. Stochastic agents are increasingly common in production AI systems, and single-shot evaluations provide false confidence that can lead to deployment of unreliable agents. By implementing configurable trials with deterministic aggregation rules, FastSkill becomes suitable for CI/CD pipelines where merge gates require statistical confidence (e.g., "agent passes 80% of trials"). Optional parallelism reduces evaluation time from linear to logarithmic scaling, making larger trial counts practical. The cost awareness features prevent accidental expensive runs while maintaining development velocity for smaller test suites.
10. Stages
Stage 1: Core Configuration Extension
EvalConfigTomlandEvalConfigwith trials, parallelism, and threshold fieldssrc/core/manifest.rs(add new fields),src/eval/config.rs(extend resolution logic), add validation functionsStage 2: Trial Result Types and Aggregation Logic
TrialResult,CaseTrialsResulttypes with deterministic aggregation rulessrc/eval/artifacts.rswith new types, implement aggregation functions, add serialization supportStage 3: Runner Extension for Multi-Trial Execution
run_case_trials()method with parallel execution and trial managementsrc/eval/runner.rswith new trait method, implement inAikitEvalRunner, add semaphore-based concurrency controlStage 4: Artifact Storage for Trial Data
{run_dir}/{case_id}/trial-{trial_id}/src/eval/artifacts.rswithwrite_trial_artifacts(),write_case_trials_summary(), update directory allocationStage 5: CLI Interface and Orchestration
RunArgswith--trials,--ci,--thresholdoptions and integration with multi-trial runnersrc/cli/commands/eval/run.rsto handle new arguments, orchestrate multi-trial execution, implement CI exit semanticsStage 6: Cost Awareness and Warnings
11. Breaking Changes
None. All changes are backward compatible:
trials_per_case = 1,pass_threshold = 1.0)summary.jsonformat is extended but maintains all existing fieldsEvalRunnertrait gains new methods but existingrun_case()remains unchanged12. Modified Files and Summary
src/core/manifest.rstrials_per_case,parallel,pass_thresholdfields toEvalConfigTomlwith default functionssrc/eval/config.rsEvalConfigstruct andresolve_from_toml()function to handle new configuration fieldssrc/eval/artifacts.rsTrialResult,CaseTrialsResulttypes; implementwrite_trial_artifacts(),write_case_trials_summary()functionssrc/eval/runner.rsrun_case_trials()method toEvalRunnertrait; implement parallel execution inAikitEvalRunnersrc/cli/commands/eval/run.rsRunArgswith trials/CI options; modifyexecute_run_with_runner()for multi-trial orchestrationtests/cli/eval_tests.rssrc/eval/config.rs(tests)13. Dependencies
External Dependencies:
tokio::task::JoinSet(already in Cargo.toml) for parallel trial executiontokio::sync::Semaphore(already in Cargo.toml) for bounded concurrencynum_cpuscrate (new) for default parallelism detectionInternal Dependencies:
If
num_cpusdependency is skipped: Default parallelism falls back to hardcoded value (4 cores).14. Proposed Folder Structure
Modified paths only (no new directories):
{run_dir}/{case_id}/trial-{trial_id}/{run_dir}/{case_id}/aggregated.json15. Design Decisions
16. Out of Scope (Explicit)
checks.tomlformat or check execution logic17. References
Key Files to Start From:
src/eval/runner.rs:75-210-AikitEvalRunner::run_case()implementation to extend for multi-trial supportsrc/eval/config.rs:55-84-resolve_from_toml()function for configuration resolution logicsrc/cli/commands/eval/run.rs:78-272-execute_run_with_runner()orchestration functionsrc/eval/artifacts.rs:109-129-write_case_artifacts()pattern to replicate for trial artifactssrc/core/manifest.rs:283-295-EvalConfigTomlstruct to extend with new fieldstests/cli/eval_tests.rs:134-150- Existing eval integration test patterns to follow for trials testing