- 🎯 What is ProjectScylla?
- Core Concepts
- 🚀 Quick Start
- 📊 System Requirements
- Analysis Pipeline Architecture
- Development
- 🔧 Troubleshooting
- Publication Readiness
- 🤝 Contributing
ProjectScylla is a comprehensive testing framework for AI agent workflows that:
- 🔬 Measures agent performance under constrained conditions
- 📈 Analyzes results with rigorous statistical methods
- ⚖️ Optimizes agent decisions through trade-off evaluation
- 📋 Generates publication-ready reports, figures, and tables
Key Output: Publication-quality statistical reports with 34 figures and 11 tables from a single command.
"In Homer's Odyssey, Scylla represents one of the greatest challenges on the journey home — a monster that forced sailors to navigate perilous straits where every choice carried risk. ProjectScylla provides the same proving ground for AI agents."
# 1. Install prerequisites
curl -fsSL https://pixi.sh/install.sh | bash
# 2. Clone and setup
git clone https://github.com/HomericIntelligence/ProjectScylla.git
cd ProjectScylla
# 3. Run your first analysis
pixi run python --version # Verify installation
pixi run python scripts/generate_all_results.py --data-dir ~/fullruns
# 4. View results (34 figures + 11 tables generated)
open results/analysis/figures/*.png # macOS
xdg-open results/analysis/figures/*.png # LinuxThat's it! All outputs appear in results/analysis/ directory.
Compare Two Agent Configurations:
pixi run python scripts/generate_all_results.py \
--data-dir ~/experiments/ \
--output-dir comparison_results/ \
--exclude test001-dryrunFast Development Mode (No Rendering):
# Quick iteration - generates Vega-Lite specs only
pixi run python scripts/generate_all_results.py \
--data-dir ~/quick_test \
--no-render \
--skip-data # Skip if CSVs already existMinimum Requirements:
- Python 3.10+
- 8GB RAM for full dataset analysis
- 2GB disk space for results
Typical Performance:
- Full analysis: 10-15 minutes (10,000 bootstrap samples)
- Figures only: 2-3 minutes
- Tables only: 1-2 minutes
Scale: Handles experiments with 1000+ runs efficiently
- ⚖️ Trade-Off Evaluation: Agents face scenarios where every decision has cost, mirroring Scylla and Charybdis dilemma
- 📊 Metrics & Benchmarks: Structured measurement across adaptability, efficiency, and reliability
- 🔄 Iterative Optimization: Continuous refinement through repeated trials
- 🧭 Ablation Benchmarking: Systematic evaluation of agent architectures across complexity tiers
Part of a 12-repository ecosystem:
| Repository | Role |
|---|---|
| AchaeanFleet | Container images for the agent mesh |
| Myrmidons | GitOps agent provisioning |
| Odysseus | CLI and core platform for agent lifecycle management |
| ProjectArgus | Observability — monitoring and metrics |
| ProjectHephaestus | Shared Python utilities and foundational tools |
| ProjectHermes | Webhook-to-NATS bridge — event ingestion |
| ProjectKeystone | DAG execution engine |
| ProjectMnemosyne | Skills marketplace — team knowledge sharing |
| ProjectOdyssey | Training and capability development for agents |
| ProjectProteus | CI/CD pipeline infrastructure |
| ProjectScylla | Testing, measurement, and optimization under constraints (this project) |
| ProjectTelemachy | Workflow engine |
Generate all outputs (data exports, figures, tables):
pixi run python scripts/generate_all_results.py \
--data-dir ~/fullruns \
--output-dir results/analysisKey Options:
--data-dir→ Directory with experiment results (default:~/fullruns)--output-dir→ Base output directory (default:docs/)--no-render→ Skip PNG/PDF (faster, Vega-Lite specs only)--skip-data/skip-figures/skip-tables→ Generate specific components only--exclude→ Filter experiments (e.g.,--exclude test001-dryrun)
# Development mode - no rendering
pixi run python scripts/generate_all_results.py \
--no-render \
--exclude test001-dryrun test001-debug
# Regenerate tables only (assumes data/figures exist)
pixi run python scripts/generate_all_results.py \
--skip-data --skip-figures1. Export Data Only
pixi run python scripts/export_data.py \
--data-dir ~/fullruns \
--output-dir results/analysis/dataOutputs: runs.csv, judges.csv, criteria.csv, subtests.csv, summary.json, statistical_results.json
2. Generate Figures Only (34 figures × 5 formats)
pixi run python scripts/generate_figures.py \
--data-dir ~/fullruns \
--output-dir results/analysis/figuresOutputs: *.vl.json, *.csv, *.png (300 DPI), *.pdf, *_include.tex
3. Generate Tables Only (11 tables × 2 formats)
pixi run python scripts/generate_tables.py \
--data-dir ~/fullruns \
--output-dir results/analysis/tablesOutputs: *.md (human-readable), *.tex (LaTeX, booktabs formatted)
results/analysis/
├── data/
│ ├── runs.csv # Per-run metrics
│ ├── judges.csv # Judge evaluations
│ ├── criteria.csv # Criterion-level scores
│ ├── subtests.csv # Subtest metadata
│ ├── summary.json # Experiment summary
│ └── statistical_results.json # Statistical analysis
├── figures/ # 34 figures × 5 formats
│ ├── fig01_score_variance.*
│ ├── fig02_grade_distribution.*
│ └── ... (34 total)
└── tables/ # 11 tables × 2 formats
├── table01_tier_summary.md
├── table01_tier_summary.tex
└── ... (11 total)
LaTeX Integration:
\begin{figure}
\centering
\input{results/analysis/figures/fig04_pass_rate_by_tier_include.tex}
\caption{Pass rate by tier with 95\% bootstrap confidence intervals.}
\label{fig:pass-rate}
\end{figure}
\input{results/analysis/tables/table02_tier_comparison.tex}Python/Jupyter:
import pandas as pd
import json
# Load data
runs_df = pd.read_csv('results/analysis/data/runs.csv')
judges_df = pd.read_csv('results/analysis/data/judges.csv')
# Load statistical results
with open('results/analysis/data/statistical_results.json') as f:
stats = json.load(f)ProjectScylla provides comprehensive scripts for running, managing, and analyzing experiments.
Primary Experiment Runner:
# Run full experiment
pixi run python scripts/manage_experiment.py run --config config/test.yaml
# Run specific tiers
pixi run python scripts/manage_experiment.py run \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 T1 --runs 10 -vContainer-Based Execution:
./scripts/setup_api_key.sh
./scripts/run_experiment_in_container.sh \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 --runs 5 --verbose# Re-run failed agents
pixi run python scripts/manage_experiment.py rerun-agents \
~/fullruns/test_experiment --tier T0 T1
# Re-run failed judges
pixi run python scripts/manage_experiment.py rerun-judges \
~/fullruns/test_experiment# Regenerate all results
pixi run python scripts/manage_experiment.py regenerate \
~/fullruns/test_experiment
# Repair corrupt checkpoint
pixi run python scripts/manage_experiment.py repair \
~/fullruns/test_experiment/checkpoint.jsonRigorous non-parametric methods for bounded, ordinal, non-normal data:
- Bootstrap Confidence Intervals: BCa with 10,000 resamples
- Omnibus Testing: Kruskal-Wallis H test (controls FWER)
- Pairwise Comparisons: Mann-Whitney U + Holm-Bonferroni correction
- Effect Sizes: Cliff's delta with bootstrapped CIs
- Inter-Rater Reliability: Krippendorff's alpha for judge agreement
Configuration: src/scylla/analysis/config.yaml (all parameters externalized)
Quality:
- Pass-Rate (functional test coverage)
- Implementation Rate (semantic satisfaction)
- Score (weighted rubric evaluation)
- Consistency (1 - Coefficient of Variation)
Economic:
- Cost-of-Pass (expected cost per success)
- Frontier CoP (minimum CoP across configs)
- Token Distribution (cost breakdown)
Process:
- Latency (query to resolution time)
- Judge Agreement (Krippendorff's alpha)
Expected structure:
fullruns/{experiment_name}/{timestamp}/
├── config/experiment.json # Metadata
└── T0-T6/{subtest_id}/run_{01-10}/
├── run_result.json # Outcomes
└── judge/judge_{01-03}/judgment.json # Evaluations
Required in run.json:
run_number(integer)exit_code(0 = success)judges(list with grades & criteria)
Schema: src/scylla/analysis/schemas/run_result.schema.json
ProjectScylla has a comprehensive test suite covering all functionality. To see the current test count:
pixi run pytest tests/ --collect-only -q | tail -1- Unit Tests: Analysis (incl. integration-style tests), adapters, config, executors, judges, metrics, reporting
- E2E Tests (1 file): Full pipeline validation
- Test Fixtures (47+ scenarios): Complete test cases with expected outputs
# All tests (comprehensive)
pixi run pytest tests/ --verbose
# Unit tests only (fastest)
pixi run pytest tests/unit/ -v
# Specific modules
pixi run pytest tests/unit/analysis/ -v
pixi run pytest tests/unit/adapters/ -v
pixi run pytest tests/unit/config/ -v
# Coverage analysis
pixi run pytest tests/ --cov=src/scylla --cov-report=html
# Specific test file
pixi run pytest tests/unit/analysis/test_stats.py -v# Code quality (linting + formatting)
pixi run ruff check src/scylla/
pixi run ruff format src/scylla/ --checkGit hooks enforce quality checks locally before code reaches CI. Install them once after cloning:
bash scripts/install_hooks.sh| Hook | Trigger | What it does |
|---|---|---|
pre-push |
Every git push |
Runs the full test suite with coverage; aborts the push if tests fail or coverage drops below the threshold in pyproject.toml |
The coverage threshold is read directly from pyproject.toml — update it there and the hook stays in sync automatically.
Hook source files live in
scripts/hooks/and are version-controlled. Seescripts/README.mdfor details.
New Figures:
- Create module in
src/scylla/analysis/figures/ - Implement function following existing pattern
- Register in
scripts/generate_figures.py - Add tests in
tests/unit/analysis/test_figures.py
New Tables:
- Add function to module in
src/scylla/analysis/tables/ - Register in
scripts/generate_tables.py - Add tests in
tests/unit/analysis/test_tables.py
# Linting
pixi run ruff check src/scylla/analysis/
# Auto-fix and format
pixi run ruff check --fix src/scylla/analysis/
pixi run ruff format src/scylla/analysis/| Symptom | Solution |
|---|---|
Schema validation failed: 'N/A' does not match |
Ensure grades are S, A, B, C, D, or F only |
[Errno 2] No such file or directory |
Run: find ~/fullruns -name "run_result.json" |
TypeError: unsupported operand |
Fix type coercion in criterion.achieved values |
| Empty outputs | Check: ≥2 experiments, ≥1 completed run each |
| Slow performance | Use --no-render flag for faster iteration |
1. Data Validation Errors
Schema validation failed: 'N/A' does not match '^[SABCDF]$'
Fix: Review problematic runs, ensure valid grades S/A/B/C/D/F or update schema.
2. Missing Files
Failed to load: [Errno 2] No such file or directory
Fix: Incomplete runs skipped with warnings. Investigate:
find ~/fullruns -name "run_*" -type d -exec sh -c 'test -f "$1/run_result.json" || echo "Missing: $1"' _ {} \;3. Type Errors
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Fix: Some criterion.achieved are strings. Fix in data generation or add coercion.
- Documentation:
docs/research.mdfor methodology - Examples:
tests/unit/analysis/for usage patterns - Issues: GitHub Issues
- Support: Create an issue with error message and steps to reproduce
✅ Rigorous non-parametric statistics (Kruskal-Wallis, Mann-Whitney U, Cliff's delta)
✅ Multiple comparison correction (Holm-Bonferroni throughout)
✅ Bootstrap confidence intervals (BCa, 10K resamples, seed=42)
✅ Effect sizes with confidence intervals
✅ 300 DPI publication-quality figures
✅ LaTeX-ready tables with booktabs formatting
✅ Reproducible configuration (all parameters in config.yaml)
✅ Comprehensive test suite
✅ Documented methodology with citations
See docs/research.md for complete research methodology and metric definitions.
Required packages for document compilation:
\documentclass{article}
\usepackage{booktabs} % Professional tables
\usepackage{longtable} % Multi-page tables
\usepackage{threeparttable} % Table notes
\usepackage{graphicx} % Figure inclusion
\usepackage{amsmath} % Statistical symbols
\begin{document}
% Your content here
\end{document}We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:
- Development setup and environment configuration
- Git workflow and branch management
- Code quality standards and testing requirements
- Pull request and code review process
- Issue reporting guidelines
Quick Start for Contributors:
- Fork the repository and clone locally
- Copy
.env.exampleto.envand configure API keys - Install dependencies:
curl -fsSL https://pixi.sh/install.sh | bash - Install git hooks:
bash scripts/install_hooks.sh - Run tests:
pixi run pytest tests/ -v - Check CONTRIBUTING.md for detailed workflow
Areas for contribution:
- Additional statistical methods and metrics
- New visualization types and formats
- Performance optimizations
- Documentation improvements
- Bug fixes and feature requests
Visit our GitHub Repository to get started.
@software{projectscylla2026,
title = {ProjectScylla: A Testing and Optimization Framework for Agentic Workflows},
author = {Micah Villmow},
year = {2026},
url = {https://github.com/HomericIntelligence/ProjectScylla}
}