ProjectScylla

📑 Table of Contents

🎯 What is ProjectScylla?
Core Concepts
🚀 Quick Start
📊 System Requirements
Analysis Pipeline Architecture
Development
- Git Hooks
🔧 Troubleshooting
Publication Readiness
🤝 Contributing

🎯 What is ProjectScylla?

ProjectScylla is a comprehensive testing framework for AI agent workflows that:

🔬 Measures agent performance under constrained conditions
📈 Analyzes results with rigorous statistical methods
⚖️ Optimizes agent decisions through trade-off evaluation
📋 Generates publication-ready reports, figures, and tables

Key Output: Publication-quality statistical reports with 34 figures and 11 tables from a single command.

"In Homer's Odyssey, Scylla represents one of the greatest challenges on the journey home — a monster that forced sailors to navigate perilous straits where every choice carried risk. ProjectScylla provides the same proving ground for AI agents."

Quick Start Guide

🚀 5-Minute Setup

# 1. Install prerequisites
curl -fsSL https://pixi.sh/install.sh | bash

# 2. Clone and setup
git clone https://github.com/HomericIntelligence/ProjectScylla.git
cd ProjectScylla

# 3. Run your first analysis
pixi run python --version  # Verify installation
pixi run python scripts/generate_all_results.py --data-dir ~/fullruns

# 4. View results (34 figures + 11 tables generated)
open results/analysis/figures/*.png  # macOS
xdg-open results/analysis/figures/*.png  # Linux

That's it! All outputs appear in results/analysis/ directory.

💡 Usage Examples

Compare Two Agent Configurations:

pixi run python scripts/generate_all_results.py \
  --data-dir ~/experiments/ \
  --output-dir comparison_results/ \
  --exclude test001-dryrun

Fast Development Mode (No Rendering):

# Quick iteration - generates Vega-Lite specs only
pixi run python scripts/generate_all_results.py \
  --data-dir ~/quick_test \
  --no-render \
  --skip-data  # Skip if CSVs already exist

📊 System Requirements

Minimum Requirements:

Python 3.10+
8GB RAM for full dataset analysis
2GB disk space for results

Typical Performance:

Full analysis: 10-15 minutes (10,000 bootstrap samples)
Figures only: 2-3 minutes
Tables only: 1-2 minutes

Scale: Handles experiments with 1000+ runs efficiently

Core Concepts

⚖️ Trade-Off Evaluation: Agents face scenarios where every decision has cost, mirroring Scylla and Charybdis dilemma
📊 Metrics & Benchmarks: Structured measurement across adaptability, efficiency, and reliability
🔄 Iterative Optimization: Continuous refinement through repeated trials
🧭 Ablation Benchmarking: Systematic evaluation of agent architectures across complexity tiers

Ecosystem

Part of a 12-repository ecosystem:

Repository	Role
AchaeanFleet	Container images for the agent mesh
Myrmidons	GitOps agent provisioning
Odysseus	CLI and core platform for agent lifecycle management
ProjectArgus	Observability — monitoring and metrics
ProjectHephaestus	Shared Python utilities and foundational tools
ProjectHermes	Webhook-to-NATS bridge — event ingestion
ProjectKeystone	DAG execution engine
ProjectMnemosyne	Skills marketplace — team knowledge sharing
ProjectOdyssey	Training and capability development for agents
ProjectProteus	CI/CD pipeline infrastructure
ProjectScylla	Testing, measurement, and optimization under constraints (this project)
ProjectTelemachy	Workflow engine

Running the Analysis Pipeline

Full Analysis (Recommended)

Generate all outputs (data exports, figures, tables):

pixi run python scripts/generate_all_results.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis

Key Options:

--data-dir → Directory with experiment results (default: ~/fullruns)
--output-dir → Base output directory (default: docs/)
--no-render → Skip PNG/PDF (faster, Vega-Lite specs only)
--skip-data/skip-figures/skip-tables → Generate specific components only
--exclude → Filter experiments (e.g., --exclude test001-dryrun)

# Development mode - no rendering
pixi run python scripts/generate_all_results.py \
  --no-render \
  --exclude test001-dryrun test001-debug

# Regenerate tables only (assumes data/figures exist)
pixi run python scripts/generate_all_results.py \
  --skip-data --skip-figures

Individual Pipeline Steps

1. Export Data Only

pixi run python scripts/export_data.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis/data

Outputs: runs.csv, judges.csv, criteria.csv, subtests.csv, summary.json, statistical_results.json

2. Generate Figures Only (34 figures × 5 formats)

pixi run python scripts/generate_figures.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis/figures

Outputs: *.vl.json, *.csv, *.png (300 DPI), *.pdf, *_include.tex

3. Generate Tables Only (11 tables × 2 formats)

pixi run python scripts/generate_tables.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis/tables

Outputs: *.md (human-readable), *.tex (LaTeX, booktabs formatted)

Output Structure

results/analysis/
├── data/
│   ├── runs.csv                      # Per-run metrics
│   ├── judges.csv                    # Judge evaluations
│   ├── criteria.csv                  # Criterion-level scores
│   ├── subtests.csv                  # Subtest metadata
│   ├── summary.json                  # Experiment summary
│   └── statistical_results.json      # Statistical analysis
├── figures/                          # 34 figures × 5 formats
│   ├── fig01_score_variance.*
│   ├── fig02_grade_distribution.*
│   └── ... (34 total)
└── tables/                           # 11 tables × 2 formats
    ├── table01_tier_summary.md
    ├── table01_tier_summary.tex
    └── ... (11 total)

Using the Outputs

LaTeX Integration:

\begin{figure}
  \centering
  \input{results/analysis/figures/fig04_pass_rate_by_tier_include.tex}
  \caption{Pass rate by tier with 95\% bootstrap confidence intervals.}
  \label{fig:pass-rate}
\end{figure}

\input{results/analysis/tables/table02_tier_comparison.tex}

Python/Jupyter:

import pandas as pd
import json

# Load data
runs_df = pd.read_csv('results/analysis/data/runs.csv')
judges_df = pd.read_csv('results/analysis/data/judges.csv')

# Load statistical results
with open('results/analysis/data/statistical_results.json') as f:
    stats = json.load(f)

Experiment Management Scripts

ProjectScylla provides comprehensive scripts for running, managing, and analyzing experiments.

🧪 Running Experiments

Primary Experiment Runner:

# Run full experiment
pixi run python scripts/manage_experiment.py run --config config/test.yaml

# Run specific tiers
pixi run python scripts/manage_experiment.py run \
  --tiers-dir tests/fixtures/tests/test-001 \
  --tiers T0 T1 --runs 10 -v

Container-Based Execution:

./scripts/setup_api_key.sh
./scripts/run_experiment_in_container.sh \
  --tiers-dir tests/fixtures/tests/test-001 \
  --tiers T0 --runs 5 --verbose

🔄 Recovery & Re-running

# Re-run failed agents
pixi run python scripts/manage_experiment.py rerun-agents \
  ~/fullruns/test_experiment --tier T0 T1

# Re-run failed judges
pixi run python scripts/manage_experiment.py rerun-judges \
  ~/fullruns/test_experiment

📊 Results Management

# Regenerate all results
pixi run python scripts/manage_experiment.py regenerate \
  ~/fullruns/test_experiment

# Repair corrupt checkpoint
pixi run python scripts/manage_experiment.py repair \
  ~/fullruns/test_experiment/checkpoint.json

Analysis Pipeline Architecture

Statistical Methodology

Rigorous non-parametric methods for bounded, ordinal, non-normal data:

Bootstrap Confidence Intervals: BCa with 10,000 resamples
Omnibus Testing: Kruskal-Wallis H test (controls FWER)
Pairwise Comparisons: Mann-Whitney U + Holm-Bonferroni correction
Effect Sizes: Cliff's delta with bootstrapped CIs
Inter-Rater Reliability: Krippendorff's alpha for judge agreement

Configuration: src/scylla/analysis/config.yaml (all parameters externalized)

Metrics

Quality:

Pass-Rate (functional test coverage)
Implementation Rate (semantic satisfaction)
Score (weighted rubric evaluation)
Consistency (1 - Coefficient of Variation)

Economic:

Cost-of-Pass (expected cost per success)
Frontier CoP (minimum CoP across configs)
Token Distribution (cost breakdown)

Process:

Latency (query to resolution time)
Judge Agreement (Krippendorff's alpha)

Data Requirements

Expected structure:

fullruns/{experiment_name}/{timestamp}/
├── config/experiment.json            # Metadata
└── T0-T6/{subtest_id}/run_{01-10}/
    ├── run_result.json              # Outcomes
    └── judge/judge_{01-03}/judgment.json  # Evaluations

Required in run.json:

run_number (integer)
exit_code (0 = success)
judges (list with grades & criteria)

Schema: src/scylla/analysis/schemas/run_result.schema.json

Development

🧪 Testing

ProjectScylla has a comprehensive test suite covering all functionality. To see the current test count:

pixi run pytest tests/ --collect-only -q | tail -1

Test Categories

Unit Tests: Analysis (incl. integration-style tests), adapters, config, executors, judges, metrics, reporting
E2E Tests (1 file): Full pipeline validation
Test Fixtures (47+ scenarios): Complete test cases with expected outputs

Running Tests

# All tests (comprehensive)
pixi run pytest tests/ --verbose

# Unit tests only (fastest)
pixi run pytest tests/unit/ -v

# Specific modules
pixi run pytest tests/unit/analysis/ -v
pixi run pytest tests/unit/adapters/ -v
pixi run pytest tests/unit/config/ -v

# Coverage analysis
pixi run pytest tests/ --cov=src/scylla --cov-report=html

# Specific test file
pixi run pytest tests/unit/analysis/test_stats.py -v

Test Quality Assurance

# Code quality (linting + formatting)
pixi run ruff check src/scylla/
pixi run ruff format src/scylla/ --check

Git Hooks

Git hooks enforce quality checks locally before code reaches CI. Install them once after cloning:

bash scripts/install_hooks.sh

Hook	Trigger	What it does
`pre-push`	Every `git push`	Runs the full test suite with coverage; aborts the push if tests fail or coverage drops below the threshold in `pyproject.toml`

The coverage threshold is read directly from pyproject.toml — update it there and the hook stays in sync automatically.

Hook source files live in scripts/hooks/ and are version-controlled. See scripts/README.md for details.

Adding Components

New Figures:

Create module in src/scylla/analysis/figures/
Implement function following existing pattern
Register in scripts/generate_figures.py
Add tests in tests/unit/analysis/test_figures.py

New Tables:

Add function to module in src/scylla/analysis/tables/
Register in scripts/generate_tables.py
Add tests in tests/unit/analysis/test_tables.py

Code Quality

# Linting
pixi run ruff check src/scylla/analysis/

# Auto-fix and format
pixi run ruff check --fix src/scylla/analysis/
pixi run ruff format src/scylla/analysis/

🔧 Troubleshooting

Quick Reference

Symptom	Solution
`Schema validation failed: 'N/A' does not match`	Ensure grades are S, A, B, C, D, or F only
`[Errno 2] No such file or directory`	Run: `find ~/fullruns -name "run_result.json"`
`TypeError: unsupported operand`	Fix type coercion in criterion.achieved values
Empty outputs	Check: ≥2 experiments, ≥1 completed run each
Slow performance	Use `--no-render` flag for faster iteration

Common Issues

1. Data Validation Errors

Schema validation failed: 'N/A' does not match '^[SABCDF]$'

Fix: Review problematic runs, ensure valid grades S/A/B/C/D/F or update schema.

2. Missing Files

Failed to load: [Errno 2] No such file or directory

Fix: Incomplete runs skipped with warnings. Investigate:

find ~/fullruns -name "run_*" -type d -exec sh -c 'test -f "$1/run_result.json" || echo "Missing: $1"' _ {} \;

3. Type Errors

TypeError: unsupported operand type(s) for +: 'float' and 'str'

Fix: Some criterion.achieved are strings. Fix in data generation or add coercion.

Getting Help

Documentation: docs/research.md for methodology
Examples: tests/unit/analysis/ for usage patterns
Issues: GitHub Issues
Support: Create an issue with error message and steps to reproduce

Publication Readiness

✅ Rigorous non-parametric statistics (Kruskal-Wallis, Mann-Whitney U, Cliff's delta)

✅ Multiple comparison correction (Holm-Bonferroni throughout)

✅ Bootstrap confidence intervals (BCa, 10K resamples, seed=42)

✅ Effect sizes with confidence intervals

✅ 300 DPI publication-quality figures

✅ LaTeX-ready tables with booktabs formatting

✅ Reproducible configuration (all parameters in config.yaml)

✅ Comprehensive test suite

✅ Documented methodology with citations

See docs/research.md for complete research methodology and metric definitions.

LaTeX Dependencies

Required packages for document compilation:

\documentclass{article}
 \usepackage{booktabs}   % Professional tables
 \usepackage{longtable}  % Multi-page tables
 \usepackage{threeparttable} % Table notes
 \usepackage{graphicx}   % Figure inclusion
 \usepackage{amsmath}    % Statistical symbols

\begin{document}
% Your content here
\end{document}

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:

Development setup and environment configuration
Git workflow and branch management
Code quality standards and testing requirements
Pull request and code review process
Issue reporting guidelines

Quick Start for Contributors:

Fork the repository and clone locally
Copy .env.example to .env and configure API keys
Install dependencies: curl -fsSL https://pixi.sh/install.sh | bash
Install git hooks: bash scripts/install_hooks.sh
Run tests: pixi run pytest tests/ -v
Check CONTRIBUTING.md for detailed workflow

Areas for contribution:

Additional statistical methods and metrics
New visualization types and formats
Performance optimizations
Documentation improvements
Bug fixes and feature requests

Visit our GitHub Repository to get started.

License

Citation

@software{projectscylla2026,
  title = {ProjectScylla: A Testing and Optimization Framework for Agentic Workflows},
  author = {Micah Villmow},
  year = {2026},
  url = {https://github.com/HomericIntelligence/ProjectScylla}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,424 Commits
.claude-plugin/skills		.claude-plugin/skills
.claude		.claude
.github		.github
ci		ci
config		config
docker		docker
docs		docs
references		references
schemas		schemas
scripts		scripts
skills/docs-status-fix		skills/docs-status-fix
src/scylla		src/scylla
tests		tests
.claude-prompt-1563.md		.claude-prompt-1563.md
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.markdownlint.json		.markdownlint.json
.pip-audit-ignore.txt		.pip-audit-ignore.txt
.pre-commit-config.yaml		.pre-commit-config.yaml
.shellcheckrc		.shellcheckrc
.yamllint.yaml		.yamllint.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
justfile		justfile
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

ProjectScylla

📑 Table of Contents

🎯 What is ProjectScylla?

Quick Start Guide

🚀 5-Minute Setup

💡 Usage Examples

📊 System Requirements

Core Concepts

Ecosystem

Running the Analysis Pipeline

Full Analysis (Recommended)

Individual Pipeline Steps

Output Structure

Using the Outputs

Experiment Management Scripts

🧪 Running Experiments

🔄 Recovery & Re-running

📊 Results Management

Analysis Pipeline Architecture

Statistical Methodology

Metrics

Data Requirements

Development

🧪 Testing

Test Categories

Running Tests

Test Quality Assurance

Git Hooks

Adding Components

Code Quality

🔧 Troubleshooting

Quick Reference

Common Issues

Getting Help

Publication Readiness

LaTeX Dependencies

🤝 Contributing

License

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages