Skip to content

microsoft/amplifier-eval-harness

Repository files navigation

amplifier-eval-harness

Test harness for running scenarios through amplifier-app-cli inside Digital Twin Universe (DTU) environments.

Runs bundles × scenarios × runs matrices in isolated containers, captures per-run artifacts and metrics, supports swapping in local working trees of any ecosystem repo via Gitea mirroring.

Status

Pre-alpha (v0.2). Sequential and parallel execution paths in place. First successful end-to-end smoke run against a live DTU on 2026-05-08 (Linux/Incus, foundation bundle, claude-opus-4-7); broader validation across configs/scenarios is still pending.

Quick start

# Prerequisites:
#   - amplifier CLI installed (uv tool install git+https://github.com/microsoft/amplifier)
#   - amplifier-bundle-gitea (provides amplifier-gitea CLI)
#   - amplifier-bundle-digital-twin-universe v0.2.0+ (provides amplifier-digital-twin CLI).
#       v0.1.x silently ignores `default_match_mode: boundary`; URL prefix collisions
#       with sibling repos can over-match. PR #7 (merged 2026-05-05) fixes it.
#   - Docker running (for Gitea container) + Incus (for DTU containers)
#   - At least one provider env var set (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GITHUB_TOKEN…)

# Install
uv tool install --from . amplifier-eval-harness

# Sanity check (no DTU launches)
amplifier-eval-harness validate --config configs/smoke.yaml

# Smoke run (1 bundle × 1 scenario × 1 run, sequential)
amplifier-eval-harness run --config configs/smoke.yaml

# Baseline (foundation + amplifier-dev × 3 runs each, up to 2 in parallel)
amplifier-eval-harness run --config configs/baseline.yaml

# Override parallelism at the CLI without editing the config
amplifier-eval-harness run --config configs/baseline.yaml --parallelism 4

# Dry run (just expand and print the matrix)
amplifier-eval-harness run --config configs/smoke.yaml --dry-run

Output lands in eval-results/<config-stem>-<timestamp>/. Read summary.md first.

Configs

Configs live in configs/. Add new ones with descriptive names; pick which to run via --config.

Config Purpose
smoke.yaml Inner-dev-loop. 1 bundle × 1 scenario × 1 run, sequential.
baseline.yaml foundation + amplifier-dev × explain-repo × 3 runs each, parallelism=2.

See docs/designs/architecture.md for the full schema and run flow.

Scenarios

Scenarios live in scenarios/<id>/. Each scenario has a prompt.md and an optional workspace/ directory of fixture files seeded into /workspace inside the DTU before the prompt runs.

Scenario What it exercises
explain-repo File reading, code summarization. Stable across runs.

Settings overlays

Per-config provider/model selection happens via a YAML overlay deep-merged into the container's ~/.amplifier/settings.yaml at provision time. The default overlay (settings/default-providers.yaml) is lifted from the harness owner's ~/.amplifier/settings.yaml minus provider-chat-completions (which is local-only and not relevant inside DTUs).

To use a different model mix, copy the overlay, edit, and point the config's settings_overlay: at the new file.

Gitea instance pinning

By default the harness greedily reuses the first instance returned by amplifier-gitea list, falling back to creating a new one if none exist. That's fine on a solo dev machine but dangerous when multiple workspaces share a host — two harness invocations against the same Gitea race on populate_repo and may stomp on each other's mirrors.

To isolate a workspace, create a dedicated Gitea instance and pin to it:

amplifier-gitea create --port 10111 --name gitea-myworkspace
# {"id": "gitea-abcd1234", "port": 10111, ...}

# Pin via env var (per-invocation, no config edit required)
EVAL_HARNESS_GITEA_INSTANCE=gitea-abcd1234 amplifier-eval-harness run --config configs/smoke.yaml

# Or pin in the config itself
echo "gitea_instance_id: gitea-abcd1234" >> configs/myconfig.yaml

Resolution order (first wins): EVAL_HARNESS_GITEA_INSTANCE env var → YAML gitea_instance_id → greedy reuse of first listed instance → create new on port 10110. Pinned instances must already exist; the harness errors out rather than silently falling back.

Running inside a nested Incus DTU

When you run amplifier-eval-harness from inside an Incus DTU (e.g. a resolve-stack instance), eval-sub-DTUs are spawned as siblings via the forwarded Incus socket. Their localhost is their own loopback — not the harness DTU's — so the default http://localhost:<port> GITEA_URL baked into sub-DTU profiles is unreachable. uv tool install inside the sub-DTU fails on any transitive git+https://github.com/microsoft/... dependency because mitmproxy's url_rewrites redirect those to the unreachable host.

Fix: set AMPLIFIER_EVAL_HARNESS_GITEA_HOST to the harness DTU's eth0 IP. The harness will use this IP (instead of localhost) when passing GITEA_URL to eval-sub-DTU launch vars.

# Find the harness DTU's eth0 IP (run this inside the DTU):
ip -4 addr show eth0 | awk '/inet / {print $2}' | cut -d/ -f1
# e.g. 10.119.176.124

# Set before running the harness:
export AMPLIFIER_EVAL_HARNESS_GITEA_HOST=10.119.176.124
amplifier-eval-harness run --config configs/smoke.yaml

Local harness operations (Gitea API calls, mirroring, token fetches) are not affected — they still reach Gitea via localhost from the harness DTU's own perspective.

Architecture in 60 seconds

  1. Read config → expand bundles × scenarios × runs_per_combo into a flat list of RunSpec.
  2. Ensure a Gitea instance, push every relevant repo into it (upstream mirror or local working-tree snapshot).
  3. For each RunSpec (sequential when parallelism: 1, ThreadPoolExecutor-bounded when > 1):
    • Render a parameterized DTU profile.
    • Launch DTU; wait for readiness; push scenario workspace fixture; deep-merge settings overlay.
    • exec amplifier run --bundle <name> --output-format json-trace "<prompt>" and capture stdout, stderr, exit code.
    • file-pull the session directory; destroy DTU (or keep on failure).
  4. Aggregate results into manifest.json, summary.csv, summary.md.

Always routes installs through Gitea — one code path, swapping in a local working tree is a per-repo flag rather than a runtime mode switch.

Parallelism

parallelism: N in the config (or --parallelism N on the CLI) caps the number of concurrent DTUs. Each running DTU consumes ~1.5–2 GB of RAM and a CPU core during provisioning. Pick a value your machine can sustain.

Gitea is shared but read-mostly during the run loop — repo population happens once, sequentially, before any DTU launches. Output from concurrent runs is interleaved on stderr; each line is prefixed with the run id for traceability.

Limitations (v0.2)

  • No token / cost capture. amplifier CLI doesn't surface these. Wall clock, tool call count, agent invocations, full transcript, and per-tool execution trace are captured.
  • No quality scoring. Raw artifacts only; LLM-as-judge / rubric scoring is a separate later layer that reads runs/*/result.json.
  • Single provider mix per config. Different model setups require different settings_overlay: files (and therefore different configs).

Layout

.
├── README.md
├── pyproject.toml
├── docs/designs/architecture.md       # source-of-truth design doc
├── eval_harness/                      # the CLI package
│   ├── cli.py        # eval-harness CLI (run / validate / gitea-status)
│   ├── config.py     # YAML schema + matrix expansion
│   ├── gitea.py      # Gitea instance lifecycle + mirror/snapshot push
│   ├── profile.py    # Parameterized profile rendering (url_rewrites dedup, settings overlay splice)
│   ├── runner.py     # Per-run flow (launch / exec / file-pull / destroy)
│   ├── results.py    # Per-run JSON, summary CSV/MD, manifest
│   └── _log.py       # Thread-local log prefix for parallel runs
├── profiles/eval-base.yaml.tmpl       # parameterized DTU profile
├── configs/                           # ready-to-run named configs
├── scenarios/                         # prompt + workspace fixtures
└── settings/                          # provider/model overlay YAMLs

License

MIT (TBD)

Contributing

Note

This project is not currently accepting external contributions, but we're actively working toward opening this up. We value community input and look forward to collaborating in the future. For now, feel free to fork and experiment!

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

Eval harness for exploring configuration of amplifier-app-cli for the Amplifier project

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors