Dart Arena is a Flutter desktop app for benchmarking AI coding models on Dart and Flutter tasks. It helps compare model quality across code generation, agentic execution, hidden verification, repeated trials, and human preference review.
- Codegen and agentic tracks: run direct model responses as well as agent-style planning/execution workflows.
- Task QA and hidden verifiers: score tasks with compile checks, analyzer checks, visible tests, hidden/reference tests, diff-size signals, and LLM judges.
- Reliable leaderboards: repeat trials per task/model combo and aggregate results across quality, speed, reliability, and category dimensions.
- Human review: compare competing outputs in a review queue and fold preferences into rankings.
- Provenance and exports: save run manifests, environment details, summaries, CSV/Markdown/JSON reports, and reproducible artifact bundles.
- Headless CI smoke: exercise the headless benchmark runner in GitHub Actions for release confidence.
Install Flutter for your desktop platform, then run:
cd app
flutter pub get
flutter run -d linuxUse windows or macos instead of linux when running on those hosts.
Open Settings in the app to configure model providers. Dart Arena currently supports:
- Ollama Local and Ollama Cloud
- OpenCode Go
- OpenAI
- OpenRouter
- DeepSeek
- Anthropic
- custom OpenAI-compatible local providers
- local Factory Droid execution
API keys and provider base URLs are stored through platform secure storage. Do not commit keys, exported credentials, local databases, or benchmark work directories.
- Configure at least one provider in Settings.
- Select New Run.
- Choose tasks, providers, models, evaluator settings, concurrency, and trial count.
- Start the run and monitor progress.
- Review the leaderboard, inspect task-run details, export run bundles, or compare outputs in the review queue.
Use this flow when you want to run the private official agentic corpus, publish the result to the static Pickforge/Dart Arena web leaderboard, and keep the evidence reproducible.
Prerequisites:
bwrapis installed and available onPATH.- Flutter, Dart, and Bun are installed.
- Factory Droid can run the custom model from
~/.factory/settings.json. - The default model id is
custom:gpt-5.3-codex-spark---Codex, which maps to GPT 5.3 Codex Spark - Codex in the local Factory settings. - The git worktree is clean before the benchmark run. Release reports intentionally mark dirty-worktree runs as non-release evidence.
Run the official Bubblewrap benchmark:
RUN_ID=spark-sandboxed-official-$(date -u +%Y%m%dT%H%M%SZ) \
bash scripts/run-official-bubblewrap-benchmark.shThe script writes .factory/$RUN_ID/run.json, runs dart_arena_headless, enables requireGeneratedCodeSandbox, uses Bubblewrap for generated code, runs the five active official Flutter tasks, and stores the run database plus artifact bundle under .factory/$RUN_ID/.
Useful overrides:
TRIALS_PER_TASK=3 MAX_CONCURRENCY=1 TIMEOUT_SECONDS=7200 \
RUN_ID=spark-sandboxed-official-20260606T120000Z \
bash scripts/run-official-bubblewrap-benchmark.shPublish a completed run to the static website data:
bash scripts/publish-benchmark-to-web.sh .factory/<run-id>That command:
- exports
web/static/data/leaderboard.v1.jsonwith--strategy aggregate-compatible; - exports
web/static/data/release_report.v1.jsonas a provenance sidecar; - validates the Svelte static site with
bun run web:checkandbun run web:smoke.
To publish and push in one command after reviewing the run id:
COMMIT=1 PUSH=1 \
COMMIT_MESSAGE="data: publish spark benchmark results" \
bash scripts/publish-benchmark-to-web.sh .factory/<run-id>The script stages only the generated static data files. It does not stage local databases, workdirs, screenshots, credentials, or .factory/ contents.
Manual equivalent:
cd app
dart run --verbosity=error dart_arena:dart_arena_export_leaderboard \
--database ../.factory/<run-id>/dart_arena.sqlite \
--out ../web/static/data/leaderboard.v1.json \
--track agentic \
--strategy aggregate-compatible \
--run-id <run-id>
dart run --verbosity=error dart_arena:dart_arena_release_report \
--leaderboard ../web/static/data/leaderboard.v1.json \
--database ../.factory/<run-id>/dart_arena.sqlite \
--artifact-bundle-root ../.factory/<run-id>/bundles/dart_arena_run_<run-id> \
--task-qa-report-root ../tasks/flutter \
--release-id <run-id> \
--out ../web/static/data/release_report.v1.json
cd ..
bun run web:check
bun run web:smoke
git add web/static/data/leaderboard.v1.json web/static/data/release_report.v1.json
git commit -m "data: publish benchmark results"
git push origin mainweb/static/data/leaderboard.v1.json is the file consumed by the public Svelte site. web/static/data/release_report.v1.json is published for auditability, but the current site UI does not require it to render the leaderboard.
Deploying the website:
- If Vercel, Netlify, or another static host is connected to
main, pushing the data commit is enough for the host to rebuild. - Otherwise, run
bun run web:smokeand deploy the generatedweb/build/directory. - For a non-root path such as GitHub Pages at
/dart_arena, build withPUBLIC_BASE_PATH=/dart_arena bun run web:smokeand deployweb/build/.
Use these commands before submitting changes:
cd app
flutter pub get
dart format --set-exit-if-changed lib test
flutter analyze
flutter test
flutter build linux --debugThe CI smoke workflow also runs:
cd app
flutter test test/headless/headless_benchmark_runner_test.dartBuild debug desktop artifacts from the matching host OS:
cd app
flutter build linux --debug
flutter build windows --debug
flutter build macos --debugCross-building Windows or macOS from Linux is not supported by Flutter, so run those commands on native hosts.
- Provider credentials stay in platform secure storage.
- Benchmark tasks and generated work directories may contain model output and code diffs; inspect exported bundles before sharing them.
- Hidden verifier fixtures are part of the local benchmark corpus and should not be exposed to model prompts during a run.
- The app does not require committing local databases, caches, generated build outputs, or exported benchmark artifacts.
Contributions should keep the package/import name as dart_arena, preserve benchmark reproducibility, and include tests for scoring, task fixtures, or UI behavior when changed.
Before opening a pull request:
cd app
dart format --set-exit-if-changed lib test
flutter analyze
flutter testDart Arena is released under the MIT License. See LICENSE.

