-
Notifications
You must be signed in to change notification settings - Fork 0
Quick Start
This page assumes you have completed Installation and have at least one model pulled into Ollama. It walks you from a fresh stack to your first set of metrics in under five minutes.
docker compose run --rm puma_runner puma run \
--scenario triage_jira \
--model qwen2.5:3b \
--strategy zero_shot \
--instances 10What each flag means:
-
--scenario triage_jira— picks the multi-class issue-triage task on the Jira Social Repository dataset. -
--model qwen2.5:3b— the Ollama tag of the model under test. -
--strategy zero_shot— the prompting strategy (no in-context examples). -
--instances 10— how many issues from the dataset to evaluate. Ten is enough to verify everything works; production sweeps use 100–500.
The command prints progress, writes raw predictions to SQLite, and emits a final summary with F1-macro, latency, and CodeCarbon emissions.
docker compose run --rm puma_runner puma list-runsThis shows every run logged in the local database with its scenario, model,
strategy, key metric, and run ID. Use the run ID with puma compare to put
two runs side by side, or pass it to puma share-results to submit it to
PUMA Community.
docker compose up -d puma_dashboardThen open http://localhost:8501 in your browser. The dashboard exposes
nine views — overview, model comparison, multi-model, reliability, robustness,
fairness, sustainability frontier, instance drill-down, and a Community submissions panel.
- Read Architecture to understand how PUMA's six layers fit together.
- Read Running Benchmarks for the full
puma runflag reference and the strategy catalog. - Read Publishing Results once you have a result you'd like to share with the community.