-
Notifications
You must be signed in to change notification settings - Fork 0
Running A Benchmark
puma-community-bot edited this page May 24, 2026
·
1 revision
This page is a short overview of running PUMA locally so you have a result worth submitting. For the complete reference, see the PUMA Wiki.
git clone https://github.com/pumacp/puma.git
cd puma
docker compose up -ddocker compose exec puma_ollama ollama pull qwen2.5:3bYou can swap qwen2.5:3b for any Ollama tag. The curated catalog is listed
under puma models; off-catalog models are accepted but flagged as
"experimental" when submitted.
docker compose run --rm puma_runner puma run \
--scenario triage_jira \
--model qwen2.5:3b \
--strategy zero_shot \
--instances 10The run writes its predictions and metrics to the local SQLite database and prints a final summary. Make a note of the run ID in the output; you'll need it when submitting.
-
--scenariomust be one of the catalog values:triage_jira,effort_tawos,prioritization_jira. Off-catalog scenarios cannot be submitted. -
--modelshould ideally be from the curated PUMA catalog. Off-catalog models are accepted but markedexperimental: truein the submission metadata so other users know the model's provenance hasn't been vetted by the project. -
--strategymust be one of the supported strategies:zero_shot,few_shot_3,few_shot_6,chain_of_thought,rcoif,contextual_anchoring. -
--instancesshould be at least 10 for a publishable result. Smaller runs are accepted but flagged in the submission metadata so readers know the metrics come from a small sample.
Continue with Submitting Results to publish what you just ran.