-
Notifications
You must be signed in to change notification settings - Fork 0
Installation
PUMA ships as a Docker Compose stack that bundles a Python runner, an Ollama inference server, and an optional Streamlit dashboard. The recommended path is Docker; a native install is also possible for advanced users who want to run PUMA directly on their host Python interpreter.
-
Docker 24+ with the
docker composeplugin. - RAM: ~16 GB minimum for small models (1.5B–3B parameters); 32 GB or more is comfortable when running 7B–8B models or many concurrent runs.
- GPU (optional): NVIDIA with a recent CUDA driver, or Apple Silicon. PUMA also runs entirely on CPU; performance scales accordingly.
- Disk: ~10 GB free for the base stack plus ~2–8 GB per model image.
git clone https://github.com/pumacp/puma.git
cd puma
docker compose up -dThis starts the Ollama service and prepares the PUMA runner container. To benchmark anything you need at least one model pulled into Ollama:
docker compose exec puma_ollama ollama pull qwen2.5:3bVerify everything is wired up correctly:
docker compose run --rm puma_runner puma preflightpreflight inspects your host, picks a hardware profile, and reports whether
your environment is ready to run benchmarks.
A native install requires Python 3.11+ and a separately running Ollama instance
on http://localhost:11434. After cloning the repo:
pip install -e .Then run puma directly without docker compose run. See CONTRIBUTING.md
for the full development setup (test runner, linters, pre-commit hooks).
puma preflight selects one of fifteen hardware profiles automatically based on
your detected RAM and GPU: the five baseline tiers cpu-lite, cpu-standard,
gpu-entry, gpu-mid, gpu-high, plus ten Apple-Silicon variants. Each profile
sets sensible defaults for concurrency, batch size,
and memory budget. See Models and Datasets for the full
profile matrix.