Pi provider extension for running Pi self-managed local llama.cpp inference.
The extension registers Qwen3.6 GGUF models under the llamacpp provider,
downloads/builds a matching llama.cpp runtime and downloads the selected GGUF on
first use, starts llama-server, and stops it automatically when pi shuts down.
Currently registered:
llamacpp/qwen-3.6-dense-2bit(27B dense)llamacpp/qwen-3.6-dense-4bit(27B dense)llamacpp/qwen-3.6-dense-8bit(27B dense)llamacpp/qwen-3.6-moe-2bit(35B-A3B MoE)llamacpp/qwen-3.6-moe-4bit(35B-A3B MoE)llamacpp/qwen-3.6-moe-8bit(35B-A3B MoE)
The model names describe the architecture:
denseis the Qwen3.6 27B dense model. All parameters participate in every token, which makes compute and memory use more direct and predictable.moeis the Qwen3.6 35B-A3B Mixture-of-Experts model. It has about 35B total parameters, but routes each token through only a small active subset of experts (about 3B active parameters). MoE can offer more total capacity for a similar amount of active compute, but the full expert weights still need to be stored and loaded.
The moe (35B-A3B) models are downloaded from
havenoammo/Qwen3.6-35B-A3B-MTP-GGUF
at revision 44ce525026e7e7d0e0915dc1bf83a783c813e75a, and the dense
(27B) models are downloaded from
froggeric/Qwen3.6-27B-MTP-GGUF
at revision 431204640c8511573e61a7964a12cc452114a223. Pinning the
revisions keeps downloads reproducible if upstream main moves; set
LLAMACPP_QWEN_35B_A3B_REVISION, LLAMACPP_QWEN_27B_REVISION, or
LLAMACPP_QWEN_REVISION to override.
These files need llama.cpp MTP/NextN support, so the default runtime path builds
a pinned snapshot of llama.cpp pull request #22673
instead of using the stock binary release.
pi install https://github.com/mitsuhiko/pi-llamacppFor local development from this checkout:
./install-pi-extension-local.shThen restart Pi or run /reload.
Runtime state is kept under ~/.pi/llamacpp:
source/: pinned llama.cpp source snapshots built locally (default: PR #22673 snapshot for MTP/NextN support)runtime/: extracted llama.cpp release archives whenLLAMACPP_RUNTIME_KIND=releasedownloads/: release archives and resumable.partfilesmodels/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF/: cachedmoe(35B-A3B) GGUF model filesmodels/froggeric/Qwen3.6-27B-MTP-GGUF/: cacheddense(27B) GGUF model filesclients/: active Pi process leasesserver.json: managedllama-serverstatelog: download/extract/server/watchdog log
The managed server binds to a random localhost port by default and records the
active endpoint in server.json. Set LLAMACPP_PORT only if you explicitly
want a fixed port.
Use /llamacpp inside Pi to show the live llama.cpp log, /llamacpp status for
paths/status, and /llamacpp stop to stop the managed server when no other
leases are active.