Local LLM inference for Common Lisp. Run GGUF models in-process on your CPU or GPU — generate text, chat, embed, constrain output with grammars, fork context state, and compose sampler pipelines. Built on llama.cpp via CFFI.
The generated %llama layer covers the complete public C API (881 functions, 38 enums,
59 structs). A higher-level cl-llama-cpp package wraps common workflows with
idiomatic Lisp macros, typed handles, and structured conditions. %llama is a
first-class escape hatch, not an implementation detail: when a high-level wrapper
doesn’t exist, call the C function directly. A missing wrapper is never a blocker.
Active Development / Beta. API is unstable and will change without notice.
Some familiarity with Common Lisp and llama.cpp will help, but if you’re new to either, the Cloud VM Quickstart walks through everything from a bare Ubuntu GPU instance. An AI coding assistant can also handle the initial setup — see the installation tip below.
- SBCL (recommended, required for thread safety) — llama.cpp triggers floating-point
exceptions that SBCL masks automatically via
with-llama-compatible-fp-environment. Thread-safety guarantees (backend lifecycle mutex, atomic handle deallocation via CAS, abort/log callback locking) are implemented with SBCL primitives (sb-thread:make-mutex,sb-ext:cas). On non-SBCL implementations these compile to no-ops: single-threaded usage works, but concurrent access to the backend lifecycle or GC finalizers racing with explicitfree-model/free-contextcalls is unsafe. - llama.cpp built as a shared library (
libllama.so/libllama.dylib) - CFFI and cffi-libffi
- trivial-garbage
Tested on Linux. macOS should work (.dylib is handled). Windows is untested.
Tip: If you use an AI coding assistant such as Claude Code, Codex, or OpenCode, you can often ask it to install and verify cl-llama-cpp for you. For example: “Install cl-llama-cpp from this repository on this machine, build llama.cpp if needed, satisfy all Common Lisp dependencies, and verify the examples run.” If the installation fails, compare the assistant’s actions against the installation guide below.
The repository includes llama.cpp as a git submodule:
git clone --recursive https://github.com/licjon/cl-llama-cpp
cd cl-llama-cpp/llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON
cmake --build build -j$(nproc)For GPU acceleration with CUDA, add -DGGML_CUDA=ON:
cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON
cmake --build build -j$(nproc)See the llama.cpp build documentation for Metal, Vulkan, and other backend options.
The library loader finds libllama.so in llama.cpp/build/bin/ automatically.
Alternatively, install it somewhere on your system library path.
Place or symlink this directory into your ASDF source registry (e.g. ~/common-lisp/):
(asdf:load-system "cl-llama-cpp")Roswell users: clone into ~/.roswell/local-projects/ and it is found automatically.
Quicklisp users: run ql:register-local-projects first.
(use-package :cl-llama-cpp)
(with-model (model "/path/to/model.gguf" :n-gpu-layers 99)
(with-context (ctx model :n-ctx 2048)
(princ (generate ctx "The capital of France is"
:max-tokens 64 :temp 0.7))))Streaming:
(with-model (model "/path/to/model.gguf" :n-gpu-layers 99)
(with-context (ctx model :n-ctx 2048)
(generate ctx "Once upon a time"
:max-tokens 128 :temp 0.9
:token-callback (lambda (tok)
(write-string tok)
(force-output)
t))))Multi-turn chat with KV-cache reuse — each turn decodes only the new tokens:
(with-model (model "/path/to/chat-model.gguf" :n-gpu-layers 99)
(with-context (ctx model :n-ctx 4096)
(let ((session (make-chat-session ctx
:system-prompt "You are a helpful assistant.")))
(format t "~A~%"
(chat-session-send session "What is Common Lisp?"
:max-tokens 256))
(format t "~A~%"
(chat-session-send session "Tell me more about CLOS."
:max-tokens 256)))))- Text generation —
generateaccepts 15 sampling strategies (temperature, top-k/p, min-p, mirostat v1/v2, XTC, DRY, typical-p, top-n-sigma, dynamic temperature, repetition/frequency/presence penalties, logit bias, greedy). Speculative decoding via caller-supplied draft/verify/accept closures (:speculative-fns; see cl-llama-cpp-extras for ready-made primitives). Reusable sampler config objects (make-sampler-config) and explicit sampler chains (with-sampler-chain, individualmake-*-samplerconstructors). - Chat — Simple sessions (full-history prefill) and incremental sessions
(
make-chat-session/chat-session-send: KV-cache reuse keeps per-turn prefill cost proportional to new tokens only; supports:speculative-fnsfor speculative decoding). Chat template formatting and safe multi-turn tokenization (format-chat,tokenize-chat). - Constrained generation — GBNF grammars, lazy grammars with trigger words, infill.
- Parallel decoding —
generate-parallelruns multiple prompts simultaneously in shared forward passes. Independent contexts can run concurrently on separate threads (SBCL only for thread-safety guarantees; see Requirements). - Embeddings —
embedwith optional normalization; context created with:embeddings t. - LoRA adapters —
with-lora/apply-lorawith scale control. - KV cache —
prefilldecodes tokens into the cache without sampling (the primitive for context forking); clear, copy, shift, divide, per-sequence position queries. - Session state — save/load to disk (survives process restarts) or to in-memory octet vectors (context forking, snapshotting).
- GGUF inspection —
with-ggufreads file-level metadata, typed KV entries (including arrays), and tensor info without loading weights or initializing the backend. - Resource planning —
estimate-memory,validate-configuration,suggest-configuration,feasibility-reportbefore committing to a model load. Optional pre-creation guardrails inwith-context(:validation :warn/:error). - Introspection — model description and metadata, architecture properties, context config, backend device enumeration, system capabilities.
- Performance —
with-perfmacro, structured timing data, sampler timing. - Logging — route llama.cpp log output to a Lisp callback (
set-log-callback). - Backend lifecycle —
with-backend,ensure-backend, NUMA init, threadpool management, abort callbacks.
Conditions inherit from llama-error with structured slots (e.g. model-load-error
carries path, decode-error carries code, input-validation-error carries
function/argument/value/reason). Resource-loading conditions also establish interactive
restarts for the debugger: retry with different params, use CPU only, skip LoRA, etc.
Not yet wrapped at the high level (use %llama directly):
- Model quantization
%llama is a supported, first-class layer — not an implementation detail. All 881
public C functions are available. The naming convention strips the llama_ prefix and
converts underscores to hyphens (llama_decode → %llama:decode,
LLAMA_SPLIT_MODE → %llama:split-mode); ggml_ and gguf_ prefixes are kept.
Extract raw pointers from high-level handles with llama-model-pointer,
llama-context-pointer, etc.
Direct %llama calls must run inside with-llama-compatible-fp-environment to mask
SBCL’s FP traps. The high-level API handles this automatically.
See docs/low-level.org for naming conventions, struct and enum usage, and worked examples (including model quantization).
(asdf:load-system "cl-llama-cpp/examples")| File | Demonstrates |
|---|---|
examples/incremental-chat.lisp | Multi-turn chat with KV-cache reuse: O(1) prefill per turn |
examples/tool-calling.lisp | Tool calls via chat template + XML parsing |
examples/sampler-showcase.lisp | Grammars, lazy grammars, extended samplers |
examples/sampler-comparison.lisp | Side-by-side comparison of all sampling strategies |
examples/parallel.lisp | generate-parallel: multiple prompts in shared passes |
examples/parallel-threads.lisp | Concurrent contexts on separate Lisp threads |
examples/kv-cache.lisp | KV cache ops with assert-driven verification |
examples/context-fork.lisp | In-memory state snapshotting and context forking |
examples/lora.lisp | LoRA adapter loading and application |
examples/resource-planning.lisp | VRAM estimation and configuration validation |
examples/introspection.lisp | Model metadata, tensor listing, system info |
examples/backend-lifecycle.lisp | Thread tuning, abort callbacks, time-limited generation |
examples/perf-and-logging.lisp | Timing, throughput calculation, log capture |
- API Reference — complete function, condition, and restart reference
- Getting Started — guided walkthrough for common patterns
- Guides — samplers, grammar, KV cache, embeddings, LoRA, parallel, resource planning, cloud quickstart
- Low-level %llama access — calling the C API directly
- Upgrading llama.cpp — keeping the submodule and bindings in sync
- Contributing — running tests, adding wrappers, PR guidelines
- cl-llama-chat — terminal interface for cl-llama-cpp providing automatic model configuration and dual-branch response selection across sampling configurations.
Smoke tests (no model required):
sbcl --eval '(ql:quickload "cl-llama-cpp/tests")' \
--eval '(asdf:test-system "cl-llama-cpp/tests")' \
--eval '(quit)'Integration tests require GGUF model files set via environment variables
(LLAMA_TEST_MODEL, LLAMA_TEST_EMBED_MODEL, LLAMA_TEST_LORA):
LLAMA_TEST_MODEL=/path/to/model.gguf \
LLAMA_TEST_EMBED_MODEL=/path/to/embedding-model.gguf \
LLAMA_TEST_LORA=/path/to/adapter.gguf \
sbcl --eval '(ql:quickload "cl-llama-cpp/tests")' \
--eval '(asdf:test-system "cl-llama-cpp/tests")' \
--eval '(quit)'MIT