cl-llama-cpp

Local LLM inference for Common Lisp. Run GGUF models in-process on your CPU or GPU — generate text, chat, embed, constrain output with grammars, fork context state, and compose sampler pipelines. Built on llama.cpp via CFFI.

The generated %llama layer covers the complete public C API (881 functions, 38 enums, 59 structs). A higher-level cl-llama-cpp package wraps common workflows with idiomatic Lisp macros, typed handles, and structured conditions. %llama is a first-class escape hatch, not an implementation detail: when a high-level wrapper doesn’t exist, call the C function directly. A missing wrapper is never a blocker.

Active Development / Beta. API is unstable and will change without notice.

Getting Set Up

Some familiarity with Common Lisp and llama.cpp will help, but if you’re new to either, the Cloud VM Quickstart walks through everything from a bare Ubuntu GPU instance. An AI coding assistant can also handle the initial setup — see the installation tip below.

Requirements

SBCL (recommended, required for thread safety) — llama.cpp triggers floating-point exceptions that SBCL masks automatically via with-llama-compatible-fp-environment. Thread-safety guarantees (backend lifecycle mutex, atomic handle deallocation via CAS, abort/log callback locking) are implemented with SBCL primitives (sb-thread:make-mutex, sb-ext:cas). On non-SBCL implementations these compile to no-ops: single-threaded usage works, but concurrent access to the backend lifecycle or GC finalizers racing with explicit free-model / free-context calls is unsafe.
llama.cpp built as a shared library (libllama.so / libllama.dylib)
CFFI and cffi-libffi
trivial-garbage

Tested on Linux. macOS should work (.dylib is handled). Windows is untested.

Installation

Tip: If you use an AI coding assistant such as Claude Code, Codex, or OpenCode, you can often ask it to install and verify cl-llama-cpp for you. For example: “Install cl-llama-cpp from this repository on this machine, build llama.cpp if needed, satisfy all Common Lisp dependencies, and verify the examples run.” If the installation fails, compare the assistant’s actions against the installation guide below.

The repository includes llama.cpp as a git submodule:

git clone --recursive https://github.com/licjon/cl-llama-cpp
cd cl-llama-cpp/llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON
cmake --build build -j$(nproc)

For GPU acceleration with CUDA, add -DGGML_CUDA=ON:

cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON
cmake --build build -j$(nproc)

See the llama.cpp build documentation for Metal, Vulkan, and other backend options.

The library loader finds libllama.so in llama.cpp/build/bin/ automatically. Alternatively, install it somewhere on your system library path.

Place or symlink this directory into your ASDF source registry (e.g. ~/common-lisp/):

(asdf:load-system "cl-llama-cpp")

Roswell users: clone into ~/.roswell/local-projects/ and it is found automatically. Quicklisp users: run ql:register-local-projects first.

Quick Start

(use-package :cl-llama-cpp)

(with-model (model "/path/to/model.gguf" :n-gpu-layers 99)
  (with-context (ctx model :n-ctx 2048)
    (princ (generate ctx "The capital of France is"
                     :max-tokens 64 :temp 0.7))))

Streaming:

(with-model (model "/path/to/model.gguf" :n-gpu-layers 99)
  (with-context (ctx model :n-ctx 2048)
    (generate ctx "Once upon a time"
              :max-tokens 128 :temp 0.9
              :token-callback (lambda (tok)
                                (write-string tok)
                                (force-output)
                                t))))

Multi-turn chat with KV-cache reuse — each turn decodes only the new tokens:

(with-model (model "/path/to/chat-model.gguf" :n-gpu-layers 99)
  (with-context (ctx model :n-ctx 4096)
    (let ((session (make-chat-session ctx
                     :system-prompt "You are a helpful assistant.")))
      (format t "~A~%"
              (chat-session-send session "What is Common Lisp?"
                                 :max-tokens 256))
      (format t "~A~%"
              (chat-session-send session "Tell me more about CLOS."
                                 :max-tokens 256)))))

High-Level API

Text generation — generate accepts 15 sampling strategies (temperature, top-k/p, min-p, mirostat v1/v2, XTC, DRY, typical-p, top-n-sigma, dynamic temperature, repetition/frequency/presence penalties, logit bias, greedy). Speculative decoding via caller-supplied draft/verify/accept closures (:speculative-fns; see cl-llama-cpp-extras for ready-made primitives). Reusable sampler config objects (make-sampler-config) and explicit sampler chains (with-sampler-chain, individual make-*-sampler constructors).
Chat — Simple sessions (full-history prefill) and incremental sessions (make-chat-session / chat-session-send: KV-cache reuse keeps per-turn prefill cost proportional to new tokens only; supports :speculative-fns for speculative decoding). Chat template formatting and safe multi-turn tokenization (format-chat, tokenize-chat).
Constrained generation — GBNF grammars, lazy grammars with trigger words, infill.
Parallel decoding — generate-parallel runs multiple prompts simultaneously in shared forward passes. Independent contexts can run concurrently on separate threads (SBCL only for thread-safety guarantees; see Requirements).
Embeddings — embed with optional normalization; context created with :embeddings t.
LoRA adapters — with-lora / apply-lora with scale control.
KV cache — prefill decodes tokens into the cache without sampling (the primitive for context forking); clear, copy, shift, divide, per-sequence position queries.
Session state — save/load to disk (survives process restarts) or to in-memory octet vectors (context forking, snapshotting).
GGUF inspection — with-gguf reads file-level metadata, typed KV entries (including arrays), and tensor info without loading weights or initializing the backend.
Resource planning — estimate-memory, validate-configuration, suggest-configuration, feasibility-report before committing to a model load. Optional pre-creation guardrails in with-context (:validation :warn/:error).
Introspection — model description and metadata, architecture properties, context config, backend device enumeration, system capabilities.
Performance — with-perf macro, structured timing data, sampler timing.
Logging — route llama.cpp log output to a Lisp callback (set-log-callback).
Backend lifecycle — with-backend, ensure-backend, NUMA init, threadpool management, abort callbacks.

Conditions inherit from llama-error with structured slots (e.g. model-load-error carries path, decode-error carries code, input-validation-error carries function/argument/value/reason). Resource-loading conditions also establish interactive restarts for the debugger: retry with different params, use CPU only, skip LoRA, etc.

Not yet wrapped at the high level (use %llama directly):

Model quantization

Low-Level `%llama` Access

%llama is a supported, first-class layer — not an implementation detail. All 881 public C functions are available. The naming convention strips the llama_ prefix and converts underscores to hyphens (llama_decode → %llama:decode, LLAMA_SPLIT_MODE → %llama:split-mode); ggml_ and gguf_ prefixes are kept. Extract raw pointers from high-level handles with llama-model-pointer, llama-context-pointer, etc.

Direct %llama calls must run inside with-llama-compatible-fp-environment to mask SBCL’s FP traps. The high-level API handles this automatically.

See docs/low-level.org for naming conventions, struct and enum usage, and worked examples (including model quantization).

Examples

(asdf:load-system "cl-llama-cpp/examples")

File	Demonstrates
`examples/incremental-chat.lisp`	Multi-turn chat with KV-cache reuse: O(1) prefill per turn
`examples/tool-calling.lisp`	Tool calls via chat template + XML parsing
`examples/sampler-showcase.lisp`	Grammars, lazy grammars, extended samplers
`examples/sampler-comparison.lisp`	Side-by-side comparison of all sampling strategies
`examples/parallel.lisp`	`generate-parallel`: multiple prompts in shared passes
`examples/parallel-threads.lisp`	Concurrent contexts on separate Lisp threads
`examples/kv-cache.lisp`	KV cache ops with assert-driven verification
`examples/context-fork.lisp`	In-memory state snapshotting and context forking
`examples/lora.lisp`	LoRA adapter loading and application
`examples/resource-planning.lisp`	VRAM estimation and configuration validation
`examples/introspection.lisp`	Model metadata, tensor listing, system info
`examples/backend-lifecycle.lisp`	Thread tuning, abort callbacks, time-limited generation
`examples/perf-and-logging.lisp`	Timing, throughput calculation, log capture

Documentation

API Reference — complete function, condition, and restart reference
Getting Started — guided walkthrough for common patterns
Guides — samplers, grammar, KV cache, embeddings, LoRA, parallel, resource planning, cloud quickstart
Low-level %llama access — calling the C API directly
Upgrading llama.cpp — keeping the submodule and bindings in sync
Contributing — running tests, adding wrappers, PR guidelines

Built with cl-llama-cpp

cl-llama-chat — terminal interface for cl-llama-cpp providing automatic model configuration and dual-branch response selection across sampling configurations.

Running Tests

Smoke tests (no model required):

sbcl --eval '(ql:quickload "cl-llama-cpp/tests")' \
     --eval '(asdf:test-system "cl-llama-cpp/tests")' \
     --eval '(quit)'

Integration tests require GGUF model files set via environment variables (LLAMA_TEST_MODEL, LLAMA_TEST_EMBED_MODEL, LLAMA_TEST_LORA):

LLAMA_TEST_MODEL=/path/to/model.gguf \
LLAMA_TEST_EMBED_MODEL=/path/to/embedding-model.gguf \
LLAMA_TEST_LORA=/path/to/adapter.gguf \
sbcl --eval '(ql:quickload "cl-llama-cpp/tests")' \
     --eval '(asdf:test-system "cl-llama-cpp/tests")' \
     --eval '(quit)'

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
docs		docs
examples		examples
generate		generate
llama.cpp @ 4988f6e		llama.cpp @ 4988f6e
spec		spec
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.org		README.org
cl-llama-cpp.asd		cl-llama-cpp.asd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cl-llama-cpp

Getting Set Up

Requirements

Installation

Quick Start

High-Level API

Low-Level `%llama` Access

Examples

Documentation

Built with cl-llama-cpp

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cl-llama-cpp

Getting Set Up

Requirements

Installation

Quick Start

High-Level API

Low-Level %llama Access

Examples

Documentation

Built with cl-llama-cpp

Running Tests

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Low-Level `%llama` Access

Packages