Skip to content

licjon/cl-llama-cpp

Repository files navigation

cl-llama-cpp

Local LLM inference for Common Lisp. Run GGUF models in-process on your CPU or GPU — generate text, chat, embed, constrain output with grammars, fork context state, and compose sampler pipelines. Built on llama.cpp via CFFI.

The generated %llama layer covers the complete public C API (881 functions, 38 enums, 59 structs). A higher-level cl-llama-cpp package wraps common workflows with idiomatic Lisp macros, typed handles, and structured conditions. %llama is a first-class escape hatch, not an implementation detail: when a high-level wrapper doesn’t exist, call the C function directly. A missing wrapper is never a blocker.

Active Development / Beta. API is unstable and will change without notice.

Getting Set Up

Some familiarity with Common Lisp and llama.cpp will help, but if you’re new to either, the Cloud VM Quickstart walks through everything from a bare Ubuntu GPU instance. An AI coding assistant can also handle the initial setup — see the installation tip below.

Requirements

  • SBCL (recommended, required for thread safety) — llama.cpp triggers floating-point exceptions that SBCL masks automatically via with-llama-compatible-fp-environment. Thread-safety guarantees (backend lifecycle mutex, atomic handle deallocation via CAS, abort/log callback locking) are implemented with SBCL primitives (sb-thread:make-mutex, sb-ext:cas). On non-SBCL implementations these compile to no-ops: single-threaded usage works, but concurrent access to the backend lifecycle or GC finalizers racing with explicit free-model / free-context calls is unsafe.
  • llama.cpp built as a shared library (libllama.so / libllama.dylib)
  • CFFI and cffi-libffi
  • trivial-garbage

Tested on Linux. macOS should work (.dylib is handled). Windows is untested.

Installation

Tip: If you use an AI coding assistant such as Claude Code, Codex, or OpenCode, you can often ask it to install and verify cl-llama-cpp for you. For example: “Install cl-llama-cpp from this repository on this machine, build llama.cpp if needed, satisfy all Common Lisp dependencies, and verify the examples run.” If the installation fails, compare the assistant’s actions against the installation guide below.

The repository includes llama.cpp as a git submodule:

git clone --recursive https://github.com/licjon/cl-llama-cpp
cd cl-llama-cpp/llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=ON
cmake --build build -j$(nproc)

For GPU acceleration with CUDA, add -DGGML_CUDA=ON:

cmake -B build -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON
cmake --build build -j$(nproc)

See the llama.cpp build documentation for Metal, Vulkan, and other backend options.

The library loader finds libllama.so in llama.cpp/build/bin/ automatically. Alternatively, install it somewhere on your system library path.

Place or symlink this directory into your ASDF source registry (e.g. ~/common-lisp/):

(asdf:load-system "cl-llama-cpp")

Roswell users: clone into ~/.roswell/local-projects/ and it is found automatically. Quicklisp users: run ql:register-local-projects first.

Quick Start

(use-package :cl-llama-cpp)

(with-model (model "/path/to/model.gguf" :n-gpu-layers 99)
  (with-context (ctx model :n-ctx 2048)
    (princ (generate ctx "The capital of France is"
                     :max-tokens 64 :temp 0.7))))

Streaming:

(with-model (model "/path/to/model.gguf" :n-gpu-layers 99)
  (with-context (ctx model :n-ctx 2048)
    (generate ctx "Once upon a time"
              :max-tokens 128 :temp 0.9
              :token-callback (lambda (tok)
                                (write-string tok)
                                (force-output)
                                t))))

Multi-turn chat with KV-cache reuse — each turn decodes only the new tokens:

(with-model (model "/path/to/chat-model.gguf" :n-gpu-layers 99)
  (with-context (ctx model :n-ctx 4096)
    (let ((session (make-chat-session ctx
                     :system-prompt "You are a helpful assistant.")))
      (format t "~A~%"
              (chat-session-send session "What is Common Lisp?"
                                 :max-tokens 256))
      (format t "~A~%"
              (chat-session-send session "Tell me more about CLOS."
                                 :max-tokens 256)))))

High-Level API

  • Text generationgenerate accepts 15 sampling strategies (temperature, top-k/p, min-p, mirostat v1/v2, XTC, DRY, typical-p, top-n-sigma, dynamic temperature, repetition/frequency/presence penalties, logit bias, greedy). Speculative decoding via caller-supplied draft/verify/accept closures (:speculative-fns; see cl-llama-cpp-extras for ready-made primitives). Reusable sampler config objects (make-sampler-config) and explicit sampler chains (with-sampler-chain, individual make-*-sampler constructors).
  • Chat — Simple sessions (full-history prefill) and incremental sessions (make-chat-session / chat-session-send: KV-cache reuse keeps per-turn prefill cost proportional to new tokens only; supports :speculative-fns for speculative decoding). Chat template formatting and safe multi-turn tokenization (format-chat, tokenize-chat).
  • Constrained generation — GBNF grammars, lazy grammars with trigger words, infill.
  • Parallel decodinggenerate-parallel runs multiple prompts simultaneously in shared forward passes. Independent contexts can run concurrently on separate threads (SBCL only for thread-safety guarantees; see Requirements).
  • Embeddingsembed with optional normalization; context created with :embeddings t.
  • LoRA adapterswith-lora / apply-lora with scale control.
  • KV cacheprefill decodes tokens into the cache without sampling (the primitive for context forking); clear, copy, shift, divide, per-sequence position queries.
  • Session state — save/load to disk (survives process restarts) or to in-memory octet vectors (context forking, snapshotting).
  • GGUF inspectionwith-gguf reads file-level metadata, typed KV entries (including arrays), and tensor info without loading weights or initializing the backend.
  • Resource planningestimate-memory, validate-configuration, suggest-configuration, feasibility-report before committing to a model load. Optional pre-creation guardrails in with-context (:validation :warn/:error).
  • Introspection — model description and metadata, architecture properties, context config, backend device enumeration, system capabilities.
  • Performancewith-perf macro, structured timing data, sampler timing.
  • Logging — route llama.cpp log output to a Lisp callback (set-log-callback).
  • Backend lifecyclewith-backend, ensure-backend, NUMA init, threadpool management, abort callbacks.

Conditions inherit from llama-error with structured slots (e.g. model-load-error carries path, decode-error carries code, input-validation-error carries function/argument/value/reason). Resource-loading conditions also establish interactive restarts for the debugger: retry with different params, use CPU only, skip LoRA, etc.

Not yet wrapped at the high level (use %llama directly):

  • Model quantization

Low-Level %llama Access

%llama is a supported, first-class layer — not an implementation detail. All 881 public C functions are available. The naming convention strips the llama_ prefix and converts underscores to hyphens (llama_decode%llama:decode, LLAMA_SPLIT_MODE%llama:split-mode); ggml_ and gguf_ prefixes are kept. Extract raw pointers from high-level handles with llama-model-pointer, llama-context-pointer, etc.

Direct %llama calls must run inside with-llama-compatible-fp-environment to mask SBCL’s FP traps. The high-level API handles this automatically.

See docs/low-level.org for naming conventions, struct and enum usage, and worked examples (including model quantization).

Examples

(asdf:load-system "cl-llama-cpp/examples")
FileDemonstrates
examples/incremental-chat.lispMulti-turn chat with KV-cache reuse: O(1) prefill per turn
examples/tool-calling.lispTool calls via chat template + XML parsing
examples/sampler-showcase.lispGrammars, lazy grammars, extended samplers
examples/sampler-comparison.lispSide-by-side comparison of all sampling strategies
examples/parallel.lispgenerate-parallel: multiple prompts in shared passes
examples/parallel-threads.lispConcurrent contexts on separate Lisp threads
examples/kv-cache.lispKV cache ops with assert-driven verification
examples/context-fork.lispIn-memory state snapshotting and context forking
examples/lora.lispLoRA adapter loading and application
examples/resource-planning.lispVRAM estimation and configuration validation
examples/introspection.lispModel metadata, tensor listing, system info
examples/backend-lifecycle.lispThread tuning, abort callbacks, time-limited generation
examples/perf-and-logging.lispTiming, throughput calculation, log capture

Documentation

Built with cl-llama-cpp

  • cl-llama-chat — terminal interface for cl-llama-cpp providing automatic model configuration and dual-branch response selection across sampling configurations.

Running Tests

Smoke tests (no model required):

sbcl --eval '(ql:quickload "cl-llama-cpp/tests")' \
     --eval '(asdf:test-system "cl-llama-cpp/tests")' \
     --eval '(quit)'

Integration tests require GGUF model files set via environment variables (LLAMA_TEST_MODEL, LLAMA_TEST_EMBED_MODEL, LLAMA_TEST_LORA):

LLAMA_TEST_MODEL=/path/to/model.gguf \
LLAMA_TEST_EMBED_MODEL=/path/to/embedding-model.gguf \
LLAMA_TEST_LORA=/path/to/adapter.gguf \
sbcl --eval '(ql:quickload "cl-llama-cpp/tests")' \
     --eval '(asdf:test-system "cl-llama-cpp/tests")' \
     --eval '(quit)'

License

MIT

About

Idiomatic Common Lisp interface for high-performance local LLM inference via llama.cpp

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors