llm.c is a portable C library and CLI for native LLM inference using GGUF models and GGML. It loads transformer model weights, tokenizes prompt text from GGUF metadata, builds the inference graph, applies optional LoRA adapters, generates tokens with sampling controls, and streams generated text through a small C callback API.
The library is designed as a standalone inference primitive for C/C++ applications. A kc_llm_t context owns the loaded model, backend resources, LoRA adapters, error state, and optional KV-cache used to accelerate generation.
Run native GGUF text generation from standard input. The CLI opens a model once and runs a resident request loop: each request is read from stdin until the --until delimiter byte (default EOT, byte 4), inference runs, the response is streamed to stdout, and the same delimiter is written at the end of each response. When stdin closes the process exits cleanly.
Single response generation:
echo "What is the capital of France?" | ./bin/x86_64/linux/llm --model llama-3-8b.ggufWith a LoRA adapter:
printf '%s\n' "Hello" | ./bin/x86_64/linux/llm \
--model base.gguf \
--lora adapter.safetensors --lora-scale 0.8| Flag | Description | Default |
|---|---|---|
-h, --help |
Show help and usage | - |
-v, --version |
Show version | - |
--model PATH |
Path to GGUF model file (required) | - |
--ctx N |
Context size in tokens | 2048 |
--predict N |
Max tokens to predict | 128 |
--threads N |
Number of threads | auto |
--gpu N |
GPU mode: -1 auto, 0 CPU, > 0 require GPU | -1 |
--gpu-layers N |
Number of layers to offload to GPU | all |
--kv-cache N |
Enable (1) or disable (0) KV-caching | 1 |
--temp F |
Temperature for sampling | 0.80 |
--top-k N |
Top-k sampling parameter | 40 |
--top-p F |
Top-p sampling parameter | 0.95 |
--penalty F |
Repeat penalty parameter | 1.10 |
--repeat-last-n N |
Last tokens for penalty | 64 |
--seed N |
RNG seed (-1 for random) | -1 |
--until N |
Request/response delimiter byte | 4 (EOT) |
--lora PATH |
Apply a LoRA adapter (repeatable) | - |
--lora-scale F |
Scale for the previous LoRA | 1.0 |
Generated text is written directly to standard output as it is produced, followed by the --until delimiter byte. Diagnostics and errors are written to standard error.
llm.c supports the GGUF model families implemented by its local graph builders and tokenizer backends:
- Llama-style:
llama,mistral,mixtral— SiLU activation, RoPE, standard GQA - Qwen-style:
qwen2,qwen2.5,qwen3— SiLU activation, RoPE with Qwen freq base, Qwen3 Q/K RMS normalization - Gemma-style:
gemma— GELU activation, embedding scale √n_embd
The engine requires all mandatory tensors (token embeddings, output norm, and standard transformer block weights) to be present in the GGUF file. Quantized models (Q4_0, Q4_K_M, Q8_0, etc.) are supported via GGML.
Prompt text is encoded by the tokenizer implementation selected from the GGUF metadata. The engine currently implements:
- BPE: GPT-2 byte-level BPE (
tokenizer.ggml.model = gpt2) withgpt2,llama-bpe, andqwen2pre-tokenizers. - SentencePiece: Google SentencePiece (
tokenizer.ggml.model = llama), used by LLaMA, Mistral, and Gemma models. - Unigram: Google Unigram (
tokenizer.ggml.model = unigram).
Model compatibility is therefore bounded by both the architecture metadata and the tokenizer metadata. Unsupported architectures, tokenizer models, or tokenizer pre-tokenizers fail during model load with a clear error.
#include "llm.h"
kc_llm_options_t opts = { .model_path = "model.gguf", .ctx = 2048, .predict = 128 };
kc_llm_t *ctx = NULL;
if (kc_llm_open(&ctx, &opts) == 0) {
kc_llm_generate(ctx, "Hello!", write_callback, user_data);
kc_llm_close(ctx);
}llm.c uses a clear ownership model to ensure predictable memory behavior:
- Options:
kc_llm_options_tis copied duringkc_llm_open(). You can release your copy immediately. - Callbacks: The buffer provided to the
kc_llm_write_fncallback is owned by the library and is only valid for the duration of that specific callback execution. - Errors:
kc_llm_error()returns a pointer to a context-owned string. It remains valid until the next state-modifying call on that context. - Generation: The caller owns prompt storage before and after each
kc_llm_generate()call.
kc_llm_open()- allocates and prepares a new LLM context with specific options.kc_llm_lora_apply()- loads and registers a safetensors LoRA adapter.kc_llm_lora_clear()- releases all applied adapters.kc_llm_generate()- performs synchronous generation from input. Supports streaming via callback.kc_llm_stop()- thread-safe mechanism to stop an ongoing generation.kc_llm_close()- releases the context and all associated resources.
Compiled artifacts are generated under bin/{arch}/{platform}/ for the host architecture running the build.
make clean && makeCUDA support is opt-in. Pass CUDA=1 to request a CUDA-enabled build. The
flag only has an effect for supported targets and only when the build machine
has a usable CUDA toolkit; otherwise the build remains CPU-only.
make CUDA=1
make CUDA=1 x86_64/linuxWhen GPU support is available, the engine follows these semantics:
--gpu -1(Auto): Uses GPU if a compatible device is found, falls back to CPU otherwise.--gpu 0: Disables GPU and strictly uses the CPU backend.--gpu >0: Explicitly requires a GPU. Fails with a descriptive error if CUDA support was not enabled at build time or no compatible device is found at runtime.--gpu-layers N: Controls how many transformer layers are offloaded to VRAM. Any remaining layers are kept in RAM. This allows running large models on limited hardware by combining GPU and CPU resources.
The project is prepared to build artifacts for multiple architectures under bin/{arch}/{platform}/. A plain make builds only the current host architecture, while the targets below build the full matrix or a specific target.
make all
make x86_64/linux
make x86_64/windows
make i686/linux
make i686/windows
make aarch64/linux
make aarch64/android
make armv7/linux
make armv7/android
make armv7hf/linux
make riscv64/linux
make powerpc64le/linux
make mips/linux
make mipsel/linux
make mips64el/linux
make s390x/linux
make loongarch64/linuxThis project is distributed under the GNU General Public License version 3 (GPLv3).
