cellm is a ground-up LLM serving engine for iOS and Android, written in Rust. It brings serving-engine concepts — paged KV cache management, continuous decode scheduling, multi-session concurrency, and a high-performance CLI — to phones running under 512MB RAM.
Not a wrapper around llama.cpp. Not a port of vLLM. A new runtime designed for mobile constraints from scratch.
cellm has evolved from an honest baseline into a multimodal-ready inference engine:
- Paged KV Cache: Fixed-size block allocation using
BlockAllocator&PageTable. - Multi-session Scheduler: Round-robin interleaved decoding for concurrent users.
- 4-bit Affine Dequantization: Native support for high-precision 4-bit packed weights from MLX/HF.
- Multimodal Vision: Native ViT/SigLIP vision encoder and linear projector integration.
- Accelerated Math: Metal (macOS/iOS) compute kernels and SIMD-optimized CPU fallbacks.
- High-Performance CLI: Suite of tools for
.cellmconversion, latency benchmarking, and debug inference. - Vulkan Support: Cross-platform compute kernels (Active Research).
- Android Integration: Native Kotlin/JNI bindings and performance tuning (Coming Soon).
- Qwen iOS Porting: Integrate and optimize Qwen inference path for native iOS deployment.
- Rust 1.75+ (Modern toolchain recommended)
To build the workspace:
cargo build --releaseRun unoptimized CPU inference for a .cellm model:
cargo run --release --bin infer -- \
--model models/smollm2-135m.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "Hello, how are you?" \
--chat \
--gen 32Notes:
--chatauto-detects ChatML-style tokens (for SmolLM2 this uses<|im_start|>/<|im_end|>). Without chat formatting, many base models behave like “text completion” and may not answer directly.- Use
--chat-format plainto force the simplerUser:/Assistant:style.--chat-format auto(default) only uses ChatML when the tokenizer advertises a chat template intokenizer_config.json. --max-layersis only for debugging; using fewer layers will significantly degrade quality.--backend metalnow performs a Metal kernel smoke check before inference, then falls back to current CPU math paths (full Metal forward kernels are still in progress).
Run with backend selection:
# CPU
cargo run --release --bin infer -- \
--model models/smollm2-135m-int8.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "Hello" \
--chat \
--gen 16 \
--backend cpu
# Metal (auto-falls back to CPU if unavailable)
cargo run --release --bin infer -- \
--model models/smollm2-135m-int8.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "Hello" \
--chat \
--gen 16 \
--backend metalQwen3.5 notes:
- Qwen3.5 4-bit MLX tokenizers sometimes store BPE merges as
[[a,b], ...]instead of["a b", ...].inferauto-normalizes this on load. - Qwen3.5 mixes
full_attentionlayers andlinear_attention(DeltaNet / gated delta rule) layers.inferincludes a CPU reference implementation for both. - DeltaNet is stateful per session (separate from the paged KV cache). See
docs/qwen3_5-deltanet.md.
Qwen3.5 quantization findings (March 30, 2026):
- Baseline text+vision
.cellm:models/qwen3.5-0.8b.cellm=1746.9 MB - Int8 weight-only:
models/qwen3.5-0.8b-int8.cellm=1706.0 MB - Int4 weight-only (all stacks):
models/qwen3.5-0.8b-int4.cellm=620.4 MB - Int4 weight-only + text-only tensors:
models/qwen3.5-0.8b-int4-textonly.cellm=378.3 MB(< 500MB)
Create Qwen3.5 int4:
./target/release/convert \
--input models/hf/qwen3.5-0.8b \
--output models/qwen3.5-0.8b-int4.cellm \
--quantize-int4-symmetricCreate Qwen3.5 int4 text-only:
./target/release/convert \
--input models/hf/qwen3.5-0.8b \
--output models/qwen3.5-0.8b-int4-textonly.cellm \
--quantize-int4-symmetric \
--text-onlyRun prompt test on qwen3.5-0.8b-int4.cellm:
./target/release/infer \
--model models/qwen3.5-0.8b-int4.cellm \
--tokenizer models/hf/qwen3.5-0.8b/tokenizer.json \
--prompt "Write one short sentence about Rust programming." \
--chat \
--chat-format chatml \
--gen 24 \
--temperature 0Sample output:
Rust programming serves as a concise and concise language that simplifies the creation of programs, making it easier to understand the
Qwen int4 Metal run note:
--backend metalcurrently reports:Backend: metal (smoke ok). Forward path currently uses CPU math kernels.- This means Metal device/kernel smoke passes, but Qwen forward execution in
inferis still CPU-path today.
Repro command:
./target/release/infer \
--model models/qwen3.5-0.8b-int4-textonly.cellm \
--tokenizer models/hf/qwen3.5-0.8b/tokenizer.json \
--prompt "Return exactly one uppercase letter: R" \
--chat \
--chat-format chatml \
--gen 4 \
--temperature 0 \
--backend metalRun prompt test on qwen3.5-0.8b-int4-textonly.cellm:
./target/release/infer \
--model models/qwen3.5-0.8b-int4-textonly.cellm \
--tokenizer models/hf/qwen3.5-0.8b/tokenizer.json \
--prompt "Write one short sentence about Rust programming." \
--chat \
--chat-format chatml \
--gen 24 \
--temperature 0Sample output:
Rust programming serves as a concise and concise language that simplifies the creation of programs, making it easier to understand the
SmolVLM-256M-Instruct ships official ONNX exports (vision_encoder, embed_tokens, decoder_model_merged). For quick local validation on macOS, use:
cargo build --release -p cellm-vlm-onnx-infer
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image in detail." \
--backend cpu \
--split-image \
--max-new-tokens 128Recommended “works now” test command:
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--split-image \
--max-new-tokens 96 \
--min-new-tokens 24 \
--temperature 0.7 \
--top-k 40 \
--seed 1Second image test:
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo_1.jpg \
--prompt "Describe this image." \
--split-image \
--max-new-tokens 96 \
--min-new-tokens 24 \
--temperature 0.7 \
--top-k 40 \
--seed 1Notes:
- Use
--split-imagefor best quality on detailed scenes (global + local tiles). It is slower, but improves caption relevance. - Keep
--split-imageoff for faster smoke tests. vlm-infernow computes decoderposition_idsfrom the full attention history and growsattention_maskacross decode steps, which fixes repetitive/garbled outputs from earlier builds.
Native .cellm decoder path (experimental):
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--cellm-model models/smolvlm-256m-int8.cellm \
--decoder-backend cellm \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--max-new-tokens 24- This uses ONNX for vision encoding and native
.cellmfor decoder text-stack execution.
Native .cellm vision + decoder path (experimental):
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--cellm-model models/smolvlm-256m.cellm \
--vision-backend cellm \
--decoder-backend cellm \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--max-new-tokens 12- This bypasses ONNX model execution for both vision and decoder.
- Native vision now runs a full ViT encoder path in Rust (patch embed + 12 transformer blocks + post layernorm + connector projection).
- The native path now uses SIMD-optimized BLAS (
AccelerateSGEMM on macOS) for linear layers and attention matmuls. - Current limitation: ONNX Runtime is still faster for vision on the same machine.
SDK FFI VLM smoke test (same native .cellm vision+decoder stack used by iOS):
CELLM_VLM_TOKENIZER=models/hf/smolvlm-256m-instruct/tokenizer.json \
cargo run --release --bin vlm-smoke -- \
--model models/smolvlm-256m-int8.cellm \
--image models/test_images/rococo_1.jpg \
--prompt "Describe this image."Run VLM with backend selection:
# ONNX vision+decoder, CPU backend selection
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--backend cpu
# ONNX vision+decoder, Metal requested (auto-fallback if unavailable)
./target/release/vlm-infer \
--model-dir models/hf/smolvlm-256m-instruct \
--onnx-variant fp16 \
--image models/test_images/rococo.jpg \
--prompt "Describe this image." \
--backend metalSee docs/vlm-smolvlm-onnx.md for details, debug flags, and current limitations.
Sequence progress is tracked in docs/cellm-vlm-sequence.md.
Test the latency and memory throughput of the engine:
# Run with tiny test configuration
cargo run --release --bin bench -- --model tiny
# Run with SmolLM2-135M configuration
cargo run --release --bin bench -- --model smollm2-135m --seq 128 --gen 64Recent run-time snapshots (measured locally on March 29, 2026; these are reference numbers, not guaranteed across machines):
| Run | Key Params | Vision Time | Prefill Time | Decode Time |
|---|---|---|---|---|
infer |
smollm2-135m.cellm, 30 layers, prompt "Hello, how are you?", --chat, --gen 8 |
N/A | 12 tokens in 2.35s |
8 tokens in 1.62s |
infer |
smollm2-135m-int8.cellm, prompt "Hello, how are you?", --chat, --gen 16 |
N/A | 12 tokens in 2.38s |
16 tokens in 3.36s |
vlm-infer ONNX fp16 |
rococo.jpg, --max-new-tokens 16 |
[64, 576] in 0.99s |
N/A | 16 tokens in 1.54s |
vlm-infer ONNX quantized |
rococo.jpg, --onnx-variant quantized, --max-new-tokens 24 |
[64, 576] in 1.44s |
N/A | 18 tokens in 0.35s (EOS step 17) |
vlm-infer native vision + ONNX decoder |
rococo.jpg, --vision-backend cellm --decoder-backend onnx --max-new-tokens 16 |
[64, 576] in 5.96s |
N/A | 16 tokens in 4.15s |
vlm-infer native vision + native decoder |
rococo.jpg, --vision-backend cellm --decoder-backend cellm --max-new-tokens 24 |
[64, 576] in 5.65s |
N/A | 24 tokens in 18.39s |
CPU vs Metal request benchmark (same prompts/settings, local run on March 29, 2026):
| Tool | Backend Arg | Host Log | Vision Time | Prefill Time | Decode Time |
|---|---|---|---|---|---|
infer (smollm2-135m-int8, --gen 16) |
--backend cpu |
Backend: cpu (macos/aarch64) |
N/A | 12 tokens in 1.77s |
16 tokens in 2.07s |
infer (smollm2-135m-int8, --gen 16) |
--backend metal |
Backend: metal (smoke ok) |
N/A | 12 tokens in 1.75s |
16 tokens in 2.07s |
vlm-infer (fp16, rococo.jpg, --max-new-tokens 16) |
--backend cpu |
Backend: cpu (macos/aarch64) |
[64, 576] in 2.28s |
N/A | 16 tokens in 2.37s |
vlm-infer (fp16, rococo.jpg, --max-new-tokens 16) |
--backend metal |
Backend: metal (smoke ok) |
[64, 576] in 1.85s |
N/A | 16 tokens in 2.56s |
Note: in restricted/sandboxed shells, Metal device discovery can fail and trigger CPU fallback. On host macOS runs, Metal smoke succeeds.
For a dedicated benchmark page (commands + tables), see docs/benchmarks/README.md.
Metal troubleshooting:
# 1) Verify Metal device access
cargo run --release --bin metal-smoke
# 2) Verify infer picks Metal
./target/release/infer \
--model models/smollm2-135m-int8.cellm \
--tokenizer models/hf/smollm2-135m/tokenizer.json \
--prompt "hello" \
--gen 8 \
--backend metal(Local) Convert HuggingFace safetensors to .cellm:
cargo run --bin convert -- \
--input ./models/hf/smollm2-135m \
--output ./models/smollm2-135m.cellm \
--dtype f16PyTorch checkpoint import (.bin / .pt) is also supported (auto-converted to temporary safetensors first):
cargo run --bin convert -- \
--input ./models/hf/some-model/pytorch_model.bin \
--output ./models/some-model.cellm \
--dtype f16Working quantization option (Llama text stacks):
cargo run --bin convert -- \
--input ./models/hf/smollm2-135m \
--output ./models/smollm2-135m-int8.cellm \
--dtype f16 \
--quantize-int8-symmetric- This is weight-only per-row symmetric int8 for attention/MLP linear weights.
inferruns these quantized weights directly (with per-row f16 scales).
Multimodal checkpoints (example: SmolVLM):
convertnow readstext_config.model_type(not just top-levelmodel_type) when writing.cellm, so multimodal wrappers can map to the correct text backbone runner (llama/qwen) ininfer/SDK..cellmheaders now include VLM-aware sections:text_tensor_prefix,vision_tensor_prefix,projector_tensor_prefix, plus sourcevision_config/projector config metadata.- If conversion fails with
metadata incomplete, the local.safetensorsfile is truncated and must be re-downloaded before conversion. - Keep enough free disk space for conversion (typically at least model_size + output_size + working headroom).
- Native
.cellmVLM execution is now available invlm-inferwith--vision-backend cellm --decoder-backend cellm(experimental CPU path).
Quantized validation status:
- Text (quantized
.cellm): tested and working. smollm2-135m-int8.cellmruns ininferand generates output.smolvlm-256m-int8.cellmalso runs through the text path (infer).- Vision (quantized): tested with ONNX VLM path (
vlm-infer --onnx-variant quantized) and produces image-relevant captions. - Native
.cellmvision execution is implemented invlm-infer(--vision-backend cellm) and tested with both ONNX and native decoders. - Native
.cellmvision currently prioritizes correctness over speed (CPU-only Rust math, no fused kernels yet).
Quantized sizes:
models/smollm2-135m.cellm:257Mmodels/smollm2-135m-int8.cellm:156M(~39% smaller)models/smolvlm-256m.cellm:489Mmodels/smolvlm-256m-int8.cellm:308M(~37% smaller)
Quantized checkpoints:
- Some HF folders (e.g. 4-bit affine:
uint32packed weights +*.scales/*.biases) require expanding weights to f16 during conversion: add--dequant-4bit-affine. This increases output size.
Recommended Model: SmolLM2-135M
Sample .cellm checkpoints are included in the repository (tracked via Git LFS) and can be used for immediate testing:
models/smollm2-135m-int8.cellmmodels/smolvlm-256m-int8.cellmmodels/qwen3.5-0.8b-int4-textonly.cellm
crates/cellm-core: Memory arena, tensor layout, and op dispatch.crates/cellm-model: Model format, configuration, and weight management.crates/cellm-cache: Paged KV cache building blocks (allocator, page table, physical KV storage).crates/cellm-sdk: High-level public API for mobile consumers.tools/bench: Benchmark harness for TTFT and tok/s metrics.tools/convert: HuggingFace to.cellmconversion pipeline.tools/infer: Simple Rust inference runner for debugging models and cache behavior.tools/vlm-onnx-infer: Rust runner for SmolVLM ONNX exports (VLM validation on desktop).
Phase 6 provides native vision-language reasoning on device.
See docs/paged-kv-cache-foundation.md for a plain-English walkthrough of the BlockAllocator/PageTable/KVCache foundation.
Verify the Metal toolchain works on Apple Silicon/macOS (compile + dispatch a tiny compute kernel):
cargo run --release --bin metal-smokeBuild bindings/swift/CellmFFI.xcframework (staticlib + headers) so the Swift package works on both macOS (M-series dev) and iOS:
./scripts/build_xcframework.shThere is a small SwiftUI demo app scaffold under:
bindings/ios/CellmDemo
It uses the C FFI from cellm-sdk:
cellm_engine_create_v3(...)for engine + sampling + backend config (cpu/metal)cellm_engine_backend_name(...)to confirm active backend in-appcellm_tokenizer_create/encode/decode(...)for prompt tokenization in-app
See bindings/ios/CellmDemo/README.md for the Xcode steps.
- Sampling PRNG:
cellmuses simple, high-performance PRNGs for stochastic sampling.- Linear Congruential Generator (LCG) — Background on simple PRNG architectures.
- Xorshift — The specific 64-bit implementation used in
cellm-sdk.
Licensed under either of:
- MIT license (
LICENSE-MIT) - Apache License, Version 2.0 (
LICENSE-APACHE)
at your option.