cellm — Mobile-Native LLM Serving Engine

cellm is a ground-up LLM serving engine for iOS and Android, written in Rust. It brings serving-engine concepts — paged KV cache management, continuous decode scheduling, multi-session concurrency, and a high-performance CLI — to phones running under 512MB RAM.

Not a wrapper around llama.cpp. Not a port of vLLM. A new runtime designed for mobile constraints from scratch.

Current Status: Phase 6 (Multimodal Vision)

cellm has evolved from an honest baseline into a multimodal-ready inference engine:

Paged KV Cache: Fixed-size block allocation using BlockAllocator & PageTable.
Multi-session Scheduler: Round-robin interleaved decoding for concurrent users.
4-bit Affine Dequantization: Native support for high-precision 4-bit packed weights from MLX/HF.
Multimodal Vision: Native ViT/SigLIP vision encoder and linear projector integration.
Accelerated Math: Metal (macOS/iOS) compute kernels and SIMD-optimized CPU fallbacks.
High-Performance CLI: Suite of tools for .cellm conversion, latency benchmarking, and debug inference.
Vulkan Support: Cross-platform compute kernels (Active Research).
Android Integration: Native Kotlin/JNI bindings and performance tuning (Coming Soon).
Qwen iOS Porting: Integrate and optimize Qwen inference path for native iOS deployment.

Getting Started

Prerequisites

Rust 1.75+ (Modern toolchain recommended)

Build

To build the workspace:

cargo build --release

Run Inference (Smoke Test)

Run unoptimized CPU inference for a .cellm model:

cargo run --release --bin infer -- \
  --model models/smollm2-135m.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "Hello, how are you?" \
  --chat \
  --gen 32

Notes:

--chat auto-detects ChatML-style tokens (for SmolLM2 this uses <|im_start|> / <|im_end|>). Without chat formatting, many base models behave like “text completion” and may not answer directly.
Use --chat-format plain to force the simpler User:/Assistant: style. --chat-format auto (default) only uses ChatML when the tokenizer advertises a chat template in tokenizer_config.json.
--max-layers is only for debugging; using fewer layers will significantly degrade quality.
--backend metal now performs a Metal kernel smoke check before inference, then falls back to current CPU math paths (full Metal forward kernels are still in progress).

Run with backend selection:

# CPU
cargo run --release --bin infer -- \
  --model models/smollm2-135m-int8.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "Hello" \
  --chat \
  --gen 16 \
  --backend cpu

# Metal (auto-falls back to CPU if unavailable)
cargo run --release --bin infer -- \
  --model models/smollm2-135m-int8.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "Hello" \
  --chat \
  --gen 16 \
  --backend metal

Qwen3.5 notes:

Qwen3.5 4-bit MLX tokenizers sometimes store BPE merges as [[a,b], ...] instead of ["a b", ...]. infer auto-normalizes this on load.
Qwen3.5 mixes full_attention layers and linear_attention (DeltaNet / gated delta rule) layers. infer includes a CPU reference implementation for both.
DeltaNet is stateful per session (separate from the paged KV cache). See docs/qwen3_5-deltanet.md.

Qwen3.5 quantization findings (March 30, 2026):

Baseline text+vision .cellm: models/qwen3.5-0.8b.cellm = 1746.9 MB
Int8 weight-only: models/qwen3.5-0.8b-int8.cellm = 1706.0 MB
Int4 weight-only (all stacks): models/qwen3.5-0.8b-int4.cellm = 620.4 MB
Int4 weight-only + text-only tensors: models/qwen3.5-0.8b-int4-textonly.cellm = 378.3 MB (< 500MB)

Create Qwen3.5 int4:

./target/release/convert \
  --input models/hf/qwen3.5-0.8b \
  --output models/qwen3.5-0.8b-int4.cellm \
  --quantize-int4-symmetric

Create Qwen3.5 int4 text-only:

./target/release/convert \
  --input models/hf/qwen3.5-0.8b \
  --output models/qwen3.5-0.8b-int4-textonly.cellm \
  --quantize-int4-symmetric \
  --text-only

Run prompt test on qwen3.5-0.8b-int4.cellm:

./target/release/infer \
  --model models/qwen3.5-0.8b-int4.cellm \
  --tokenizer models/hf/qwen3.5-0.8b/tokenizer.json \
  --prompt "Write one short sentence about Rust programming." \
  --chat \
  --chat-format chatml \
  --gen 24 \
  --temperature 0

Sample output:

Rust programming serves as a concise and concise language that simplifies the creation of programs, making it easier to understand the

Qwen int4 Metal run note:

--backend metal currently reports: Backend: metal (smoke ok). Forward path currently uses CPU math kernels.
This means Metal device/kernel smoke passes, but Qwen forward execution in infer is still CPU-path today.

Repro command:

./target/release/infer \
  --model models/qwen3.5-0.8b-int4-textonly.cellm \
  --tokenizer models/hf/qwen3.5-0.8b/tokenizer.json \
  --prompt "Return exactly one uppercase letter: R" \
  --chat \
  --chat-format chatml \
  --gen 4 \
  --temperature 0 \
  --backend metal

Run prompt test on qwen3.5-0.8b-int4-textonly.cellm:

./target/release/infer \
  --model models/qwen3.5-0.8b-int4-textonly.cellm \
  --tokenizer models/hf/qwen3.5-0.8b/tokenizer.json \
  --prompt "Write one short sentence about Rust programming." \
  --chat \
  --chat-format chatml \
  --gen 24 \
  --temperature 0

Sample output:

Rust programming serves as a concise and concise language that simplifies the creation of programs, making it easier to understand the

Run VLM (SmolVLM-256M via ONNX, Rust validation)

SmolVLM-256M-Instruct ships official ONNX exports (vision_encoder, embed_tokens, decoder_model_merged). For quick local validation on macOS, use:

cargo build --release -p cellm-vlm-onnx-infer

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image in detail." \
  --backend cpu \
  --split-image \
  --max-new-tokens 128

Recommended “works now” test command:

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --split-image \
  --max-new-tokens 96 \
  --min-new-tokens 24 \
  --temperature 0.7 \
  --top-k 40 \
  --seed 1

Second image test:

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo_1.jpg \
  --prompt "Describe this image." \
  --split-image \
  --max-new-tokens 96 \
  --min-new-tokens 24 \
  --temperature 0.7 \
  --top-k 40 \
  --seed 1

Notes:

Use --split-image for best quality on detailed scenes (global + local tiles). It is slower, but improves caption relevance.
Keep --split-image off for faster smoke tests.
vlm-infer now computes decoder position_ids from the full attention history and grows attention_mask across decode steps, which fixes repetitive/garbled outputs from earlier builds.

Native .cellm decoder path (experimental):

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --cellm-model models/smolvlm-256m-int8.cellm \
  --decoder-backend cellm \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --max-new-tokens 24

This uses ONNX for vision encoding and native .cellm for decoder text-stack execution.

Native .cellm vision + decoder path (experimental):

./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --cellm-model models/smolvlm-256m.cellm \
  --vision-backend cellm \
  --decoder-backend cellm \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --max-new-tokens 12

This bypasses ONNX model execution for both vision and decoder.
Native vision now runs a full ViT encoder path in Rust (patch embed + 12 transformer blocks + post layernorm + connector projection).
The native path now uses SIMD-optimized BLAS (Accelerate SGEMM on macOS) for linear layers and attention matmuls.
Current limitation: ONNX Runtime is still faster for vision on the same machine.

SDK FFI VLM smoke test (same native .cellm vision+decoder stack used by iOS):

CELLM_VLM_TOKENIZER=models/hf/smolvlm-256m-instruct/tokenizer.json \
cargo run --release --bin vlm-smoke -- \
  --model models/smolvlm-256m-int8.cellm \
  --image models/test_images/rococo_1.jpg \
  --prompt "Describe this image."

Run VLM with backend selection:

# ONNX vision+decoder, CPU backend selection
./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --backend cpu

# ONNX vision+decoder, Metal requested (auto-fallback if unavailable)
./target/release/vlm-infer \
  --model-dir models/hf/smolvlm-256m-instruct \
  --onnx-variant fp16 \
  --image models/test_images/rococo.jpg \
  --prompt "Describe this image." \
  --backend metal

See docs/vlm-smolvlm-onnx.md for details, debug flags, and current limitations. Sequence progress is tracked in docs/cellm-vlm-sequence.md.

Run Benchmarks

Test the latency and memory throughput of the engine:

# Run with tiny test configuration
cargo run --release --bin bench -- --model tiny

# Run with SmolLM2-135M configuration
cargo run --release --bin bench -- --model smollm2-135m --seq 128 --gen 64

Recent run-time snapshots (measured locally on March 29, 2026; these are reference numbers, not guaranteed across machines):

Run	Key Params	Vision Time	Prefill Time	Decode Time
`infer`	`smollm2-135m.cellm`, 30 layers, prompt `"Hello, how are you?"`, `--chat`, `--gen 8`	N/A	`12` tokens in `2.35s`	`8` tokens in `1.62s`
`infer`	`smollm2-135m-int8.cellm`, prompt `"Hello, how are you?"`, `--chat`, `--gen 16`	N/A	`12` tokens in `2.38s`	`16` tokens in `3.36s`
`vlm-infer` ONNX fp16	`rococo.jpg`, `--max-new-tokens 16`	`[64, 576]` in `0.99s`	N/A	`16` tokens in `1.54s`
`vlm-infer` ONNX quantized	`rococo.jpg`, `--onnx-variant quantized`, `--max-new-tokens 24`	`[64, 576]` in `1.44s`	N/A	`18` tokens in `0.35s` (EOS step `17`)
`vlm-infer` native vision + ONNX decoder	`rococo.jpg`, `--vision-backend cellm --decoder-backend onnx --max-new-tokens 16`	`[64, 576]` in `5.96s`	N/A	`16` tokens in `4.15s`
`vlm-infer` native vision + native decoder	`rococo.jpg`, `--vision-backend cellm --decoder-backend cellm --max-new-tokens 24`	`[64, 576]` in `5.65s`	N/A	`24` tokens in `18.39s`

CPU vs Metal request benchmark (same prompts/settings, local run on March 29, 2026):

Tool	Backend Arg	Host Log	Vision Time	Prefill Time	Decode Time
`infer` (`smollm2-135m-int8`, `--gen 16`)	`--backend cpu`	`Backend: cpu (macos/aarch64)`	N/A	`12` tokens in `1.77s`	`16` tokens in `2.07s`
`infer` (`smollm2-135m-int8`, `--gen 16`)	`--backend metal`	`Backend: metal (smoke ok)`	N/A	`12` tokens in `1.75s`	`16` tokens in `2.07s`
`vlm-infer` (`fp16`, `rococo.jpg`, `--max-new-tokens 16`)	`--backend cpu`	`Backend: cpu (macos/aarch64)`	`[64, 576]` in `2.28s`	N/A	`16` tokens in `2.37s`
`vlm-infer` (`fp16`, `rococo.jpg`, `--max-new-tokens 16`)	`--backend metal`	`Backend: metal (smoke ok)`	`[64, 576]` in `1.85s`	N/A	`16` tokens in `2.56s`

Note: in restricted/sandboxed shells, Metal device discovery can fail and trigger CPU fallback. On host macOS runs, Metal smoke succeeds.

For a dedicated benchmark page (commands + tables), see docs/benchmarks/README.md.

Metal troubleshooting:

# 1) Verify Metal device access
cargo run --release --bin metal-smoke

# 2) Verify infer picks Metal
./target/release/infer \
  --model models/smollm2-135m-int8.cellm \
  --tokenizer models/hf/smollm2-135m/tokenizer.json \
  --prompt "hello" \
  --gen 8 \
  --backend metal

Convert Models

(Local) Convert HuggingFace safetensors to .cellm:

cargo run --bin convert -- \
  --input  ./models/hf/smollm2-135m \
  --output ./models/smollm2-135m.cellm \
  --dtype  f16

PyTorch checkpoint import (.bin / .pt) is also supported (auto-converted to temporary safetensors first):

cargo run --bin convert -- \
  --input  ./models/hf/some-model/pytorch_model.bin \
  --output ./models/some-model.cellm \
  --dtype  f16

Working quantization option (Llama text stacks):

cargo run --bin convert -- \
  --input  ./models/hf/smollm2-135m \
  --output ./models/smollm2-135m-int8.cellm \
  --dtype  f16 \
  --quantize-int8-symmetric

This is weight-only per-row symmetric int8 for attention/MLP linear weights.
infer runs these quantized weights directly (with per-row f16 scales).

Multimodal checkpoints (example: SmolVLM):

convert now reads text_config.model_type (not just top-level model_type) when writing .cellm, so multimodal wrappers can map to the correct text backbone runner (llama/qwen) in infer/SDK.
.cellm headers now include VLM-aware sections: text_tensor_prefix, vision_tensor_prefix, projector_tensor_prefix, plus source vision_config/projector config metadata.
If conversion fails with metadata incomplete, the local .safetensors file is truncated and must be re-downloaded before conversion.
Keep enough free disk space for conversion (typically at least model_size + output_size + working headroom).
Native .cellm VLM execution is now available in vlm-infer with --vision-backend cellm --decoder-backend cellm (experimental CPU path).

Quantized validation status:

Text (quantized .cellm): tested and working.
smollm2-135m-int8.cellm runs in infer and generates output.
smolvlm-256m-int8.cellm also runs through the text path (infer).
Vision (quantized): tested with ONNX VLM path (vlm-infer --onnx-variant quantized) and produces image-relevant captions.
Native .cellm vision execution is implemented in vlm-infer (--vision-backend cellm) and tested with both ONNX and native decoders.
Native .cellm vision currently prioritizes correctness over speed (CPU-only Rust math, no fused kernels yet).

Quantized sizes:

models/smollm2-135m.cellm: 257M
models/smollm2-135m-int8.cellm: 156M (~39% smaller)
models/smolvlm-256m.cellm: 489M
models/smolvlm-256m-int8.cellm: 308M (~37% smaller)

Quantized checkpoints:

Some HF folders (e.g. 4-bit affine: uint32 packed weights + *.scales/*.biases) require expanding weights to f16 during conversion: add --dequant-4bit-affine. This increases output size.

Recommended Model: SmolLM2-135M

Sample Models

Sample .cellm checkpoints are included in the repository (tracked via Git LFS) and can be used for immediate testing:

models/smollm2-135m-int8.cellm
models/smolvlm-256m-int8.cellm
models/qwen3.5-0.8b-int4-textonly.cellm

Directory Structure

crates/cellm-core: Memory arena, tensor layout, and op dispatch.
crates/cellm-model: Model format, configuration, and weight management.
crates/cellm-cache: Paged KV cache building blocks (allocator, page table, physical KV storage).
crates/cellm-sdk: High-level public API for mobile consumers.
tools/bench: Benchmark harness for TTFT and tok/s metrics.
tools/convert: HuggingFace to .cellm conversion pipeline.
tools/infer: Simple Rust inference runner for debugging models and cache behavior.
tools/vlm-onnx-infer: Rust runner for SmolVLM ONNX exports (VLM validation on desktop).

Phase 6 provides native vision-language reasoning on device.

See docs/paged-kv-cache-foundation.md for a plain-English walkthrough of the BlockAllocator/PageTable/KVCache foundation.

Metal Smoke Test (macOS)

Verify the Metal toolchain works on Apple Silicon/macOS (compile + dispatch a tiny compute kernel):

cargo run --release --bin metal-smoke

Build Swift XCFramework (iOS + macOS)

Build bindings/swift/CellmFFI.xcframework (staticlib + headers) so the Swift package works on both macOS (M-series dev) and iOS:

./scripts/build_xcframework.sh

iOS Demo App (LLM now, VLM stub)

There is a small SwiftUI demo app scaffold under:

bindings/ios/CellmDemo

It uses the C FFI from cellm-sdk:

cellm_engine_create_v3(...) for engine + sampling + backend config (cpu / metal)
cellm_engine_backend_name(...) to confirm active backend in-app
cellm_tokenizer_create/encode/decode(...) for prompt tokenization in-app

See bindings/ios/CellmDemo/README.md for the Xcode steps.

Design References

Sampling PRNG: cellm uses simple, high-performance PRNGs for stochastic sampling.
- Linear Congruential Generator (LCG) — Background on simple PRNG architectures.
- Xorshift — The specific 64-bit implementation used in cellm-sdk.

License

Licensed under either of:

MIT license (LICENSE-MIT)
Apache License, Version 2.0 (LICENSE-APACHE)

at your option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cellm — Mobile-Native LLM Serving Engine

Current Status: Phase 6 (Multimodal Vision)

Getting Started

Prerequisites

Build

Run Inference (Smoke Test)

Run VLM (SmolVLM-256M via ONNX, Rust validation)

Run Benchmarks

Convert Models

Sample Models

Directory Structure

Metal Smoke Test (macOS)

Build Swift XCFramework (iOS + macOS)

iOS Demo App (LLM now, VLM stub)

Design References

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bindings		bindings
crates		crates
docs		docs
models		models
scripts		scripts
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
references.md		references.md

Folders and files

Latest commit

History

Repository files navigation

cellm — Mobile-Native LLM Serving Engine

Current Status: Phase 6 (Multimodal Vision)

Getting Started

Prerequisites

Build

Run Inference (Smoke Test)

Run VLM (SmolVLM-256M via ONNX, Rust validation)

Run Benchmarks

Convert Models

Sample Models

Directory Structure

Metal Smoke Test (macOS)

Build Swift XCFramework (iOS + macOS)

iOS Demo App (LLM now, VLM stub)

Design References

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages