Pure Rust inference engine for transformer models on Apple Silicon.
No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph — weight loading, tokenization, forward pass, and vector operations — in Rust, with hand-written Metal shaders and SIMD kernels.
Lattice is the only inference engine that correctly runs Qwen3.5's hybrid GatedDeltaNet architecture at 4-bit on Apple Silicon — with QuaRot-Q4 and LoRA hot-swap that neither Ollama nor MLX support. On a fair end-to-end decode measurement it is ~1.9× faster than Ollama/llama.cpp. Apple's MLX (Metal-native) decodes faster than Lattice at raw throughput — Lattice's edge is portability (pure Rust, no Python/framework) plus those Q4 + adapter capabilities. Full table and methodology below.
# Reproduce on your hardware (macOS + ollama + uv):
./scripts/bench_apples_to_apples.shBuilt for inference on CPU and macOS GPU. Optimized for AVX2 (x86), NEON (ARM), and Metal (Apple Silicon) — not CUDA. If you need NVIDIA GPU inference, use candle or mistral.rs. Lattice targets the other 90% of deployments: servers, edge, laptops, and library dependencies that shouldn't drag in a 300 MB ONNX runtime.
[dependencies]
lattice-embed = "0.1"
tokio = { version = "1", features = ["full"] }use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let service = NativeEmbeddingService::default();
// Single embedding (BGE-small-en-v1.5, 384 dimensions)
let embedding = service
.embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
.await?;
println!("Dimensions: {}", embedding.len()); // 384
// Batch
let texts = vec![
"First document".to_string(),
"Second document".to_string(),
];
let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;
// SIMD-accelerated similarity
let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
println!("Similarity: {:.4}", similarity);
Ok(())
}Model weights are downloaded from HuggingFace on first use and cached at ~/.lattice/models
(or $LATTICE_MODEL_CACHE).
| Feature | Description |
|---|---|
| Pure Rust compute | Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA. |
| Two transformer architectures | BERT/BGE encoder-only (mean pooling) and Qwen3 decoder-only (causal GQA, last-token pooling) |
| 9 supported models | BGE, mE5, MiniLM, Qwen3-Embedding families — see table below |
| Metal GPU backend | Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform. |
| Three pure Rust tokenizers | WordPiece, SentencePiece, BPE — no Hugging Face tokenizers C extension |
| Safetensors native | Memory-mapped weight loading from HuggingFace .safetensors format |
| MRL support | Matryoshka truncation for Qwen3 models (configurable output dimension >= 32) |
| LRU embedding cache | CachedEmbeddingService with sharded in-memory cache and hit/miss stats |
| LoRA adapter injection | Inject trained LoRA adapters into inference models at runtime |
| Knowledge distillation | Train small models from Claude/GPT/Gemini teacher soft labels |
| Optimal transport | Sinkhorn-Knopp solver (log-domain, epsilon-scaling) for embedding drift detection |
| Tiny fast networks | lattice-fann: sub-5ms classifiers with pre-allocated buffers, zero-alloc forward pass |
Application
|
v
lattice-embed (public API — embedding service, SIMD distance ops, LRU cache)
|
v
lattice-inference (transformer kernel — BERT/Qwen3 forward pass, tokenizers, weights)
|
+---> CPU (primary) Metal (macOS) WGPU (fallback)
AVX2/NEON kernels Apple Silicon Vulkan/DX12
lattice-fann (standalone — tiny network primitives, <5ms CPU inference)
lattice-transport (standalone — optimal transport math, Wasserstein distances)
lattice-tune (depends on fann + inference — LoRA, distillation, model registry)
The three leaf crates (inference, fann, transport) have zero intra-workspace dependencies
and can be used standalone.
| Crate | Description | LOC |
|---|---|---|
lattice-embed |
Embedding service — EmbeddingService trait, NativeEmbeddingService, CachedEmbeddingService, SIMD cosine/dot/euclidean, backfill, migration |
~14 k |
lattice-inference |
Transformer kernel — safetensors loading, BERT/BGE/Qwen3 forward pass, WordPiece/SentencePiece/BPE tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding | ~69 k |
lattice-fann |
Fast neural network primitives — NetworkBuilder, pre-allocated layers, zero-alloc forward pass, backprop trainer, FANN binary format |
~7.5 k |
lattice-tune |
Training infrastructure — knowledge distillation pipeline, dataset management, LoRA adapter management, model registry with semver lineage | ~13 k |
lattice-transport |
Optimal transport math — Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection, log-domain throughout | ~5.3 k |
All local models load from HuggingFace safetensors format.
| Model | Variant | Architecture | Dimensions | Max Tokens | Tokenizer |
|---|---|---|---|---|---|
BgeSmallEnV15 |
BAAI/bge-small-en-v1.5 | BERT encoder | 384 | 512 | WordPiece |
BgeBaseEnV15 |
BAAI/bge-base-en-v1.5 | BERT encoder | 768 | 512 | WordPiece |
BgeLargeEnV15 |
BAAI/bge-large-en-v1.5 | BERT encoder | 1024 | 512 | WordPiece |
MultilingualE5Small |
intfloat/multilingual-e5-small | BERT encoder | 384 | 512 | SentencePiece |
MultilingualE5Base |
intfloat/multilingual-e5-base | BERT encoder | 768 | 512 | SentencePiece |
AllMiniLmL6V2 |
sentence-transformers/all-MiniLM-L6-v2 | BERT encoder | 384 | 256 | WordPiece |
ParaphraseMultilingualMiniLmL12V2 |
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | BERT encoder | 384 | 128 | WordPiece |
Qwen3Embedding0_6B |
Qwen/Qwen3-Embedding-0.6B | Decoder (GQA+RoPE) | 1024 | 8192 | BPE |
Qwen3Embedding4B |
Qwen/Qwen3-Embedding-4B | Decoder (GQA+RoPE) | 2560* | 8192 | BPE |
*Qwen3-Embedding-4B supports MRL truncation to any dimension >= 32.
E5 models expect asymmetric "query: " / "passage: " prefixes for retrieval. Qwen3 models
expect an instruction prefix. The EmbeddingModel enum exposes query_instruction() and
document_instruction() so callers can apply these correctly.
use lattice_embed::EmbeddingModel;
// Fast general-purpose English retrieval
let model = EmbeddingModel::BgeSmallEnV15; // 384-dim, fastest
// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15; // 768-dim
// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15; // 1024-dim
// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base; // 768-dim, 100+ languages
// Long context + multilingual (requires GPU for practical throughput)
let model = EmbeddingModel::Qwen3Embedding0_6B; // 1024-dim, 8K context
// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;| Feature | Default | Description |
|---|---|---|
native |
yes | Pure Rust inference via lattice-inference |
metal-gpu |
no | Metal GPU acceleration (macOS) |
avx512 |
no | AVX-512 SIMD kernels (requires nightly) |
| Feature | Default | Description |
|---|---|---|
f16 |
no | Half-precision weights |
metal-gpu |
no | Metal compute backend |
wgpu-gpu |
no | WGPU cross-platform GPU backend |
download |
yes | HuggingFace weight download with checksum verification |
backfill |
no | Re-embedding coordinator (requires rusqlite) |
# GPU acceleration on macOS
lattice-embed = { version = "0.1", features = ["metal-gpu"] }
# Cross-platform GPU
lattice-embed = { version = "0.1", features = ["wgpu-gpu"] }lattice-embed exposes SIMD-accelerated vector utilities as a stable public API:
use lattice_embed::utils;
// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);
utils::normalize(&mut vector); // in-place L2 normalization
// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);Measured performance on normalized 384-dim vectors (internal benchmarks, subject to hardware):
| Operation | Scalar | SIMD |
|---|---|---|
| cosine similarity (384-dim) | ~650 ns | ~90 ns |
| cosine similarity (768-dim) | ~1300 ns | ~180 ns |
| cosine similarity (1024-dim) | ~1700 ns | ~240 ns |
For tiny classifiers that need to run in under 5 ms on CPU:
use lattice_fann::{NetworkBuilder, Activation, BackpropTrainer, TrainingConfig, Trainer};
// Build a network
let mut network = NetworkBuilder::new()
.input(784)
.hidden(128, Activation::ReLU)
.hidden(64, Activation::ReLU)
.output(10, Activation::Softmax)
.build()?;
println!("{}", network.architecture()); // "784 -> ReLU(128) -> ReLU(64) -> Softmax(10)"
println!("Parameters: {}", network.total_params());
// Forward pass (no heap allocation)
let output = network.forward(&input)?;
// Serialize to compact binary (magic "FANN")
let bytes = network.to_bytes();
let restored = lattice_fann::Network::from_bytes(&bytes)?;Entropy-regularized optimal transport for measuring embedding geometry drift:
// Sinkhorn-Knopp in log-domain (numerically stable, no Gibbs kernel materialization)
// Balanced OT, unbalanced OT (KL-relaxed), Wasserstein barycenters
// Pre-allocated SinkhornWorkspace for zero-alloc inner loopsPrimary use case: detect when an embedding model update has shifted the distribution of stored vectors enough to warrant re-indexing.
Fair end-to-end measurement — slope method: tok/s = (N₂−N₁) / (T(N₂)−T(N₁)) for a fixed
prompt, so prompt prefill, model load, and per-call overhead cancel and every engine is measured
the same way (greedy, median of 5 runs, N₁=32, N₂=256).
| Engine | Quant | decode tok/s | Notes |
|---|---|---|---|
| MLX | Q8 (g64) | 247 | Apple's Metal-native framework — fastest |
| Lattice | f16 | 157 | Pure Rust + Metal; ~1.9× Ollama |
| Ollama | Q8_0 | 84 | llama.cpp Metal backend |
Cross-check: Ollama's slope (84) matches its own eval_duration rate (87), confirming the
methodology is sound.
Honest caveats. Lattice's Qwen3.5 decode path is f16 (MetalQwen35State::new); MLX is Q8
and Ollama is Q8_0, so this is not quant-matched. There is no end-to-end Q8 path for Qwen3.5
in Lattice — the earlier "Q8 139" number was an f16 forward_step micro-benchmark mislabeled
"Q8" (the bench's load_q8_state builds the same f16 state as the f16 path). 157 tok/s (f16,
true e2e) is Lattice's actual decode rate for this model; the only other real e2e path is Q4
(from_q4_dir). MLX decodes faster than Lattice. Lattice's value is portability plus
capabilities no other engine has on this model:
| Lattice-only capability | MLX | Ollama |
|---|---|---|
| QuaRot 4-bit (rotated quant) | ✗ | ✗ |
| Q4 + LoRA r8 hot-swap (no reload) | ✗ | ✗ |
| Pure Rust, zero Python / framework | ✗ | ✗ |
A previous version of this table claimed "+8% vs MLX"; that was a measurement artifact (a
criterion forward_step micro-bench compared against MLX end-to-end with prefill counted in the
decode rate) and has been corrected. All three engines implement the full GDN recurrence for
Qwen3.5's hybrid architecture (18 GatedDeltaNet + 6 GQA layers). Reproducible via
./scripts/bench_apples_to_apples.sh.
# Embedding throughput
cargo bench --package lattice-embed
# Metal GPU decode (macOS only, requires model weights)
cargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode
# Attention kernel
cargo bench --package lattice-inference --bench attention_benchPerformance depends on hardware, model size, batch size, and sequence length. Run the benchmarks on your target hardware to get representative numbers.
- Architecture — crate dependency graph, design decisions, stability tiers
- ADR directory:
docs/adr/— architectural decision records
Apache-2.0. See LICENSE.
Built by Ocean (HaiyangLi). Powers khive, a cognitive infrastructure for AI agents.
