Lattice

Pure Rust inference engine for transformer models on Apple Silicon.

No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph — weight loading, tokenization, forward pass, and vector operations — in Rust, with hand-written Metal shaders and SIMD kernels.

Lattice is the only inference engine that correctly runs Qwen3.5's hybrid GatedDeltaNet architecture at 4-bit on Apple Silicon — with QuaRot-Q4 and LoRA hot-swap that neither Ollama nor MLX support. On a fair end-to-end decode measurement it is ~1.9× faster than Ollama/llama.cpp. Apple's MLX (Metal-native) decodes faster than Lattice at raw throughput — Lattice's edge is portability (pure Rust, no Python/framework) plus those Q4 + adapter capabilities. Full table and methodology below.

# Reproduce on your hardware (macOS + ollama + uv):
./scripts/bench_apples_to_apples.sh

Built for inference on CPU and macOS GPU. Optimized for AVX2 (x86), NEON (ARM), and Metal (Apple Silicon) — not CUDA. If you need NVIDIA GPU inference, use candle or mistral.rs. Lattice targets the other 90% of deployments: servers, edge, laptops, and library dependencies that shouldn't drag in a 300 MB ONNX runtime.

Quick Start

[dependencies]
lattice-embed = "0.1"
tokio = { version = "1", features = ["full"] }

use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service = NativeEmbeddingService::default();

    // Single embedding (BGE-small-en-v1.5, 384 dimensions)
    let embedding = service
        .embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
        .await?;

    println!("Dimensions: {}", embedding.len()); // 384

    // Batch
    let texts = vec![
        "First document".to_string(),
        "Second document".to_string(),
    ];
    let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;

    // SIMD-accelerated similarity
    let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
    println!("Similarity: {:.4}", similarity);

    Ok(())
}

Model weights are downloaded from HuggingFace on first use and cached at ~/.lattice/models (or $LATTICE_MODEL_CACHE).

Features

Feature	Description
Pure Rust compute	Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA.
Two transformer architectures	BERT/BGE encoder-only (mean pooling) and Qwen3 decoder-only (causal GQA, last-token pooling)
9 supported models	BGE, mE5, MiniLM, Qwen3-Embedding families — see table below
Metal GPU backend	Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform.
Three pure Rust tokenizers	WordPiece, SentencePiece, BPE — no Hugging Face tokenizers C extension
Safetensors native	Memory-mapped weight loading from HuggingFace `.safetensors` format
MRL support	Matryoshka truncation for Qwen3 models (configurable output dimension >= 32)
LRU embedding cache	`CachedEmbeddingService` with sharded in-memory cache and hit/miss stats
LoRA adapter injection	Inject trained LoRA adapters into inference models at runtime
Knowledge distillation	Train small models from Claude/GPT/Gemini teacher soft labels
Optimal transport	Sinkhorn-Knopp solver (log-domain, epsilon-scaling) for embedding drift detection
Tiny fast networks	`lattice-fann`: sub-5ms classifiers with pre-allocated buffers, zero-alloc forward pass

Architecture

Application
    |
    v
lattice-embed          (public API — embedding service, SIMD distance ops, LRU cache)
    |
    v
lattice-inference      (transformer kernel — BERT/Qwen3 forward pass, tokenizers, weights)
    |
    +---> CPU (primary)      Metal (macOS)     WGPU (fallback)
          AVX2/NEON kernels   Apple Silicon      Vulkan/DX12


lattice-fann           (standalone — tiny network primitives, <5ms CPU inference)
lattice-transport      (standalone — optimal transport math, Wasserstein distances)
lattice-tune           (depends on fann + inference — LoRA, distillation, model registry)

The three leaf crates (inference, fann, transport) have zero intra-workspace dependencies and can be used standalone.

Crates

Crate	Description	LOC
`lattice-embed`	Embedding service — `EmbeddingService` trait, `NativeEmbeddingService`, `CachedEmbeddingService`, SIMD cosine/dot/euclidean, backfill, migration	~14 k
`lattice-inference`	Transformer kernel — safetensors loading, BERT/BGE/Qwen3 forward pass, WordPiece/SentencePiece/BPE tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding	~69 k
`lattice-fann`	Fast neural network primitives — `NetworkBuilder`, pre-allocated layers, zero-alloc forward pass, backprop trainer, FANN binary format	~7.5 k
`lattice-tune`	Training infrastructure — knowledge distillation pipeline, dataset management, LoRA adapter management, model registry with semver lineage	~13 k
`lattice-transport`	Optimal transport math — Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection, log-domain throughout	~5.3 k

Supported Models

All local models load from HuggingFace safetensors format.

Model	Variant	Architecture	Dimensions	Max Tokens	Tokenizer
`BgeSmallEnV15`	BAAI/bge-small-en-v1.5	BERT encoder	384	512	WordPiece
`BgeBaseEnV15`	BAAI/bge-base-en-v1.5	BERT encoder	768	512	WordPiece
`BgeLargeEnV15`	BAAI/bge-large-en-v1.5	BERT encoder	1024	512	WordPiece
`MultilingualE5Small`	intfloat/multilingual-e5-small	BERT encoder	384	512	SentencePiece
`MultilingualE5Base`	intfloat/multilingual-e5-base	BERT encoder	768	512	SentencePiece
`AllMiniLmL6V2`	sentence-transformers/all-MiniLM-L6-v2	BERT encoder	384	256	WordPiece
`ParaphraseMultilingualMiniLmL12V2`	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	BERT encoder	384	128	WordPiece
`Qwen3Embedding0_6B`	Qwen/Qwen3-Embedding-0.6B	Decoder (GQA+RoPE)	1024	8192	BPE
`Qwen3Embedding4B`	Qwen/Qwen3-Embedding-4B	Decoder (GQA+RoPE)	2560*	8192	BPE

*Qwen3-Embedding-4B supports MRL truncation to any dimension >= 32.

E5 models expect asymmetric "query: " / "passage: " prefixes for retrieval. Qwen3 models expect an instruction prefix. The EmbeddingModel enum exposes query_instruction() and document_instruction() so callers can apply these correctly.

Selecting a Model

use lattice_embed::EmbeddingModel;

// Fast general-purpose English retrieval
let model = EmbeddingModel::BgeSmallEnV15;   // 384-dim, fastest

// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15;   // 768-dim

// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15;  // 1024-dim

// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base;  // 768-dim, 100+ languages

// Long context + multilingual (requires GPU for practical throughput)
let model = EmbeddingModel::Qwen3Embedding0_6B;  // 1024-dim, 8K context

// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;

Feature Flags

lattice-embed

Feature	Default	Description
`native`	yes	Pure Rust inference via `lattice-inference`
`metal-gpu`	no	Metal GPU acceleration (macOS)
`avx512`	no	AVX-512 SIMD kernels (requires nightly)

lattice-inference

Feature	Default	Description
`f16`	no	Half-precision weights
`metal-gpu`	no	Metal compute backend
`wgpu-gpu`	no	WGPU cross-platform GPU backend
`download`	yes	HuggingFace weight download with checksum verification
`backfill`	no	Re-embedding coordinator (requires `rusqlite`)

# GPU acceleration on macOS
lattice-embed = { version = "0.1", features = ["metal-gpu"] }

# Cross-platform GPU
lattice-embed = { version = "0.1", features = ["wgpu-gpu"] }

Vector Operations

lattice-embed exposes SIMD-accelerated vector utilities as a stable public API:

use lattice_embed::utils;

// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);

utils::normalize(&mut vector);  // in-place L2 normalization

// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);

Measured performance on normalized 384-dim vectors (internal benchmarks, subject to hardware):

Operation	Scalar	SIMD
cosine similarity (384-dim)	~650 ns	~90 ns
cosine similarity (768-dim)	~1300 ns	~180 ns
cosine similarity (1024-dim)	~1700 ns	~240 ns

lattice-fann: Fast Neural Networks

For tiny classifiers that need to run in under 5 ms on CPU:

use lattice_fann::{NetworkBuilder, Activation, BackpropTrainer, TrainingConfig, Trainer};

// Build a network
let mut network = NetworkBuilder::new()
    .input(784)
    .hidden(128, Activation::ReLU)
    .hidden(64, Activation::ReLU)
    .output(10, Activation::Softmax)
    .build()?;

println!("{}", network.architecture()); // "784 -> ReLU(128) -> ReLU(64) -> Softmax(10)"
println!("Parameters: {}", network.total_params());

// Forward pass (no heap allocation)
let output = network.forward(&input)?;

// Serialize to compact binary (magic "FANN")
let bytes = network.to_bytes();
let restored = lattice_fann::Network::from_bytes(&bytes)?;

lattice-transport: Optimal Transport

Entropy-regularized optimal transport for measuring embedding geometry drift:

// Sinkhorn-Knopp in log-domain (numerically stable, no Gibbs kernel materialization)
// Balanced OT, unbalanced OT (KL-relaxed), Wasserstein barycenters
// Pre-allocated SinkhornWorkspace for zero-alloc inner loops

Primary use case: detect when an embedding model update has shifted the distribution of stored vectors enough to warrant re-indexing.

Benchmarks

Qwen3.5-0.8B Decode Throughput (Apple M2 Max)

Fair end-to-end measurement — slope method: tok/s = (N₂−N₁) / (T(N₂)−T(N₁)) for a fixed prompt, so prompt prefill, model load, and per-call overhead cancel and every engine is measured the same way (greedy, median of 5 runs, N₁=32, N₂=256).

Engine	Quant	decode tok/s	Notes
MLX	Q8 (g64)	247	Apple's Metal-native framework — fastest
Lattice	f16	157	Pure Rust + Metal; ~1.9× Ollama
Ollama	Q8_0	84	llama.cpp Metal backend

Cross-check: Ollama's slope (84) matches its own eval_duration rate (87), confirming the methodology is sound.

Honest caveats. Lattice's Qwen3.5 decode path is f16 (MetalQwen35State::new); MLX is Q8 and Ollama is Q8_0, so this is not quant-matched. There is no end-to-end Q8 path for Qwen3.5 in Lattice — the earlier "Q8 139" number was an f16 forward_step micro-benchmark mislabeled "Q8" (the bench's load_q8_state builds the same f16 state as the f16 path). 157 tok/s (f16, true e2e) is Lattice's actual decode rate for this model; the only other real e2e path is Q4 (from_q4_dir). MLX decodes faster than Lattice. Lattice's value is portability plus capabilities no other engine has on this model:

Lattice-only capability	MLX	Ollama
QuaRot 4-bit (rotated quant)	✗	✗
Q4 + LoRA r8 hot-swap (no reload)	✗	✗
Pure Rust, zero Python / framework	✗	✗

A previous version of this table claimed "+8% vs MLX"; that was a measurement artifact (a criterion forward_step micro-bench compared against MLX end-to-end with prefill counted in the decode rate) and has been corrected. All three engines implement the full GDN recurrence for Qwen3.5's hybrid architecture (18 GatedDeltaNet + 6 GQA layers). Reproducible via ./scripts/bench_apples_to_apples.sh.

Embedding & Kernel Benchmarks

# Embedding throughput
cargo bench --package lattice-embed

# Metal GPU decode (macOS only, requires model weights)
cargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode

# Attention kernel
cargo bench --package lattice-inference --bench attention_bench

Performance depends on hardware, model size, batch size, and sequence length. Run the benchmarks on your target hardware to get representative numbers.

Documentation

Architecture — crate dependency graph, design decisions, stability tiers
ADR directory: docs/adr/ — architectural decision records

License

Apache-2.0. See LICENSE.

Built by Ocean (HaiyangLi). Powers khive, a cognitive infrastructure for AI agents.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.githooks		.githooks
.github		.github
crates		crates
docs		docs
scripts		scripts
.clippy.toml		.clippy.toml
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
deno.json		deno.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lattice

Quick Start

Features

Architecture

Crates

Supported Models

Selecting a Model

Feature Flags

lattice-embed

lattice-inference

Vector Operations

lattice-fann: Fast Neural Networks

lattice-transport: Optimal Transport

Benchmarks

Qwen3.5-0.8B Decode Throughput (Apple M2 Max)

Embedding & Kernel Benchmarks

Documentation

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lattice

Quick Start

Features

Architecture

Crates

Supported Models

Selecting a Model

Feature Flags

lattice-embed

lattice-inference

Vector Operations

lattice-fann: Fast Neural Networks

lattice-transport: Optimal Transport

Benchmarks

Qwen3.5-0.8B Decode Throughput (Apple M2 Max)

Embedding & Kernel Benchmarks

Documentation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages