Skip to content

ohdearquant/lattice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lattice

Pure Rust inference engine for transformer models on Apple Silicon.

License Crates.io CI

No ONNX. No Python. No CUDA. No external ML runtime. Lattice implements the full compute graph — weight loading, tokenization, forward pass, and vector operations — in Rust, with hand-written Metal shaders and SIMD kernels.

Benchmark: Lattice vs Ollama — fair end-to-end decode on Qwen3.5-0.8B

Lattice is the only inference engine that correctly runs Qwen3.5's hybrid GatedDeltaNet architecture at 4-bit on Apple Silicon — with QuaRot-Q4 and LoRA hot-swap that neither Ollama nor MLX support. On a fair end-to-end decode measurement it is ~1.9× faster than Ollama/llama.cpp. Apple's MLX (Metal-native) decodes faster than Lattice at raw throughput — Lattice's edge is portability (pure Rust, no Python/framework) plus those Q4 + adapter capabilities. Full table and methodology below.

# Reproduce on your hardware (macOS + ollama + uv):
./scripts/bench_apples_to_apples.sh

Built for inference on CPU and macOS GPU. Optimized for AVX2 (x86), NEON (ARM), and Metal (Apple Silicon) — not CUDA. If you need NVIDIA GPU inference, use candle or mistral.rs. Lattice targets the other 90% of deployments: servers, edge, laptops, and library dependencies that shouldn't drag in a 300 MB ONNX runtime.


Quick Start

[dependencies]
lattice-embed = "0.1"
tokio = { version = "1", features = ["full"] }
use lattice_embed::{EmbeddingService, EmbeddingModel, NativeEmbeddingService};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let service = NativeEmbeddingService::default();

    // Single embedding (BGE-small-en-v1.5, 384 dimensions)
    let embedding = service
        .embed_one("The quick brown fox jumps over the lazy dog", EmbeddingModel::default())
        .await?;

    println!("Dimensions: {}", embedding.len()); // 384

    // Batch
    let texts = vec![
        "First document".to_string(),
        "Second document".to_string(),
    ];
    let embeddings = service.embed(&texts, EmbeddingModel::BgeSmallEnV15).await?;

    // SIMD-accelerated similarity
    let similarity = lattice_embed::utils::cosine_similarity(&embeddings[0], &embeddings[1]);
    println!("Similarity: {:.4}", similarity);

    Ok(())
}

Model weights are downloaded from HuggingFace on first use and cached at ~/.lattice/models (or $LATTICE_MODEL_CACHE).


Features

Feature Description
Pure Rust compute Hand-written SIMD kernels (AVX2/NEON). No C++, no ONNX, no CUDA.
Two transformer architectures BERT/BGE encoder-only (mean pooling) and Qwen3 decoder-only (causal GQA, last-token pooling)
9 supported models BGE, mE5, MiniLM, Qwen3-Embedding families — see table below
Metal GPU backend Native Apple Silicon acceleration via Metal MSL shaders. WGPU fallback for cross-platform.
Three pure Rust tokenizers WordPiece, SentencePiece, BPE — no Hugging Face tokenizers C extension
Safetensors native Memory-mapped weight loading from HuggingFace .safetensors format
MRL support Matryoshka truncation for Qwen3 models (configurable output dimension >= 32)
LRU embedding cache CachedEmbeddingService with sharded in-memory cache and hit/miss stats
LoRA adapter injection Inject trained LoRA adapters into inference models at runtime
Knowledge distillation Train small models from Claude/GPT/Gemini teacher soft labels
Optimal transport Sinkhorn-Knopp solver (log-domain, epsilon-scaling) for embedding drift detection
Tiny fast networks lattice-fann: sub-5ms classifiers with pre-allocated buffers, zero-alloc forward pass

Architecture

Application
    |
    v
lattice-embed          (public API — embedding service, SIMD distance ops, LRU cache)
    |
    v
lattice-inference      (transformer kernel — BERT/Qwen3 forward pass, tokenizers, weights)
    |
    +---> CPU (primary)      Metal (macOS)     WGPU (fallback)
          AVX2/NEON kernels   Apple Silicon      Vulkan/DX12


lattice-fann           (standalone — tiny network primitives, <5ms CPU inference)
lattice-transport      (standalone — optimal transport math, Wasserstein distances)
lattice-tune           (depends on fann + inference — LoRA, distillation, model registry)

The three leaf crates (inference, fann, transport) have zero intra-workspace dependencies and can be used standalone.


Crates

Crate Description LOC
lattice-embed Embedding service — EmbeddingService trait, NativeEmbeddingService, CachedEmbeddingService, SIMD cosine/dot/euclidean, backfill, migration ~14 k
lattice-inference Transformer kernel — safetensors loading, BERT/BGE/Qwen3 forward pass, WordPiece/SentencePiece/BPE tokenizers, Metal/WGPU backends, LoRA hooks, KV cache, speculative decoding ~69 k
lattice-fann Fast neural network primitives — NetworkBuilder, pre-allocated layers, zero-alloc forward pass, backprop trainer, FANN binary format ~7.5 k
lattice-tune Training infrastructure — knowledge distillation pipeline, dataset management, LoRA adapter management, model registry with semver lineage ~13 k
lattice-transport Optimal transport math — Sinkhorn-Knopp (balanced + unbalanced), Wasserstein barycenters, embedding drift detection, log-domain throughout ~5.3 k

Supported Models

All local models load from HuggingFace safetensors format.

Model Variant Architecture Dimensions Max Tokens Tokenizer
BgeSmallEnV15 BAAI/bge-small-en-v1.5 BERT encoder 384 512 WordPiece
BgeBaseEnV15 BAAI/bge-base-en-v1.5 BERT encoder 768 512 WordPiece
BgeLargeEnV15 BAAI/bge-large-en-v1.5 BERT encoder 1024 512 WordPiece
MultilingualE5Small intfloat/multilingual-e5-small BERT encoder 384 512 SentencePiece
MultilingualE5Base intfloat/multilingual-e5-base BERT encoder 768 512 SentencePiece
AllMiniLmL6V2 sentence-transformers/all-MiniLM-L6-v2 BERT encoder 384 256 WordPiece
ParaphraseMultilingualMiniLmL12V2 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 BERT encoder 384 128 WordPiece
Qwen3Embedding0_6B Qwen/Qwen3-Embedding-0.6B Decoder (GQA+RoPE) 1024 8192 BPE
Qwen3Embedding4B Qwen/Qwen3-Embedding-4B Decoder (GQA+RoPE) 2560* 8192 BPE

*Qwen3-Embedding-4B supports MRL truncation to any dimension >= 32.

E5 models expect asymmetric "query: " / "passage: " prefixes for retrieval. Qwen3 models expect an instruction prefix. The EmbeddingModel enum exposes query_instruction() and document_instruction() so callers can apply these correctly.


Selecting a Model

use lattice_embed::EmbeddingModel;

// Fast general-purpose English retrieval
let model = EmbeddingModel::BgeSmallEnV15;   // 384-dim, fastest

// Balanced quality/speed
let model = EmbeddingModel::BgeBaseEnV15;   // 768-dim

// Best quality, English
let model = EmbeddingModel::BgeLargeEnV15;  // 1024-dim

// Multilingual retrieval
let model = EmbeddingModel::MultilingualE5Base;  // 768-dim, 100+ languages

// Long context + multilingual (requires GPU for practical throughput)
let model = EmbeddingModel::Qwen3Embedding0_6B;  // 1024-dim, 8K context

// MRL: variable output dimension
use lattice_embed::ModelConfig;
let config = ModelConfig::try_new(EmbeddingModel::Qwen3Embedding4B, Some(512))?;

Feature Flags

lattice-embed

Feature Default Description
native yes Pure Rust inference via lattice-inference
metal-gpu no Metal GPU acceleration (macOS)
avx512 no AVX-512 SIMD kernels (requires nightly)

lattice-inference

Feature Default Description
f16 no Half-precision weights
metal-gpu no Metal compute backend
wgpu-gpu no WGPU cross-platform GPU backend
download yes HuggingFace weight download with checksum verification
backfill no Re-embedding coordinator (requires rusqlite)
# GPU acceleration on macOS
lattice-embed = { version = "0.1", features = ["metal-gpu"] }

# Cross-platform GPU
lattice-embed = { version = "0.1", features = ["wgpu-gpu"] }

Vector Operations

lattice-embed exposes SIMD-accelerated vector utilities as a stable public API:

use lattice_embed::utils;

// Runtime dispatch: AVX2 on x86_64, NEON on aarch64, scalar fallback elsewhere
let sim = utils::cosine_similarity(&a, &b);
let dot = utils::dot_product(&a, &b);
let dist = utils::euclidean_distance(&a, &b);

utils::normalize(&mut vector);  // in-place L2 normalization

// Batch operations
let sims = utils::batch_cosine_similarity(&pairs);

Measured performance on normalized 384-dim vectors (internal benchmarks, subject to hardware):

Operation Scalar SIMD
cosine similarity (384-dim) ~650 ns ~90 ns
cosine similarity (768-dim) ~1300 ns ~180 ns
cosine similarity (1024-dim) ~1700 ns ~240 ns

lattice-fann: Fast Neural Networks

For tiny classifiers that need to run in under 5 ms on CPU:

use lattice_fann::{NetworkBuilder, Activation, BackpropTrainer, TrainingConfig, Trainer};

// Build a network
let mut network = NetworkBuilder::new()
    .input(784)
    .hidden(128, Activation::ReLU)
    .hidden(64, Activation::ReLU)
    .output(10, Activation::Softmax)
    .build()?;

println!("{}", network.architecture()); // "784 -> ReLU(128) -> ReLU(64) -> Softmax(10)"
println!("Parameters: {}", network.total_params());

// Forward pass (no heap allocation)
let output = network.forward(&input)?;

// Serialize to compact binary (magic "FANN")
let bytes = network.to_bytes();
let restored = lattice_fann::Network::from_bytes(&bytes)?;

lattice-transport: Optimal Transport

Entropy-regularized optimal transport for measuring embedding geometry drift:

// Sinkhorn-Knopp in log-domain (numerically stable, no Gibbs kernel materialization)
// Balanced OT, unbalanced OT (KL-relaxed), Wasserstein barycenters
// Pre-allocated SinkhornWorkspace for zero-alloc inner loops

Primary use case: detect when an embedding model update has shifted the distribution of stored vectors enough to warrant re-indexing.


Benchmarks

Qwen3.5-0.8B Decode Throughput (Apple M2 Max)

Fair end-to-end measurement — slope method: tok/s = (N₂−N₁) / (T(N₂)−T(N₁)) for a fixed prompt, so prompt prefill, model load, and per-call overhead cancel and every engine is measured the same way (greedy, median of 5 runs, N₁=32, N₂=256).

Engine Quant decode tok/s Notes
MLX Q8 (g64) 247 Apple's Metal-native framework — fastest
Lattice f16 157 Pure Rust + Metal; ~1.9× Ollama
Ollama Q8_0 84 llama.cpp Metal backend

Cross-check: Ollama's slope (84) matches its own eval_duration rate (87), confirming the methodology is sound.

Honest caveats. Lattice's Qwen3.5 decode path is f16 (MetalQwen35State::new); MLX is Q8 and Ollama is Q8_0, so this is not quant-matched. There is no end-to-end Q8 path for Qwen3.5 in Lattice — the earlier "Q8 139" number was an f16 forward_step micro-benchmark mislabeled "Q8" (the bench's load_q8_state builds the same f16 state as the f16 path). 157 tok/s (f16, true e2e) is Lattice's actual decode rate for this model; the only other real e2e path is Q4 (from_q4_dir). MLX decodes faster than Lattice. Lattice's value is portability plus capabilities no other engine has on this model:

Lattice-only capability MLX Ollama
QuaRot 4-bit (rotated quant)
Q4 + LoRA r8 hot-swap (no reload)
Pure Rust, zero Python / framework

A previous version of this table claimed "+8% vs MLX"; that was a measurement artifact (a criterion forward_step micro-bench compared against MLX end-to-end with prefill counted in the decode rate) and has been corrected. All three engines implement the full GDN recurrence for Qwen3.5's hybrid architecture (18 GatedDeltaNet + 6 GQA layers). Reproducible via ./scripts/bench_apples_to_apples.sh.

Embedding & Kernel Benchmarks

# Embedding throughput
cargo bench --package lattice-embed

# Metal GPU decode (macOS only, requires model weights)
cargo bench -p lattice-inference --features metal-gpu,f16 -- metal_decode

# Attention kernel
cargo bench --package lattice-inference --bench attention_bench

Performance depends on hardware, model size, batch size, and sequence length. Run the benchmarks on your target hardware to get representative numbers.


Documentation

  • Architecture — crate dependency graph, design decisions, stability tiers
  • ADR directory: docs/adr/ — architectural decision records

License

Apache-2.0. See LICENSE.


Built by Ocean (HaiyangLi). Powers khive, a cognitive infrastructure for AI agents.

About

pure rust inference engine

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages