High-performance MaxSim (Maximum Similarity) scoring for Elixir using BLAS GEMM operations with SIMD acceleration. This is a port of the maxsim-cpu package by Mixedbread.ai.
MaxSim is a scoring function used in ColBERT-style neural retrieval models. For each query token, it finds the maximum similarity with any document token, then sums these maximums to produce the final score:
MaxSim(Q, D) = Σᵢ maxⱼ(Qᵢ · Dⱼ)
This library provides a highly optimized implementation using:
- BLAS GEMM for efficient matrix multiplication (OpenBLAS on Linux, Accelerate on macOS)
- SIMD for fast max reduction (AVX2 on x86_64, NEON on ARM64)
- Rayon for parallel document processing
- Optional libxsmm support for additional performance on Intel CPUs
Add ex_maxsim_cpu to your dependencies in mix.exs:
def deps do
[
{:ex_maxsim_cpu, "~> 0.1.0"},
{:nx, "~> 0.7"} # Required for tensor operations
]
end- macOS (Apple Silicon): Uses Accelerate framework (no additional setup needed)
- Linux x86_64: Requires OpenBLAS; AVX2 is recommended for SIMD speedups
For source builds that should use AVX2, set:
# Ubuntu/Debian sudo apt-get install libopenblas-dev # Fedora/RHEL sudo dnf install openblas-devel
export RUSTFLAGS="-C target-feature=+avx2"
This library uses Rustler to compile native code. Ensure you have Rust installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh# Create query tensor: [q_len, dim]
query = Nx.tensor([
[0.1, 0.2, 0.3, 0.4],
[0.5, 0.6, 0.7, 0.8]
], type: :f32)
# Create documents tensor: [n_docs, d_len, dim]
docs = Nx.tensor([
[[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8]],
[[0.9, 0.8, 0.7, 0.6], [0.5, 0.4, 0.3, 0.2]]
], type: :f32)
# Compute MaxSim scores
scores = ExMaxsimCpu.maxsim_scores(query, docs)
# => #Nx.Tensor<f32[2]>Note: MaxSim expects L2-normalized embeddings. Normalize query and docs along the token axis before scoring.
query = Nx.tensor([[1.0, 0.0], [0.0, 1.0]], type: :f32)
# Documents with different lengths
doc1 = Nx.tensor([[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]], type: :f32) # 3 tokens
doc2 = Nx.tensor([[0.5, 0.5]], type: :f32) # 1 token
scores = ExMaxsimCpu.maxsim_scores_variable(query, [doc1, doc2])RAYON_NUM_THREADS: Control Rayon thread pool size (default: number of CPUs)OPENBLAS_NUM_THREADS=1: Recommended to avoid oversubscription with Rayon
export OPENBLAS_NUM_THREADS=1
export RAYON_NUM_THREADS=8For additional performance on Intel CPUs, you can enable libxsmm:
-
Build libxsmm:
git clone https://github.com/libxsmm/libxsmm.git cd libxsmm make -j$(nproc) export LIBXSMM_DIR=$(pwd)
-
Build with the feature enabled:
cd native/maxsim_cpu cargo build --release --features use-libxsmm
ExMaxsimCpu delivers strong CPU performance, especially on small-to-moderate batch sizes:
- OS: macOS 26.2 (25C56)
- CPU: Apple M4 Pro (12 physical / 12 logical cores)
- Architecture: arm64
- Elixir: 1.19.5 (OTP 28)
- BLAS: Accelerate
- Env: OPENBLAS_NUM_THREADS=1, RAYON_NUM_THREADS=8
- Nx CPU backend: Torchx (CPU)
| Implementation | Typical Latency | vs ExMaxsimCpu |
|---|---|---|
| ExMaxsimCpu (BLAS+SIMD) | 0.08 - 0.28 ms | — |
| Nx CPU backend (Torchx) | 0.20 - 0.89 ms | ~4x slower |
| Nx + Torchx MPS (Apple GPU) | 1.09 - 3.24 ms | ~15x slower |
| Nx BinaryBackend (unaccelerated) | 386 - 5,038 ms | ~13,800x slower |
Notes:
- Nx BinaryBackend is an unaccelerated baseline; EXLA/Torchx CPU will be faster.
- Nx CPU backend results (EXLA or Torchx) are included when available.
- MPS timings are for small shapes where transfer overhead can dominate.
- Results are machine- and shape-dependent; run your own benchmarks for production sizing.
Benchmark on Apple Silicon (this run), showing ExMaxsimCpu (blue), Nx BinaryBackend (orange), and optional Nx CPU/MPS backends on a log scale.
| Comparison | Speedup Factor |
|---|---|
| vs Nx BinaryBackend | 4,100x - 24,200x |
| vs Nx CPU backend (Torchx) | 2.1x - 6.2x |
| vs Nx + Torchx MPS (GPU) | 7.5x - 24.6x |
- Optimized BLAS: Uses Apple Accelerate (or OpenBLAS on Linux) which is highly tuned for the CPU architecture
- SIMD Instructions: Hand-optimized AVX2/NEON code for max-reduction operations
- No Transfer Overhead: GPU implementations incur memory transfer costs
- Cache Efficiency: Tiled algorithms keep data in CPU cache
Click to expand individual benchmark charts
Generate benchmark data (includes optional MPS and Nx CPU backends when available):
OPENBLAS_NUM_THREADS=1 mix run bench/generate_plots.exsGenerate plots:
uv run bench/plot_benchmarks.pyNote: For MPS benchmarks, ensure
torchxis installed. For Nx CPU backends, installtorchxor addexlato your dev deps. The benchmark auto-detects availability.
| Platform | Architecture | Status |
|---|---|---|
| macOS | arm64 (Apple Silicon) | ✅ Fully supported |
| Linux | x86_64 (AVX2) | ✅ Fully supported |
| Linux | arm64 | |
| Windows | Any | ❌ Not supported |
This is the Elixir binding for maxsim-cpu, a high-performance MaxSim implementation in Rust.




