A blazing-fast inference server for hybrid neural architectures, supporting Mamba2 SSM, Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and standard transformers.
- Auto-detection - Automatically detects model architecture, format (HuggingFace vs oxidizr), and tokenizer vocabulary from checkpoint tensors
- Hybrid Architecture Support - Seamlessly handles mixed Mamba2 and attention layers in a single model
- HuggingFace Compatible - Loads standard HuggingFace Llama models (tested with llama3.2-1b) alongside custom oxidizr checkpoints
- OpenAI-Compatible API - Drop-in replacement with
/v1/completionsand/v1/chat/completionsendpoints - High Performance - Written in Rust using the Candle ML framework with optional CUDA acceleration
- Multiple Tokenizers - Supports cl100k_base, o200k_base, llama3, and deepseek_v3 vocabularies via splintr
# Clone the repository
git clone https://github.com/farhan-syah/blazr.git
cd blazr
# Build (CPU-only)
cargo build --release
# Build with CUDA support (requires CUDA 12.x)
cargo build --release --features cudablazr generate \
--model ./checkpoints/nano \
--prompt "Once upon a time" \
--max-tokens 100 \
--vocab llama3blazr serve --model ./checkpoints/nano --port 8080Then make API requests:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello, world!",
"max_tokens": 50,
"temperature": 0.7
}'blazr info --model ./checkpoints/nanoblazr auto-detects and supports:
- Mamba2 - State Space Models with selective attention
- MLA - Multi-Head Latent Attention with compressed KV cache
- MoE - Mixture of Experts with top-k routing and optional shared expert
- Standard Transformers - GQA (Grouped Query Attention) with MLP layers
Models can mix and match these layer types freely.
blazr automatically detects:
- Architecture - Identifies layer types (Mamba2, MLA, MoE, Transformer) from tensor name patterns
- Model Format - Distinguishes between oxidizr format (
layers.X.) and HuggingFace format (model.layers.X.) - Tokenizer Vocabulary - Infers vocabulary from
vocab_sizeif--vocabis not specified:- ~100k tokens →
cl100k_base - ~128k tokens →
llama3 - ~129k tokens →
deepseek_v3 - ~200k tokens →
o200k_base
- ~100k tokens →
blazr uses splintr for high-performance BPE tokenization with pretrained vocabularies.
| Vocabulary | Description | Vocab Size | Use Case |
|---|---|---|---|
cl100k_base |
GPT-4, GPT-3.5-turbo | ~100k | OpenAI-compatible models |
o200k_base |
GPT-4o | ~200k | Extended multilingual support |
llama3 |
Meta Llama 3 family | ~128k | Llama 3.x models (default) |
deepseek_v3 |
DeepSeek V3/R1 | ~129k | DeepSeek models |
All vocabularies include 54 agent tokens for chat, reasoning, and tool-use applications.
Custom vocabularies are not yet supported. If you need a custom vocabulary:
- Train your model with one of the supported vocabularies above
- Modify blazr's tokenizer module to load your
.tiktokenfile (base64-encoded tokens with ranks)
- API Reference - Complete API endpoint documentation
- Architecture - Technical details on hybrid model support
- Configuration - Model configuration and tuning options
# Generate text from a prompt
blazr generate --model <path> --prompt "text" [OPTIONS]
# Start inference server
blazr serve --model <path> [--port 8080] [--host 0.0.0.0]
# Display model configuration
blazr info --model <path>
# Decode token IDs (debugging)
blazr decode --ids "123,456,789" --vocab llama3Generation:
--model- Model path (local directory or HuggingFace ID likemeta-llama/Llama-3.2-1B)--prompt- Input text prompt--max-tokens- Maximum tokens to generate (default: 100)--temperature- Sampling temperature (default: 0.7)--top-p- Nucleus sampling threshold (default: 0.9)--top-k- Top-k sampling (default: 40)--vocab- Tokenizer vocabulary (llama3,cl100k_base,o200k_base,deepseek_v3). Auto-detected if not specified.--cpu- Force CPU inference even if CUDA is available
Server:
--model- Model path (local directory or HuggingFace ID)--port- Port to listen on (default: 8080)--host- Host to bind to (default: 0.0.0.0)--cpu- Force CPU inference even if CUDA is available
blazr loads models from SafeTensors checkpoints in two formats:
checkpoint_dir/
├── model.safetensors # Model weights
└── config.json # Model configuration (optional)
Tensor naming: embed_tokens, layers.X.mamba2, layers.X.self_attn, lm_head
checkpoint_dir/
├── model.safetensors # Model weights
└── config.json # Standard HuggingFace config
Tensor naming: model.embed_tokens, model.layers.X.self_attn, lm_head
blazr automatically detects the format and architecture from tensor names. If config.json is missing or incomplete, all parameters are inferred from tensor shapes.
- Rust 1.70 or later
- (Optional) CUDA 12.x for GPU acceleration
Apache-2.0 License - see LICENSE for details.
- oxidizr - Training framework for hybrid Mamba2 + MLA + MoE architectures
- splintr - High-performance BPE tokenizer with Python bindings
Contributions are welcome! Please open an issue or submit a pull request.