blazr

A blazing-fast inference server for hybrid neural architectures, supporting Mamba2 SSM, Multi-Head Latent Attention (MLA), Mixture of Experts (MoE), and standard transformers.

Features

Auto-detection - Automatically detects model architecture, format (HuggingFace vs oxidizr), and tokenizer vocabulary from checkpoint tensors
Hybrid Architecture Support - Seamlessly handles mixed Mamba2 and attention layers in a single model
HuggingFace Compatible - Loads standard HuggingFace Llama models (tested with llama3.2-1b) alongside custom oxidizr checkpoints
OpenAI-Compatible API - Drop-in replacement with /v1/completions and /v1/chat/completions endpoints
High Performance - Written in Rust using the Candle ML framework with optional CUDA acceleration
Multiple Tokenizers - Supports cl100k_base, o200k_base, llama3, and deepseek_v3 vocabularies via splintr

Quick Start

Installation

# Clone the repository
git clone https://github.com/farhan-syah/blazr.git
cd blazr

# Build (CPU-only)
cargo build --release

# Build with CUDA support (requires CUDA 12.x)
cargo build --release --features cuda

Basic Usage

Generate Text

blazr generate \
  --model ./checkpoints/nano \
  --prompt "Once upon a time" \
  --max-tokens 100 \
  --vocab llama3

Start Server

blazr serve --model ./checkpoints/nano --port 8080

Then make API requests:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Hello, world!",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Model Info

blazr info --model ./checkpoints/nano

Supported Architectures

blazr auto-detects and supports:

Mamba2 - State Space Models with selective attention
MLA - Multi-Head Latent Attention with compressed KV cache
MoE - Mixture of Experts with top-k routing and optional shared expert
Standard Transformers - GQA (Grouped Query Attention) with MLP layers

Models can mix and match these layer types freely.

Auto-Detection

blazr automatically detects:

Architecture - Identifies layer types (Mamba2, MLA, MoE, Transformer) from tensor name patterns
Model Format - Distinguishes between oxidizr format (layers.X.) and HuggingFace format (model.layers.X.)
Tokenizer Vocabulary - Infers vocabulary from vocab_size if --vocab is not specified:
- ~100k tokens → cl100k_base
- ~128k tokens → llama3
- ~129k tokens → deepseek_v3
- ~200k tokens → o200k_base

Tokenizer

blazr uses splintr for high-performance BPE tokenization with pretrained vocabularies.

Supported Vocabularies

Vocabulary	Description	Vocab Size	Use Case
`cl100k_base`	GPT-4, GPT-3.5-turbo	~100k	OpenAI-compatible models
`o200k_base`	GPT-4o	~200k	Extended multilingual support
`llama3`	Meta Llama 3 family	~128k	Llama 3.x models (default)
`deepseek_v3`	DeepSeek V3/R1	~129k	DeepSeek models

All vocabularies include 54 agent tokens for chat, reasoning, and tool-use applications.

Custom Vocabularies

Custom vocabularies are not yet supported. If you need a custom vocabulary:

Train your model with one of the supported vocabularies above
Modify blazr's tokenizer module to load your .tiktoken file (base64-encoded tokens with ranks)

Documentation

API Reference - Complete API endpoint documentation
Architecture - Technical details on hybrid model support
Configuration - Model configuration and tuning options

CLI Commands

# Generate text from a prompt
blazr generate --model <path> --prompt "text" [OPTIONS]

# Start inference server
blazr serve --model <path> [--port 8080] [--host 0.0.0.0]

# Display model configuration
blazr info --model <path>

# Decode token IDs (debugging)
blazr decode --ids "123,456,789" --vocab llama3

Options

Generation:

--model - Model path (local directory or HuggingFace ID like meta-llama/Llama-3.2-1B)
--prompt - Input text prompt
--max-tokens - Maximum tokens to generate (default: 100)
--temperature - Sampling temperature (default: 0.7)
--top-p - Nucleus sampling threshold (default: 0.9)
--top-k - Top-k sampling (default: 40)
--vocab - Tokenizer vocabulary (llama3, cl100k_base, o200k_base, deepseek_v3). Auto-detected if not specified.
--cpu - Force CPU inference even if CUDA is available

Server:

--model - Model path (local directory or HuggingFace ID)
--port - Port to listen on (default: 8080)
--host - Host to bind to (default: 0.0.0.0)
--cpu - Force CPU inference even if CUDA is available

Model Format

blazr loads models from SafeTensors checkpoints in two formats:

oxidizr Format

checkpoint_dir/
├── model.safetensors    # Model weights
└── config.json          # Model configuration (optional)

Tensor naming: embed_tokens, layers.X.mamba2, layers.X.self_attn, lm_head

HuggingFace Format

checkpoint_dir/
├── model.safetensors    # Model weights
└── config.json          # Standard HuggingFace config

Tensor naming: model.embed_tokens, model.layers.X.self_attn, lm_head

blazr automatically detects the format and architecture from tensor names. If config.json is missing or incomplete, all parameters are inferred from tensor shapes.

Requirements

Rust 1.70 or later
(Optional) CUDA 12.x for GPU acceleration

License

Apache-2.0 License - see LICENSE for details.

Related Projects

oxidizr - Training framework for hybrid Mamba2 + MLA + MoE architectures
splintr - High-performance BPE tokenizer with Python bindings

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cargo		.cargo
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

blazr

Features

Quick Start

Installation

Basic Usage

Generate Text

Start Server

Model Info

Supported Architectures

Auto-Detection

Tokenizer

Supported Vocabularies

Custom Vocabularies

Documentation

CLI Commands

Options

Model Format

oxidizr Format

HuggingFace Format

Requirements

License

Related Projects

Contributing

About

Uh oh!

Releases

Packages

Languages

License

ml-rust/blazr

Folders and files

Latest commit

History

Repository files navigation

blazr

Features

Quick Start

Installation

Basic Usage

Generate Text

Start Server

Model Info

Supported Architectures

Auto-Detection

Tokenizer

Supported Vocabularies

Custom Vocabularies

Documentation

CLI Commands

Options

Model Format

oxidizr Format

HuggingFace Format

Requirements

License

Related Projects

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages