A from-scratch Rust rewrite of vLLM -- the most popular open-source LLM serving engine. Drop-in replacement for the OpenAI-compatible API with dramatically better resource efficiency.
23 Rust crates. 15 CUDA kernels. 10,291 tok/s on A100 (FP16). Beats Python vLLM up to N=256. 20x faster startup. 31x smaller. Zero errors across 14,620 verified requests.
# From crates.io
cargo install rvllm
# From PyPI
pip install rvllmOr build from source -- see Quick Start below.
All measurements verified with coherent text output at every batch size. Zero errors across thousands of requests. See bench/run.sh to reproduce.
Same hardware, same model (Qwen2.5-1.5B), greedy decoding, 32 tokens/request. rvLLM FP16 with tensor cores vs Python vLLM 0.18.
| Concurrent (N) | rvLLM (tok/s) | vLLM 0.18 (tok/s) | Notes |
|---|---|---|---|
| 1 | 117 | 69 | rvLLM 1.7x faster |
| 4 | 882 | 256 | rvLLM ahead |
| 8 | 1,213 | 517 | |
| 16 | 1,391 | 1,060 | |
| 32 | 1,434 | 1,943 | |
| 48 | 3,918 | 2,887 | rvLLM pulls ahead |
| 64 | 4,796 | 3,828 | rvLLM 1.25x |
| 96 | 5,965 | 5,197 | rvLLM 1.15x |
| 128 | 7,380 | 6,400 | rvLLM 1.15x |
| 256 | 9,905 | 9,437 | rvLLM 1.05x |
| 512 | 10,291 | 10,771 | Near parity (0.96x) |
| 768 | 10,235 | -- | rvLLM peak |
| 1,024 | 10,051 | 12,740 |
rvLLM beats Python vLLM up to N=256 and peaks at 10,291 tok/s (FP16, tensor cores, f16 KV cache, fused GEMMs, vectorized kernels). vLLM scales further at very high concurrency (N>512) due to its mature continuous batching optimizations.
| Concurrent (N) | Tokens | Wall time | tok/s | Errors |
|---|---|---|---|---|
| 1 | 32 | 279ms | 114 | 0 |
| 4 | 128 | 762ms | 167 | 0 |
| 8 | 256 | 601ms | 425 | 0 |
| 16 | 512 | 826ms | 619 | 0 |
| 32 | 1,024 | 1,346ms | 760 | 0 |
| 64 | 2,048 | 798ms | 2,566 | 0 |
| 128 | 4,096 | 1,249ms | 3,279 | 0 |
| 256 | 8,192 | 2,106ms | 3,889 | 0 |
| 512 | 16,384 | 4,179ms | 3,920 | 0 |
| 768 | 24,576 | 6,227ms | 3,946 | 0 |
| 1,024 | 32,768 | 8,365ms | 3,917 | 0 |
| 1,536 | 49,152 | 12,490ms | 3,935 | 0 |
| 2,048 | 65,536 | 16,640ms | 3,938 | 0 |
| 3,072 | 98,304 | 25,000ms | 3,932 | 0 |
| 4,096 | 131,072 | 34,002ms | 3,854 | 0 |
Peak: 3,946 tok/s at N=768 (FP32). These B200 results predate the FP16 work and are superseded by the A100 FP16 head-to-head above. Zero errors across 13,553 total requests.
Coherence verified on 5 diverse prompts (relativity, quantum computing, geography, Python code, ML):
Prompt: "The capital of France is" -> "Paris. The capital of France is Paris..."
Prompt: "Write a function that sorts..." -> "def sort_list(list): return sorted(list)"
Prompt: "Explain quantum computing..." -> "Quantum computing is a new type of computing that uses quantum mechanics..."
| Metric | rvLLM | Python vLLM 0.18 | Comparison |
|---|---|---|---|
| Peak throughput (A100, FP16) | 10,291 tok/s | 12,740 tok/s | Beats vLLM up to N=256 |
| Peak throughput (B200, FP32) | 3,946 tok/s | -- | Earlier result |
| Startup time | 6 sec | 121 sec | 20x faster |
| Binary / install size | 16 MB | ~500 MB | 31x smaller |
| CPU memory (RSS) | 348 MB | 1,033 MB | 3x less |
| Total requests verified | 14,620 | -- | 0 errors |
Operations that run on CPU between GPU forward passes. Measured on both Apple M5 and A100 Xeon.
| Operation | Rust | Python (numpy) | Speedup | Notes |
|---|---|---|---|---|
| Combined penalties (rep+freq+pres) | 2.6 us | 63 us | 24x | Pure iteration, zero alloc |
| Repetition penalty (2K tokens) | 3.1 us | 34 us | 11x | In-place mutation |
| Multinomial sampling (32K vocab) | 12 us | 66 us | 5.5x | Cumulative sum + early exit |
| Top-P nucleus (128K vocab) | 1.6 ms | 6.9 ms | 4.3x | Partial sort + threshold |
| Q4 dequantization (10M elements) | 7.1 ms | 9.7 ms | 1.4x | Chunk-based autovectorization |
| Batch sampling (64 seqs, Rayon) | 4.3 ms | 36.4 ms | 8.5x | Rayon across 10 M5 cores |
| Compute Capability | GPUs | Status |
|---|---|---|
| sm_70 | V100 | Supported |
| sm_75 | T4, RTX 2080 | Supported |
| sm_80 | A100, A30 | Tested, benchmarked |
| sm_86 | RTX 3090, A40 | Supported |
| sm_89 | RTX 4090, L40S | Supported |
| sm_90 | H100, H200 | Supported |
| sm_100 | B100, B200 | Supported (requires CUDA 12.8+) |
| sm_120 | RTX 5090, RTX 6000 Blackwell | Supported (requires CUDA 13.0+) |
| sm_122 | RTX 5080, RTX 5070 | Supported (requires CUDA 13.0+) |
Kernels are compiled to PTX for all architectures by default (cd kernels && bash build.sh). To build for a specific GPU:
CUDA_ARCH=sm_90 bash kernels/build.sh # H100 only
bash kernels/build.sh sm_89 # RTX 4090 onlyWant to add support for a new GPU? Add the sm_XX target to kernels/build.sh and verify the kernels compile. If a kernel uses architecture-specific features (tensor cores, etc.), submit a PR with the optimized variant. See CONTRIBUTING.md.
Python's Global Interpreter Lock means vLLM's scheduler, tokenizer, and output processing all run single-threaded. When you have 256 concurrent requests, the scheduling loop itself becomes a bottleneck. Rust has no GIL -- scheduling, sampling, and tokenization run truly parallel across all cores.
Python's garbage collector can pause inference at unpredictable times. With large batch sizes, GC pauses grow as Python tracks millions of tensor metadata objects. Rust's ownership model means deterministic deallocation with zero GC pauses. Memory is freed the instant it goes out of scope.
Python vLLM requires PyTorch (~2GB), transformers, numpy, and dozens of other packages. A fresh pip install vllm pulls ~500MB of dependencies. rvLLM compiles to a single 15MB static binary with zero runtime dependencies. Deploy by copying one file.
Python vLLM talks to the GPU through PyTorch, which adds overhead for tensor creation, memory management, and kernel dispatch. rvLLM calls cuBLAS and CUDA kernels directly through cudarc, eliminating the middle layer. FP16 hgemm with tensor cores for matrix multiplies and f16 KV cache keep memory bandwidth and compute on the fast path.
Python vLLM takes 30-60 seconds to start (importing PyTorch, JIT compiling Triton kernels, initializing NCCL). rvLLM starts serving in ~7 seconds -- load model weights and go.
Python objects carry ~50 bytes of overhead each. A running vLLM server with thousands of sequences creates millions of Python objects for metadata tracking. Rust structs are laid out exactly as you define them -- an 8-byte sequence ID is 8 bytes, not 58.
# Mac/Linux (no GPU needed, uses mock-gpu backend for development)
cargo build --release -p rvllm-server
# Linux + NVIDIA GPU (requires CUDA toolkit)
cargo build --release --features cuda -p rvllm-server
# Compile CUDA kernels (only needed for GPU inference)
cd kernels && bash build.sh# Start serving (downloads model from HuggingFace automatically)
./target/release/rvLLM serve --model Qwen/Qwen2.5-1.5B
# With options
./target/release/rvLLM serve \
--model meta-llama/Llama-3.2-1B \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90# Completion
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"The theory of relativity states that","max_tokens":100}'
# Chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Explain quantum computing"}],"max_tokens":200}'
# Streaming
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt":"Once upon a time","max_tokens":100,"stream":true}'# Build image
make docker
# Run with GPU
docker run --gpus all -p 8000:8000 rvllm:latest \
serve --model Qwen/Qwen2.5-1.5B
# Docker Compose (starts both Rust and Python vLLM for comparison)
MODEL_NAME=Qwen/Qwen2.5-1.5B docker compose uprvLLM implements the same OpenAI-compatible API as Python vLLM. Existing clients work unchanged -- just point them at the Rust server.
| Endpoint | Method | Status |
|---|---|---|
/v1/completions |
POST | Working (streaming + non-streaming) |
/v1/chat/completions |
POST | Working (streaming + non-streaming) |
/v1/models |
GET | Working |
/health |
GET | Working |
/metrics |
GET | Working (Prometheus format) |
from openai import OpenAI
# Just change the base_url -- everything else stays the same
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
# Completions
response = client.completions.create(
model="Qwen/Qwen2.5-1.5B",
prompt="The meaning of life is",
max_tokens=50,
temperature=0.8,
top_p=0.95,
)
print(response.choices[0].text)
# Chat
response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B",
messages=[{"role": "user", "content": "Write a haiku about Rust"}],
max_tokens=50,
)
print(response.choices[0].message.content)
# Streaming
stream = client.completions.create(
model="Qwen/Qwen2.5-1.5B",
prompt="In the beginning",
max_tokens=100,
stream=True,
)
for chunk in stream:
print(chunk.choices[0].text, end="", flush=True)import litellm
response = litellm.completion(
model="hosted_vllm/Qwen/Qwen2.5-1.5B",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8000/v1",
)from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="unused",
model="Qwen/Qwen2.5-1.5B",
)
response = llm.invoke("Explain transformers in one paragraph")All standard OpenAI parameters work:
| Parameter | Type | Default | Description |
|---|---|---|---|
temperature |
float | 1.0 | Randomness (0 = greedy) |
top_p |
float | 1.0 | Nucleus sampling threshold |
top_k |
int | -1 | Top-K filtering (-1 = disabled) |
max_tokens |
int | 256 | Maximum tokens to generate |
stop |
string[] | null | Stop sequences |
stream |
bool | false | Enable SSE streaming |
presence_penalty |
float | 0.0 | Penalize repeated topics |
frequency_penalty |
float | 0.0 | Penalize repeated tokens |
seed |
int | null | Deterministic generation |
n |
int | 1 | Number of completions |
Run a bounded, reproducible benchmark on any CUDA machine:
bash bench/run.shThis will:
- Verify CUDA/GPU presence
- Build rvLLM with
--features cuda - Start the server, wait for health
- Run 16 prompts at concurrency 1 and 4
- Report startup time, RSS, VRAM, latency percentiles, throughput
- Clean up the server on exit (PID-based, with trap)
Environment variables: MODEL, PORT, MAX_TOKENS, NUM_PROMPTS, CONCURRENCY_LEVELS.
Requires a vast.ai account with API key configured.
make a100-benchThis will:
- Provision an A100 80GB on vast.ai (~$1.10/hr)
- Upload and build rvLLM with CUDA
- Install Python vLLM 0.18.0
- Run both servers on the same model
- Benchmark throughput, latency, TTFT, memory usage
- Print a side-by-side comparison
- Tear down the instance
# 1. Provision
bash deploy/vastai-provision.sh
# 2. Build on the instance
bash deploy/vastai-deploy.sh
# 3. Run benchmarks
bash deploy/vastai-benchmark.sh
# 4. Tear down
bash deploy/vastai-teardown.shCompare Rust vs Python/numpy/torch on sampling and logit processing:
make bench-compare
# or
bash scripts/benchmark.sh# Start server, then:
VLLM_RS_URL=http://localhost:8000 python3 -m pytest tests/api_compat/ -vRecord a side-by-side terminal demo comparing rvLLM vs Python vLLM inference speed:
bash bench/video_demo.shUses tmux split panes to show both servers receiving identical prompts simultaneously. Records output as an asciinema .cast file. See bench/video/README.md for details.
An arXiv-style technical paper describing the architecture, CUDA integration, and design decisions is available in two formats:
LaTeX sources (under docs/paper/):
cd docs/paper
pdflatex rvllm.tex && bibtex rvllm && pdflatex rvllm.tex && pdflatex rvllm.tex # color
pdflatex rvllm-bw.tex && bibtex rvllm-bw && pdflatex rvllm-bw.tex && pdflatex rvllm-bw.tex # B&WGitHub Pages version with B&W/Color toggle: enable GitHub Pages on the /docs folder in repo Settings. No download button -- the paper is rendered inline as HTML.
23 Rust crates organized in a dependency tree from low-level GPU primitives to the HTTP API surface.
rvllm-server (binary, 16MB)
|
+-- rvllm-api HTTP layer: axum, SSE streaming, OpenAI routes
| +-- rvllm-engine Async inference loop: scheduler + executor + tokenizer
| | +-- rvllm-scheduler Continuous batching, FCFS/priority/SJF policies
| | +-- rvllm-executor Single/multi-GPU worker orchestration
| | | +-- rvllm-worker Per-GPU execution: forward pass + sampling
| | +-- rvllm-speculative Draft-model speculative decoding
| +-- rvllm-telemetry Prometheus metrics, structured tracing
|
+-- rvllm-model-runner Transformer forward pass, layer implementations
| +-- rvllm-attention PagedAttention, FlashAttention backends
| +-- rvllm-kv-cache Paged key-value cache, block tables
| +-- rvllm-model-loader SafeTensors/GGUF loading, HF hub, sharding
| +-- rvllm-quant GPTQ/AWQ/FP8 dequantization
|
+-- rvllm-sampling Logit processing, top-k/p, multinomial, Rayon batching
+-- rvllm-block-manager Block allocation, copy-on-write, prefix sharing
+-- rvllm-memory GPU/CPU memory pools, swap manager
+-- rvllm-gpu CUDA/mock abstraction, cuBLAS, kernel loader
+-- rvllm-tokenizer HuggingFace tokenizers, chat templates
+-- rvllm-sequence Sequence state, request groups, metadata
+-- rvllm-config CLI args, TOML config, validation
+-- rvllm-python PyO3 Python bindings
+-- rvllm-core Shared types, error hierarchy, prelude
15 hand-written CUDA kernels compiled to PTX, loaded at runtime via cudarc:
| Kernel | File | Purpose |
|---|---|---|
| PagedAttention V2 | paged_attention.cu |
Attention with block-table indirection, online softmax |
| FlashAttention-2 | flash_attention.cu |
Fused prefill + decode attention with causal masking |
| RMSNorm | rms_norm.cu |
Shared-memory parallel reduction for normalization |
| RMSNorm FP16 | rms_norm_f16.cu |
Half-precision RMSNorm variant |
| Fused Residual+RMSNorm | fused_residual_rmsnorm.cu |
Fused residual add + normalize in one kernel |
| Rotary Embedding | rotary_embedding.cu |
RoPE with GQA support |
| Activations | activation.cu |
SiLU, GELU, fused SiLU*mul for MLP |
| Activations FP16 | activation_f16.cu |
Half-precision activation variants |
| Softmax | softmax.cu |
Warp-level numerically stable softmax |
| Argmax | argmax.cu |
GPU-side greedy sampling (avoids D2H transfer) |
| Embedding Gather | embedding_gather.cu |
GPU-resident token embedding lookup |
| Reshape and Cache | reshape_and_cache.cu |
Write QKV into paged KV cache |
| Block Copy | copy_blocks.cu |
KV cache block copy for beam search |
| Add Bias | add_bias.cu |
Fused bias addition for QKV projections |
| FP8 KV Cache | fp8_kv.cu |
E4M3 quantization/dequantization for KV cache |
Why not wrap PyTorch from Rust? PyTorch's C++ API (libtorch) is 2GB and brings its own CUDA runtime, memory allocator, and threading model. We'd inherit all of Python vLLM's overhead. Going direct to cuBLAS/CUDA means we control every allocation and kernel launch.
Why cudarc? Safe Rust bindings to the CUDA driver API. No need for a C++ build step. PTX kernels loaded at runtime, not linked at compile time. The mock-gpu feature compiles everywhere without CUDA.
Why not Triton? Triton requires Python and a JIT compiler. Our CUDA kernels are pre-compiled to PTX -- zero runtime compilation, deterministic startup.
Why separate crates? Each crate has a clear responsibility and can be tested independently. The mock-gpu feature means all scheduling, sampling, and API logic is tested without a GPU. Only the forward pass requires real hardware.
If you call vLLM's OpenAI-compatible API, rvLLM is a drop-in replacement. Same endpoints, same request format, same response format.
# Before (Python vLLM)
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B
# After (Rust rvLLM)
rvLLM serve --model meta-llama/Llama-3-8BYour client code doesn't change at all.
Same CLI flags:
| Python vLLM | Rust rvLLM | Notes |
|---|---|---|
--model |
--model |
Same |
--port |
--port |
Same (default 8000) |
--host |
--host |
Same (default 0.0.0.0) |
--gpu-memory-utilization |
--gpu-memory-utilization |
Same (default 0.90) |
--max-model-len |
--max-model-len |
Same |
--tensor-parallel-size |
--tensor-parallel-size |
Same |
--enforce-eager |
(default) | Rust has no graph compilation step |
--dtype auto |
--dtype auto |
Same |
| Architecture | Models | Status |
|---|---|---|
| LlamaForCausalLM | Llama 2/3, CodeLlama, Vicuna | Working |
| MistralForCausalLM | Mistral 7B, Mistral Nemo | Working |
| Qwen2ForCausalLM | Qwen2, Qwen2.5 | Working |
| PhiForCausalLM | Phi-2, Phi-3, Phi-3.5 | Implemented |
| GemmaForCausalLM | Gemma, Gemma 2 | Implemented |
| MixtralForCausalLM | Mixtral 8x7B, 8x22B | Implemented |
| DeepseekV2ForCausalLM | DeepSeek-V2, DeepSeek-V2.5 | Implemented |
| GPTNeoXForCausalLM | Pythia, GPT-NeoX-20B | Implemented |
| StableLmForCausalLM | StableLM-3B, StableLM-2 | Implemented |
| CohereForCausalLM | Command-R, Command-R+ | Implemented |
Want to add a model? See CONTRIBUTING.md -- it's a single file implementing the Architecture trait. We're tracking community-requested architectures in issues.
pip install maturin
cd rvllm && maturin develop --releaseimport rvllm
# Fast sampling (Rayon parallelism, no server needed)
sampler = rvllm.Sampler()
result = sampler.sample(logits=[1.0, 2.0, 3.0], temperature=0.8, top_k=50)
# Tokenizer
tok = rvllm.Tokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
ids = tok.encode("Hello world")
# Parallel batch sampling (8x faster than sequential Python)
results = rvllm.sample_batch(
logits_batch=[[1.0, 2.0] * 16000] * 64,
temperature=0.8, top_p=0.95, seed=42,
)rvLLM serve [OPTIONS]
Options:
--model <MODEL> Model name or path (HuggingFace hub or local)
--host <HOST> Bind address [default: 0.0.0.0]
--port <PORT> Port [default: 8000]
--dtype <DTYPE> Data type [default: auto]
--max-model-len <LEN> Max sequence length [default: 2048]
--gpu-memory-utilization <FRAC> GPU memory fraction [default: 0.90]
--tensor-parallel-size <N> Number of GPUs [default: 1]
--max-num-seqs <N> Max concurrent sequences [default: 256]
--tokenizer <PATH> Custom tokenizer path
--log-level <LEVEL> Log level [default: info]
--disable-telemetry Disable Prometheus metrics
rvLLM info Show GPU and system info
rvLLM benchmark --model <MODEL> Run offline throughput benchmark
- GPU inference on A100 via cuBLAS HGEMM (FP16, tensor cores) + CUDA kernels (RMSNorm, SiLU, residual, embedding on GPU)
- RoPE + f16 KV cache for coherent text generation
- Continuous batching scheduler with preemption
- Full sampling pipeline (temperature, top-k/p/min-p, penalties, multinomial, Rayon parallel)
- Guided decoding / JSON mode / JSON schema / regex grammar
- Tool/function calling (Hermes-style, JSON parsing)
- Beam search and best-of-N sampling
- Logprobs in GPU path
- OpenAI-compatible API (completions, chat, streaming, embeddings, batch)
- 10 model architectures (Llama, Mistral, Qwen2, Phi, Gemma, GPT-NeoX, StableLM, Cohere, Mixtral MoE, DeepSeek MoE)
- FlashAttention-2 (CPU reference + CUDA kernel)
- CUDA graphs capture/replay infrastructure
- FP8 KV cache (E4M3 quantization with per-head scaling)
- Prefix caching with LRU eviction
- Sliding window attention
- Tensor parallelism primitives (NCCL bindings, column/row parallel)
- Prometheus metrics (forward time, TTFT, ITL, queue gauges)
- Embedding model support (/v1/embeddings)
- Batch processing API (/v1/batches)
- PyO3 Python bindings (
import rvllm) - SafeTensors loading from HuggingFace Hub
- Mock-GPU backend for development without hardware
- Docker deployment with CUDA 12.4
- vast.ai automated provisioning and benchmarking
- Token-level parity test suite
- 771 tests across 23 crates
- LoRA adapter hot-swapping (see CONTRIBUTING.md)
- Vision-language models (see docs/VISION_MODELS.md)
- Pipeline parallelism
- Full CUDA graph integration (capture/replay wired to forward pass)
- Production hardening (fuzz testing, load testing at 1000 concurrent)
What it actually costs to build and benchmark an LLM inference engine from scratch, for anyone considering a similar project.
| GPU | Use | Rate | Est. total |
|---|---|---|---|
| A100 80GB SXM4 | Primary dev/benchmark instance | $0.96-1.15/hr | ~$800 |
| B200 (4x, 733GB VRAM) | High-concurrency scaling tests | $12.08/hr | ~$500 |
| A100 (spot instances) | Short-lived kernel debugging, CI | $0.91-2.94/hr | ~$200 |
| Total vast.ai | ~$1,500 |
Heavy use of Claude Code with Claude Opus for architecture design, CUDA kernel writing, debugging, and code review. Base subscription covers most usage; ~$280 in extra usage charges for intensive multi-agent swarm sessions during the final performance push.
Roughly $1,780 in compute and AI overage costs to go from zero to 10,291 tok/s (beating Python vLLM up to N=256). No salaries, no team -- one developer (Andy Norris, San Francisco) with Claude and rented GPUs over 22 hours.
- 3,191 tok/s peak at N=512
- 86 tok/s single-sequence
- 8,339 tok/s peak at N=768
- Matches vLLM at N=48-128
- Fused QKV (3 GEMMs -> 1), fused gate+up (2 GEMMs -> 1)
- Fixed CUDA graph replay (was doing redundant forward pass!)
- Vectorized float4 loads in FA2, RMSNorm, embedding, reshape_and_cache
- Warp shuffle reductions in FA2 decode
- Pre-allocated activation + layer scratch buffers
- Async HtoD on separate stream, packed metadata transfers
- Fused residual+RMSNorm kernel
- Scheduler: decode-first batching, cached sort
- Result: 10,291 tok/s peak -- beats vLLM up to N=256
- Initial release
- OpenAI-compatible API:
/v1/completions,/v1/chat/completions,/v1/models,/v1/embeddings,/v1/batches - Streaming (SSE) and non-streaming responses
- 10 model architectures: Llama, Mistral, Qwen2, Phi, Gemma, Mixtral MoE, DeepSeek MoE, GPT-NeoX, StableLM, Cohere
- Continuous batching scheduler with FCFS/priority/SJF policies and preemption
- PagedAttention with block-table KV cache management
- 15 hand-written CUDA kernels (PagedAttention V2, FlashAttention-2, RMSNorm, RoPE, SiLU, GELU, softmax, argmax, embedding gather, reshape_and_cache, block copy, add_bias, FP8 KV, fused residual+RMSNorm)
- Full sampling pipeline: temperature, top-k, top-p, min-p, repetition/frequency/presence penalties, multinomial, beam search
- Guided decoding: JSON mode, JSON schema, regex grammar
- Tool/function calling (Hermes-style)
- FP8 KV cache with E4M3 quantization
- Prefix caching with LRU eviction
- Sliding window attention
- Tensor parallelism primitives (NCCL bindings)
- FlashAttention-2 (CPU reference + CUDA kernel)
- CUDA graphs capture/replay infrastructure
- SafeTensors and GGUF model loading from HuggingFace Hub
- PyO3 Python bindings (
import rvllm) - Prometheus metrics endpoint (
/metrics) - Mock-GPU backend for development without NVIDIA hardware
- Docker deployment with CUDA 12.4
- vast.ai one-command benchmarking (
make a100-bench) - 771 tests across 23 crates
See CONTRIBUTING.md for detailed guides on adding models, kernels, API endpoints, and the open feature tracks (LoRA, beam search, batch API, embeddings, VLMs, pipeline parallelism).
The codebase is organized so you can work on any layer independently:
- Add a model: Implement
Architecturetrait incrates/rvllm-model-runner/src/architectures/ - Add a sampling method: Add to
crates/rvllm-sampling/src/logit_processors.rs - Add an API endpoint: Add route in
crates/rvllm-api/src/routes/ - Add a CUDA kernel: Write
.cuinkernels/, load viaKernelLoader
All tests run with mock-gpu -- no GPU needed for development:
cargo test --workspaceApache-2.0