vLLM-lite

A lightweight LLM inference engine written in Rust, implementing key vLLM innovations.

Supported Models

Model	Architecture	Status
Qwen3	GQA + RoPE	✅
Llama	GQA + RMSNorm	✅
Mistral	Sliding Window + GQA	✅
Gemma4	Hybrid Attention + GeGLU	✅
Mixtral	Sparse MoE (8 experts)	✅

Features

Continuous Batching - Dynamic batch scheduling with fairness
Paged KV Cache - Memory-efficient cache management with pool
Prefix Caching - Block hash-based cache reuse
Speculative Decoding - Accelerated token generation
OpenAI-compatible API - /v1/completions, /v1/chat/completions
Flash Attention - Dynamic tile size selection (64/128/256)
Fused Kernels - Optimized attention + MLP layers
Authentication - API key support
Rate Limiting - Request rate control

Quick Start

# Build
cargo build --workspace

# Run server (default: /models/Qwen2.5-0.5B-Instruct)
cargo run -p vllm-server

# Or specify model path
cargo run -p vllm-server -- --model /path/to/your/model

# Test
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, how are you?", "max_tokens": 50, "stream": true}'

Configuration

Configuration can be provided via YAML file or environment variables:

# config.yaml
server:
  host: "0.0.0.0"
  port: 8000
  log_level: "info"

engine:
  max_draft_tokens: 8
  num_kv_blocks: 1024
  max_batch_size: 256
  tensor_parallel_size: 1
  kv_quantization: false  # Enable INT8 KV cache quantization

auth:
  api_keys: []
  rate_limit_requests: 100
  rate_limit_window_secs: 60

Environment Variables

Variable	Description	Default
`VLLM_HOST`	Server host	`0.0.0.0`
`VLLM_PORT`	Server port	`8000`
`VLLM_LOG_LEVEL`	Log level	`info`
`VLLM_MAX_DRAFT_TOKENS`	Max speculative tokens	`8`
`VLLM_KV_BLOCKS`	Number of KV blocks	`1024`
`VLLM_TENSOR_PARALLEL_SIZE`	Tensor parallel size	`1`
`VLLM_KV_QUANTIZATION`	Enable INT8 KV cache quantization	`false`

CLI Options

cargo run -p vllm-server -- --help

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completion
`/v1/completions`	POST	Text completion
`/v1/embeddings`	POST	Get embeddings
`/v1/batches`	POST/GET	Batch requests
`/metrics`	GET	Prometheus metrics
`/health`	GET	Health check
`/shutdown`	GET	Shutdown server

Examples

Chat Completion

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is Rust?"}
    ],
    "max_tokens": 100
  }'

Streaming

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Once upon a time",
    "max_tokens": 50,
    "stream": true
  }'

With Authentication

# Set API key
export VLLM_API_KEY=your-secret-key

# Use with requests
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{"prompt": "Hello", "max_tokens": 10}'

Features

🚀 Fast Rust implementation
🎯 Continuous batching with decode-priority scheduling
💾 Paged KV cache with LRU eviction + memory pool
🔍 Block hash-based prefix caching
⚡ Flash Attention with dynamic tile selection
🔗 Fused attention and MLP kernels
🔄 Streaming token generation (SSE)
📡 OpenAI-compatible HTTP API
🖥️ CUDA GPU support (via Candle)
📊 Real-time metrics collection
🔐 API key authentication
⏱️ Rate limiting

Documentation

File	Description
ROADMAP.md	Development roadmap and milestones
CHANGELOG.md	Version history and changes
AGENTS.md	Developer guide and conventions
docs/	Design specs and implementation plans

Architecture

vllm-lite/
├── crates/
│   ├── traits/      # Interface definitions (ModelBackend trait)
│   ├── core/        # Engine, Scheduler, KV Cache, Metrics
│   │   └── src/
│   │       ├── scheduler/  # Scheduler modules (queue, preemption, eviction, batch)
│   │       └── kv_cache/   # Logical KV cache (block_allocator, prefix_cache)
│   ├── model/       # Model implementations
│   │   └── src/
│   │       ├── kernels/    # GPU kernels (flash_attention, fused_mlp, cuda_graph)
│   │       ├── paged_tensor/ # Physical KV cache (tensor_store, quantization)
│   │       └── components/ # Model components (attention, mlp, norm, positional)
│   ├── dist/        # Tensor Parallel support
│   └── server/      # HTTP API (OpenAI compatible)

Name		Name	Last commit message	Last commit date
Latest commit History 413 Commits
.github/workflows		.github/workflows
crates		crates
docs		docs
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.rumdl.toml		.rumdl.toml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM-lite

Supported Models

Features

Quick Start

Configuration

Environment Variables

CLI Options

API Endpoints

Examples

Chat Completion

Streaming

With Authentication

Features

Documentation

Architecture

Tech Stack

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM-lite

Supported Models

Features

Quick Start

Configuration

Environment Variables

CLI Options

API Endpoints

Examples

Chat Completion

Streaming

With Authentication

Features

Documentation

Architecture

Tech Stack

License

Links

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages