Fast LiteLLM

High-performance Rust acceleration for LiteLLM - targeting 2-20x performance improvements for token counting, routing, rate limiting, and connection management.

Why Fast LiteLLM?

Fast LiteLLM is a drop-in Rust acceleration layer for LiteLLM that provides targeted performance improvements where it matters most:

Modest improvements in already well-optimized operations like token counting
~46% faster rate limiting with async and concurrent primitives
~39% faster connection management with improved pooling
Enhanced batch processing capabilities
Lock-free data structures for concurrent operations

Built with PyO3 and Rust, it seamlessly integrates with existing LiteLLM code with zero configuration required. Performance gains are most significant in complex operations where Rust's concurrency model provides advantages over Python's.

Installation

pip install fast-litellm

Quick Start

import fast_litellm  # Automatically accelerates LiteLLM
import litellm

# All LiteLLM operations now use Rust acceleration where available
response = litellm.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

That's it! Just import fast_litellm before litellm and acceleration is automatically applied.

Architecture

The acceleration uses PyO3 to create Python extensions from Rust code:

┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package                                      │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer)                    │
│ ├── Enhanced Monkeypatching                                │
│ ├── Feature Flags & Gradual Rollout                        │
│ ├── Performance Monitoring                                 │
│ └── Automatic Fallback                                     │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3)                        │
│ ├── core               (Advanced Routing)                   │
│ ├── tokens             (Token Counting)                    │
│ ├── connection_pool    (Connection Management)             │
│ └── rate_limiter       (Rate Limiting)                     │
└─────────────────────────────────────────────────────────────┘

Features

Zero Configuration: Works automatically on import
Production Safe: Built-in feature flags, monitoring, and automatic fallback to Python
Performance Monitoring: Real-time metrics and optimization recommendations
Gradual Rollout: Support for canary deployments and percentage-based feature rollout
Thread Safe: Lock-free data structures using DashMap for concurrent operations
Type Safe: Full Python type hints and type stubs included

Performance Benchmarks

Component	Baseline	Optimized	Use Case
Token Counting	Well-optimized	~0x	Individual token counting (LiteLLM already optimized)
Batch Token Counting	Python implementation	+9%	Processing multiple texts at once
Request Routing	Python implementation	+0.7%	Load balancing, model selection
Rate Limiting	Python implementation	+46%	Request throttling, quota management
Connection Pooling	Python implementation	+39%	HTTP reuse, latency reduction

Note: Our benchmarking revealed that LiteLLM's core token counting is already well-optimized, so performance gains are most significant in complex operations like rate limiting and connection pooling, where Rust's concurrent primitives provide meaningful improvements.

Configuration

Fast LiteLLM works out of the box with zero configuration. For advanced use cases, you can configure behavior via environment variables:

# Disable specific features
export FAST_LITELLM_RUST_ROUTING=false

# Gradual rollout (10% of traffic)
export FAST_LITELLM_BATCH_TOKEN_COUNTING=canary:10

# Custom configuration file
export FAST_LITELLM_FEATURE_CONFIG=/path/to/config.json

See the Configuration Guide for all options.

Requirements

Python 3.8 or higher
LiteLLM

Rust is not required for installation - prebuilt wheels are available for all major platforms.

Development

To contribute or build from source:

Prerequisites:

Python 3.8+
Rust toolchain (1.70+)
maturin for building Python extensions

Setup:

git clone https://github.com/neul-labs/fast-litellm.git
cd fast-litellm

# Install maturin
pip install maturin

# Build and install in development mode
maturin develop

# Run unit tests
pip install pytest pytest-asyncio
pytest tests/

Integration Testing

Fast LiteLLM includes comprehensive integration tests that run LiteLLM's test suite with acceleration enabled:

# Setup LiteLLM for testing
./scripts/setup_litellm.sh

# Run LiteLLM tests with acceleration
./scripts/run_litellm_tests.sh

# Compare performance (with vs without acceleration)
./scripts/compare_performance.py

This ensures Fast LiteLLM doesn't break any LiteLLM functionality. See the Testing Guide for details.

For more information, see our Contributing Guide.

Documentation

Performance Analysis - Realistic benchmarks and expectations
API Reference
Architecture Guide
Feature Flags
Performance Monitoring

How It Works

Fast LiteLLM uses PyO3 to create Python extensions from Rust code:

┌─────────────────────────────────────────────────────────────┐
│ LiteLLM Python Package                                      │
├─────────────────────────────────────────────────────────────┤
│ fast_litellm (Python Integration Layer)                    │
│ ├── Enhanced Monkeypatching                                │
│ ├── Feature Flags & Gradual Rollout                        │
│ ├── Performance Monitoring                                 │
│ └── Automatic Fallback                                     │
├─────────────────────────────────────────────────────────────┤
│ Rust Acceleration Components (PyO3)                        │
│ ├── core               (Advanced Routing)                   │
│ ├── tokens             (Token Counting)                    │
│ ├── connection_pool    (Connection Management)             │
│ └── rate_limiter       (Rate Limiting)                     │
└─────────────────────────────────────────────────────────────┘

When you import fast_litellm, it automatically patches LiteLLM's performance-critical functions with Rust implementations while maintaining full compatibility with the Python API.

Note: Performance gains vary significantly by operation. Core token counting shows minimal improvement as LiteLLM is already well-optimized for these operations. The most significant gains (40-50%) come from complex concurrent operations like rate limiting and connection pooling. See Performance Analysis for detailed benchmarks and realistic expectations.

Contributing

We welcome contributions! Please see our Contributing Guide.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Links

GitHub: https://github.com/neul-labs/fast-litellm
PyPI: https://pypi.org/project/fast-litellm/
Issues: https://github.com/neul-labs/fast-litellm/issues
LiteLLM: https://github.com/BerriAI/litellm

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
fast_litellm		fast_litellm
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
mise.toml		mise.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast LiteLLM

Why Fast LiteLLM?

Installation

Quick Start

Architecture

Features

Performance Benchmarks

Configuration

Requirements

Development

Integration Testing

Documentation

How It Works

Contributing

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

neul-labs/fast-litellm

Folders and files

Latest commit

History

Repository files navigation

Fast LiteLLM

Why Fast LiteLLM?

Installation

Quick Start

Architecture

Features

Performance Benchmarks

Configuration

Requirements

Development

Integration Testing

Documentation

How It Works

Contributing

License

Links

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages