FastLLM is a Rust workspace for routing LLM requests through a unified
gateway API. It provides typed routing contracts, provider registration,
scheduling, prompt caching, retries, local model metadata, and adapters backed
by autoagents-llm.
Configuration is typed Rust API. FastLLM does not currently expose YAML or TOML configuration loading.
The current workspace is intentionally compact:
| Path | Purpose |
|---|---|
crates/fastllm |
Public SDK facade exposed as fastllm |
crates/core |
Core gateway implementation exposed as fastllm-core |
examples/hello-world |
Minimal OpenAI chat example |
examples/local-model-inference |
Single local GGUF inference with autoagents-llamacpp |
examples/parallel-local-inference |
Two local model routes executed concurrently |
examples/scheduler-showcase |
Scheduler, prompt cache, and telemetry without external services |
examples/memory-management |
Local model memory-budget admission |
Build and check the workspace:
cargo checkRun the OpenAI hello-world example:
OPENAI_API_KEY=sk-... cargo run -p fastllm-example-hello-worldThe example defaults to gpt-4o-mini. Set OPENAI_MODEL to use a different
OpenAI chat model:
OPENAI_API_KEY=sk-... OPENAI_MODEL=gpt-4o-mini cargo run -p fastllm-example-hello-worldRun examples that do not need external services:
cargo run -p fastllm-example-scheduler-showcase
cargo run -p fastllm-example-memory-managementRun the default local GGUF model from Hugging Face
(unsloth/Qwen3.5-9B-GGUF, Qwen3.5-9B-Q4_0.gguf):
cargo run -p fastllm-example-local-model-inferenceOverride with a local GGUF file:
FASTLLM_GGUF_MODEL=/models/model.gguf cargo run -p fastllm-example-local-model-inferenceRun two local routes concurrently. By default both routes use the same Hugging Face model; set local paths to override:
FASTLLM_GGUF_MODEL_A=/models/a.gguf \
FASTLLM_GGUF_MODEL_B=/models/b.gguf \
cargo run -p fastllm-example-parallel-local-inferenceRegister an autoagents-llm provider and send one chat request:
use fastllm::{LlmGateway, LlmMessage, LlmRequest, ModelRoute, ProviderConfig};
let gateway = LlmGateway::new();
gateway.register_provider_config(ProviderConfig::from_env("openai", "gpt-4o-mini"))?;
let response = gateway
.chat(LlmRequest::new(
ModelRoute::new("openai", "gpt-4o-mini"),
vec![LlmMessage::user("What is the capital of France?")],
))
.await?;
println!("{}", response.text);Compose scheduler, cache, retry, and local model policy with Rust structs:
use fastllm::{
CacheConfig, GatewayConfig, LlmGateway, ModelConfig, ModelRoute, RetryConfig,
RuntimeKind, SchedulerConfig,
};
let route = ModelRoute::new("local", "llama-3.2");
let config = GatewayConfig {
scheduler: SchedulerConfig {
max_queue_depth: 2048,
max_concurrent_tasks: 64,
per_route_concurrency: 4,
default_deadline_ms: 120_000,
},
cache: CacheConfig {
enabled: true,
ttl_seconds: 300,
max_entries: 4096,
},
retry: RetryConfig {
max_attempts: 2,
..RetryConfig::default()
},
local_memory_budget_bytes: 24 * 1024 * 1024 * 1024,
..GatewayConfig::default()
}
.with_model(ModelConfig {
route,
runtime: RuntimeKind::Local,
model_path: Some("/models/llama.gguf".to_string()),
memory_bytes: 8 * 1024 * 1024 * 1024,
kv_cache_bytes: 2 * 1024 * 1024 * 1024,
max_parallel_sequences: 4,
ttl_seconds: 600,
..ModelConfig::default()
});
let gateway = LlmGateway::builder().config(config).build();With the local feature enabled, register_llamacpp_model attaches a
lazy-loading local runtime backed by autoagents-llamacpp.
ExecutionSchedulerapplies bounded admission, per-route concurrency, and request deadlines before dispatch.PromptCacheuses canonical request keys, TTL, and entry-limit eviction while excluding provider-only parameters from cache identity.RetryPipelineapplies typed retry and fallback policies around scheduler execution.ModelRegistry,MemoryManager,InferenceSlots, andKvCacheManagertrack local residency, memory pressure, parallel slots, and KV-prefix metadata.Telemetryexposes lightweight counters for cache hits/misses, scheduling, retries, and model load/unload events.
Required environment variables:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key used by the OpenAI provider |
OPENAI_MODEL |
Optional model override for the example |
Format and check before sending changes:
cargo fmt --all --check
cargo checkAutoAgents is dual-licensed under:
- MIT License (MIT_LICENSE)
- Apache License 2.0 (APACHE_LICENSE)
You may choose either license for your use case.
Built by the Liquidos AI team and wonderful community of researchers and engineers.
Special thanks to:
- The Rust community for the excellent ecosystem
- LLM providers for enabling high-quality model APIs
- All contributors who help improve AutoAgents
