FastLLM

A Unified LLM Gateway with local Models in Rust

Like this project? Star us on GitHub

FastLLM is a Rust workspace for routing LLM requests through a unified gateway API. It provides typed routing contracts, provider registration, scheduling, prompt caching, retries, local model metadata, and adapters backed by autoagents-llm.

Configuration is typed Rust API. FastLLM does not currently expose YAML or TOML configuration loading.

The current workspace is intentionally compact:

Path	Purpose
`crates/fastllm`	Public SDK facade exposed as `fastllm`
`crates/core`	Core gateway implementation exposed as `fastllm-core`
`examples/hello-world`	Minimal OpenAI chat example
`examples/local-model-inference`	Single local GGUF inference with `autoagents-llamacpp`
`examples/parallel-local-inference`	Two local model routes executed concurrently
`examples/scheduler-showcase`	Scheduler, prompt cache, and telemetry without external services
`examples/memory-management`	Local model memory-budget admission

Quick Start

Build and check the workspace:

cargo check

Run the OpenAI hello-world example:

OPENAI_API_KEY=sk-... cargo run -p fastllm-example-hello-world

The example defaults to gpt-4o-mini. Set OPENAI_MODEL to use a different OpenAI chat model:

OPENAI_API_KEY=sk-... OPENAI_MODEL=gpt-4o-mini cargo run -p fastllm-example-hello-world

Run examples that do not need external services:

cargo run -p fastllm-example-scheduler-showcase
cargo run -p fastllm-example-memory-management

Run the default local GGUF model from Hugging Face (unsloth/Qwen3.5-9B-GGUF, Qwen3.5-9B-Q4_0.gguf):

cargo run -p fastllm-example-local-model-inference

Override with a local GGUF file:

FASTLLM_GGUF_MODEL=/models/model.gguf cargo run -p fastllm-example-local-model-inference

Run two local routes concurrently. By default both routes use the same Hugging Face model; set local paths to override:

FASTLLM_GGUF_MODEL_A=/models/a.gguf \
FASTLLM_GGUF_MODEL_B=/models/b.gguf \
cargo run -p fastllm-example-parallel-local-inference

Cloud Provider Example

Register an autoagents-llm provider and send one chat request:

use fastllm::{LlmGateway, LlmMessage, LlmRequest, ModelRoute, ProviderConfig};

let gateway = LlmGateway::new();
gateway.register_provider_config(ProviderConfig::from_env("openai", "gpt-4o-mini"))?;

let response = gateway
    .chat(LlmRequest::new(
        ModelRoute::new("openai", "gpt-4o-mini"),
        vec![LlmMessage::user("What is the capital of France?")],
    ))
    .await?;
println!("{}", response.text);

Typed Configuration

Compose scheduler, cache, retry, and local model policy with Rust structs:

use fastllm::{
    CacheConfig, GatewayConfig, LlmGateway, ModelConfig, ModelRoute, RetryConfig,
    RuntimeKind, SchedulerConfig,
};

let route = ModelRoute::new("local", "llama-3.2");
let config = GatewayConfig {
    scheduler: SchedulerConfig {
        max_queue_depth: 2048,
        max_concurrent_tasks: 64,
        per_route_concurrency: 4,
        default_deadline_ms: 120_000,
    },
    cache: CacheConfig {
        enabled: true,
        ttl_seconds: 300,
        max_entries: 4096,
    },
    retry: RetryConfig {
        max_attempts: 2,
        ..RetryConfig::default()
    },
    local_memory_budget_bytes: 24 * 1024 * 1024 * 1024,
    ..GatewayConfig::default()
}
.with_model(ModelConfig {
    route,
    runtime: RuntimeKind::Local,
    model_path: Some("/models/llama.gguf".to_string()),
    memory_bytes: 8 * 1024 * 1024 * 1024,
    kv_cache_bytes: 2 * 1024 * 1024 * 1024,
    max_parallel_sequences: 4,
    ttl_seconds: 600,
    ..ModelConfig::default()
});

let gateway = LlmGateway::builder().config(config).build();

With the local feature enabled, register_llamacpp_model attaches a lazy-loading local runtime backed by autoagents-llamacpp.

Runtime Features

ExecutionScheduler applies bounded admission, per-route concurrency, and request deadlines before dispatch.
PromptCache uses canonical request keys, TTL, and entry-limit eviction while excluding provider-only parameters from cache identity.
RetryPipeline applies typed retry and fallback policies around scheduler execution.
ModelRegistry, MemoryManager, InferenceSlots, and KvCacheManager track local residency, memory pressure, parallel slots, and KV-prefix metadata.
Telemetry exposes lightweight counters for cache hits/misses, scheduling, retries, and model load/unload events.

Required environment variables:

Variable	Description
`OPENAI_API_KEY`	API key used by the OpenAI provider
`OPENAI_MODEL`	Optional model override for the example

Development

Format and check before sending changes:

cargo fmt --all --check
cargo check

License

AutoAgents is dual-licensed under:

MIT License (MIT_LICENSE)
Apache License 2.0 (APACHE_LICENSE)

You may choose either license for your use case.

Acknowledgments

Built by the Liquidos AI team and wonderful community of researchers and engineers.

Special thanks to:

The Rust community for the excellent ecosystem
LLM providers for enabling high-quality model APIs
All contributors who help improve AutoAgents

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
assets		assets
crates		crates
docs		docs
examples		examples
.gitignore		.gitignore
APACHE_LICENSE		APACHE_LICENSE
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
MIT_LICENSE		MIT_LICENSE
README.md		README.md
lefthook.yml		lefthook.yml
tarpaulin.toml		tarpaulin.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastLLM

Quick Start

Cloud Provider Example

Typed Configuration

Runtime Features

Development

License

Acknowledgments

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastLLM

Quick Start

Cloud Provider Example

Typed Configuration

Runtime Features

Development

License

Acknowledgments

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages