This is a python package focused on systems performance: quantized weights, KV cache reuse, dynamic batching, token streaming, and rigorous benchmarking across backends.
llminfer is for engineers who want to:
- run and compare multiple local inference backends with one API,
- tune latency/throughput trade-offs (batching, quantization, compile),
- benchmark systematically and export reproducible artifacts,
- expose models behind an OpenAI-compatible API for internal tools.
It is not intended to replace full distributed serving stacks; it is a compact, hackable inference/benchmarking layer you can use in experiments, Colab, and single-node production prototypes.
# Base (eager + compiled)
pip install -e .
# Optional extras
pip install -e ".[vllm]" # vLLM backend
pip install -e ".[serve]" # FastAPI + uvicorn + metrics
pip install -e ".[peft]" # LoRA adapter hot-swapNiccolò Forzano --- github.com/nickforce989 · nic.forz@gmail.com
- Start with
examples/llminfer_all_examples_colab.ipynbfor a complete guided tour. - Use
examples/llminfer_colab.ipynbfor a fast smoke test. - Use
examples/llminfer_advanced_colab.ipynbfor compile/speculative/TP tuning. - Use
examples/llminfer_serving_colab.ipynbfor OpenAI-compatible API + SSE. - Use
examples/llminfer_readme_colab.ipynbas a concise README companion.
┌──────────────────────────────────────────────────────────────┐
│ InferenceEngine │
│ .run() .run_batch() .stream() .cache_stats() │
└──────────────┬───────────────────────────────────────────────┘
│ selects backend
┌──────────▼──────────┬──────────────────┬────────────────┐
│ EagerBackend │ CompiledBackend │ VLLMBackend │
│ (HF transformers) │ (torch.compile) │ (PagedAttn.) │
└──────────┬──────────┴──────────────────┴────────────────┘
│ uses
┌──────────▼──────────┐ ┌─────────────────┐
│ KVCacheManager │ │ BatchQueue │
│ prefix cache (LRU) │ │ dynamic batch │
└─────────────────────┘ └─────────────────┘
| Feature | Details |
|---|---|
| Quantization | 4-bit NF4/FP4 (QLoRA), INT8 via bitsandbytes |
| KV cache | Per-sequence cache + prefix cache; optional paged KV representation (page_size_tokens) |
| Batching | Static batch API + continuous batching scheduler with configurable timeout |
| Streaming | Token-by-token via HF TextIteratorStreamer |
| Benchmarking | Throughput, latency p50/p95/p99, TTFT, GPU memory |
| Backends | PyTorch eager, torch.compile, vLLM (optional) |
| Parallelism | Tensor/pipeline parallel knobs (--tp-size, --pp-size) |
| CUDA graphs | Compile backend exposes cudagraph toggles (compile_cudagraphs, step-marking) |
| HF loading | Revision/token/local-files-only/cache-dir/trust-remote-code controls |
| Serving API | OpenAI-compatible /v1/completions + /v1/chat/completions + SSE |
| Artifacts | Export benchmark/comparison data to JSON + CSV with environment metadata |
| Advanced decode | Stop sequences, no-repeat n-gram, bad/force words, optional speculative decoding |
| Adapters | LoRA hot-swap via PEFT (load_adapter, set_adapter, unload_adapter) |
| CLI | llminfer run / stream / bench / compare / serve / info |
from llminfer import InferenceEngine, EngineConfig
from llminfer.config import Backend, QuantConfig, QuantMode
# ── Eager inference, no quantization ──────────────────────────
engine = InferenceEngine(EngineConfig(model_name="facebook/opt-1.3b"))
engine.load()
result = engine.run("Explain attention mechanisms.")
print(result.generated_text)
print(f"Latency: {result.stats.total_latency_ms:.0f}ms | {result.stats.throughput_tokens_per_sec:.1f} tok/s")
# ── 4-bit quantization (uses ~25% of float16 VRAM) ───────────
cfg = EngineConfig(
model_name="meta-llama/Llama-2-7b-hf",
quant=QuantConfig(mode=QuantMode.NF4, double_quant=True),
)
engine_4bit = InferenceEngine(cfg)
engine_4bit.load()
# ── Streaming ─────────────────────────────────────────────────
for chunk in engine.stream("Tell me a story"):
if not chunk.is_final:
print(chunk.token, end="", flush=True)
# ── Batch inference ───────────────────────────────────────────
results = engine.run_batch(["What is RLHF?", "Explain LoRA."])
# ── Prefix KV cache reuse ─────────────────────────────────────
from llminfer.kv_cache import KVCacheManager
sys_prompt = "You are a helpful AI assistant."
key = KVCacheManager.hash_prefix(sys_prompt)
r1 = engine.run(sys_prompt + "\n\nUser: What is BERT?", prefix_key=key) # miss
r2 = engine.run(sys_prompt + "\n\nUser: What is GPT-2?", prefix_key=key) # hit ✓
print(f"Cache hit: {r2.stats.cache_hit}")You can load a specific Hugging Face revision, use private/gated models with a token, force local-only loading, and pin the cache directory.
cfg = EngineConfig(
model_name="meta-llama/Llama-3.2-1B-Instruct",
hf_revision="main", # branch/tag/commit
hf_token=None, # or "hf_xxx" for gated models
hf_local_files_only=False, # True = do not hit network
hf_cache_dir="./hf-cache",
hf_trust_remote_code=True,
)
engine = InferenceEngine(cfg).load()CLI:
llminfer run "Explain tensor parallelism." \
--model meta-llama/Llama-3.2-1B-Instruct \
--revision main \
--cache-dir ./hf-cache \
--hf-token "$HF_TOKEN"For serving-like traffic, you can:
- represent cached KV tensors in fixed-size pages (
enable_paged_kv), - process requests with the continuous-batching scheduler.
cfg = EngineConfig(
model_name="Qwen/Qwen2.5-1.5B-Instruct",
max_batch_size=16,
batch_timeout_ms=20,
)
cfg.cache.enable_paged_kv = True
cfg.cache.page_size_tokens = 16
engine = InferenceEngine(cfg).load()
results = engine.run_batch_continuous(
["Give me 3 GPU facts.", "Explain KV cache briefly."],
max_new_tokens=96,
temperature=0.2,
)
for r in results:
print(r.generated_text)CLI:
llminfer bench \
--model facebook/opt-125m \
--batch-sizes 1,2,4,8 \
--runs 5 \
--continuous \
--paged-kv \
--page-size-tokens 16You can pass additional controls per request to reduce repetition, enforce style, or improve reproducibility.
result = engine.run(
"Explain GPU memory hierarchy in 5 bullets.",
max_new_tokens=160,
temperature=0.2,
top_p=0.9,
top_k=50,
repetition_penalty=1.05,
no_repeat_ngram_size=3,
stop_sequences=["\n\nReferences:"],
bad_words=["I'm not sure", "joking"],
force_words=["HBM", "Tensor Core"],
seed=42,
)
print(result.generated_text)CLI equivalents:
llminfer run "Explain GPU memory hierarchy in 5 bullets." \
--model Qwen/Qwen2.5-1.5B-Instruct \
--temp 0.2 \
--top-p 0.9 \
--no-repeat-ngram-size 3 \
--bad-words "I'm not sure,joking" \
--force-words "HBM,Tensor Core" \
--stop "References:" \
--seed 42llminfer can pass a smaller assistant model into Hugging Face generation
(assistant_model) to accelerate decoding on some workloads.
cfg = EngineConfig(
model_name="Qwen/Qwen2.5-1.5B-Instruct",
assistant_model_name="Qwen/Qwen2.5-0.5B-Instruct",
)
engine = InferenceEngine(cfg).load()
print(engine.run("What is speculative decoding?", max_new_tokens=96, temperature=0.2).generated_text)Advanced speculative knobs:
cfg = EngineConfig(
model_name="Qwen/Qwen2.5-1.5B-Instruct",
assistant_model_name="Qwen/Qwen2.5-0.5B-Instruct",
speculative_num_assistant_tokens=8,
speculative_confidence_threshold=0.4,
)CLI:
llminfer run "What is speculative decoding?" \
--model Qwen/Qwen2.5-1.5B-Instruct \
--assistant-model Qwen/Qwen2.5-0.5B-Instruct \
--spec-num-assistant-tokens 8 \
--spec-confidence-threshold 0.4 \
--temp 0.2Use adapters without reloading the base model.
engine = InferenceEngine(EngineConfig(model_name="Qwen/Qwen2.5-1.5B-Instruct")).load()
engine.load_adapter("your-adapter-path-or-hf-repo", adapter_name="domain_a")
print(engine.list_adapters())
engine.set_adapter("domain_a")
print(engine.run("Summarize LoRA adapters in one paragraph.", max_new_tokens=96).generated_text)
engine.unload_adapter("domain_a")CLI:
llminfer run "Summarize LoRA adapters." \
--model Qwen/Qwen2.5-1.5B-Instruct \
--adapter your-adapter-path-or-hf-repo \
--adapter-name domain_afrom llminfer import Benchmarker
bm = Benchmarker(engine)
result = bm.run(batch_sizes=[1, 2, 4, 8, 16], num_runs=20, max_new_tokens=128)
result.print_summary()
result.plot("bench.png")Output:
┌─────────────────────────────────────────────────────────────────┐
│ Benchmark: facebook/opt-125m [eager / none] │
├────────────┬──────────────┬──────────────┬───────────┬─────────┤
│ Batch Size │ Latency p50 │ Latency p95 │ Tok/s │ GPU MB │
├────────────┼──────────────┼──────────────┼───────────┼─────────┤
│ 1 │ 412 ms │ 428 ms │ 311.2 │ 486 │
│ 2 │ 489 ms │ 503 ms │ 524.6 │ 487 │
│ 4 │ 701 ms │ 718 ms │ 731.8 │ 490 │
│ 8 │ 1121 ms │ 1145 ms │ 915.4 │ 497 │
└────────────┴──────────────┴──────────────┴───────────┴─────────┘
Benchmarks and backend comparisons can be exported as machine-readable artifacts with environment metadata (Python, PyTorch, CUDA, GPU, transformers versions).
Single backend benchmark:
llminfer bench \
--model facebook/opt-125m \
--batch-sizes 1,2,4 \
--runs 5 \
--artifacts-dir benchmarks/outOutputs:
benchmarks/out/benchmark.jsonbenchmarks/out/benchmark.csv
Comparison benchmark:
llminfer compare \
--model facebook/opt-125m \
--backends eager,compiled,vllm \
--batch-sizes 1,2,4 \
--runs 5 \
--artifacts-dir benchmarks/compare_outOutputs:
benchmarks/compare_out/comparison.jsonbenchmarks/compare_out/comparison.csv
You can also write explicit paths with --json-out and --csv-out.
Dedicated eager-vs-vLLM script:
python examples/benchmark_vs_vllm.py \
--model facebook/opt-125m \
--batch-sizes 1,2,4,8 \
--runs 5 \
--continuous \
--outdir benchmarks/vllm_compareAdditional plotting outputs:
llminfer bench --model facebook/opt-125m --plot-suite-dir plots/bench
llminfer compare --model facebook/opt-125m --backends eager,compiled --plot-suite-dir plots/compareEach plot suite writes:
<prefix>_dashboard.png(combined 2x2 view)<prefix>_throughput.png<prefix>_latency.png<prefix>_ttft.png(if TTFT is available)<prefix>_memory.png
Tensor/pipeline parallel knobs can be applied during compare runs, especially for vLLM:
llminfer compare \
--model meta-llama/Llama-3.2-1B-Instruct \
--backends eager,vllm \
--tp-size 2 \
--pp-size 1 \
--compile-cudagraphs \
--batch-sizes 1,2,4,8 \
--runs 5from llminfer.benchmark import BackendComparison
from llminfer.config import Backend
cmp = BackendComparison(
model_name="facebook/opt-1.3b",
backends=[Backend.EAGER, Backend.COMPILED, Backend.VLLM],
)
results = cmp.run(batch_sizes=[1, 4, 8, 16])
cmp.print_table(results)
cmp.plot(results, "comparison.png")Typical speedups on A100 (Llama-7B, bs=8, 128 tok output):
| Backend | Throughput | vs Eager |
|---|---|---|
| Eager | 480 tok/s | 1.0× |
| torch.compile (reduce-overhead) | 710 tok/s | 1.48× |
| vLLM (PagedAttention) | 1240 tok/s | 2.58× |
Compiled runtime safety:
CompiledBackendmarks CUDAGraph step boundaries before each invocation.- If a compile/runtime-specific failure occurs (e.g. CUDAGraph overwrite errors),
it can automatically fall back to eager execution (
compile_fallback_to_eager=True).
# Run inference
llminfer run "Tell me about GPUs" --model facebook/opt-125m
# Streaming
llminfer stream "Explain backpropagation" --model facebook/opt-1.3b --quant nf4
# Benchmark
llminfer bench --model facebook/opt-125m --batch-sizes 1,2,4,8 --runs 10 --plot bench.png
# Benchmark + continuous batching + paged KV
llminfer bench \
--model facebook/opt-125m \
--batch-sizes 1,2,4,8 \
--runs 10 \
--continuous \
--paged-kv \
--page-size-tokens 16 \
--plot-suite-dir plots/bench
# Compare backends
llminfer compare --model facebook/opt-125m --backends eager,compiled,vllm --plot compare.png
# Compare backends + full plot suite
llminfer compare --model facebook/opt-125m --backends eager,vllm --continuous --plot-suite-dir plots/compare
# Run OpenAI-compatible server
llminfer serve --model Qwen/Qwen2.5-1.5B-Instruct --host 0.0.0.0 --port 8000
# Info
llminfer info --model facebook/opt-125m --backend compiledllminfer serve launches a FastAPI server with an internal continuous-batching
scheduler (max_batch_size, batch_timeout_ms, max_queue_size).
Endpoints:
POST /v1/completionsPOST /v1/chat/completionsGET /v1/modelsGET /healthzGET /metrics(Prometheus format, ifprometheus-clientinstalled)
Start server:
llminfer serve \
--model Qwen/Qwen2.5-1.5B-Instruct \
--backend eager \
--host 0.0.0.0 \
--port 8000 \
--max-batch-size 16 \
--batch-timeout-ms 20 \
--max-queue-size 1024Non-streaming chat request:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-llminfer",
"messages": [
{"role":"system","content":"You are concise."},
{"role":"user","content":"Tell me about GPUs in 3 bullets."}
],
"max_tokens": 128,
"temperature": 0.2
}'Non-streaming semantics:
- one JSON response (
object: "chat.completion") - final model output is in
choices[0].message.content - token counts are in
usage
Python example (non-streaming):
import requests
payload = {
"model": "local-llminfer",
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Tell me about GPUs in 4 bullet points."},
],
"max_tokens": 160,
"temperature": 0.2,
}
resp = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload, timeout=180)
resp.raise_for_status()
data = resp.json()
print(data["choices"][0]["message"]["content"])
print(data["usage"])Streaming chat request (SSE):
curl -N http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"local-llminfer",
"messages":[{"role":"user","content":"Explain KV cache in under 80 words."}],
"stream": true
}'Streaming semantics (SSE):
- chunks arrive as Server-Sent Event lines:
data: {...} - text increments are in
choices[0].delta.content - stream terminates with
data: [DONE]
Python example (SSE, robust parser):
import json
import requests
stream_payload = {
"model": "local-llminfer",
"messages": [{"role": "user", "content": "Explain KV cache in under 80 words."}],
"max_tokens": 120,
"temperature": 0.2,
"stream": True,
}
decoder = json.JSONDecoder()
done = False
with requests.post(
"http://127.0.0.1:8000/v1/chat/completions",
json=stream_payload,
stream=True,
timeout=180,
) as r:
r.raise_for_status()
for raw_line in r.iter_lines(decode_unicode=True):
if not raw_line:
continue
line = raw_line.strip()
if "data:" not in line:
continue
# Some transports can coalesce multiple data frames in one line.
payloads = [p.strip() for p in line.split("data:") if p.strip()]
for payload in payloads:
if payload == "[DONE]":
done = True
break
# Ignore heartbeat/newline fragments.
if payload in {"\\n", "\\n\\n", "\"\\n\"", "\"\\n\\n\""}:
continue
s = payload
while s:
s = s.lstrip()
if not s:
break
try:
obj, idx = decoder.raw_decode(s)
except json.JSONDecodeError:
break
s = s[idx:]
delta = obj.get("choices", [{}])[0].get("delta", {}).get("content", "")
if delta:
print(delta, end="", flush=True)
if done:
print("\\n[done]")
breakThis parser pattern is also used in:
examples/llminfer_serving_colab.ipynbexamples/openai_api_client.py
Health and metrics:
curl http://127.0.0.1:8000/healthz
curl http://127.0.0.1:8000/metricsllminfer/
├── __init__.py Public API surface
├── config.py EngineConfig, QuantConfig, CacheConfig
├── engine.py InferenceEngine (main entrypoint)
├── request.py GenerationRequest, GenerationResult, StreamChunk
├── kv_cache.py KVCacheManager with prefix cache + optional paged KV layout
├── batching.py Dynamic batch assembly (sync + async)
├── streaming.py TokenStreamer wrapping HF TextIteratorStreamer
├── benchmark.py Benchmarker + BackendComparison + artifact export
├── serving.py ContinuousBatchScheduler + chat prompt helpers
├── api.py OpenAI-compatible FastAPI app + SSE streaming
├── cli.py Typer CLI (run / stream / bench / compare / serve / info)
└── backends/
├── base.py Abstract BaseBackend
├── eager.py HF generation + quantization + constraints + adapters
├── compiled.py torch.compile backend + runtime fallback to eager
└── vllm_backend.py vLLM async backend + true token streaming
examples/ ├── full_demo.py Original end-to-end walkthrough ├── advanced_features.py Constraints, speculative setup, artifacts, adapters ├── benchmark_vs_vllm.py Focused eager-vs-vLLM benchmark/export script ├── run_server.py Launch OpenAI-compatible server ├── openai_api_client.py Non-stream + stream API client examples ├── llminfer_all_examples_colab.ipynb Master Colab that covers all example paths ├── llminfer_colab.ipynb Quickstart notebook ├── llminfer_readme_colab.ipynb Concise README companion notebook ├── llminfer_advanced_colab.ipynb Advanced tuning notebook └── llminfer_serving_colab.ipynb Serving/API notebook
-
examples/full_demo.py- Original end-to-end package tour (eager, quantization, streaming, batch, prefix cache, benchmark, backend comparison).
-
examples/advanced_features.py- Demonstrates advanced decoding controls.
- Shows optional speculative-decoding setup with
assistant_model_name. - Shows optional PEFT adapter workflow.
- Exports benchmark artifacts (
benchmark.json,benchmark.csv,benchmark.png).
-
examples/benchmark_vs_vllm.py- Compares eager and vLLM backends with the same model/prompts.
- Supports HF loading flags, continuous batching, and paged-KV toggles.
- Exports JSON/CSV plus dashboard and per-metric plots.
-
examples/run_server.py- Starts OpenAI-compatible API with
ContinuousBatchScheduler. - Intended as a minimal serving bootstrap script.
- Starts OpenAI-compatible API with
-
examples/openai_api_client.py- Calls
/v1/chat/completionsin non-streaming mode. - Parses SSE streaming chunks for streaming mode.
- Calls
The notebooks are intentionally de-duplicated so each has a distinct job:
-
examples/llminfer_all_examples_colab.ipynb- End-to-end master notebook covering all example scripts/workflows.
- Includes clear section headers, explicit prints, artifact exports, and plot displays.
-
examples/llminfer_colab.ipynb- Minimal quickstart: install, run, stream, static vs continuous batching, benchmark plots.
-
examples/llminfer_readme_colab.ipynb- Concise runnable companion to README sections.
- Links out to specialized notebooks instead of duplicating all steps.
-
examples/llminfer_advanced_colab.ipynb- Advanced controls: HF loading flags, speculative knobs, compile/cudagraph toggles, continuous batching, artifact suite, optional vLLM compare.
- Includes vLLM install/import checks and guidance for torch/vLLM ABI mismatch in Colab.
-
examples/llminfer_serving_colab.ipynb- OpenAI-compatible server workflow with robust startup checks.
- Non-streaming + robust SSE streaming parser, concurrency smoke test, metrics checks.
Notebook quality conventions:
- clean cells (no stale output blobs committed),
- explicit print summaries for each section,
- saved plots rendered inline from generated artifacts,
- optional heavy sections (
RUN_VLLM, etc.) gated with clear flags.
vLLM note for Colab:
- If you see an import error with
undefined symbol ... _C.abi3.so, your runtime has a torch/vLLM binary mismatch. Restart the runtime, rerun install cells, and use the notebook's vLLM section (it force-refreshes vLLM and validates import).
Why KV cache at the application level?
HuggingFace manages KV cache internally per generate() call. Our
KVCacheManager adds prefix-level caching across calls — identical
system prompts don't re-compute their KV blocks on the second request.
It can also represent cached KV in fixed-size pages to reduce memory
fragmentation pressure in long-running processes.
Why dynamic batching?
Static batching wastes GPU time waiting for the slowest sequence.
BatchQueue collects requests for up to batch_timeout_ms before
dispatching, trading a small fixed latency for much higher throughput.
Why compare all three backends?
- Eager: baseline, debuggable
- torch.compile: ~1.5× speedup with zero infra change
- vLLM: the production gold standard — PagedAttention eliminates KV cache fragmentation and enables continuous batching for ~2.5× throughput
- Python ≥ 3.9
- PyTorch ≥ 2.1 (for
torch.compilesupport) - CUDA GPU (CPU inference works but is slow for large models)
bitsandbytesfor quantization (Linux/Windows with CUDA)vllmfor the VLLM backend (Linux + CUDA only)fastapi+uvicorn+prometheus-clientforllminfer servepeftfor LoRA adapter hot-swap
This project is licensed under the GNU General Public License v3.0 or later (GPL-3.0-or-later). See the LICENSE file.