LLM-D v0.8.0 Component Summary
Themes: solidify CI coverage & project operations, expand accelerator coverage, graduate multimodal/batch/flow-control to production, introduce initial RL support.
| Component |
Version |
Previous Version |
Type |
Notes |
| llm-d/llm-d-router-endpoint-picker |
v0.9.0 |
v0.8.0 |
Image + Helm Chart |
Core EPP image (renamed from llm-d-inference-scheduler) |
| llm-d/llm-d-router-disagg-sidecar |
v0.9.0 |
v0.8.0 |
Image |
P/D routing sidecar (renamed from llm-d-routing-sidecar) |
| llm-d/llm-d-uds-tokenizer |
vllm-v0.23.0 |
vllm-v0.19.1 |
Image |
Tokenizer sidecar aligned with vLLM version |
| llm-d/llm-d-kv-cache |
v0.9.0 |
v0.8.0 |
Library |
HMA support, storage events, multi-tier offloading |
| llm-d/llm-d-inference-sim |
v0.9.2 |
v0.8.2 |
Image |
Multimodal support, Mooncake bootstrap, configurable latency |
| llm-d/llm-d-cuda |
v0.8.0 |
v0.7.0 |
Image |
vLLM v0.23.0, CUDA 13.0.2 |
| llm-d/llm-d-aws (EFA) |
v0.8.0 |
v0.7.0 |
Image |
|
| llm-d/llm-d-hpu |
v0.8.0 |
v0.7.0 |
Image |
|
| llm-d/llm-d-kv-cache/llmd_fs_backend_connector |
v0.23 |
v0.19.1 |
Wheel installed in llm-d |
Migrated to vLLM 0.23.0 offload API |
| llm-d/llm-d-benchmark |
v0.7.0 |
v0.6.8.1 |
Image |
Benchmark workload launcher |
| llm-d/llm-d-workload-variant-autoscaler |
v0.8.0 |
v0.7.0 |
Helm Chart + Image |
CRD migration to llm-d.ai API group, improved observability |
| vllm-project/vllm |
v0.23.0 |
v0.19.1 |
Wheel installed in llm-d |
Confirmed by @Gregory-Pereira and @tessapham |
| kubernetes-sigs/gateway-api-inference-extension |
v1.5.0 |
v1.5.0 |
Helm Chart |
Charts now published from llm-d-router OCI registry |
Upstream vLLM Images (replacing llm-d-built images)
Per PR #1791, the following platforms now use upstream vLLM images directly instead of llm-d-built custom images:
| Platform |
New Image |
Tag |
Previous llm-d Image |
| GPU (CUDA) |
vllm/vllm-openai |
v0.23.0 |
ghcr.io/llm-d/llm-d-cuda (still available for advanced builds) |
| ROCm (AMD) |
vllm/vllm-openai-rocm |
|
ghcr.io/llm-d/llm-d-rocm |
| CPU |
ghcr.io/llm-d/llm-d-cpu |
v0.7.0 |
Same image, version bump |
| XPU (Intel) |
vllm/vllm-openai |
v0.23.0 |
ghcr.io/llm-d/llm-d-xpu |
Infrastructure Changes
| Component |
Version |
Previous Version |
Notes |
| Gateway API |
v1.5.1 |
v1.5.1 |
No change |
| Istio |
1.29.4 |
1.29.1 |
Patch update |
| kgateway (agentgateway) |
v2.3.3 |
v2.2.1 |
|
Deprecated / Removed Components
| Component |
Status |
Replaced By |
| llm-d/llm-d-inference-scheduler |
Renamed |
ghcr.io/llm-d/llm-d-router-endpoint-picker |
| llm-d/llm-d-routing-sidecar |
Renamed |
ghcr.io/llm-d/llm-d-router-disagg-sidecar |
| llm-d/llm-d-cuda (debug) |
Removed |
N/A |
| llm-d/llm-d-cuda-gb200 |
Removed |
N/A |
| llm-d/llm-d-xpu |
Removed |
vllm/vllm-openai |
| llm-d/llm-d-rocm |
Removed |
vllm/vllm-openai-rocm |
New Capabilities in This Release
| Capability |
Component(s) |
Status |
| Multi-modal serving (production) |
llm-d-router, llm-d-inference-sim |
Graduated |
| Batch gateway (production) |
llm-d-router |
Graduated |
| Flow control (production) |
llm-d-router |
Graduated |
| Non-Kubernetes mode (RL/Slurm) |
llm-d-router (FileDiscovery plugin) |
New |
| Responses API support |
llm-d-router |
New |
| Multi-tier KV offloading (CPU → storage) |
llm-d-kv-cache |
New |
| HMA (Heterogeneous Memory Allocation) support |
llm-d-kv-cache, fs-connector |
New |
| DP-Aware scheduling |
llm-d-router |
Graduating |
| Mooncake connector |
llm-d-kv-cache |
New |
| Predicted latency scheduling |
llm-d-router |
New |
| Agentic workload routing |
llm-d-router |
New |
| TPU nightly tests |
CI/CD |
New |
What's Changed
Full Changelog: v0.8.0...v0.8.1