v0.11.0 - 2026-07-01
This release closes out the migration off the legacy preset system, tightens node estimation, and rounds out three long-running workstreams — GPU node provisioning via Karpenter, model weight streaming from blob, and prefill/decode disaggregation with MultiRoleInference. It also promotes InferenceSet to v1beta1 on by default, ships the KAITO ProductionStack as an optional Helm subchart, and refreshes the model catalog, base images, and runtime versions.
Breaking Changes
- Legacy preset system removed . The old built-in preset API is gone. Workloads that still reference legacy preset names must migrate to
modelNamein the Model Catalog before upgrading. - Stricter
max-model-lenand node estimation . KAITO now uses vLLM's native estimation to compute defaultmax-model-lenand the number of GPU nodes required. Deployments that previously relied on looser estimates may need to setmax-model-lenexplicitly or request additional nodes.
Major Features
Karpenter-based GPU node provisioning
Set --node-provisioner=karpenter at controller startup and KAITO will provision, reconcile, and drift-detect GPU nodes through Karpenter instead of the legacy gpu-provisioner path. NodeClass RBAC is derived from the configured karpenterProvider so operators no longer hand-roll it. The bundled gpu-provisioner is also updated to v0.4.2 for clusters that stay on the classic path.
Model weight streaming from blob storage
Two new resources work together to eliminate per-pod model downloads at startup:
ModelMirrorCRD creates a PVC-backed mirror of a HuggingFace model on blob storage.- Workspace
ModelStreamingannotation — addkaito.sh/model-streaming: "true"to a Workspace and the controller creates theModelMirrorfor you and starts the download in parallel with GPU node provisioning. Once the mirror is ready, inference pods start with--model=az://and stream weights directly from blob via RunAI Model Streamer, with no per-pod weight volume. Download bandwidth is exposed as a Prometheus metric on the vLLM pod.
Prefill/decode disaggregation — MultiRoleInference
A new MultiRoleInference CRD lets you run prefill and decode on separate, independently-scaled pod groups. You declare the roles and node requirements; the controller creates the InferencePool, wires the EPP plugin config, and injects the NixlConnector KV-transfer settings so the two groups can exchange KV-cache state. A GWIE + MultiRoleInference deployment guide is included in the docs.
InferenceSet promoted to v1beta1
InferenceSet is now v1beta1 and enabled by default — the feature gate is gone. Two behaviors ship with the promotion: replicas can scale to zero, and the Kubernetes deployment name can be used directly as the served-model-name.
Base image auto-upgrade for Workspaces
Workspaces can now opt in to automatic base image updates. When a new vLLM base image is published, opted-in Workspaces roll forward without any manual spec change.
ProductionStack as an optional subchart
@rambohe-ch @techworldhello @tnsimon @zhuangqh
The KAITO ProductionStack is now available as an optional Helm subchart — a turnkey inference gateway designed based on the llm-d reference stack. It stitches together Istio, llm-gateway-auth (API-key or Azure Entra ID authz), the llm-d inference scheduler for KV-cache-aware routing, KEDA-driven autoscaling on vllm:* metrics, and KAITO InferenceSet into a single deployable configuration.
Model Catalog Additions
| Family | Models added |
|---|---|
| Gemma 4 | Gemma 4 series |
| NVIDIA Nemotron | Nemotron series |
| Mistral / Qwen | Mistral 4, Qwen 3.5, Qwen 3.6 |
| MoonshotAI / Minimax | Kimi-K2.5, Minimax series |
AWQ 4-bit quantization is now supported as a deployment option for compatible models.
Dependency & Runtime Upgrades
| Component | Version |
|---|---|
| vLLM | 0.22.1 |
| PyTorch | 2.11.0 |
| transformers | 5.6.0 |
| Go | 1.26.4 |
| Kubernetes | 1.33.8 |
| gpu-provisioner | v0.4.2 |
| local-csi-driver | 0.2.18 |
| ProductionStack | 0.2.2 |
Changelog
Features 🌈
- 4d73506 feat: Add minimal block-only streaming guardrails (#2157)
- 2853512 feat: use vLLM native estimation for max-model-len (#2156)
- a8def1d feat: add RAGEngine streaming buffer window (#2138)
- f4b80b3 feat: derive NodeClass RBAC from karpenterProvider config (#2150)
- 77ddc2c feat: add RAGEngine streaming semantic utilities (#2136)
- 2007c56 feat: promote InferenceSet to v1beta1 and enable EnableInferenceSetController by default (#2112)
- 7245974 feat: support P/D disaggregation (#2093)
- 45266bd feat: Support streaming passthrough for RAGEngine chat completions (#2128)
- d5b9af0 feat: model streaming (#2100)
- 27ad3b0 feat: upgrade gpu-provisioner version to v0.4.2 for kaito (#2111)
- ab299f9 feat: add productionstack to kaito chart (#2105)
- 561a658 feat: expose model download bandwidth metrics in vLLM pod (#2087)
- d34d424 feat: support base image auto-upgrade for Kaito Workspaces (#2062)
- af0167f feat: model mirror crd + controller (#2082)
- 70848a5 feat: add invisible text and token limit guardrail scanners (#2064)
- f8876c4 feat: scanner pack 3 json reading time length limit (#2061)
- f7de6d4 feat: add secrets and sensitive scanners (#2060)
- ef5cda7 feat: MRI controller creates InferencePool + EPP plugins config (basic functionalities) (#2048)
- 410d557 feat: Add response-path guardrails metrics (#2070)
- 73d51ba feat: add guardrails observability metrics (#2037)
- 5842a40 feat: enable karpenter and add e2e tests (#2041)
- 79b2a1c feat: support hybrid Mamba model in max_concurrency calculation (#2052)
- 6aaff7e feat: add guardrails scanner capability validation (#2039)
- 0c4ce3e feat: ragengine guardrails reload observability (#2038)
- e72797f feat: hot-reload output guardrails policy via watchfiles (#2025)
- d0b2168 feat: onboard kimi and minimax models to model catalog (#2047)
- b0002fd feat: add MultiRoleInference CRD types, controller, and status aggregation for P/D disaggregated inference (#2005)
- 75d60ba feat: support per-scanner guardrails actions (#2042)
- f2220cd feat: inject NixlConnector kv-transfer-config for P/D disaggregated inference (#2027)
- 4a4a40b feat: wire default guardrails policy configmap for ragengine (#1992)
- 464662e feat: onboard mistral 4 and qwen 3.5/3.6 models to model catalog (#2031)
- f507b2c feat: Add guardrails fail-open config and policy ConfigMap mounting for RAGEngine (#2022)
- ce81a5f feat: 1952 integrate karpenter provisioner (#2017)
- 6353a00 feat: onboard Gemma 4 models to model catalog (#2021)
- 898d91a feat: add fail-open and basic error handling for output guardrails (#1963)
- cc371e0 feat: add AWQ quantization support (#2007)
- ecb905c feat: onboard Nvidia Nemotron models to model catalog (#2008)
- 23a745c feat: karpenter drift controller (#2003)
- dcbd663 feat: support deployment name as served-mode-name for inferenceset (#2004)
- 7b4a600 feat: add guardrails YAML policy loader to RAG runtime (#1990)
- 557194d feat: karpenter provisioner (#1982)
- afcdfbe feat: add RAGEngine guardrails API and policy validation (#1986)
- d325c36 feat: [breaking] complete legacy preset to model catalog migration (#1977)
- 4cdfaa1 feat: Migrate default EPP from GWIE to llm-d inference scheduler (#1975)
- 12836f1 feat: add --node-provisioner startup parameter (#1974)
- eeb5b3e feat: add karpenter crds dependencies and ensure aksnodeclass resources (#1968)
- a60bcf1 feat: onboard more built-in models to model catalog (#1945)
- bb5b4db feat: add non-streaming output guardrails hook for chat completions (#1962)
- 2783b2e feat: add nodes provision interface and gpu-provisoner implementation (#1954)
Bug Fixes 🐞
- b023ad9 fix: skip inference config validation when no ConfigMap is specified (#2171)
- cb140a3 fix: pin inference pods to provisioned nodes (#2161)
- 888ec57 fix: isolate inference config from default template (#2162)
- 5009de8 fix: [breaking] harden max-model-len and node estimation (#2139)
- eded298 fix: CVE-2026-25680 (#2140)
- 286d685 fix: inference deployment of Qwen3.6-35B-A3B and gpt-oss-120b (#2131)
- 6e9b08b fix: exempt legacy workspaces from benchmarking checks and retain first benchmark res and status (#2134)
- 858e518 fix: cap inference node count at 3 to dodge vLLM PP>3 bug (#2121)
- 513065a fix: inference deployment for model moonshotai/Kimi-K2.5 (#2110)
- 7da9159 fix: pin fastapi<0.137.0 to fix vLLM health check crash & replace NC12s_v3 SKU in E2E (#2104)
- bd0c83e fix: install keda-kaito-scaler in the same namespace as KEDA in aikit-integration-test (#2079)
- 9ec0033 fix: allow scaling InferenceSet replicas to 0 (#2071)
- f2c6d4b fix: Increase timeout and add dump for failed resources. (#2056)
- a76c28a fix: Deepseek r1/v3 models tokenization issue (#2053)
- 6d87657 fix: missing webhook infrastructure and add e2e test for MultiRoleInference (#2046)
- f8c1916 fix: ignore KAITO-reserved labels in user-supplied selectors (#2032)
- 685816d fix: benchmark: sum vLLM metrics across engine shards (#2030)
- 004587f fix: switch 1es runner for image building (#2019)
- b932405 fix: broken upgrade compatibility test by defining GO_VERSION at workflow level (#2013)
- 245c1ef fix: inferenceset should list workspaces in the same namespace (#1999)
- 6961877 fix: nil reference in validation (#1936)
Code Refactoring 💎
- 9a9d621 refactor: reorganize utils to reduce dependency contagion (#2084)
- ec88c58 refactor: introduce local apis.FieldError shim for non-webhook callers (#2018)
- 6e6c9ba refactor: unify output guardrails scanner configuration (#1996)
Continuous Integration 💜
- a3e8c13 ci: automate release on tag push
- 57a75d4 ci: pin nightly chart to nightly image and document new tag format (#2129)
- a75a6bb ci: switch nightly images to semver tag and publish to corp ACR (#2099)
- 1c50dee ci: refine large model e2e tests (#1978)
- 4ef62e2 ci: pass ginkgo_label input to full and fast e2e test steps
Documentation 📘
- 02b0731 docs: add wiki for inferenceset and auto-upgrade (#2167)
- 5f49aa0 docs: model mirror and streaming doc (#2153)
- 77f2a5b docs: add prefill/decode disaggregation guide for GWIE + MultiRoleInference (#2152)
- 9b598d9 docs: update InferenceSet and KEDA docs for beta promotion in v0.11.0 (#2135)
- 00ba1b1 docs: proposal for distributed cache integration (#2088)
- 0031c98 docs: Modify copyright notice in website (#2113)
- 1b3155a docs: update guardrails current behavior (#2068)
- 73a1b51 docs: add docs for model catalog (#2081)
- 3716736 docs: model mirror proposal (#2058)
- 01d123e docs: Dynamo vs llm-d comparison (#1995)
- 09ddd95 docs: add versioned documentation for v0.10.x (#1965)
- 7d06530 docs: update README for release (#1964)
Maintenance 🔧
- 9898a82 chore: bump productionstack to 0.2.2 (#2168)
- 5cf14cc chore: streaming E2E (#2130)
- 55432a1 chore: update code owner and maintainer (#2151)
- eb957a9 chore: bump local-csi-driver to 0.2.18 (#2146)
- 77a86ca chore: Publish Nightly Workspace Extension (#2125)
- ca98186 chore: bump go version to 1.26.4 (#2117)
- 8a31d23 chore: bump vllm to 0.22.1 and torch to 2.11.0 (#2078)
- c695896 chore: use CRD API enum for inference role constants instead of consts package (#2045)
- fdb89f3 chore: bump go version from 1.26.2 to 1.26.3 (#2033)
- c2d7357 chore: enable TPM benchmark by default (#2016)
- 39b16f6 chore: bump vllm to 0.19.1, transformers to 5.6.0 (#1987)
- a87d58c chore: bump follow-redirects from 1.15.11 to 1.16.0 in /website (#1966)
- 17b80c0 chore: add workflow dispatch for e2e test
- 0970cc7 chore: bump k8s version from 1.32.9 to 1.33.8 (#1967)