Skip to content

v0.9.0-rc.2

Pre-release
Pre-release

Choose a tag to compare

@liu-cong liu-cong released this 17 Jun 17:10
v0.9.0-rc.2
181aa83

Here is a summary of the release notes for v0.9.0-rc.2:

Highlights

  • Performance Wins: Optimizations were made to the sidecar proxy hot path for high-concurrency P/D routing (#746) and allocations in Scheduler.Schedule were reduced by ~90% on large fleets (#1171). OpenAI parser response usage parsing allocations were also minimized (#1583).
  • Key Features: Added support for multi-modal image content in approximate prefix cache match, and encoder cache affinity scorer, Added support for the Anthropic v1/messages API (#1088), Mooncake as a new KV-connector in the routing sidecar (#1193), and a chunked decode feature to the sidecar (#822).
  • New Metrics: Introduced inter-token latency (ITL) (#1591), stream-aware TPOT, and core TTFT metrics (#1499).
  • Major Bug Fixes: Resolved a panic in the SGLang proxy handling concurrent requests (#632), fixed dropped multimodal content in prefix cache scoring/tokenizer rendering (#781), and patched a ZMQ port range validation issue to prevent multi-rank pod failures (#1242).

Upgrade Steps & Deprecations

  • Repository & Binary Renaming: The project has officially transitioned from llm-d-inference-scheduler to llm-d-router. Reflecting this change, Docker images have been renamed to router-endpoint-picker and router-disagg-sidecar (#1098), and configuration pseudo CRD groups are migrating to llm-d.ai (#972).
  • Removed Sidecar Flags: The deprecated --inference-pool-name and --inference-pool-namespace flags have been permanently removed from the sidecar (#1416).
  • Metric Changes: Legacy metrics feature gates and their associated CLI flags have been removed from the EPP template (#1418, #1466). The metrics subsystem has also been unified and renamed to llm_d_epp (#1661).
  • Component Deprecations:
    • The disagg-headers-handler has been deprecated in favor of PreRequest inside the disagg-profile-handler (#905).
    • The UDS-backend in token-producer is deprecated; configuring the vLLM render endpoint instead is highly suggested (#1079).
  • Istio & Gateway API: Istio has been upgraded to 1.29.2 (#1052) and the Gateway API has been updated to v1.5.1 (#780). The previous workaround for vLLM Data Parallel on Istio 1.28 is now deprecated (#727).

What's Changed

New Contributors

Full Changelog: v0.4.0-rc.1...v0.9.0-rc.2