Skip to content

v1.5.0

Latest

Choose a tag to compare

@slin1237 slin1237 released this 12 Jun 02:44
· 63 commits to main since this release
df9824b

πŸš€ Shepherd Model Gateway v1.5.0 Released

Our biggest release yet: priority scheduling with preemption, two new backends β€” TokenSpeed and MLX β€” bringing SMG to five supported engines, and complete Responses API protocol coverage.

πŸŽ›οΈ Priority Scheduler

New request scheduler with admission control, prioritization, and preemption:

  • Priority classes β€” Configurable request classes with policy-driven admission
  • Admission control β€” Inflight tracking, queue management, and capacity slots
  • Preemption β€” Victim search and preempt path for high-priority requests under load
  • Capacity-proportional reservations β€” Floor + share guarantees per priority class, restored automatically after backend capacity recovers
  • Full observability β€” Operational metrics, tracing, and autoscaling gauges via metrics sampler
  • Battle-tested β€” Integration tests covering rejection, clamping, preemption, and starvation scenarios

Impact: Run mixed workloads on shared capacity. Latency-sensitive interactive traffic preempts batch jobs, priority classes get guaranteed capacity shares, and autoscalers get the signals they need.

⚑ TokenSpeed Backend

TokenSpeed joins SMG with full first-class gRPC integration:

  • Complete gRPC pipeline β€” Native TokenSpeed gRPC client, Python servicer, and router wiring
  • Multimodal support β€” Vision-language model (VLM) input through the gRPC pipeline
  • Admin operations β€” FlushCache and profile RPCs with worker-abstracted admin ops
  • Load reporting β€” Server-info label conversion for load-aware routing
  • Production CI β€” Full install and GPU E2E coverage from day one

Impact: TokenSpeed deployments get everything SMG offers β€” cache-aware routing, load balancing, priority scheduling, the full API surface (Chat Completions, Responses, Messages), and multimodal β€” through the same battle-tested gRPC pipeline that powers SGLang, vLLM, and TensorRT-LLM.

🍎 MLX Backend: Apple Silicon Support

SMG now runs on Apple Silicon via MLX:

  • Native MlxEngine gRPC proto, Rust client, and Python gRPC servicer
  • Per-step admission and own-thread BatchGenerator for stable concurrency
  • Coalesced concurrent prefill admission for agent workloads
  • macOS CI workflow with E2E coverage
  • Nightly benchmark: MLX direct HTTP vs Router+gRPC

Impact: Develop and serve on Mac. The same gateway that fronts your GPU fleet now runs your local MLX models β€” same APIs, same routing, same tooling.

SMG now supports five backends:vLLM, TokenSpeed, TensorRT-LLM, SGLang and MLX.

πŸ› οΈ Responses API: Complete Protocol Surface

Full OpenAI Responses API spec coverage β€” tools, content parts, and typed protocol items:

New hosted tools:

  • image_generation β€” Full streaming events wired across OpenAI, gRPC-regular, and gRPC-harmony routers, with MCP-backed dispatch and output metadata (action, background, format, quality, size)
  • web_search (non-preview) with results and return_token_budget
  • file_search, computer use, local_shell, containerized shell, apply_patch, custom tools, and namespace tool grouping

Protocol completeness:

  • Typed content parts and annotations with round-trip tests
  • ConversationRef union typing, item_reference inputs, compaction items
  • Full ToolChoice variant coverage
  • Fail-fast on unknown content/items β€” no more silent swallowing

Impact: SMG remains the only gateway with Responses API support for open-source models and third-party vendors β€” now with the complete protocol surface, including hosted tools like image generation and computer use.

πŸ’¬ Messages API Enhancements

  • HTTP router support β€” Messages API now available on the HTTP router, not just gRPC
  • Adaptive thinking β€” Support adaptive thinking on /v1/messages
  • Protocol fidelity β€” Preserve system text and unknown fields end-to-end

πŸ€– New Model Support

  • DeepSeek V3.2 and V4 β€” Chat-template encoders with tools, thinking introspection, and DSML tool call parser
  • Kimi K2 / K2.5 / K2.6 β€” Tiktoken tokenizer support, K2.5 vision via gRPC router, chat template tool injection
  • Qwen3.5 family β€” Routed to Qwen3-VL multimodal processor
  • GPT-4o and modern OpenAI models β€” o200k_base tokenizer support

βš–οΈ Smarter Load Balancing

New policies and KV-aware routing signals:

  • least_load policy β€” Routes by token-work expected-wait, not just request counts
  • kv_pressure_weight policy β€” Balances on KV cache pressure
  • KV-aware imbalance triggers β€” Spread and overload detection via load monitor
  • DP-aware load balancing β€” Scheduler load info forwarded from gRPC backends, DP logical workers routed via base endpoint
  • GetLoads endpoint for the vLLM engine service

πŸ”„ Worker Lifecycle State Machine

Workers now have a real state machine:

  • WorkerStatus enum with Pending start semantics and Draining state
  • Workflow-driven drain for graceful removal
  • Event-driven WorkerMonitor with richer event payloads
  • WorkerCapacity tracker with 4-tier capacity sourcing
  • max_running_requests surfaced from SGLang /server_info

πŸ”§ PD Disaggregation Reliability

  • NIXL PD now actually transfers KV cache β€” kv_transfer_params relayed correctly
  • vLLM MooncakeConnector PD driven with router-minted kv_transfer_params
  • Smooth regularβ†’PD transition β€” no disruption when enabling disaggregation
  • Fixed PD cache-aware policy lifecycle

🌐 Golang gRPC Client

Official Go client package joins Python, Rust, and Java SDKs.

πŸŽ™οΈ Audio Transcriptions

New /v1/audio/transcriptions multipart route.

πŸ“ˆ Performance

  • /workers endpoint 5x faster
  • Fused cache-aware match+insert β€” Single tree descent instead of two, with concurrency stress coverage
  • K8s discovery β€” Label selectors pushed to the API server instead of client-side filtering
  • Python bindings β€” GIL released while the router server runs

πŸ› Notable Fixes

  • Tokenizer: HuggingFace tojson formatting parity, EOS token IDs from generation_config.json, L1 prefix cache keyed on add_special_tokens
  • Reasoning: Fresh parser on non-streaming path (no shared mutex), whitespace preserved in BaseReasoningParser
  • MCP: Internal self-brought MCP details hidden from final responses, citations/sources/query included when available, forwarded request headers preserved
  • gRPC: request.seed passed through to vLLM, backend sampling defaults applied, ResponsesRequest sampling params passed to all backends
  • Readiness: Wait for gRPC worker tokenizer autoload before reporting ready
  • Build: Debug symbols kept in release binaries

πŸ—οΈ Infrastructure

  • Nightly GHCR releases for SMG + vLLM/SGLang/TRT-LLM engine images
  • Dev wheels published to GitHub Releases and wheel index
  • CI: vLLM β‰₯0.22.1 (cu13 stack), SGLang 0.5.12.post1, TensorRT-LLM 1.3.0rc18
  • Rust toolchain 1.95.0

πŸ™ Welcome New Contributors

25 first-time contributors landed in this release β€” thank you all!

Full Changelog: v1.4.1...v1.5.0

Upgrade now: pip install smg --upgrade

πŸ‘ Shepherd your LLM infrastructure with confidence.

What's Changed

  • feat(image): add image generation tool support by @TingtingZhou7 in #1057
  • refactor(core): move worker domain files from core/ to worker/ by @slin1237 in #1079
  • refactor(core): move workflow files to workflow/, delete core/ by @slin1237 in #1081
  • refactor: move job_queue.rs from worker/ to workflow/ by @slin1237 in #1083
  • fix(python): add router arg fallback disable flag by @gongwei-130 in #1072
  • Revert "feat(image): add image generation tool support" by @CatherineSue in #1087
  • fix(docker): install smg-grpc-servicer from source in engine images by @key4ng in #1086
  • feat(mcp): hide internal self-brought MCP details from final responses by @zhoug9127 in #1061
  • refactor(protocols): add WorkerStatus enum to protocol crate by @slin1237 in #1093
  • feat(mlx): add MlxEngine gRPC proto and Rust client by @key4ng in #1034
  • ci: allow 'ci-approved' label to bypass fork PR approval gate by @slin1237 in #1095
  • refactor(worker): expand WorkerEvent with richer payloads by @slin1237 in #1100
  • refactor(worker): replace healthy AtomicBool with status AtomicU8 by @slin1237 in #1101
  • ci: apply change detection to pull_request_target events by @ai-jz in #1104
  • refactor(worker): land Pending start semantics and state machine by @slin1237 in #1102
  • refactor(worker): extract health loop into WorkerManager by @slin1237 in #1105
  • test(completions): add E2E tests for /v1/completions gRPC endpoint by @vschandramourya in #1021
  • refactor(worker): reorganize WorkerRegistry methods and doc API by @slin1237 in #1113
  • fix(deps): bump google.golang.org/grpc to v1.79.3 for CVE-2026-33186 by @slin1237 in #1120
  • refactor(worker): create event-driven WorkerMonitor (PR 8) by @slin1237 in #1118
  • refactor(worker): delete set_healthy and migrate every caller (PR 9) by @slin1237 in #1121
  • fix(worker): http_health_check must use the per-worker http_client by @slin1237 in #1125
  • refactor(routers): extract shared utilities into routers/common by @slin1237 in #1126
  • refactor(worker): canonical metadata API on WorkerMetadata, trait keeps convenience defaults by @slin1237 in #1127
  • refactor(worker): extract HashRing into its own module (Finding 5) by @slin1237 in #1129
  • ci(mergify): upgrade configuration to current format by @mergify[bot] in #906
  • ci: move branch naming check from Mergify to GitHub Actions by @CatherineSue in #1131
  • feat(protocols): add serde(flatten) to ChatCompletionRequest for engine-specific fields by @DavidBellamy in #1130
  • chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #1135
  • feat(grpc_servicer): implement GetTokenizer RPC for SGLang backend by @CatherineSue in #1136
  • feat(grpc_servicer): implement GetTokenizer RPC for vLLM backend by @CatherineSue in #1142
  • fix(grpc_servicer): remove snapshot_download from vLLM GetTokenizer by @CatherineSue in #1143
  • fix(mcp): avoid relisting tools on resumed responses by @zhaowenzi in #1060
  • feat(tokenizer): load EOS token IDs from generation_config.json, strip in StopDecoder by @CatherineSue in #1122
  • refactor(grpc): replace PreparationOutput struct with typed enum by @CatherineSue in #1144
  • feat(multimodal): add Kimi-K2.5 vision support for gRPC router by @Kangyan-Zhou in #1044
  • refactor(grpc): remove filtered_request from Chat PreparationOutput by @CatherineSue in #1145
  • feat(tool_parser): add structural tag constraint generation, fix skip_special_tokens derivation by @CatherineSue in #1090
  • refactor(worker): canonical runtime API on WorkerRuntime (Finding 4 Part B) by @slin1237 in #1128
  • docs(worker): strip redundant inline comments by @slin1237 in #1147
  • fix(grpc_client): warn instead of error on conflicting constraints by @CatherineSue in #1146
  • feat(python/serve): disable router retries and circuit breaker by default when dp<=1 by @slin1237 in #1148
  • refactor(middleware): split middleware.rs into a folder + relocate token_bucket (Finding 2) by @slin1237 in #1151
  • docs(entrypoints): audit and fix top-level navigation, README, and getting-started by @slin1237 in #1152
  • docs(reliability): audit and fix circuit breakers, rate limiting, health checks, graceful shutdown by @slin1237 in #1153
  • docs(extensibility): audit and fix tokenizer cache, MCP, WASM plugin docs by @slin1237 in #1158
  • docs(api-ref): audit and fix OpenAI, Responses, Admin, Extensions API reference by @slin1237 in #1157
  • docs(ops): audit and fix monitoring, metrics reference, control-plane, auth concepts by @slin1237 in #1156
  • docs(config): audit and fix configuration reference + contributing docs by @slin1237 in #1155
  • docs(routing): audit and fix PD disaggregation, routing policies, gRPC pipeline by @slin1237 in #1154
  • feat(grpc): add golang gRPC client code package by @zetxqx in #874
  • docs(workers): audit and fix worker setup, service discovery, external providers, HA by @slin1237 in #1159
  • feat(mesh): MeshKV wrapper with explicit namespace handles (v2 Step 1) by @CatherineSue in #1164
  • feat(mesh): stream buffer integration with backpressure (v2 Step 2a) by @CatherineSue in #1166
  • feat(mesh): centralized gossip loop (v2 Step 2b) by @CatherineSue in #1169
  • refactor(mesh): cleanup tasks from Step 2b review by @CatherineSue in #1170
  • test(mesh): multi-peer delivery verification + legacy collector removal (v2 Step 2c) by @CatherineSue in #1171
  • revert(ci): ci-approved fork PR trigger (security + stability) by @slin1237 in #1175
  • docs(contributing): add Code of Conduct and root CONTRIBUTING guide by @slin1237 in #1230
  • refactor(tool_parser): rename qwen_coder to qwen_xml and fix model mappings by @key4ng in #1165
  • feat(mesh): ChunkAssembler with memory bounds and proto wire-format (v2 Step 3 part 1) by @CatherineSue in #1173
  • feat(mcp): Include citations, sources and query when available by @Tobel158 in #1109
  • chore(codeowners): add @zhoug9127 and @zhaowenzi to mcp and data_connector by @slin1237 in #1236
  • feat(protocols): add skills core types by @slin1237 in #1235
  • feat(responses): interrupt approval-required MCP tool calls by @zhaowenzi in #1174
  • fix(mcp): centralize non-stream internal privacy filtering by @zhoug9127 in #1168
  • chore(ci): bump Rust toolchain to 1.95.0 to unblock CI by @slin1237 in #1243
  • feat(protocols): integrate skills request surfaces by @slin1237 in #1242
  • feat(skills): add crate and config scaffolding by @slin1237 in #1249
  • chore: remove ut from main by @slin1237 in #1252
  • fix(ci): pin vllm<0.19.1 to restore embedding test correctness by @slin1237 in #1251
  • feat(skills): parse SKILL.md and sidecar metadata by @slin1237 in #1253
  • feat(skills): normalize skill bundle zip archives by @slin1237 in #1255
  • feat(protocols): BGM-PR-01 additive Responses API fields for background mode by @slin1237 in #1244
  • feat(mesh): stream chunking wiring in gossip loop (v2 Step 3 part 2) by @CatherineSue in #1254
  • fix(grpc): forward scheduler load info for DP-aware load balancing by @Kangyan-Zhou in #1116
  • feat(skills): add storage contracts and schema migrations by @slin1237 in #1257
  • perf(mesh): stream entry data as Bytes instead of Vec by @CatherineSue in #1259
  • chore(mesh): drop spec section refs from code comments by @CatherineSue in #1261
  • feat(gateway): add /v1/audio/transcriptions multipart route by @shenxiul in #1256
  • feat(memory): add conversation memory header contract no-op hook by @spalimpaaces-star in #1134
  • fix(e2e): pin oracle-custom schema to latest migration by @slin1237 in #1268
  • fix(workflow): skip tokenizer autoload for HTTP-mode workers by @key4ng in #1265
  • fix(grpc_client): pass ResponsesRequest sampling params to all backends by @key4ng in #1161
  • ci(pr-naming): allow uppercase letters in branch names by @CatherineSue in #1280
  • fix(model-gateway): format conversation memory header assertions by @CatherineSue in #1279
  • feat(protocols): implement P3 EasyInputMessage.phase by @slin1237 in #1281
  • feat(protocols): implement P4 IncludeField web_search_call variants by @slin1237 in #1274
  • fix(protocols): implement P8 Reasoning.summary SummaryTextContent by @slin1237 in #1273
  • feat(protocols): implement T1 file_search tool declaration by @slin1237 in #1272
  • feat(skills): add in-memory runtime and filesystem blob cache by @slin1237 in #1262
  • fix(memory): skip memory injection when header is absent by @spalimpaaces-star in #1277
  • feat(protocols): implement P2 top-level ResponsesRequest fields by @slin1237 in #1278
  • feat(protocols): implement P1 content parts + typed annotations by @slin1237 in #1275
  • fix(skills): tighten local storage semantics by @slin1237 in #1283
  • feat(mesh): route targeted stream entries on server-side sync streams by @CatherineSue in #1260
  • ci: run full PR test suite for dependabot PRs by @CatherineSue in #1292
  • chore(deps): bump actions/setup-go from 5 to 6 by @dependabot[bot] in #1286
  • perf(mesh): carry drain stream entries as Bytes through RoundBatch by @CatherineSue in #1285
  • test(mesh): stream chunking integration tests (v2 Step 3 part 3) by @CatherineSue in #1289
  • chore(deps): update wasm-encoder requirement from 0.246 to 0.247 by @dependabot[bot] in #1290
  • chore(deps): update lru requirement from 0.16.2 to 0.17.0 by @dependabot[bot] in #1291
  • feat(smg): validate skills startup config by @slin1237 in #1294
  • chore(deps): update npyz requirement from 0.8 to 0.9 by @dependabot[bot] in #1293
  • feat(protocols): implement P7 ToolChoice variant coverage by @slin1237 in #1276
  • feat(mesh): EpochMaxWins merge helper for rate-limit counters by @CatherineSue in #1295
  • fix(protocols): implement P5 fail-fast on unknown content/items by @slin1237 in #1298
  • feat(gateway): implement R1 OpenAI Responses content-part transformer regression coverage by @slin1237 in #1300
  • feat(protocols): implement T3 web_search non-preview tool + WebSearchCall.results by @slin1237 in #1304
  • feat(protocols): implement P6 ConversationRef union typing by @slin1237 in #1306
  • feat(protocols): implement T4 image_generation tool + call item by @slin1237 in #1303
  • refactor(protocols): strip audit-process residue from comments by @slin1237 in #1310
  • feat(mesh): auto-register config: CRDT prefix and mirror v1 AppStore on receive by @CatherineSue in #1299
  • refactor(protocols): extract responses tests to crates/protocols/tests/ by @slin1237 in #1309
  • feat(memory): add runtime-gated conversation memory substrate by @zhoug9127 in #1149
  • refactor(gateway): strip scope-bleed guardrails from Wave-1/2 Responses router code by @slin1237 in #1311
  • refactor(protocols): restore Chat/Responses ToolChoice separation invariant by @slin1237 in #1314
  • refactor(openai): split provider module into submodules by @VadivelanU in #1308
  • feat(smg): add tenant resolution request metadata by @slin1237 in #1312
  • feat(smg): add WorkerSyncAdapter for v2 worker: CRDT namespace by @CatherineSue in #1313
  • fix(multimodal): extract images from tool role messages by @ConnorLi96 in #1307
  • test(embeddings): enable sglang embedding correctness on h100 by @CatherineSue in #1326
  • fix(metrics): add path label to HTTP response metrics by @zhaowenzi in #1328
  • feat(data-connector): BGM-PR-02 background repository trait + schema + migrations by @slin1237 in #1258
  • feat(protocols): implement I3 compaction input+output item by @slin1237 in #1302
  • fix(bench): handle Compaction variant in routing_allocation_bench by @slin1237 in #1331
  • test(embeddings): restore skip_for_runtime("sglang") on correctness test by @slin1237 in #1330
  • feat(skills): add create upload admin endpoints by @slin1237 in #1332
  • feat(protocols): implement T2 computer / computer_use_preview tools by @slin1237 in #1305
  • feat(protocols): implement T8 custom tool + call/output items by @slin1237 in #1301
  • feat(protocols): implement T9 namespace tool grouping by @slin1237 in #1337
  • test(responses): add negative tests for silent swallow (E7) by @slin1237 in #1336
  • test(responses): add annotation round-trip tests (E5) by @slin1237 in #1334
  • feat(skills): add read and list admin endpoints by @slin1237 in #1335
  • feat(smg): add RateLimitSyncAdapter for v2 rl: CRDT namespace by @CatherineSue in #1327
  • feat(skills): add patch and delete admin endpoints by @slin1237 in #1344
  • feat(protocols): implement T6 shell (containerized) tool + call/output items by @slin1237 in #1342
  • feat(background): BGM-PR-03 config + AppContext + memory repository by @slin1237 in #1340
  • feat(protocols): implement I2 item_reference input item by @slin1237 in #1341
  • feat(protocols): implement T7 apply_patch tool + call/output items by @slin1237 in #1339
  • fix(background): keep raw_response.status in sync with terminal transitions by @slin1237 in #1348
  • feat(protocols): implement T5 local_shell tool + call/output items by @slin1237 in #1338
  • fix(data-connector): scope startup migrations to core history by @slin1237 in #1347
  • feat(gateway): add image_generation infrastructure for hosted-tool MCP plumbing (R6.1) by @slin1237 in #1355
  • feat(smg): resolve tenant aliases in request metadata by @slin1237 in #1354
  • feat(smg): TreeSyncAdapter foundation β€” tenant-delta fast path by @CatherineSue in #1345
  • feat(openai): wire image_generation streaming events in Responses router (R6.2) by @slin1237 in #1356
  • feat(grpc-regular): wire image_generation streaming events in Responses router (R6.4) by @slin1237 in #1358
  • feat(grpc-harmony): wire image_generation streaming events (R6.3) by @slin1237 in #1359
  • fix(background): share memory storage + validate all retry fields + builder setter by @slin1237 in #1349
  • fix(skills): harden skills API review findings by @slin1237 in #1366
  • feat(mcp): preserve forwarded request headers in MCP request context by @RohanSogani in #1333
  • feat(mm): share multimodal config registry keyed by tokenizer UUID by @key4ng in #1248
  • fix(grpc-harmony): dispatch image_generation as function tool for gpt-oss (R6.8) by @slin1237 in #1368
  • fix(openai): image_generation result round-trip + streaming event ordering (R6.7) by @slin1237 in #1369
  • feat(mcp): add hosted-tool overrides helpers and wire image_generation dispatch (R6.6) by @slin1237 in #1370
  • fix(openai): broaden output_item.done suppression gate to all tool-call item types (R6.7b) by @slin1237 in #1371
  • refactor: strip R6.x audit-process residue from MCP + router comments (C3) by @slin1237 in #1374
  • fix(openai-router): suppress duplicate output-item envelopes on native passthrough (R6.7c) by @slin1237 in #1376
  • feat(protocols): ImageGenerationCall output metadata β€” action/background/output_format/quality/size (R6.9) by @slin1237 in #1377
  • feat(protocols): implement T11 mcp input roundtrip by @slin1237 in #1350
  • fix(helm): add clusterWide flag for opt-in cluster-scoped service discovery by @MohanKumar21 in #1379
  • feat(tokenizer): add DeepSeek V3.2 and V4 chat-template encoders by @CatherineSue in #1373
  • fix(tokenizer): plumb tools, thinking introspection, and V3.2 iteration for DeepSeek encoders by @CatherineSue in #1381
  • feat(tool_parser): add DeepSeek V3.2 and V4 DSML tool call parser by @key4ng in #1030
  • test(e2e): add mock MCP server + image_generation integration tests (R6.5) by @slin1237 in #1365
  • feat(skills): resolve request skill manifests by @slin1237 in #1382
  • test(e2e): lock mcp_call shape for plain-MCP image_generation tool by @slin1237 in #1391
  • fix(mcp): forward request.user to hosted-tool MCP dispatch args by @slin1237 in #1389
  • feat(mlx): add Python gRPC servicer for MLX backend by @key4ng in #1099
  • ci(docker): move image builds to cpu-e5 runner and parallelize make by @key4ng in #1372
  • refactor(smg): move skill resolution into skills crate by @slin1237 in #1393
  • fix(mlx-grpc): drain-and-batch to avoid BatchGenerator rope crash at concurrencyβ‰₯4 by @key4ng in #1414
  • fix(router): require model for IGW generate requests by @jshanson7 in #1420
  • refactor(server): extract /v1/audio/transcriptions multipart parsing into FromRequest by @slin1237 in #1426
  • fix(mlx-grpc): per-step admission + own-thread BatchGenerator (eliminates chat-c4 TTFT regression and Stream(gpu,1) crash) by @key4ng in #1427
  • docs: fix image reference paths by @Weili-0234 in #1428
  • feat(smg): TreeHandle on CacheAwarePolicy β€” adapter consumes policy-owned hash membership by @CatherineSue in #1364
  • chore(deps): update lru requirement from 0.17.0 to 0.18.0 by @dependabot[bot] in #1408
  • feat(mesh): publish tree:req: repair on unknown tenant deltas by @CatherineSue in #1444
  • fix(openai-bridge): address PR #1429 review follow-ups (alias miss, error state, wire shape) by @slin1237 in #1442
  • feat(mesh): apply known remote tenant deltas instead of just logging by @CatherineSue in #1446
  • refactor(openai-bridge): finish descriptor pattern, fix FileSearch routing gap by @slin1237 in #1450
  • refactor(openai-bridge): close last new-builtin coupling leaks (descriptor + connect wrapper) by @slin1237 in #1455
  • feat(gateway): Smooth transition from regular->PD by @ekzhang in #1445
  • feat(openai-protocol): add ResponseTool::as_str() by @slin1237 in #1457
  • feat(sse): add shared SSE codec module by @XinyueZhang369 in #987
  • feat(realtime-api): WebRTC signaling handler and OpenAIRouter integra… by @pallasathena92 in #748
  • feat(kv-index): iter_entries lazy walker for Tree and TokenTree by @CatherineSue in #1454
  • fix(deps): drop default features on hf-hub; opt into rustls-tls by @krishung5 in #1459
  • fix(protocols): add continuous_usage_stats to StreamOptions by @Abhishek8108 in #1460
  • feat(mesh): tree-repair pages, responder side of the protocol by @CatherineSue in #1458
  • chore(deps): update str0m requirement from 0.18 to 0.19 by @dependabot[bot] in #1453
  • feat(mesh): tree-repair pages, receiver side of the protocol by @CatherineSue in #1461
  • feat(grpc): apply backend sampling defaults by @CatherineSue in #1462
  • fix(grpc): tighten sampling defaults metadata by @CatherineSue in #1463
  • docs: Update README by @lightseek-bot in #1467
  • ci(release): add dev wheel workflow β†’ GitHub Releases by @key4ng in #1471
  • ci(release): publish dev wheels to whl index by @zhyncs in #1473
  • smg now needs either launch, server or serve - those examples are all… by @surak in #1474
  • fix(ci): pin nixl<1.1.0 to avoid cu13 libcudart on CUDA 12 runners by @key4ng in #1477
  • fix(tokenizer): match HuggingFace tojson formatting by @CatherineSue in #1478
  • test(e2e): stabilize Mistral required streaming by @CatherineSue in #1483
  • refactor(mesh): remove v1 mesh from gateway and public API by @CatherineSue in #1476
  • chore(deps): update tokenizers requirement from 0.22.0 to 0.23.1 by @dependabot[bot] in #1410
  • chore(deps): update wasm-encoder requirement from 0.247 to 0.248 by @dependabot[bot] in #1409
  • chore(gprc): bump up grpc-servicer version to 0.5.3 by @YouNeedCryDear in #1443
  • perf(discovery): push label selectors to K8s API server by @slin1237 in #1488
  • fix(protocols): accept return_token_budget for web_search tool by @Tobel158 in #1490
  • refactor(mesh): delete v1 internals (-10k lines) by @CatherineSue in #1486
  • fix(worker): honor explicit worker connection schemes by @heymrbox in #1485
  • feat(worker): add Draining state with workflow-driven drain by @slin1237 in #1491
  • refactor(grpc-client): extract shared Channel builder by @slin1237 in #1494
  • refactor(grpc-client): extract AbortOnDropStream + engine basics by @slin1237 in #1498
  • ci: skip unit-tests on PRs without Rust changes by @MohanKumar21 in #1493
  • docs: end-to-end example for KV-events cache-aware routing by @yetone in #1497
  • test(mlx): add MLX backend E2E tests + macos-latest CI workflow by @key4ng in #1398
  • fix(mlx-grpc): runtime error in mlx multi streams when loading gpt-oss model series by @zach-li-sudo in #1489
  • feat(mlx-bench): nightly benchmark β€” MLX direct HTTP vs Router+gRPC by @key4ng in #1399
  • feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) by @yetone in #1351
  • feat(protocols): add serde(flatten) to GenerateRequest for engine-specific fields by @devvrit-mirendil in #1503
  • test(realtime): pin to gpt-realtime GA alias instead of retired preview snapshot by @key4ng in #1504
  • feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) by @key4ng in #1464
  • ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) by @key4ng in #1465
  • feat(reasoning_parser): add none reasoning parser by @zhyncs in #1505
  • refactor(mesh): rename gossip files, consolidate transport limits by @CatherineSue in #1495
  • fix(mesh): off-by-one in RetryManager backoff slot lookup by @CatherineSue in #1511
  • refactor(mesh): move chunking + chunk_assembler under transport/ by @CatherineSue in #1514
  • refactor(mesh): extract shared SyncStream message helpers by @CatherineSue in #1513
  • ci(unit-tests): use python -m pip for vision golden deps by @key4ng in #1518
  • fix(realtime): drop OpenAI-Beta header rejected by GA Realtime API by @key4ng in #1516
  • test(e2e): drop tokenspeed marker from structurally-unsupported tests by @key4ng in #1517
  • fix(mesh): inbound sender drops targeted entries before peer is learned by @CatherineSue in #1519
  • refactor(reasoning_parser): replace NoneParser with PassthroughParser by @CatherineSue in #1512
  • fix(pd): Fix PD cache-aware policy lifecycle by @aurickq in #1520
  • fix(pd): route DP logical workers via base endpoint by @aurickq in #1522
  • chore(deps): update opentelemetry-proto requirement from 0.31 to 0.32 by @dependabot[bot] in #1506
  • feat(gateway): Add Messages API to HTTP router by @ekzhang in #1521
  • fix(observability): include addr in metrics bind failure by @zhyncs in #1525
  • fix(observability): use expect for metrics bind failure by @zhyncs in #1528
  • ci: cancel stale PR test runs by @zhyncs in #1530
  • fix(service-discovery): drop http:// prefix so dual-probe detects gRPC by @CatherineSue in #1523
  • feat(workflow): surface max_running_requests on SGLang HTTP /server_info by @slin1237 in #1529
  • feat(worker): add max_running_requests accessor on Worker trait by @slin1237 in #1526
  • fix(mesh): wire EpochMaxWins into CRDT merge by @CatherineSue in #1469
  • chore(deps): update wasm-encoder requirement from 0.248 to 0.250 by @dependabot[bot] in #1508
  • fix(ci): restore per-commit CI runs on main by @CatherineSue in #1536
  • ci(claude-review): flag PRs missing required template sections by @CatherineSue in #1537
  • feat(worker): add WorkerCapacity tracker (4-tier capacity sourcing) by @slin1237 in #1534
  • refactor(mesh): move tree path hashing out of mesh into kv-index by @CatherineSue in #1535
  • feat(grpc): add TokenSpeed multimodal (VLM) input support by @chenht2022 in #1515
  • fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models by @CatherineSue in #1482
  • fix(docs): replace HTML img tags with markdown image syntax for correct MkDocs path rewriting by @1195343015 in #1543
  • fix(tokenizer): add o200k_base support for GPT-4o and modern OpenAI models by @1195343015 in #1542
  • feat(scheduler): foundation β€” class, config, policy (PR 2 M1) by @CatherineSue in #1541
  • refactor(mesh): extract NamespaceCrdtEngine + LwwEngine + transitional EpochMaxWins by @CatherineSue in #1539
  • feat(scheduler): admission data layer β€” inflight, queue, slots (PR 2 M2a) by @CatherineSue in #1546
  • feat(messages): support adaptive thinking on /v1/messages by @CatherineSue in #1555
  • feat(scheduler): engine β€” PriorityScheduler, admit, dispatcher (PR 2 M2b) by @CatherineSue in #1550
  • ci: fix YAML comment swallowing echo in cancel-merged-pr-tests by @CatherineSue in #1556
  • refactor(mesh): typed RateLimitEngine, drop EpochMaxWinsLegacyEngine by @CatherineSue in #1549
  • ci: remove cancel-merged-pr-tests workflow by @CatherineSue in #1558
  • chore(grpc): release smg-grpc-proto 0.4.8 by @CatherineSue in #1562
  • fix(metrics): add /v1/messages to route_to_endpoint match by @ekzhang in #1564
  • feat(multimodal): route Qwen3.5 family to Qwen3-VL processor by @chenht2022 in #1563
  • refactor(scheduler): remove queue_size_per_slot multiplier by @CatherineSue in #1559
  • refactor(mesh): make OperationLog strategy-free; engines own merge/compact by @CatherineSue in #1560
  • perf(mesh): size apply_remote_ops dedup to the incoming batch by @CatherineSue in #1569
  • ci(trtllm): install pre-release wheel from PyPI instead of building from source by @key4ng in #1501
  • test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU by @key4ng in #1502
  • perf(gateway): Make /workers 5x faster by @ekzhang in #1576
  • fix(cache-aware): gate hash_index hot-path writes behind explicit flag by @ekzhang in #1565
  • feat(mesh): gossip CRDT operation log over the wire (d-3a) by @CatherineSue in #1570
  • feat(scheduler): preemption β€” victim search, preempt path, body wrapper (PR 2 M3) by @CatherineSue in #1572
  • feat(scheduler): wire priority admission middleware + cancel propagation (PR 2 M4) by @CatherineSue in #1577
  • feat(scheduler): operational metrics + tracing (PR 2 M5a) by @CatherineSue in #1579
  • fix(build): keep debug symbols in release binaries by @slin1237 in #1582
  • feat(scheduler): autoscaling gauges via metrics sampler (PR 2 M5b) by @CatherineSue in #1580
  • test(scheduler): integration wiring + fallback guards (PR 2 M6) by @CatherineSue in #1581
  • fix(scheduler): restore reservations after backend capacity recovers by @CatherineSue in #1587
  • test(scheduler): add admission micro-benchmark and concurrent load harness by @slin1237 in #1584
  • docs(scheduler): document the priority scheduler by @slin1237 in #1583
  • test(scheduler): integration tests for rejection, clamp, preemption, starvation by @slin1237 in #1585
  • feat(mesh): per-key CRDT send watermark + ack + retry by @CatherineSue in #1586
  • feat(scheduler): capacity-proportional reservations (floor + share) by @CatherineSue in #1588
  • fix(bindings): expose consistent_hashing and prefix_hash in python CLI by @deokjinkim in #1589
  • chore(deps): update wasm-encoder requirement from 0.250 to 0.251 by @dependabot[bot] in #1593
  • fix(mlx-bench): request-bound runs and compare on equal sample size by @key4ng in #1590
  • ci(nightly): nightly GHCR releases for SMG + vllm/sglang/trtllm engine images by @key4ng in #1600
  • fix(responses): preserve empty output_text annotations by @zhaowenzi in #1599
  • fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates by @key4ng in #1448
  • fix(readiness): wait for gRPC worker tokenizer autoload before reporting ready by @LorrinWWW in #1605
  • perf(mlx-grpc): coalesce concurrent prefill admission to fix agent-c4 TTFT by @key4ng in #1606
  • docs: update license by @key4ng in #1607
  • chore: remove the Skills API by @slin1237 in #1612
  • feat(protocols): finish background-mode protocol surface (BGM-PR-01) by @slin1237 in #1609
  • chore: remove conversation memory (STM/LTM/STMO) by @slin1237 in #1613
  • perf(kv-index): fuse cache-aware match+insert into a single tree descent by @slin1237 in #1615
  • test(kv-index): concurrency stress test for fused match+insert; qualify equivalence docs by @slin1237 in #1618
  • feat(background): BGM-PR-04 create path + snapshot resolution by @slin1237 in #1614
  • fix(scripts): accept HF repo IDs and extra vLLM args in launch-pd-workers by @slin1237 in #1620
  • ci(claude): switch PR review workflow to claude-fable-5 by @slin1237 in #1623
  • revert: ci(claude): switch PR review workflow to claude-fable-5 by @slin1237 in #1624
  • fix(mesh): track op-id in CRDT send watermark (replica tie-break) by @CatherineSue in #1592
  • fix(grpc): make NIXL PD actually transfer KV cache (relay kv_transfer_params) by @CatherineSue in #1622
  • feat(background): BGM-PR-06 background driver + sweeper by @slin1237 in #1619
  • feat(smg): MeshAdapters composition root β€” register CRDT engines before gossip, start inbound sync by @CatherineSue in #1626
  • feat(policies): add least_load load-balancing policy by @slin1237 in #1629
  • fix(ci): use NVMe storage for H100 runner workspaces by @key4ng in #1627
  • chore(background): remove background-mode support pending redesign by @slin1237 in #1631
  • ci(vllm): upgrade CI to vllm>=0.22.1 (cu13 stack, flashinfer jit-cache, Qwen3 embedding) by @CatherineSue in #1625
  • refactor(policies): rename least_load to kv_pressure_weight by @slin1237 in #1632
  • fix(policies): stop BucketPolicy adjustment thread on drop by @slin1237 in #1643
  • chore(workflow): remove dead code and trim stale comments by @slin1237 in #1648
  • chore(wasm): trim stale comments by @slin1237 in #1645
  • fix(reasoning): use fresh parser on non-streaming path to avoid shared mutex by @slin1237 in #1642
  • fix(mcp): classify image_generation builtin in session bindings by @slin1237 in #1646
  • feat(policies): route least_load by token-work expected-wait by @slin1237 in #1647
  • feat(grpc): add GetLoads endpoint to the vLLM engine service by @slin1237 in #1630
  • chore(observability): remove dead code and trim stale comments by @slin1237 in #1649
  • chore(worker): remove dead code and trim stale comments by @slin1237 in #1650
  • fix(tool_parser): add (?s) DOTALL to Kimi-K2 regexes for multi-line JSON args by @slin1237 in #1636
  • fix(grpc-client): set num_waiting_uncached_tokens in vLLM SchedulerLoad conversion by @key4ng in #1653
  • fix(tokenizer): key L1 prefix cache on add_special_tokens by @slin1237 in #1637
  • fix(grpc_servicer): fix sglang grpc binds wrong IPC socket when --skip-tokenizer-init by @wufann in #1591
  • docs: add TokenSpeed to engine support docs by @CatherineSue in #1656
  • test(grpc): cover TokenSpeed and SGLang server-info label conversion by @slin1237 in #1654
  • fix(grpc_client): pass through request.seed in vLLM SamplingParams builders by @EntilZha in #1657
  • feat(grpc): add FlushCache and profile RPCs with worker-abstracted admin ops by @slin1237 in #1655
  • fix(reasoning_parser): preserve whitespace in BaseReasoningParser by @EntilZha in #1658
  • fix(ci): give each runner its own hf-xet cache dir by @XinyueZhang369 in #1660
  • fix(python): release the GIL while the router server runs by @slin1237 in #1641
  • ci(trtllm): bump CI wheel to TensorRT-LLM 1.3.0rc18 by @slin1237 in #1663
  • feat(cache-aware): KV-aware imbalance triggers (spread + overload) via load monitor by @slin1237 in #1621
  • ci(sglang): bump CI install to sglang 0.5.12.post1 (cu13 stack) by @slin1237 in #1662
  • feat(grpc): add TokenSpeed FlushCache and profile RPCs by @slin1237 in #1659
  • feat(grpc): drive vLLM MooncakeConnector PD with router-minted kv_transfer_params by @CatherineSue in #1664
  • feat(smg): outbound worker mesh sync β€” single-writer ownership, publish loop, tombstones, recovery by @CatherineSue in #1661
  • fix(docker): shadow distro-owned PyYAML on TRT-LLM bases before smg install by @slin1237 in #1666
  • fix(messages): Anthropic messages API, preserve system "text" and unknown fields by @ekzhang in #1670
  • fix(e2e): unbreak trtllm lanes broken by mcp 2.0.0a1 by @slin1237 in #1675
  • feat(gateway): Filter /workers by model by @ekzhang in #1672
  • fix(gateway): DP-aware rank suffix by @ekzhang in #1669
  • refactor(protocols): move ListWorkersQuery to the protocols crate by @slin1237 in #1690
  • chore: bump versions for v1.5.0 release by @slin1237 in #1665

New Contributors

Full Changelog: v1.4.1...v1.5.0