π Shepherd Model Gateway v1.5.0 Released
Our biggest release yet: priority scheduling with preemption, two new backends β TokenSpeed and MLX β bringing SMG to five supported engines, and complete Responses API protocol coverage.
ποΈ Priority Scheduler
New request scheduler with admission control, prioritization, and preemption:
- Priority classes β Configurable request classes with policy-driven admission
- Admission control β Inflight tracking, queue management, and capacity slots
- Preemption β Victim search and preempt path for high-priority requests under load
- Capacity-proportional reservations β Floor + share guarantees per priority class, restored automatically after backend capacity recovers
- Full observability β Operational metrics, tracing, and autoscaling gauges via metrics sampler
- Battle-tested β Integration tests covering rejection, clamping, preemption, and starvation scenarios
Impact: Run mixed workloads on shared capacity. Latency-sensitive interactive traffic preempts batch jobs, priority classes get guaranteed capacity shares, and autoscalers get the signals they need.
β‘ TokenSpeed Backend
TokenSpeed joins SMG with full first-class gRPC integration:
- Complete gRPC pipeline β Native TokenSpeed gRPC client, Python servicer, and router wiring
- Multimodal support β Vision-language model (VLM) input through the gRPC pipeline
- Admin operations β FlushCache and profile RPCs with worker-abstracted admin ops
- Load reporting β Server-info label conversion for load-aware routing
- Production CI β Full install and GPU E2E coverage from day one
Impact: TokenSpeed deployments get everything SMG offers β cache-aware routing, load balancing, priority scheduling, the full API surface (Chat Completions, Responses, Messages), and multimodal β through the same battle-tested gRPC pipeline that powers SGLang, vLLM, and TensorRT-LLM.
π MLX Backend: Apple Silicon Support
SMG now runs on Apple Silicon via MLX:
- Native MlxEngine gRPC proto, Rust client, and Python gRPC servicer
- Per-step admission and own-thread BatchGenerator for stable concurrency
- Coalesced concurrent prefill admission for agent workloads
- macOS CI workflow with E2E coverage
- Nightly benchmark: MLX direct HTTP vs Router+gRPC
Impact: Develop and serve on Mac. The same gateway that fronts your GPU fleet now runs your local MLX models β same APIs, same routing, same tooling.
SMG now supports five backends:vLLM, TokenSpeed, TensorRT-LLM, SGLang and MLX.
π οΈ Responses API: Complete Protocol Surface
Full OpenAI Responses API spec coverage β tools, content parts, and typed protocol items:
New hosted tools:
- image_generation β Full streaming events wired across OpenAI, gRPC-regular, and gRPC-harmony routers, with MCP-backed dispatch and output metadata (action, background, format, quality, size)
- web_search (non-preview) with results and return_token_budget
- file_search, computer use, local_shell, containerized shell, apply_patch, custom tools, and namespace tool grouping
Protocol completeness:
- Typed content parts and annotations with round-trip tests
- ConversationRef union typing, item_reference inputs, compaction items
- Full ToolChoice variant coverage
- Fail-fast on unknown content/items β no more silent swallowing
Impact: SMG remains the only gateway with Responses API support for open-source models and third-party vendors β now with the complete protocol surface, including hosted tools like image generation and computer use.
π¬ Messages API Enhancements
- HTTP router support β Messages API now available on the HTTP router, not just gRPC
- Adaptive thinking β Support adaptive thinking on /v1/messages
- Protocol fidelity β Preserve system text and unknown fields end-to-end
π€ New Model Support
- DeepSeek V3.2 and V4 β Chat-template encoders with tools, thinking introspection, and DSML tool call parser
- Kimi K2 / K2.5 / K2.6 β Tiktoken tokenizer support, K2.5 vision via gRPC router, chat template tool injection
- Qwen3.5 family β Routed to Qwen3-VL multimodal processor
- GPT-4o and modern OpenAI models β o200k_base tokenizer support
βοΈ Smarter Load Balancing
New policies and KV-aware routing signals:
- least_load policy β Routes by token-work expected-wait, not just request counts
- kv_pressure_weight policy β Balances on KV cache pressure
- KV-aware imbalance triggers β Spread and overload detection via load monitor
- DP-aware load balancing β Scheduler load info forwarded from gRPC backends, DP logical workers routed via base endpoint
- GetLoads endpoint for the vLLM engine service
π Worker Lifecycle State Machine
Workers now have a real state machine:
- WorkerStatus enum with Pending start semantics and Draining state
- Workflow-driven drain for graceful removal
- Event-driven WorkerMonitor with richer event payloads
- WorkerCapacity tracker with 4-tier capacity sourcing
- max_running_requests surfaced from SGLang /server_info
π§ PD Disaggregation Reliability
- NIXL PD now actually transfers KV cache β kv_transfer_params relayed correctly
- vLLM MooncakeConnector PD driven with router-minted kv_transfer_params
- Smooth regularβPD transition β no disruption when enabling disaggregation
- Fixed PD cache-aware policy lifecycle
π Golang gRPC Client
Official Go client package joins Python, Rust, and Java SDKs.
ποΈ Audio Transcriptions
New /v1/audio/transcriptions multipart route.
π Performance
- /workers endpoint 5x faster
- Fused cache-aware match+insert β Single tree descent instead of two, with concurrency stress coverage
- K8s discovery β Label selectors pushed to the API server instead of client-side filtering
- Python bindings β GIL released while the router server runs
π Notable Fixes
- Tokenizer: HuggingFace tojson formatting parity, EOS token IDs from generation_config.json, L1 prefix cache keyed on add_special_tokens
- Reasoning: Fresh parser on non-streaming path (no shared mutex), whitespace preserved in BaseReasoningParser
- MCP: Internal self-brought MCP details hidden from final responses, citations/sources/query included when available, forwarded request headers preserved
- gRPC: request.seed passed through to vLLM, backend sampling defaults applied, ResponsesRequest sampling params passed to all backends
- Readiness: Wait for gRPC worker tokenizer autoload before reporting ready
- Build: Debug symbols kept in release binaries
ποΈ Infrastructure
- Nightly GHCR releases for SMG + vLLM/SGLang/TRT-LLM engine images
- Dev wheels published to GitHub Releases and wheel index
- CI: vLLM β₯0.22.1 (cu13 stack), SGLang 0.5.12.post1, TensorRT-LLM 1.3.0rc18
- Rust toolchain 1.95.0
π Welcome New Contributors
25 first-time contributors landed in this release β thank you all!
Full Changelog: v1.4.1...v1.5.0
Upgrade now: pip install smg --upgrade
π Shepherd your LLM infrastructure with confidence.
What's Changed
- feat(image): add image generation tool support by @TingtingZhou7 in #1057
- refactor(core): move worker domain files from core/ to worker/ by @slin1237 in #1079
- refactor(core): move workflow files to workflow/, delete core/ by @slin1237 in #1081
- refactor: move job_queue.rs from worker/ to workflow/ by @slin1237 in #1083
- fix(python): add router arg fallback disable flag by @gongwei-130 in #1072
- Revert "feat(image): add image generation tool support" by @CatherineSue in #1087
- fix(docker): install smg-grpc-servicer from source in engine images by @key4ng in #1086
- feat(mcp): hide internal self-brought MCP details from final responses by @zhoug9127 in #1061
- refactor(protocols): add WorkerStatus enum to protocol crate by @slin1237 in #1093
- feat(mlx): add MlxEngine gRPC proto and Rust client by @key4ng in #1034
- ci: allow 'ci-approved' label to bypass fork PR approval gate by @slin1237 in #1095
- refactor(worker): expand WorkerEvent with richer payloads by @slin1237 in #1100
- refactor(worker): replace healthy AtomicBool with status AtomicU8 by @slin1237 in #1101
- ci: apply change detection to pull_request_target events by @ai-jz in #1104
- refactor(worker): land Pending start semantics and state machine by @slin1237 in #1102
- refactor(worker): extract health loop into WorkerManager by @slin1237 in #1105
- test(completions): add E2E tests for /v1/completions gRPC endpoint by @vschandramourya in #1021
- refactor(worker): reorganize WorkerRegistry methods and doc API by @slin1237 in #1113
- fix(deps): bump google.golang.org/grpc to v1.79.3 for CVE-2026-33186 by @slin1237 in #1120
- refactor(worker): create event-driven WorkerMonitor (PR 8) by @slin1237 in #1118
- refactor(worker): delete set_healthy and migrate every caller (PR 9) by @slin1237 in #1121
- fix(worker): http_health_check must use the per-worker http_client by @slin1237 in #1125
- refactor(routers): extract shared utilities into routers/common by @slin1237 in #1126
- refactor(worker): canonical metadata API on WorkerMetadata, trait keeps convenience defaults by @slin1237 in #1127
- refactor(worker): extract HashRing into its own module (Finding 5) by @slin1237 in #1129
- ci(mergify): upgrade configuration to current format by @mergify[bot] in #906
- ci: move branch naming check from Mergify to GitHub Actions by @CatherineSue in #1131
- feat(protocols): add serde(flatten) to ChatCompletionRequest for engine-specific fields by @DavidBellamy in #1130
- chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #1135
- feat(grpc_servicer): implement GetTokenizer RPC for SGLang backend by @CatherineSue in #1136
- feat(grpc_servicer): implement GetTokenizer RPC for vLLM backend by @CatherineSue in #1142
- fix(grpc_servicer): remove snapshot_download from vLLM GetTokenizer by @CatherineSue in #1143
- fix(mcp): avoid relisting tools on resumed responses by @zhaowenzi in #1060
- feat(tokenizer): load EOS token IDs from generation_config.json, strip in StopDecoder by @CatherineSue in #1122
- refactor(grpc): replace PreparationOutput struct with typed enum by @CatherineSue in #1144
- feat(multimodal): add Kimi-K2.5 vision support for gRPC router by @Kangyan-Zhou in #1044
- refactor(grpc): remove filtered_request from Chat PreparationOutput by @CatherineSue in #1145
- feat(tool_parser): add structural tag constraint generation, fix skip_special_tokens derivation by @CatherineSue in #1090
- refactor(worker): canonical runtime API on WorkerRuntime (Finding 4 Part B) by @slin1237 in #1128
- docs(worker): strip redundant inline comments by @slin1237 in #1147
- fix(grpc_client): warn instead of error on conflicting constraints by @CatherineSue in #1146
- feat(python/serve): disable router retries and circuit breaker by default when dp<=1 by @slin1237 in #1148
- refactor(middleware): split middleware.rs into a folder + relocate token_bucket (Finding 2) by @slin1237 in #1151
- docs(entrypoints): audit and fix top-level navigation, README, and getting-started by @slin1237 in #1152
- docs(reliability): audit and fix circuit breakers, rate limiting, health checks, graceful shutdown by @slin1237 in #1153
- docs(extensibility): audit and fix tokenizer cache, MCP, WASM plugin docs by @slin1237 in #1158
- docs(api-ref): audit and fix OpenAI, Responses, Admin, Extensions API reference by @slin1237 in #1157
- docs(ops): audit and fix monitoring, metrics reference, control-plane, auth concepts by @slin1237 in #1156
- docs(config): audit and fix configuration reference + contributing docs by @slin1237 in #1155
- docs(routing): audit and fix PD disaggregation, routing policies, gRPC pipeline by @slin1237 in #1154
- feat(grpc): add golang gRPC client code package by @zetxqx in #874
- docs(workers): audit and fix worker setup, service discovery, external providers, HA by @slin1237 in #1159
- feat(mesh): MeshKV wrapper with explicit namespace handles (v2 Step 1) by @CatherineSue in #1164
- feat(mesh): stream buffer integration with backpressure (v2 Step 2a) by @CatherineSue in #1166
- feat(mesh): centralized gossip loop (v2 Step 2b) by @CatherineSue in #1169
- refactor(mesh): cleanup tasks from Step 2b review by @CatherineSue in #1170
- test(mesh): multi-peer delivery verification + legacy collector removal (v2 Step 2c) by @CatherineSue in #1171
- revert(ci): ci-approved fork PR trigger (security + stability) by @slin1237 in #1175
- docs(contributing): add Code of Conduct and root CONTRIBUTING guide by @slin1237 in #1230
- refactor(tool_parser): rename qwen_coder to qwen_xml and fix model mappings by @key4ng in #1165
- feat(mesh): ChunkAssembler with memory bounds and proto wire-format (v2 Step 3 part 1) by @CatherineSue in #1173
- feat(mcp): Include citations, sources and query when available by @Tobel158 in #1109
- chore(codeowners): add @zhoug9127 and @zhaowenzi to mcp and data_connector by @slin1237 in #1236
- feat(protocols): add skills core types by @slin1237 in #1235
- feat(responses): interrupt approval-required MCP tool calls by @zhaowenzi in #1174
- fix(mcp): centralize non-stream internal privacy filtering by @zhoug9127 in #1168
- chore(ci): bump Rust toolchain to 1.95.0 to unblock CI by @slin1237 in #1243
- feat(protocols): integrate skills request surfaces by @slin1237 in #1242
- feat(skills): add crate and config scaffolding by @slin1237 in #1249
- chore: remove ut from main by @slin1237 in #1252
- fix(ci): pin vllm<0.19.1 to restore embedding test correctness by @slin1237 in #1251
- feat(skills): parse SKILL.md and sidecar metadata by @slin1237 in #1253
- feat(skills): normalize skill bundle zip archives by @slin1237 in #1255
- feat(protocols): BGM-PR-01 additive Responses API fields for background mode by @slin1237 in #1244
- feat(mesh): stream chunking wiring in gossip loop (v2 Step 3 part 2) by @CatherineSue in #1254
- fix(grpc): forward scheduler load info for DP-aware load balancing by @Kangyan-Zhou in #1116
- feat(skills): add storage contracts and schema migrations by @slin1237 in #1257
- perf(mesh): stream entry data as Bytes instead of Vec by @CatherineSue in #1259
- chore(mesh): drop spec section refs from code comments by @CatherineSue in #1261
- feat(gateway): add /v1/audio/transcriptions multipart route by @shenxiul in #1256
- feat(memory): add conversation memory header contract no-op hook by @spalimpaaces-star in #1134
- fix(e2e): pin oracle-custom schema to latest migration by @slin1237 in #1268
- fix(workflow): skip tokenizer autoload for HTTP-mode workers by @key4ng in #1265
- fix(grpc_client): pass ResponsesRequest sampling params to all backends by @key4ng in #1161
- ci(pr-naming): allow uppercase letters in branch names by @CatherineSue in #1280
- fix(model-gateway): format conversation memory header assertions by @CatherineSue in #1279
- feat(protocols): implement P3 EasyInputMessage.phase by @slin1237 in #1281
- feat(protocols): implement P4 IncludeField web_search_call variants by @slin1237 in #1274
- fix(protocols): implement P8 Reasoning.summary SummaryTextContent by @slin1237 in #1273
- feat(protocols): implement T1 file_search tool declaration by @slin1237 in #1272
- feat(skills): add in-memory runtime and filesystem blob cache by @slin1237 in #1262
- fix(memory): skip memory injection when header is absent by @spalimpaaces-star in #1277
- feat(protocols): implement P2 top-level ResponsesRequest fields by @slin1237 in #1278
- feat(protocols): implement P1 content parts + typed annotations by @slin1237 in #1275
- fix(skills): tighten local storage semantics by @slin1237 in #1283
- feat(mesh): route targeted stream entries on server-side sync streams by @CatherineSue in #1260
- ci: run full PR test suite for dependabot PRs by @CatherineSue in #1292
- chore(deps): bump actions/setup-go from 5 to 6 by @dependabot[bot] in #1286
- perf(mesh): carry drain stream entries as Bytes through RoundBatch by @CatherineSue in #1285
- test(mesh): stream chunking integration tests (v2 Step 3 part 3) by @CatherineSue in #1289
- chore(deps): update wasm-encoder requirement from 0.246 to 0.247 by @dependabot[bot] in #1290
- chore(deps): update lru requirement from 0.16.2 to 0.17.0 by @dependabot[bot] in #1291
- feat(smg): validate skills startup config by @slin1237 in #1294
- chore(deps): update npyz requirement from 0.8 to 0.9 by @dependabot[bot] in #1293
- feat(protocols): implement P7 ToolChoice variant coverage by @slin1237 in #1276
- feat(mesh): EpochMaxWins merge helper for rate-limit counters by @CatherineSue in #1295
- fix(protocols): implement P5 fail-fast on unknown content/items by @slin1237 in #1298
- feat(gateway): implement R1 OpenAI Responses content-part transformer regression coverage by @slin1237 in #1300
- feat(protocols): implement T3 web_search non-preview tool + WebSearchCall.results by @slin1237 in #1304
- feat(protocols): implement P6 ConversationRef union typing by @slin1237 in #1306
- feat(protocols): implement T4 image_generation tool + call item by @slin1237 in #1303
- refactor(protocols): strip audit-process residue from comments by @slin1237 in #1310
- feat(mesh): auto-register config: CRDT prefix and mirror v1 AppStore on receive by @CatherineSue in #1299
- refactor(protocols): extract responses tests to crates/protocols/tests/ by @slin1237 in #1309
- feat(memory): add runtime-gated conversation memory substrate by @zhoug9127 in #1149
- refactor(gateway): strip scope-bleed guardrails from Wave-1/2 Responses router code by @slin1237 in #1311
- refactor(protocols): restore Chat/Responses ToolChoice separation invariant by @slin1237 in #1314
- refactor(openai): split provider module into submodules by @VadivelanU in #1308
- feat(smg): add tenant resolution request metadata by @slin1237 in #1312
- feat(smg): add WorkerSyncAdapter for v2 worker: CRDT namespace by @CatherineSue in #1313
- fix(multimodal): extract images from tool role messages by @ConnorLi96 in #1307
- test(embeddings): enable sglang embedding correctness on h100 by @CatherineSue in #1326
- fix(metrics): add path label to HTTP response metrics by @zhaowenzi in #1328
- feat(data-connector): BGM-PR-02 background repository trait + schema + migrations by @slin1237 in #1258
- feat(protocols): implement I3 compaction input+output item by @slin1237 in #1302
- fix(bench): handle Compaction variant in routing_allocation_bench by @slin1237 in #1331
- test(embeddings): restore skip_for_runtime("sglang") on correctness test by @slin1237 in #1330
- feat(skills): add create upload admin endpoints by @slin1237 in #1332
- feat(protocols): implement T2 computer / computer_use_preview tools by @slin1237 in #1305
- feat(protocols): implement T8 custom tool + call/output items by @slin1237 in #1301
- feat(protocols): implement T9 namespace tool grouping by @slin1237 in #1337
- test(responses): add negative tests for silent swallow (E7) by @slin1237 in #1336
- test(responses): add annotation round-trip tests (E5) by @slin1237 in #1334
- feat(skills): add read and list admin endpoints by @slin1237 in #1335
- feat(smg): add RateLimitSyncAdapter for v2 rl: CRDT namespace by @CatherineSue in #1327
- feat(skills): add patch and delete admin endpoints by @slin1237 in #1344
- feat(protocols): implement T6 shell (containerized) tool + call/output items by @slin1237 in #1342
- feat(background): BGM-PR-03 config + AppContext + memory repository by @slin1237 in #1340
- feat(protocols): implement I2 item_reference input item by @slin1237 in #1341
- feat(protocols): implement T7 apply_patch tool + call/output items by @slin1237 in #1339
- fix(background): keep raw_response.status in sync with terminal transitions by @slin1237 in #1348
- feat(protocols): implement T5 local_shell tool + call/output items by @slin1237 in #1338
- fix(data-connector): scope startup migrations to core history by @slin1237 in #1347
- feat(gateway): add image_generation infrastructure for hosted-tool MCP plumbing (R6.1) by @slin1237 in #1355
- feat(smg): resolve tenant aliases in request metadata by @slin1237 in #1354
- feat(smg): TreeSyncAdapter foundation β tenant-delta fast path by @CatherineSue in #1345
- feat(openai): wire image_generation streaming events in Responses router (R6.2) by @slin1237 in #1356
- feat(grpc-regular): wire image_generation streaming events in Responses router (R6.4) by @slin1237 in #1358
- feat(grpc-harmony): wire image_generation streaming events (R6.3) by @slin1237 in #1359
- fix(background): share memory storage + validate all retry fields + builder setter by @slin1237 in #1349
- fix(skills): harden skills API review findings by @slin1237 in #1366
- feat(mcp): preserve forwarded request headers in MCP request context by @RohanSogani in #1333
- feat(mm): share multimodal config registry keyed by tokenizer UUID by @key4ng in #1248
- fix(grpc-harmony): dispatch image_generation as function tool for gpt-oss (R6.8) by @slin1237 in #1368
- fix(openai): image_generation result round-trip + streaming event ordering (R6.7) by @slin1237 in #1369
- feat(mcp): add hosted-tool overrides helpers and wire image_generation dispatch (R6.6) by @slin1237 in #1370
- fix(openai): broaden output_item.done suppression gate to all tool-call item types (R6.7b) by @slin1237 in #1371
- refactor: strip R6.x audit-process residue from MCP + router comments (C3) by @slin1237 in #1374
- fix(openai-router): suppress duplicate output-item envelopes on native passthrough (R6.7c) by @slin1237 in #1376
- feat(protocols): ImageGenerationCall output metadata β action/background/output_format/quality/size (R6.9) by @slin1237 in #1377
- feat(protocols): implement T11 mcp input roundtrip by @slin1237 in #1350
- fix(helm): add clusterWide flag for opt-in cluster-scoped service discovery by @MohanKumar21 in #1379
- feat(tokenizer): add DeepSeek V3.2 and V4 chat-template encoders by @CatherineSue in #1373
- fix(tokenizer): plumb tools, thinking introspection, and V3.2 iteration for DeepSeek encoders by @CatherineSue in #1381
- feat(tool_parser): add DeepSeek V3.2 and V4 DSML tool call parser by @key4ng in #1030
- test(e2e): add mock MCP server + image_generation integration tests (R6.5) by @slin1237 in #1365
- feat(skills): resolve request skill manifests by @slin1237 in #1382
- test(e2e): lock mcp_call shape for plain-MCP image_generation tool by @slin1237 in #1391
- fix(mcp): forward request.user to hosted-tool MCP dispatch args by @slin1237 in #1389
- feat(mlx): add Python gRPC servicer for MLX backend by @key4ng in #1099
- ci(docker): move image builds to cpu-e5 runner and parallelize make by @key4ng in #1372
- refactor(smg): move skill resolution into skills crate by @slin1237 in #1393
- fix(mlx-grpc): drain-and-batch to avoid BatchGenerator rope crash at concurrencyβ₯4 by @key4ng in #1414
- fix(router): require model for IGW generate requests by @jshanson7 in #1420
- refactor(server): extract /v1/audio/transcriptions multipart parsing into FromRequest by @slin1237 in #1426
- fix(mlx-grpc): per-step admission + own-thread BatchGenerator (eliminates chat-c4 TTFT regression and Stream(gpu,1) crash) by @key4ng in #1427
- docs: fix image reference paths by @Weili-0234 in #1428
- feat(smg): TreeHandle on CacheAwarePolicy β adapter consumes policy-owned hash membership by @CatherineSue in #1364
- chore(deps): update lru requirement from 0.17.0 to 0.18.0 by @dependabot[bot] in #1408
- feat(mesh): publish tree:req: repair on unknown tenant deltas by @CatherineSue in #1444
- fix(openai-bridge): address PR #1429 review follow-ups (alias miss, error state, wire shape) by @slin1237 in #1442
- feat(mesh): apply known remote tenant deltas instead of just logging by @CatherineSue in #1446
- refactor(openai-bridge): finish descriptor pattern, fix FileSearch routing gap by @slin1237 in #1450
- refactor(openai-bridge): close last new-builtin coupling leaks (descriptor + connect wrapper) by @slin1237 in #1455
- feat(gateway): Smooth transition from regular->PD by @ekzhang in #1445
- feat(openai-protocol): add ResponseTool::as_str() by @slin1237 in #1457
- feat(sse): add shared SSE codec module by @XinyueZhang369 in #987
- feat(realtime-api): WebRTC signaling handler and OpenAIRouter integra⦠by @pallasathena92 in #748
- feat(kv-index): iter_entries lazy walker for Tree and TokenTree by @CatherineSue in #1454
- fix(deps): drop default features on hf-hub; opt into rustls-tls by @krishung5 in #1459
- fix(protocols): add continuous_usage_stats to StreamOptions by @Abhishek8108 in #1460
- feat(mesh): tree-repair pages, responder side of the protocol by @CatherineSue in #1458
- chore(deps): update str0m requirement from 0.18 to 0.19 by @dependabot[bot] in #1453
- feat(mesh): tree-repair pages, receiver side of the protocol by @CatherineSue in #1461
- feat(grpc): apply backend sampling defaults by @CatherineSue in #1462
- fix(grpc): tighten sampling defaults metadata by @CatherineSue in #1463
- docs: Update README by @lightseek-bot in #1467
- ci(release): add dev wheel workflow β GitHub Releases by @key4ng in #1471
- ci(release): publish dev wheels to whl index by @zhyncs in #1473
- smg now needs either launch, server or serve - those examples are all⦠by @surak in #1474
- fix(ci): pin nixl<1.1.0 to avoid cu13 libcudart on CUDA 12 runners by @key4ng in #1477
- fix(tokenizer): match HuggingFace tojson formatting by @CatherineSue in #1478
- test(e2e): stabilize Mistral required streaming by @CatherineSue in #1483
- refactor(mesh): remove v1 mesh from gateway and public API by @CatherineSue in #1476
- chore(deps): update tokenizers requirement from 0.22.0 to 0.23.1 by @dependabot[bot] in #1410
- chore(deps): update wasm-encoder requirement from 0.247 to 0.248 by @dependabot[bot] in #1409
- chore(gprc): bump up grpc-servicer version to 0.5.3 by @YouNeedCryDear in #1443
- perf(discovery): push label selectors to K8s API server by @slin1237 in #1488
- fix(protocols): accept return_token_budget for web_search tool by @Tobel158 in #1490
- refactor(mesh): delete v1 internals (-10k lines) by @CatherineSue in #1486
- fix(worker): honor explicit worker connection schemes by @heymrbox in #1485
- feat(worker): add Draining state with workflow-driven drain by @slin1237 in #1491
- refactor(grpc-client): extract shared Channel builder by @slin1237 in #1494
- refactor(grpc-client): extract AbortOnDropStream + engine basics by @slin1237 in #1498
- ci: skip unit-tests on PRs without Rust changes by @MohanKumar21 in #1493
- docs: end-to-end example for KV-events cache-aware routing by @yetone in #1497
- test(mlx): add MLX backend E2E tests + macos-latest CI workflow by @key4ng in #1398
- fix(mlx-grpc): runtime error in mlx multi streams when loading gpt-oss model series by @zach-li-sudo in #1489
- feat(mlx-bench): nightly benchmark β MLX direct HTTP vs Router+gRPC by @key4ng in #1399
- feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) by @yetone in #1351
- feat(protocols): add serde(flatten) to GenerateRequest for engine-specific fields by @devvrit-mirendil in #1503
- test(realtime): pin to gpt-realtime GA alias instead of retired preview snapshot by @key4ng in #1504
- feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) by @key4ng in #1464
- ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) by @key4ng in #1465
- feat(reasoning_parser): add none reasoning parser by @zhyncs in #1505
- refactor(mesh): rename gossip files, consolidate transport limits by @CatherineSue in #1495
- fix(mesh): off-by-one in RetryManager backoff slot lookup by @CatherineSue in #1511
- refactor(mesh): move chunking + chunk_assembler under transport/ by @CatherineSue in #1514
- refactor(mesh): extract shared SyncStream message helpers by @CatherineSue in #1513
- ci(unit-tests): use
python -m pipfor vision golden deps by @key4ng in #1518 - fix(realtime): drop OpenAI-Beta header rejected by GA Realtime API by @key4ng in #1516
- test(e2e): drop tokenspeed marker from structurally-unsupported tests by @key4ng in #1517
- fix(mesh): inbound sender drops targeted entries before peer is learned by @CatherineSue in #1519
- refactor(reasoning_parser): replace NoneParser with PassthroughParser by @CatherineSue in #1512
- fix(pd): Fix PD cache-aware policy lifecycle by @aurickq in #1520
- fix(pd): route DP logical workers via base endpoint by @aurickq in #1522
- chore(deps): update opentelemetry-proto requirement from 0.31 to 0.32 by @dependabot[bot] in #1506
- feat(gateway): Add Messages API to HTTP router by @ekzhang in #1521
- fix(observability): include addr in metrics bind failure by @zhyncs in #1525
- fix(observability): use expect for metrics bind failure by @zhyncs in #1528
- ci: cancel stale PR test runs by @zhyncs in #1530
- fix(service-discovery): drop http:// prefix so dual-probe detects gRPC by @CatherineSue in #1523
- feat(workflow): surface max_running_requests on SGLang HTTP /server_info by @slin1237 in #1529
- feat(worker): add max_running_requests accessor on Worker trait by @slin1237 in #1526
- fix(mesh): wire EpochMaxWins into CRDT merge by @CatherineSue in #1469
- chore(deps): update wasm-encoder requirement from 0.248 to 0.250 by @dependabot[bot] in #1508
- fix(ci): restore per-commit CI runs on main by @CatherineSue in #1536
- ci(claude-review): flag PRs missing required template sections by @CatherineSue in #1537
- feat(worker): add WorkerCapacity tracker (4-tier capacity sourcing) by @slin1237 in #1534
- refactor(mesh): move tree path hashing out of mesh into kv-index by @CatherineSue in #1535
- feat(grpc): add TokenSpeed multimodal (VLM) input support by @chenht2022 in #1515
- fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models by @CatherineSue in #1482
- fix(docs): replace HTML img tags with markdown image syntax for correct MkDocs path rewriting by @1195343015 in #1543
- fix(tokenizer): add o200k_base support for GPT-4o and modern OpenAI models by @1195343015 in #1542
- feat(scheduler): foundation β class, config, policy (PR 2 M1) by @CatherineSue in #1541
- refactor(mesh): extract NamespaceCrdtEngine + LwwEngine + transitional EpochMaxWins by @CatherineSue in #1539
- feat(scheduler): admission data layer β inflight, queue, slots (PR 2 M2a) by @CatherineSue in #1546
- feat(messages): support adaptive thinking on /v1/messages by @CatherineSue in #1555
- feat(scheduler): engine β PriorityScheduler, admit, dispatcher (PR 2 M2b) by @CatherineSue in #1550
- ci: fix YAML comment swallowing echo in cancel-merged-pr-tests by @CatherineSue in #1556
- refactor(mesh): typed RateLimitEngine, drop EpochMaxWinsLegacyEngine by @CatherineSue in #1549
- ci: remove cancel-merged-pr-tests workflow by @CatherineSue in #1558
- chore(grpc): release smg-grpc-proto 0.4.8 by @CatherineSue in #1562
- fix(metrics): add /v1/messages to route_to_endpoint match by @ekzhang in #1564
- feat(multimodal): route Qwen3.5 family to Qwen3-VL processor by @chenht2022 in #1563
- refactor(scheduler): remove queue_size_per_slot multiplier by @CatherineSue in #1559
- refactor(mesh): make OperationLog strategy-free; engines own merge/compact by @CatherineSue in #1560
- perf(mesh): size apply_remote_ops dedup to the incoming batch by @CatherineSue in #1569
- ci(trtllm): install pre-release wheel from PyPI instead of building from source by @key4ng in #1501
- test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU by @key4ng in #1502
- perf(gateway): Make
/workers5x faster by @ekzhang in #1576 - fix(cache-aware): gate hash_index hot-path writes behind explicit flag by @ekzhang in #1565
- feat(mesh): gossip CRDT operation log over the wire (d-3a) by @CatherineSue in #1570
- feat(scheduler): preemption β victim search, preempt path, body wrapper (PR 2 M3) by @CatherineSue in #1572
- feat(scheduler): wire priority admission middleware + cancel propagation (PR 2 M4) by @CatherineSue in #1577
- feat(scheduler): operational metrics + tracing (PR 2 M5a) by @CatherineSue in #1579
- fix(build): keep debug symbols in release binaries by @slin1237 in #1582
- feat(scheduler): autoscaling gauges via metrics sampler (PR 2 M5b) by @CatherineSue in #1580
- test(scheduler): integration wiring + fallback guards (PR 2 M6) by @CatherineSue in #1581
- fix(scheduler): restore reservations after backend capacity recovers by @CatherineSue in #1587
- test(scheduler): add admission micro-benchmark and concurrent load harness by @slin1237 in #1584
- docs(scheduler): document the priority scheduler by @slin1237 in #1583
- test(scheduler): integration tests for rejection, clamp, preemption, starvation by @slin1237 in #1585
- feat(mesh): per-key CRDT send watermark + ack + retry by @CatherineSue in #1586
- feat(scheduler): capacity-proportional reservations (floor + share) by @CatherineSue in #1588
- fix(bindings): expose
consistent_hashingandprefix_hashin python CLI by @deokjinkim in #1589 - chore(deps): update wasm-encoder requirement from 0.250 to 0.251 by @dependabot[bot] in #1593
- fix(mlx-bench): request-bound runs and compare on equal sample size by @key4ng in #1590
- ci(nightly): nightly GHCR releases for SMG + vllm/sglang/trtllm engine images by @key4ng in #1600
- fix(responses): preserve empty output_text annotations by @zhaowenzi in #1599
- fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates by @key4ng in #1448
- fix(readiness): wait for gRPC worker tokenizer autoload before reporting ready by @LorrinWWW in #1605
- perf(mlx-grpc): coalesce concurrent prefill admission to fix agent-c4 TTFT by @key4ng in #1606
- docs: update license by @key4ng in #1607
- chore: remove the Skills API by @slin1237 in #1612
- feat(protocols): finish background-mode protocol surface (BGM-PR-01) by @slin1237 in #1609
- chore: remove conversation memory (STM/LTM/STMO) by @slin1237 in #1613
- perf(kv-index): fuse cache-aware match+insert into a single tree descent by @slin1237 in #1615
- test(kv-index): concurrency stress test for fused match+insert; qualify equivalence docs by @slin1237 in #1618
- feat(background): BGM-PR-04 create path + snapshot resolution by @slin1237 in #1614
- fix(scripts): accept HF repo IDs and extra vLLM args in launch-pd-workers by @slin1237 in #1620
- ci(claude): switch PR review workflow to claude-fable-5 by @slin1237 in #1623
- revert: ci(claude): switch PR review workflow to claude-fable-5 by @slin1237 in #1624
- fix(mesh): track op-id in CRDT send watermark (replica tie-break) by @CatherineSue in #1592
- fix(grpc): make NIXL PD actually transfer KV cache (relay kv_transfer_params) by @CatherineSue in #1622
- feat(background): BGM-PR-06 background driver + sweeper by @slin1237 in #1619
- feat(smg): MeshAdapters composition root β register CRDT engines before gossip, start inbound sync by @CatherineSue in #1626
- feat(policies): add least_load load-balancing policy by @slin1237 in #1629
- fix(ci): use NVMe storage for H100 runner workspaces by @key4ng in #1627
- chore(background): remove background-mode support pending redesign by @slin1237 in #1631
- ci(vllm): upgrade CI to vllm>=0.22.1 (cu13 stack, flashinfer jit-cache, Qwen3 embedding) by @CatherineSue in #1625
- refactor(policies): rename least_load to kv_pressure_weight by @slin1237 in #1632
- fix(policies): stop BucketPolicy adjustment thread on drop by @slin1237 in #1643
- chore(workflow): remove dead code and trim stale comments by @slin1237 in #1648
- chore(wasm): trim stale comments by @slin1237 in #1645
- fix(reasoning): use fresh parser on non-streaming path to avoid shared mutex by @slin1237 in #1642
- fix(mcp): classify image_generation builtin in session bindings by @slin1237 in #1646
- feat(policies): route least_load by token-work expected-wait by @slin1237 in #1647
- feat(grpc): add GetLoads endpoint to the vLLM engine service by @slin1237 in #1630
- chore(observability): remove dead code and trim stale comments by @slin1237 in #1649
- chore(worker): remove dead code and trim stale comments by @slin1237 in #1650
- fix(tool_parser): add (?s) DOTALL to Kimi-K2 regexes for multi-line JSON args by @slin1237 in #1636
- fix(grpc-client): set num_waiting_uncached_tokens in vLLM SchedulerLoad conversion by @key4ng in #1653
- fix(tokenizer): key L1 prefix cache on add_special_tokens by @slin1237 in #1637
- fix(grpc_servicer): fix sglang grpc binds wrong IPC socket when --skip-tokenizer-init by @wufann in #1591
- docs: add TokenSpeed to engine support docs by @CatherineSue in #1656
- test(grpc): cover TokenSpeed and SGLang server-info label conversion by @slin1237 in #1654
- fix(grpc_client): pass through request.seed in vLLM SamplingParams builders by @EntilZha in #1657
- feat(grpc): add FlushCache and profile RPCs with worker-abstracted admin ops by @slin1237 in #1655
- fix(reasoning_parser): preserve whitespace in BaseReasoningParser by @EntilZha in #1658
- fix(ci): give each runner its own hf-xet cache dir by @XinyueZhang369 in #1660
- fix(python): release the GIL while the router server runs by @slin1237 in #1641
- ci(trtllm): bump CI wheel to TensorRT-LLM 1.3.0rc18 by @slin1237 in #1663
- feat(cache-aware): KV-aware imbalance triggers (spread + overload) via load monitor by @slin1237 in #1621
- ci(sglang): bump CI install to sglang 0.5.12.post1 (cu13 stack) by @slin1237 in #1662
- feat(grpc): add TokenSpeed FlushCache and profile RPCs by @slin1237 in #1659
- feat(grpc): drive vLLM MooncakeConnector PD with router-minted kv_transfer_params by @CatherineSue in #1664
- feat(smg): outbound worker mesh sync β single-writer ownership, publish loop, tombstones, recovery by @CatherineSue in #1661
- fix(docker): shadow distro-owned PyYAML on TRT-LLM bases before smg install by @slin1237 in #1666
- fix(messages): Anthropic messages API, preserve system "text" and unknown fields by @ekzhang in #1670
- fix(e2e): unbreak trtllm lanes broken by mcp 2.0.0a1 by @slin1237 in #1675
- feat(gateway): Filter /workers by model by @ekzhang in #1672
- fix(gateway): DP-aware rank suffix by @ekzhang in #1669
- refactor(protocols): move ListWorkersQuery to the protocols crate by @slin1237 in #1690
- chore: bump versions for v1.5.0 release by @slin1237 in #1665
New Contributors
- @TingtingZhou7 made their first contribution in #1057
- @ai-jz made their first contribution in #1104
- @DavidBellamy made their first contribution in #1130
- @zetxqx made their first contribution in #874
- @shenxiul made their first contribution in #1256
- @spalimpaaces-star made their first contribution in #1134
- @VadivelanU made their first contribution in #1308
- @RohanSogani made their first contribution in #1333
- @MohanKumar21 made their first contribution in #1379
- @Weili-0234 made their first contribution in #1428
- @krishung5 made their first contribution in #1459
- @Abhishek8108 made their first contribution in #1460
- @zhyncs made their first contribution in #1473
- @surak made their first contribution in #1474
- @heymrbox made their first contribution in #1485
- @yetone made their first contribution in #1497
- @zach-li-sudo made their first contribution in #1489
- @devvrit-mirendil made their first contribution in #1503
- @aurickq made their first contribution in #1520
- @chenht2022 made their first contribution in #1515
- @1195343015 made their first contribution in #1543
- @deokjinkim made their first contribution in #1589
- @LorrinWWW made their first contribution in #1605
- @wufann made their first contribution in #1591
- @EntilZha made their first contribution in #1657
Full Changelog: v1.4.1...v1.5.0