Release v1.5.0 · lightseekorg/smg

🚀 Shepherd Model Gateway v1.5.0 Released

Our biggest release yet: priority scheduling with preemption, two new backends — TokenSpeed and MLX — bringing SMG to five supported engines, and complete Responses API protocol coverage.

🎛️ Priority Scheduler

New request scheduler with admission control, prioritization, and preemption:

Priority classes — Configurable request classes with policy-driven admission
Admission control — Inflight tracking, queue management, and capacity slots
Preemption — Victim search and preempt path for high-priority requests under load
Capacity-proportional reservations — Floor + share guarantees per priority class, restored automatically after backend capacity recovers
Full observability — Operational metrics, tracing, and autoscaling gauges via metrics sampler
Battle-tested — Integration tests covering rejection, clamping, preemption, and starvation scenarios

Impact: Run mixed workloads on shared capacity. Latency-sensitive interactive traffic preempts batch jobs, priority classes get guaranteed capacity shares, and autoscalers get the signals they need.

⚡ TokenSpeed Backend

TokenSpeed joins SMG with full first-class gRPC integration:

Complete gRPC pipeline — Native TokenSpeed gRPC client, Python servicer, and router wiring
Multimodal support — Vision-language model (VLM) input through the gRPC pipeline
Admin operations — FlushCache and profile RPCs with worker-abstracted admin ops
Load reporting — Server-info label conversion for load-aware routing
Production CI — Full install and GPU E2E coverage from day one

Impact: TokenSpeed deployments get everything SMG offers — cache-aware routing, load balancing, priority scheduling, the full API surface (Chat Completions, Responses, Messages), and multimodal — through the same battle-tested gRPC pipeline that powers SGLang, vLLM, and TensorRT-LLM.

🍎 MLX Backend: Apple Silicon Support

SMG now runs on Apple Silicon via MLX:

Native MlxEngine gRPC proto, Rust client, and Python gRPC servicer
Per-step admission and own-thread BatchGenerator for stable concurrency
Coalesced concurrent prefill admission for agent workloads
macOS CI workflow with E2E coverage
Nightly benchmark: MLX direct HTTP vs Router+gRPC

Impact: Develop and serve on Mac. The same gateway that fronts your GPU fleet now runs your local MLX models — same APIs, same routing, same tooling.

SMG now supports five backends:vLLM, TokenSpeed, TensorRT-LLM, SGLang and MLX.

🛠️ Responses API: Complete Protocol Surface

Full OpenAI Responses API spec coverage — tools, content parts, and typed protocol items:

New hosted tools:

image_generation — Full streaming events wired across OpenAI, gRPC-regular, and gRPC-harmony routers, with MCP-backed dispatch and output metadata (action, background, format, quality, size)
web_search (non-preview) with results and return_token_budget
file_search, computer use, local_shell, containerized shell, apply_patch, custom tools, and namespace tool grouping

Protocol completeness:

Typed content parts and annotations with round-trip tests
ConversationRef union typing, item_reference inputs, compaction items
Full ToolChoice variant coverage
Fail-fast on unknown content/items — no more silent swallowing

Impact: SMG remains the only gateway with Responses API support for open-source models and third-party vendors — now with the complete protocol surface, including hosted tools like image generation and computer use.

💬 Messages API Enhancements

HTTP router support — Messages API now available on the HTTP router, not just gRPC
Adaptive thinking — Support adaptive thinking on /v1/messages
Protocol fidelity — Preserve system text and unknown fields end-to-end

🤖 New Model Support

DeepSeek V3.2 and V4 — Chat-template encoders with tools, thinking introspection, and DSML tool call parser
Kimi K2 / K2.5 / K2.6 — Tiktoken tokenizer support, K2.5 vision via gRPC router, chat template tool injection
Qwen3.5 family — Routed to Qwen3-VL multimodal processor
GPT-4o and modern OpenAI models — o200k_base tokenizer support

⚖️ Smarter Load Balancing

New policies and KV-aware routing signals:

least_load policy — Routes by token-work expected-wait, not just request counts
kv_pressure_weight policy — Balances on KV cache pressure
KV-aware imbalance triggers — Spread and overload detection via load monitor
DP-aware load balancing — Scheduler load info forwarded from gRPC backends, DP logical workers routed via base endpoint
GetLoads endpoint for the vLLM engine service

🔄 Worker Lifecycle State Machine

Workers now have a real state machine:

WorkerStatus enum with Pending start semantics and Draining state
Workflow-driven drain for graceful removal
Event-driven WorkerMonitor with richer event payloads
WorkerCapacity tracker with 4-tier capacity sourcing
max_running_requests surfaced from SGLang /server_info

🔧 PD Disaggregation Reliability

NIXL PD now actually transfers KV cache — kv_transfer_params relayed correctly
vLLM MooncakeConnector PD driven with router-minted kv_transfer_params
Smooth regular→PD transition — no disruption when enabling disaggregation
Fixed PD cache-aware policy lifecycle

🌐 Golang gRPC Client

Official Go client package joins Python, Rust, and Java SDKs.

🎙️ Audio Transcriptions

New /v1/audio/transcriptions multipart route.

📈 Performance

/workers endpoint 5x faster
Fused cache-aware match+insert — Single tree descent instead of two, with concurrency stress coverage
K8s discovery — Label selectors pushed to the API server instead of client-side filtering
Python bindings — GIL released while the router server runs

🐛 Notable Fixes

Tokenizer: HuggingFace tojson formatting parity, EOS token IDs from generation_config.json, L1 prefix cache keyed on add_special_tokens
Reasoning: Fresh parser on non-streaming path (no shared mutex), whitespace preserved in BaseReasoningParser
MCP: Internal self-brought MCP details hidden from final responses, citations/sources/query included when available, forwarded request headers preserved
gRPC: request.seed passed through to vLLM, backend sampling defaults applied, ResponsesRequest sampling params passed to all backends
Readiness: Wait for gRPC worker tokenizer autoload before reporting ready
Build: Debug symbols kept in release binaries

🏗️ Infrastructure

Nightly GHCR releases for SMG + vLLM/SGLang/TRT-LLM engine images
Dev wheels published to GitHub Releases and wheel index
CI: vLLM ≥0.22.1 (cu13 stack), SGLang 0.5.12.post1, TensorRT-LLM 1.3.0rc18
Rust toolchain 1.95.0

🙏 Welcome New Contributors

25 first-time contributors landed in this release — thank you all!

Full Changelog: v1.4.1...v1.5.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

What's Changed

feat(image): add image generation tool support by @TingtingZhou7 in #1057
refactor(core): move worker domain files from core/ to worker/ by @slin1237 in #1079
refactor(core): move workflow files to workflow/, delete core/ by @slin1237 in #1081
refactor: move job_queue.rs from worker/ to workflow/ by @slin1237 in #1083
fix(python): add router arg fallback disable flag by @gongwei-130 in #1072
Revert "feat(image): add image generation tool support" by @CatherineSue in #1087
fix(docker): install smg-grpc-servicer from source in engine images by @key4ng in #1086
feat(mcp): hide internal self-brought MCP details from final responses by @zhoug9127 in #1061
refactor(protocols): add WorkerStatus enum to protocol crate by @slin1237 in #1093
feat(mlx): add MlxEngine gRPC proto and Rust client by @key4ng in #1034
ci: allow 'ci-approved' label to bypass fork PR approval gate by @slin1237 in #1095
refactor(worker): expand WorkerEvent with richer payloads by @slin1237 in #1100
refactor(worker): replace healthy AtomicBool with status AtomicU8 by @slin1237 in #1101
ci: apply change detection to pull_request_target events by @ai-jz in #1104
refactor(worker): land Pending start semantics and state machine by @slin1237 in #1102
refactor(worker): extract health loop into WorkerManager by @slin1237 in #1105
test(completions): add E2E tests for /v1/completions gRPC endpoint by @vschandramourya in #1021
refactor(worker): reorganize WorkerRegistry methods and doc API by @slin1237 in #1113
fix(deps): bump google.golang.org/grpc to v1.79.3 for CVE-2026-33186 by @slin1237 in #1120
refactor(worker): create event-driven WorkerMonitor (PR 8) by @slin1237 in #1118
refactor(worker): delete set_healthy and migrate every caller (PR 9) by @slin1237 in #1121
fix(worker): http_health_check must use the per-worker http_client by @slin1237 in #1125
refactor(routers): extract shared utilities into routers/common by @slin1237 in #1126
refactor(worker): canonical metadata API on WorkerMetadata, trait keeps convenience defaults by @slin1237 in #1127
refactor(worker): extract HashRing into its own module (Finding 5) by @slin1237 in #1129
ci(mergify): upgrade configuration to current format by @mergify[bot] in #906
ci: move branch naming check from Mergify to GitHub Actions by @CatherineSue in #1131
feat(protocols): add serde(flatten) to ChatCompletionRequest for engine-specific fields by @DavidBellamy in #1130
chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #1135
feat(grpc_servicer): implement GetTokenizer RPC for SGLang backend by @CatherineSue in #1136
feat(grpc_servicer): implement GetTokenizer RPC for vLLM backend by @CatherineSue in #1142
fix(grpc_servicer): remove snapshot_download from vLLM GetTokenizer by @CatherineSue in #1143
fix(mcp): avoid relisting tools on resumed responses by @zhaowenzi in #1060
feat(tokenizer): load EOS token IDs from generation_config.json, strip in StopDecoder by @CatherineSue in #1122
refactor(grpc): replace PreparationOutput struct with typed enum by @CatherineSue in #1144
feat(multimodal): add Kimi-K2.5 vision support for gRPC router by @Kangyan-Zhou in #1044
refactor(grpc): remove filtered_request from Chat PreparationOutput by @CatherineSue in #1145
feat(tool_parser): add structural tag constraint generation, fix skip_special_tokens derivation by @CatherineSue in #1090
refactor(worker): canonical runtime API on WorkerRuntime (Finding 4 Part B) by @slin1237 in #1128
docs(worker): strip redundant inline comments by @slin1237 in #1147
fix(grpc_client): warn instead of error on conflicting constraints by @CatherineSue in #1146
feat(python/serve): disable router retries and circuit breaker by default when dp<=1 by @slin1237 in #1148
refactor(middleware): split middleware.rs into a folder + relocate token_bucket (Finding 2) by @slin1237 in #1151
docs(entrypoints): audit and fix top-level navigation, README, and getting-started by @slin1237 in #1152
docs(reliability): audit and fix circuit breakers, rate limiting, health checks, graceful shutdown by @slin1237 in #1153
docs(extensibility): audit and fix tokenizer cache, MCP, WASM plugin docs by @slin1237 in #1158
docs(api-ref): audit and fix OpenAI, Responses, Admin, Extensions API reference by @slin1237 in #1157
docs(ops): audit and fix monitoring, metrics reference, control-plane, auth concepts by @slin1237 in #1156
docs(config): audit and fix configuration reference + contributing docs by @slin1237 in #1155
docs(routing): audit and fix PD disaggregation, routing policies, gRPC pipeline by @slin1237 in #1154
feat(grpc): add golang gRPC client code package by @zetxqx in #874
docs(workers): audit and fix worker setup, service discovery, external providers, HA by @slin1237 in #1159
feat(mesh): MeshKV wrapper with explicit namespace handles (v2 Step 1) by @CatherineSue in #1164
feat(mesh): stream buffer integration with backpressure (v2 Step 2a) by @CatherineSue in #1166
feat(mesh): centralized gossip loop (v2 Step 2b) by @CatherineSue in #1169
refactor(mesh): cleanup tasks from Step 2b review by @CatherineSue in #1170
test(mesh): multi-peer delivery verification + legacy collector removal (v2 Step 2c) by @CatherineSue in #1171
revert(ci): ci-approved fork PR trigger (security + stability) by @slin1237 in #1175
docs(contributing): add Code of Conduct and root CONTRIBUTING guide by @slin1237 in #1230
refactor(tool_parser): rename qwen_coder to qwen_xml and fix model mappings by @key4ng in #1165
feat(mesh): ChunkAssembler with memory bounds and proto wire-format (v2 Step 3 part 1) by @CatherineSue in #1173
feat(mcp): Include citations, sources and query when available by @Tobel158 in #1109
chore(codeowners): add @zhoug9127 and @zhaowenzi to mcp and data_connector by @slin1237 in #1236
feat(protocols): add skills core types by @slin1237 in #1235
feat(responses): interrupt approval-required MCP tool calls by @zhaowenzi in #1174
fix(mcp): centralize non-stream internal privacy filtering by @zhoug9127 in #1168
chore(ci): bump Rust toolchain to 1.95.0 to unblock CI by @slin1237 in #1243
feat(protocols): integrate skills request surfaces by @slin1237 in #1242
feat(skills): add crate and config scaffolding by @slin1237 in #1249
chore: remove ut from main by @slin1237 in #1252
fix(ci): pin vllm<0.19.1 to restore embedding test correctness by @slin1237 in #1251
feat(skills): parse SKILL.md and sidecar metadata by @slin1237 in #1253
feat(skills): normalize skill bundle zip archives by @slin1237 in #1255
feat(protocols): BGM-PR-01 additive Responses API fields for background mode by @slin1237 in #1244
feat(mesh): stream chunking wiring in gossip loop (v2 Step 3 part 2) by @CatherineSue in #1254
fix(grpc): forward scheduler load info for DP-aware load balancing by @Kangyan-Zhou in #1116
feat(skills): add storage contracts and schema migrations by @slin1237 in #1257
perf(mesh): stream entry data as Bytes instead of Vec by @CatherineSue in #1259
chore(mesh): drop spec section refs from code comments by @CatherineSue in #1261
feat(gateway): add /v1/audio/transcriptions multipart route by @shenxiul in #1256
feat(memory): add conversation memory header contract no-op hook by @spalimpaaces-star in #1134
fix(e2e): pin oracle-custom schema to latest migration by @slin1237 in #1268
fix(workflow): skip tokenizer autoload for HTTP-mode workers by @key4ng in #1265
fix(grpc_client): pass ResponsesRequest sampling params to all backends by @key4ng in #1161
ci(pr-naming): allow uppercase letters in branch names by @CatherineSue in #1280
fix(model-gateway): format conversation memory header assertions by @CatherineSue in #1279
feat(protocols): implement P3 EasyInputMessage.phase by @slin1237 in #1281
feat(protocols): implement P4 IncludeField web_search_call variants by @slin1237 in #1274
fix(protocols): implement P8 Reasoning.summary SummaryTextContent by @slin1237 in #1273
feat(protocols): implement T1 file_search tool declaration by @slin1237 in #1272
feat(skills): add in-memory runtime and filesystem blob cache by @slin1237 in #1262
fix(memory): skip memory injection when header is absent by @spalimpaaces-star in #1277
feat(protocols): implement P2 top-level ResponsesRequest fields by @slin1237 in #1278
feat(protocols): implement P1 content parts + typed annotations by @slin1237 in #1275
fix(skills): tighten local storage semantics by @slin1237 in #1283
feat(mesh): route targeted stream entries on server-side sync streams by @CatherineSue in #1260
ci: run full PR test suite for dependabot PRs by @CatherineSue in #1292
chore(deps): bump actions/setup-go from 5 to 6 by @dependabot[bot] in #1286
perf(mesh): carry drain stream entries as Bytes through RoundBatch by @CatherineSue in #1285
test(mesh): stream chunking integration tests (v2 Step 3 part 3) by @CatherineSue in #1289
chore(deps): update wasm-encoder requirement from 0.246 to 0.247 by @dependabot[bot] in #1290
chore(deps): update lru requirement from 0.16.2 to 0.17.0 by @dependabot[bot] in #1291
feat(smg): validate skills startup config by @slin1237 in #1294
chore(deps): update npyz requirement from 0.8 to 0.9 by @dependabot[bot] in #1293
feat(protocols): implement P7 ToolChoice variant coverage by @slin1237 in #1276
feat(mesh): EpochMaxWins merge helper for rate-limit counters by @CatherineSue in #1295
fix(protocols): implement P5 fail-fast on unknown content/items by @slin1237 in #1298
feat(gateway): implement R1 OpenAI Responses content-part transformer regression coverage by @slin1237 in #1300
feat(protocols): implement T3 web_search non-preview tool + WebSearchCall.results by @slin1237 in #1304
feat(protocols): implement P6 ConversationRef union typing by @slin1237 in #1306
feat(protocols): implement T4 image_generation tool + call item by @slin1237 in #1303
refactor(protocols): strip audit-process residue from comments by @slin1237 in #1310
feat(mesh): auto-register config: CRDT prefix and mirror v1 AppStore on receive by @CatherineSue in #1299
refactor(protocols): extract responses tests to crates/protocols/tests/ by @slin1237 in #1309
feat(memory): add runtime-gated conversation memory substrate by @zhoug9127 in #1149
refactor(gateway): strip scope-bleed guardrails from Wave-1/2 Responses router code by @slin1237 in #1311
refactor(protocols): restore Chat/Responses ToolChoice separation invariant by @slin1237 in #1314
refactor(openai): split provider module into submodules by @VadivelanU in #1308
feat(smg): add tenant resolution request metadata by @slin1237 in #1312
feat(smg): add WorkerSyncAdapter for v2 worker: CRDT namespace by @CatherineSue in #1313
fix(multimodal): extract images from tool role messages by @ConnorLi96 in #1307
test(embeddings): enable sglang embedding correctness on h100 by @CatherineSue in #1326
fix(metrics): add path label to HTTP response metrics by @zhaowenzi in #1328
feat(data-connector): BGM-PR-02 background repository trait + schema + migrations by @slin1237 in #1258
feat(protocols): implement I3 compaction input+output item by @slin1237 in #1302
fix(bench): handle Compaction variant in routing_allocation_bench by @slin1237 in #1331
test(embeddings): restore skip_for_runtime("sglang") on correctness test by @slin1237 in #1330
feat(skills): add create upload admin endpoints by @slin1237 in #1332
feat(protocols): implement T2 computer / computer_use_preview tools by @slin1237 in #1305
feat(protocols): implement T8 custom tool + call/output items by @slin1237 in #1301
feat(protocols): implement T9 namespace tool grouping by @slin1237 in #1337
test(responses): add negative tests for silent swallow (E7) by @slin1237 in #1336
test(responses): add annotation round-trip tests (E5) by @slin1237 in #1334
feat(skills): add read and list admin endpoints by @slin1237 in #1335
feat(smg): add RateLimitSyncAdapter for v2 rl: CRDT namespace by @CatherineSue in #1327
feat(skills): add patch and delete admin endpoints by @slin1237 in #1344
feat(protocols): implement T6 shell (containerized) tool + call/output items by @slin1237 in #1342
feat(background): BGM-PR-03 config + AppContext + memory repository by @slin1237 in #1340
feat(protocols): implement I2 item_reference input item by @slin1237 in #1341
feat(protocols): implement T7 apply_patch tool + call/output items by @slin1237 in #1339
fix(background): keep raw_response.status in sync with terminal transitions by @slin1237 in #1348
feat(protocols): implement T5 local_shell tool + call/output items by @slin1237 in #1338
fix(data-connector): scope startup migrations to core history by @slin1237 in #1347
feat(gateway): add image_generation infrastructure for hosted-tool MCP plumbing (R6.1) by @slin1237 in #1355
feat(smg): resolve tenant aliases in request metadata by @slin1237 in #1354
feat(smg): TreeSyncAdapter foundation — tenant-delta fast path by @CatherineSue in #1345
feat(openai): wire image_generation streaming events in Responses router (R6.2) by @slin1237 in #1356
feat(grpc-regular): wire image_generation streaming events in Responses router (R6.4) by @slin1237 in #1358
feat(grpc-harmony): wire image_generation streaming events (R6.3) by @slin1237 in #1359
fix(background): share memory storage + validate all retry fields + builder setter by @slin1237 in #1349
fix(skills): harden skills API review findings by @slin1237 in #1366
feat(mcp): preserve forwarded request headers in MCP request context by @RohanSogani in #1333
feat(mm): share multimodal config registry keyed by tokenizer UUID by @key4ng in #1248
fix(grpc-harmony): dispatch image_generation as function tool for gpt-oss (R6.8) by @slin1237 in #1368
fix(openai): image_generation result round-trip + streaming event ordering (R6.7) by @slin1237 in #1369
feat(mcp): add hosted-tool overrides helpers and wire image_generation dispatch (R6.6) by @slin1237 in #1370
fix(openai): broaden output_item.done suppression gate to all tool-call item types (R6.7b) by @slin1237 in #1371
refactor: strip R6.x audit-process residue from MCP + router comments (C3) by @slin1237 in #1374
fix(openai-router): suppress duplicate output-item envelopes on native passthrough (R6.7c) by @slin1237 in #1376
feat(protocols): ImageGenerationCall output metadata — action/background/output_format/quality/size (R6.9) by @slin1237 in #1377
feat(protocols): implement T11 mcp input roundtrip by @slin1237 in #1350
fix(helm): add clusterWide flag for opt-in cluster-scoped service discovery by @MohanKumar21 in #1379
feat(tokenizer): add DeepSeek V3.2 and V4 chat-template encoders by @CatherineSue in #1373
fix(tokenizer): plumb tools, thinking introspection, and V3.2 iteration for DeepSeek encoders by @CatherineSue in #1381
feat(tool_parser): add DeepSeek V3.2 and V4 DSML tool call parser by @key4ng in #1030
test(e2e): add mock MCP server + image_generation integration tests (R6.5) by @slin1237 in #1365
feat(skills): resolve request skill manifests by @slin1237 in #1382
test(e2e): lock mcp_call shape for plain-MCP image_generation tool by @slin1237 in #1391
fix(mcp): forward request.user to hosted-tool MCP dispatch args by @slin1237 in #1389
feat(mlx): add Python gRPC servicer for MLX backend by @key4ng in #1099
ci(docker): move image builds to cpu-e5 runner and parallelize make by @key4ng in #1372
refactor(smg): move skill resolution into skills crate by @slin1237 in #1393
fix(mlx-grpc): drain-and-batch to avoid BatchGenerator rope crash at concurrency≥4 by @key4ng in #1414
fix(router): require model for IGW generate requests by @jshanson7 in #1420
refactor(server): extract /v1/audio/transcriptions multipart parsing into FromRequest by @slin1237 in #1426
fix(mlx-grpc): per-step admission + own-thread BatchGenerator (eliminates chat-c4 TTFT regression and Stream(gpu,1) crash) by @key4ng in #1427
docs: fix image reference paths by @Weili-0234 in #1428
feat(smg): TreeHandle on CacheAwarePolicy — adapter consumes policy-owned hash membership by @CatherineSue in #1364
chore(deps): update lru requirement from 0.17.0 to 0.18.0 by @dependabot[bot] in #1408
feat(mesh): publish tree:req: repair on unknown tenant deltas by @CatherineSue in #1444
fix(openai-bridge): address PR #1429 review follow-ups (alias miss, error state, wire shape) by @slin1237 in #1442
feat(mesh): apply known remote tenant deltas instead of just logging by @CatherineSue in #1446
refactor(openai-bridge): finish descriptor pattern, fix FileSearch routing gap by @slin1237 in #1450
refactor(openai-bridge): close last new-builtin coupling leaks (descriptor + connect wrapper) by @slin1237 in #1455
feat(gateway): Smooth transition from regular->PD by @ekzhang in #1445
feat(openai-protocol): add ResponseTool::as_str() by @slin1237 in #1457
feat(sse): add shared SSE codec module by @XinyueZhang369 in #987
feat(realtime-api): WebRTC signaling handler and OpenAIRouter integra… by @pallasathena92 in #748
feat(kv-index): iter_entries lazy walker for Tree and TokenTree by @CatherineSue in #1454
fix(deps): drop default features on hf-hub; opt into rustls-tls by @krishung5 in #1459
fix(protocols): add continuous_usage_stats to StreamOptions by @Abhishek8108 in #1460
feat(mesh): tree-repair pages, responder side of the protocol by @CatherineSue in #1458
chore(deps): update str0m requirement from 0.18 to 0.19 by @dependabot[bot] in #1453
feat(mesh): tree-repair pages, receiver side of the protocol by @CatherineSue in #1461
feat(grpc): apply backend sampling defaults by @CatherineSue in #1462
fix(grpc): tighten sampling defaults metadata by @CatherineSue in #1463
docs: Update README by @lightseek-bot in #1467
ci(release): add dev wheel workflow → GitHub Releases by @key4ng in #1471
ci(release): publish dev wheels to whl index by @zhyncs in #1473
smg now needs either launch, server or serve - those examples are all… by @surak in #1474
fix(ci): pin nixl<1.1.0 to avoid cu13 libcudart on CUDA 12 runners by @key4ng in #1477
fix(tokenizer): match HuggingFace tojson formatting by @CatherineSue in #1478
test(e2e): stabilize Mistral required streaming by @CatherineSue in #1483
refactor(mesh): remove v1 mesh from gateway and public API by @CatherineSue in #1476
chore(deps): update tokenizers requirement from 0.22.0 to 0.23.1 by @dependabot[bot] in #1410
chore(deps): update wasm-encoder requirement from 0.247 to 0.248 by @dependabot[bot] in #1409
chore(gprc): bump up grpc-servicer version to 0.5.3 by @YouNeedCryDear in #1443
perf(discovery): push label selectors to K8s API server by @slin1237 in #1488
fix(protocols): accept return_token_budget for web_search tool by @Tobel158 in #1490
refactor(mesh): delete v1 internals (-10k lines) by @CatherineSue in #1486
fix(worker): honor explicit worker connection schemes by @heymrbox in #1485
feat(worker): add Draining state with workflow-driven drain by @slin1237 in #1491
refactor(grpc-client): extract shared Channel builder by @slin1237 in #1494
refactor(grpc-client): extract AbortOnDropStream + engine basics by @slin1237 in #1498
ci: skip unit-tests on PRs without Rust changes by @MohanKumar21 in #1493
docs: end-to-end example for KV-events cache-aware routing by @yetone in #1497
test(mlx): add MLX backend E2E tests + macos-latest CI workflow by @key4ng in #1398
fix(mlx-grpc): runtime error in mlx multi streams when loading gpt-oss model series by @zach-li-sudo in #1489
feat(mlx-bench): nightly benchmark — MLX direct HTTP vs Router+gRPC by @key4ng in #1399
feat(grpc): add TokenSpeed gRPC client and router wiring (Part 1/3) by @yetone in #1351
feat(protocols): add serde(flatten) to GenerateRequest for engine-specific fields by @devvrit-mirendil in #1503
test(realtime): pin to gpt-realtime GA alias instead of retired preview snapshot by @key4ng in #1504
feat(grpc_servicer): add TokenSpeed servicer (Part 2/3) by @key4ng in #1464
ci(tokenspeed): add CI install + GPU e2e coverage (Part 3/3) by @key4ng in #1465
feat(reasoning_parser): add none reasoning parser by @zhyncs in #1505
refactor(mesh): rename gossip files, consolidate transport limits by @CatherineSue in #1495
fix(mesh): off-by-one in RetryManager backoff slot lookup by @CatherineSue in #1511
refactor(mesh): move chunking + chunk_assembler under transport/ by @CatherineSue in #1514
refactor(mesh): extract shared SyncStream message helpers by @CatherineSue in #1513
ci(unit-tests): use python -m pip for vision golden deps by @key4ng in #1518
fix(realtime): drop OpenAI-Beta header rejected by GA Realtime API by @key4ng in #1516
test(e2e): drop tokenspeed marker from structurally-unsupported tests by @key4ng in #1517
fix(mesh): inbound sender drops targeted entries before peer is learned by @CatherineSue in #1519
refactor(reasoning_parser): replace NoneParser with PassthroughParser by @CatherineSue in #1512
fix(pd): Fix PD cache-aware policy lifecycle by @aurickq in #1520
fix(pd): route DP logical workers via base endpoint by @aurickq in #1522
chore(deps): update opentelemetry-proto requirement from 0.31 to 0.32 by @dependabot[bot] in #1506
feat(gateway): Add Messages API to HTTP router by @ekzhang in #1521
fix(observability): include addr in metrics bind failure by @zhyncs in #1525
fix(observability): use expect for metrics bind failure by @zhyncs in #1528
ci: cancel stale PR test runs by @zhyncs in #1530
fix(service-discovery): drop http:// prefix so dual-probe detects gRPC by @CatherineSue in #1523
feat(workflow): surface max_running_requests on SGLang HTTP /server_info by @slin1237 in #1529
feat(worker): add max_running_requests accessor on Worker trait by @slin1237 in #1526
fix(mesh): wire EpochMaxWins into CRDT merge by @CatherineSue in #1469
chore(deps): update wasm-encoder requirement from 0.248 to 0.250 by @dependabot[bot] in #1508
fix(ci): restore per-commit CI runs on main by @CatherineSue in #1536
ci(claude-review): flag PRs missing required template sections by @CatherineSue in #1537
feat(worker): add WorkerCapacity tracker (4-tier capacity sourcing) by @slin1237 in #1534
refactor(mesh): move tree path hashing out of mesh into kv-index by @CatherineSue in #1535
feat(grpc): add TokenSpeed multimodal (VLM) input support by @chenht2022 in #1515
fix(tokenizer): support Kimi K2/K2.5/K2.6 tiktoken models by @CatherineSue in #1482
fix(docs): replace HTML img tags with markdown image syntax for correct MkDocs path rewriting by @1195343015 in #1543
fix(tokenizer): add o200k_base support for GPT-4o and modern OpenAI models by @1195343015 in #1542
feat(scheduler): foundation — class, config, policy (PR 2 M1) by @CatherineSue in #1541
refactor(mesh): extract NamespaceCrdtEngine + LwwEngine + transitional EpochMaxWins by @CatherineSue in #1539
feat(scheduler): admission data layer — inflight, queue, slots (PR 2 M2a) by @CatherineSue in #1546
feat(messages): support adaptive thinking on /v1/messages by @CatherineSue in #1555
feat(scheduler): engine — PriorityScheduler, admit, dispatcher (PR 2 M2b) by @CatherineSue in #1550
ci: fix YAML comment swallowing echo in cancel-merged-pr-tests by @CatherineSue in #1556
refactor(mesh): typed RateLimitEngine, drop EpochMaxWinsLegacyEngine by @CatherineSue in #1549
ci: remove cancel-merged-pr-tests workflow by @CatherineSue in #1558
chore(grpc): release smg-grpc-proto 0.4.8 by @CatherineSue in #1562
fix(metrics): add /v1/messages to route_to_endpoint match by @ekzhang in #1564
feat(multimodal): route Qwen3.5 family to Qwen3-VL processor by @chenht2022 in #1563
refactor(scheduler): remove queue_size_per_slot multiplier by @CatherineSue in #1559
refactor(mesh): make OperationLog strategy-free; engines own merge/compact by @CatherineSue in #1560
perf(mesh): size apply_remote_ops dedup to the incoming batch by @CatherineSue in #1569
ci(trtllm): install pre-release wheel from PyPI instead of building from source by @key4ng in #1501
test(e2e): cache workers + tp=1 for gpt-oss-20b/Qwen2.5-14B + move responses & chat to 1-GPU by @key4ng in #1502
perf(gateway): Make /workers 5x faster by @ekzhang in #1576
fix(cache-aware): gate hash_index hot-path writes behind explicit flag by @ekzhang in #1565
feat(mesh): gossip CRDT operation log over the wire (d-3a) by @CatherineSue in #1570
feat(scheduler): preemption — victim search, preempt path, body wrapper (PR 2 M3) by @CatherineSue in #1572
feat(scheduler): wire priority admission middleware + cancel propagation (PR 2 M4) by @CatherineSue in #1577
feat(scheduler): operational metrics + tracing (PR 2 M5a) by @CatherineSue in #1579
fix(build): keep debug symbols in release binaries by @slin1237 in #1582
feat(scheduler): autoscaling gauges via metrics sampler (PR 2 M5b) by @CatherineSue in #1580
test(scheduler): integration wiring + fallback guards (PR 2 M6) by @CatherineSue in #1581
fix(scheduler): restore reservations after backend capacity recovers by @CatherineSue in #1587
test(scheduler): add admission micro-benchmark and concurrent load harness by @slin1237 in #1584
docs(scheduler): document the priority scheduler by @slin1237 in #1583
test(scheduler): integration tests for rejection, clamp, preemption, starvation by @slin1237 in #1585
feat(mesh): per-key CRDT send watermark + ack + retry by @CatherineSue in #1586
feat(scheduler): capacity-proportional reservations (floor + share) by @CatherineSue in #1588
fix(bindings): expose consistent_hashing and prefix_hash in python CLI by @deokjinkim in #1589
chore(deps): update wasm-encoder requirement from 0.250 to 0.251 by @dependabot[bot] in #1593
fix(mlx-bench): request-bound runs and compare on equal sample size by @key4ng in #1590
ci(nightly): nightly GHCR releases for SMG + vllm/sglang/trtllm engine images by @key4ng in #1600
fix(responses): preserve empty output_text annotations by @zhaowenzi in #1599
fix(tokenizer): inject tools_ts_str for Kimi-K2.5 chat templates by @key4ng in #1448
fix(readiness): wait for gRPC worker tokenizer autoload before reporting ready by @LorrinWWW in #1605
perf(mlx-grpc): coalesce concurrent prefill admission to fix agent-c4 TTFT by @key4ng in #1606
docs: update license by @key4ng in #1607
chore: remove the Skills API by @slin1237 in #1612
feat(protocols): finish background-mode protocol surface (BGM-PR-01) by @slin1237 in #1609
chore: remove conversation memory (STM/LTM/STMO) by @slin1237 in #1613
perf(kv-index): fuse cache-aware match+insert into a single tree descent by @slin1237 in #1615
test(kv-index): concurrency stress test for fused match+insert; qualify equivalence docs by @slin1237 in #1618
feat(background): BGM-PR-04 create path + snapshot resolution by @slin1237 in #1614
fix(scripts): accept HF repo IDs and extra vLLM args in launch-pd-workers by @slin1237 in #1620
ci(claude): switch PR review workflow to claude-fable-5 by @slin1237 in #1623
revert: ci(claude): switch PR review workflow to claude-fable-5 by @slin1237 in #1624
fix(mesh): track op-id in CRDT send watermark (replica tie-break) by @CatherineSue in #1592
fix(grpc): make NIXL PD actually transfer KV cache (relay kv_transfer_params) by @CatherineSue in #1622
feat(background): BGM-PR-06 background driver + sweeper by @slin1237 in #1619
feat(smg): MeshAdapters composition root — register CRDT engines before gossip, start inbound sync by @CatherineSue in #1626
feat(policies): add least_load load-balancing policy by @slin1237 in #1629
fix(ci): use NVMe storage for H100 runner workspaces by @key4ng in #1627
chore(background): remove background-mode support pending redesign by @slin1237 in #1631
ci(vllm): upgrade CI to vllm>=0.22.1 (cu13 stack, flashinfer jit-cache, Qwen3 embedding) by @CatherineSue in #1625
refactor(policies): rename least_load to kv_pressure_weight by @slin1237 in #1632
fix(policies): stop BucketPolicy adjustment thread on drop by @slin1237 in #1643
chore(workflow): remove dead code and trim stale comments by @slin1237 in #1648
chore(wasm): trim stale comments by @slin1237 in #1645
fix(reasoning): use fresh parser on non-streaming path to avoid shared mutex by @slin1237 in #1642
fix(mcp): classify image_generation builtin in session bindings by @slin1237 in #1646
feat(policies): route least_load by token-work expected-wait by @slin1237 in #1647
feat(grpc): add GetLoads endpoint to the vLLM engine service by @slin1237 in #1630
chore(observability): remove dead code and trim stale comments by @slin1237 in #1649
chore(worker): remove dead code and trim stale comments by @slin1237 in #1650
fix(tool_parser): add (?s) DOTALL to Kimi-K2 regexes for multi-line JSON args by @slin1237 in #1636
fix(grpc-client): set num_waiting_uncached_tokens in vLLM SchedulerLoad conversion by @key4ng in #1653
fix(tokenizer): key L1 prefix cache on add_special_tokens by @slin1237 in #1637
fix(grpc_servicer): fix sglang grpc binds wrong IPC socket when --skip-tokenizer-init by @wufann in #1591
docs: add TokenSpeed to engine support docs by @CatherineSue in #1656
test(grpc): cover TokenSpeed and SGLang server-info label conversion by @slin1237 in #1654
fix(grpc_client): pass through request.seed in vLLM SamplingParams builders by @EntilZha in #1657
feat(grpc): add FlushCache and profile RPCs with worker-abstracted admin ops by @slin1237 in #1655
fix(reasoning_parser): preserve whitespace in BaseReasoningParser by @EntilZha in #1658
fix(ci): give each runner its own hf-xet cache dir by @XinyueZhang369 in #1660
fix(python): release the GIL while the router server runs by @slin1237 in #1641
ci(trtllm): bump CI wheel to TensorRT-LLM 1.3.0rc18 by @slin1237 in #1663
feat(cache-aware): KV-aware imbalance triggers (spread + overload) via load monitor by @slin1237 in #1621
ci(sglang): bump CI install to sglang 0.5.12.post1 (cu13 stack) by @slin1237 in #1662
feat(grpc): add TokenSpeed FlushCache and profile RPCs by @slin1237 in #1659
feat(grpc): drive vLLM MooncakeConnector PD with router-minted kv_transfer_params by @CatherineSue in #1664
feat(smg): outbound worker mesh sync — single-writer ownership, publish loop, tombstones, recovery by @CatherineSue in #1661
fix(docker): shadow distro-owned PyYAML on TRT-LLM bases before smg install by @slin1237 in #1666
fix(messages): Anthropic messages API, preserve system "text" and unknown fields by @ekzhang in #1670
fix(e2e): unbreak trtllm lanes broken by mcp 2.0.0a1 by @slin1237 in #1675
feat(gateway): Filter /workers by model by @ekzhang in #1672
fix(gateway): DP-aware rank suffix by @ekzhang in #1669
refactor(protocols): move ListWorkersQuery to the protocols crate by @slin1237 in #1690
chore: bump versions for v1.5.0 release by @slin1237 in #1665

New Contributors

@TingtingZhou7 made their first contribution in #1057
@ai-jz made their first contribution in #1104
@DavidBellamy made their first contribution in #1130
@zetxqx made their first contribution in #874
@shenxiul made their first contribution in #1256
@spalimpaaces-star made their first contribution in #1134
@VadivelanU made their first contribution in #1308
@RohanSogani made their first contribution in #1333
@MohanKumar21 made their first contribution in #1379
@Weili-0234 made their first contribution in #1428
@krishung5 made their first contribution in #1459
@Abhishek8108 made their first contribution in #1460
@zhyncs made their first contribution in #1473
@surak made their first contribution in #1474
@heymrbox made their first contribution in #1485
@yetone made their first contribution in #1497
@zach-li-sudo made their first contribution in #1489
@devvrit-mirendil made their first contribution in #1503
@aurickq made their first contribution in #1520
@chenht2022 made their first contribution in #1515
@1195343015 made their first contribution in #1543
@deokjinkim made their first contribution in #1589
@LorrinWWW made their first contribution in #1605
@wufann made their first contribution in #1591
@EntilZha made their first contribution in #1657