Skip to content

v1.4.1

Choose a tag to compare

@slin1237 slin1237 released this 09 Apr 18:18
· 409 commits to main since this release
ea9005d

🚀 Shepherd Model Gateway v1.4.1 Released

Patch release with mesh HA stability fix, DP rank scheduling, reasoning parser fixes, and engine version bumps.

Mesh HA Stability Fix

Fixed premature worker removal during rolling deploys:

  • Workers synced via mesh with health: false were being removed by the health checker before they had a chance to pass local health checks
  • Fix: health checker now only removes workers whose health check actually failed this tick, not workers that are merely marked unhealthy from mesh state
  • Eliminates the 500/503 error spike during gateway redeploys with --remove-unhealthy-workers enabled

DP Rank Scheduling

Data-parallel rank scheduling for multi-GPU inference:

  • Supports scheduling with the minimum number of required ranks
  • New scheduling policy for DP-aware worker selection

MCP Tool Improvements

  • Argument overrides (#1048) -- Add support for argument overrides with MCP tools, enabling per-request customization of MCP tool call parameters
  • Passthrough output flattening (#1041) -- MCP passthrough mcp_call output now flattened to plain strings for consistency
  • ID normalization (#989) -- MCP call item IDs normalized to mcp_ prefix for OpenAI alignment

Reasoning Parser Fixes

  • Thinking toggle detection (#1031) -- Detect thinking toggle from chat template and override parser state automatically
  • NanoV3/Nemotron fix (#1067) -- Changed parser to always_in_reasoning=false to fix incorrect reasoning block detection
  • Harmony routing (#1025) -- Route reasoning_content to analysis channel per Harmony spec

Bug Fixes

  • Routing: Eliminate unconditional token allocation on the hot path (#1024)
  • Responses API: Stop defaulting top_p for omitted requests (#1043), unify upstream header handling (#1029)
  • gRPC: Update vLLM imports for inputs reorganization (#1033)
  • Frontend: Fix smg serve rejecting vLLM OpenAI args (#832)
  • Discovery: Periodic reconciliation with identity-based pod equality (#1039)

Engine Version Bumps

  • vLLM: v0.18.0 -> v0.19.0
  • SGLang: v0.5.9/v0.5.10rc0 -> v0.5.10
  • TensorRT-LLM: 1.3.0rc8 -> 1.3.0rc10

Infrastructure

  • Claude review workflow hardened with incremental reviews and auto-approve (#1036, #1040, #1042)
  • E2E worker failure diagnostics and cleanup improvements (#1015)
  • gRPC package releases: smg-grpc-proto 0.4.6, smg-grpc-servicer 0.5.2

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10

All images for v1.4.1:

Engine Tag Pull Command
sglang 1.4.1-sglang-v0.5.10 docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10
trtllm 1.4.1-trtllm-1.3.0rc10 docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10
vllm 1.4.1-vllm-v0.19.0 docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

What's Changed

  • perf: Eliminate unconditional token allocation on the routing hot path by @ppraneth in #1024
  • refactor(e2e): rename worker_args to sglang_args by @CatherineSue in #1019
  • fix(ci): improve e2e worker failure diagnostics and cleanup by @key4ng in #1015
  • feat(metrics-ws): [2/4] add protocol types and watch registry by @key4ng in #982
  • fix(harmony): route reasoning_content to analysis channel per Harmony spec by @CatherineSue in #1025
  • fix(openai): unify responses upstream header handling by @zhaowenzi in #1029
  • fix(grpc): update vLLM imports for inputs reorganization by @CatherineSue in #1033
  • fix(reasoning): detect thinking toggle from chat template and override parser state by @CatherineSue in #1031
  • fix(ci): harden Claude review workflow with incremental reviews and resilience by @key4ng in #1036
  • fix(ci): fix comment fetch, add review summary, and auto-approve by @key4ng in #1040
  • fix(ci): handle array-format execution output in review summary by @key4ng in #1042
  • fix(mcp): flatten passthrough mcp_call output to plain strings by @zhaowenzi in #1041
  • feat(metrics-ws): [3/4] add event-driven and polled collectors by @key4ng in #1027
  • fix(responses): stop defaulting top_p for omitted requests by @zhaowenzi in #1043
  • fix(frontend): Fix smg serve reject vLLM OpenAI args by @YouNeedCryDear in #832
  • feat(realtime-api): WebRTC relay bridge by @pallasathena92 in #733
  • feat(overrides): add support for argument overrides with mcp tools by @Tobel158 in #1048
  • fix(mcp): normalize mcp_call item IDs to use mcp_ prefix for OpenAI alignment by @zhaowenzi in #989
  • feat: supports dp rank scheduling and scheduling with the minimun number of… by @jiashaokun-1 in #1007
  • fix(discovery): periodic reconciliation with identity-based pod equality by @Kangyan-Zhou in #1039
  • chore(deps): update wasm-encoder requirement from 0.245 to 0.246 by @dependabot[bot] in #1054
  • chore(deps): update lz4_flex requirement from 0.11 to 0.13 by @dependabot[bot] in #1053
  • chore(deps): update str0m requirement from 0.16 to 0.18 by @dependabot[bot] in #1052
  • chore(deps): bump vllm base image from v0.18.0 to v0.19.0 by @slin1237 in #1066
  • fix(reasoning): change NanoV3/Nemotron parser to always_in_reasoning=false by @CatherineSue in #1067
  • chore(deps): bump sglang from 0.5.9/0.5.10rc0 to 0.5.10 by @slin1237 in #1064
  • feat(metrics-ws): [4/4] add /ws/metrics endpoint with subscription support by @key4ng in #1050
  • fix(mesh): prevent premature removal of unhealthy workers by health checker by @slin1237 in #1076
  • chore(deps): bump TensorRT-LLM from 1.3.0rc8 to 1.3.0rc10 by @slin1237 in #1077
  • chore(grpc): release smg-grpc-proto 0.4.6 and smg-grpc-servicer 0.5.2 by @slin1237 in #1078
  • chore: bump versions for v1.4.1 release by @slin1237 in #1080

New Contributors

Full Changelog: v1.4.0...v1.4.1