Release v1.4.1 · lightseekorg/smg

🚀 Shepherd Model Gateway v1.4.1 Released

Patch release with mesh HA stability fix, DP rank scheduling, reasoning parser fixes, and engine version bumps.

Mesh HA Stability Fix

Fixed premature worker removal during rolling deploys:

Workers synced via mesh with health: false were being removed by the health checker before they had a chance to pass local health checks
Fix: health checker now only removes workers whose health check actually failed this tick, not workers that are merely marked unhealthy from mesh state
Eliminates the 500/503 error spike during gateway redeploys with --remove-unhealthy-workers enabled

DP Rank Scheduling

Data-parallel rank scheduling for multi-GPU inference:

Supports scheduling with the minimum number of required ranks
New scheduling policy for DP-aware worker selection

MCP Tool Improvements

Argument overrides (#1048) -- Add support for argument overrides with MCP tools, enabling per-request customization of MCP tool call parameters
Passthrough output flattening (#1041) -- MCP passthrough mcp_call output now flattened to plain strings for consistency
ID normalization (#989) -- MCP call item IDs normalized to mcp_ prefix for OpenAI alignment

Reasoning Parser Fixes

Thinking toggle detection (#1031) -- Detect thinking toggle from chat template and override parser state automatically
NanoV3/Nemotron fix (#1067) -- Changed parser to always_in_reasoning=false to fix incorrect reasoning block detection
Harmony routing (#1025) -- Route reasoning_content to analysis channel per Harmony spec

Bug Fixes

Routing: Eliminate unconditional token allocation on the hot path (#1024)
Responses API: Stop defaulting top_p for omitted requests (#1043), unify upstream header handling (#1029)
gRPC: Update vLLM imports for inputs reorganization (#1033)
Frontend: Fix smg serve rejecting vLLM OpenAI args (#832)
Discovery: Periodic reconciliation with identity-based pod equality (#1039)

Engine Version Bumps

vLLM: v0.18.0 -> v0.19.0
SGLang: v0.5.9/v0.5.10rc0 -> v0.5.10
TensorRT-LLM: 1.3.0rc8 -> 1.3.0rc10

Infrastructure

Claude review workflow hardened with incremental reviews and auto-approve (#1036, #1040, #1042)
E2E worker failure diagnostics and cleanup improvements (#1015)
gRPC package releases: smg-grpc-proto 0.4.6, smg-grpc-servicer 0.5.2

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10

All images for v1.4.1:

Engine	Tag	Pull Command
sglang	`1.4.1-sglang-v0.5.10`	`docker pull ghcr.io/lightseekorg/smg:1.4.1-sglang-v0.5.10`
trtllm	`1.4.1-trtllm-1.3.0rc10`	`docker pull ghcr.io/lightseekorg/smg:1.4.1-trtllm-1.3.0rc10`
vllm	`1.4.1-vllm-v0.19.0`	`docker pull ghcr.io/lightseekorg/smg:1.4.1-vllm-v0.19.0`

What's Changed

perf: Eliminate unconditional token allocation on the routing hot path by @ppraneth in #1024
refactor(e2e): rename worker_args to sglang_args by @CatherineSue in #1019
fix(ci): improve e2e worker failure diagnostics and cleanup by @key4ng in #1015
feat(metrics-ws): [2/4] add protocol types and watch registry by @key4ng in #982
fix(harmony): route reasoning_content to analysis channel per Harmony spec by @CatherineSue in #1025
fix(openai): unify responses upstream header handling by @zhaowenzi in #1029
fix(grpc): update vLLM imports for inputs reorganization by @CatherineSue in #1033
fix(reasoning): detect thinking toggle from chat template and override parser state by @CatherineSue in #1031
fix(ci): harden Claude review workflow with incremental reviews and resilience by @key4ng in #1036
fix(ci): fix comment fetch, add review summary, and auto-approve by @key4ng in #1040
fix(ci): handle array-format execution output in review summary by @key4ng in #1042
fix(mcp): flatten passthrough mcp_call output to plain strings by @zhaowenzi in #1041
feat(metrics-ws): [3/4] add event-driven and polled collectors by @key4ng in #1027
fix(responses): stop defaulting top_p for omitted requests by @zhaowenzi in #1043
fix(frontend): Fix smg serve reject vLLM OpenAI args by @YouNeedCryDear in #832
feat(realtime-api): WebRTC relay bridge by @pallasathena92 in #733
feat(overrides): add support for argument overrides with mcp tools by @Tobel158 in #1048
fix(mcp): normalize mcp_call item IDs to use mcp_ prefix for OpenAI alignment by @zhaowenzi in #989
feat: supports dp rank scheduling and scheduling with the minimun number of… by @jiashaokun-1 in #1007
fix(discovery): periodic reconciliation with identity-based pod equality by @Kangyan-Zhou in #1039
chore(deps): update wasm-encoder requirement from 0.245 to 0.246 by @dependabot[bot] in #1054
chore(deps): update lz4_flex requirement from 0.11 to 0.13 by @dependabot[bot] in #1053
chore(deps): update str0m requirement from 0.16 to 0.18 by @dependabot[bot] in #1052
chore(deps): bump vllm base image from v0.18.0 to v0.19.0 by @slin1237 in #1066
fix(reasoning): change NanoV3/Nemotron parser to always_in_reasoning=false by @CatherineSue in #1067
chore(deps): bump sglang from 0.5.9/0.5.10rc0 to 0.5.10 by @slin1237 in #1064
feat(metrics-ws): [4/4] add /ws/metrics endpoint with subscription support by @key4ng in #1050
fix(mesh): prevent premature removal of unhealthy workers by health checker by @slin1237 in #1076
chore(deps): bump TensorRT-LLM from 1.3.0rc8 to 1.3.0rc10 by @slin1237 in #1077
chore(grpc): release smg-grpc-proto 0.4.6 and smg-grpc-servicer 0.5.2 by @slin1237 in #1078
chore: bump versions for v1.4.1 release by @slin1237 in #1080

New Contributors

@Tobel158 made their first contribution in #1048
@jiashaokun-1 made their first contribution in #1007

Full Changelog: v1.4.0...v1.4.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.1

Choose a tag to compare

Sorry, something went wrong.