v1.4.0
🚀 Shepherd Model Gateway v1.4.0 Released
The biggest SMG release yet -- Kubernetes-native deployment via Helm, a terminal dashboard, 200x mesh memory reduction, 7-11x faster multimodal preprocessing, native Completion API over gRPC, and per-model retry configuration.
Kubernetes-Native Deployment with Helm
Production-ready Helm chart for deploying SMG on Kubernetes:
- One-command deployment --
helm install smg oci://ghcr.io/lightseekorg/smg-helmdeploys the full gateway stack - Router + Worker deployment -- A single chart deploys both the gateway router and inference engine workers (vLLM, SGLang, TRT-LLM) with GPU scheduling
- Mesh HA with service discovery -- Deploy multiple gateway replicas as a StatefulSet with automatic gossip-based peer discovery via
--router-selector - Full K8s integration -- RBAC, Ingress, HPA, PDB, ServiceMonitor, Grafana dashboard ConfigMap, JSON Schema validation at
helm linttime - 5 example configurations -- Router-only, with-postgres, with-service-discovery, with-ingress, with-monitoring
Impact: Zero-to-production SMG deployment on Kubernetes with a single helm install. Declarative configuration, automatic scaling, and built-in observability.
Terminal Dashboard (smg-tui)
Full-featured terminal UI for real-time monitoring and interactive chat:
- 7 tabs -- Pulse (real-time dashboard with sparklines), Workers (per-worker stats + circuit breaker state), Chat (streaming markdown playground), Logs (per-component with ANSI stripping), Benchmark, Traffic, Mesh
- Worker management -- Quick-add presets for OpenAI/Anthropic/xAI/Gemini, local worker launch with automatic GPU selection via
nvidia-smi, GPU claim tracking to prevent double-allocation - Gateway auto-start --
smg-tui --auto-startlaunches the gateway, polls health, and cleans up on exit - Chat playground -- Streaming SSE with live cursor, markdown rendering, multi-turn support, Tab to cycle models
Mesh Performance & Reliability Revolution
Eliminated catastrophic memory growth and achieved >200x improvement in mesh resource usage:
- Delta encoding (#899): Only send new tree operations since last sync -- 40x smaller sync payloads (18.3 MB → 417 KB), gzip compression for additional 5-8x wire reduction
- Lazy serialization (#919): Moved full TreeState serialization off the hot path -- memory: OOM crash → 31 MB stable, CPU: 280-345% → 56-58%, latency: 12s degrading → stable
- CRDT bypass (#961): Moved tree state out of CRDT operation log -- eliminated ~1 GB/1.5hr memory leak under sustained load
- Two-layer sync fix (#1011): Eliminated remaining memory leaks in the tree sync protocol
- Snapshot serialization (#974): Structure-preserving radix tree snapshots for mesh sync -- shared prefixes stored once, replacing 40 MB flat operation replay with compact tree format
- Timeout enforcement (#952): Consistent timeout contract across all RPC and stream paths
- Health mirroring (#912, #892): Mesh-synced workers now register locally for health checking with proper status mirroring
Benchmark Results (20 min, 500 rps, 20K-char prompts):
- 565,920 requests, 0 errors
- Memory plateaus at ~2.3 GB (no linear growth)
7-11x Faster Multimodal Image Preprocessing
SMG now matches or beats HuggingFace Python preprocessing performance:
- SIMD resize -- Replaced
imagecrate (pure Rust) withfast_image_resizev6 (AVX2/SSE4.1) for 10-25x faster resize - Fused operations -- Combined
to_tensor_and_normalize(), zero-copypatchify_into(), fused pad + normalize + tile split for Llama4 - Additional optimizations -- Thread-local Resizer reuse, eliminated DynamicImage clones, optimized serialization and tensor conversion
Benchmark Results (Qwen3-VL):
| Image Size | Before | After | vs HuggingFace Python |
|---|---|---|---|
| 224×224 | 4.77 ms | 0.44 ms (10.8x) | 2.5x faster |
| 640×480 | 15.5 ms | 1.59 ms (9.7x) | 1.8x faster |
| 1024×768 | 40.6 ms | 4.31 ms (9.4x) | 1.6x faster |
| 1920×1080 | 286 ms | 39.6 ms (7.2x) | ~parity |
Native Completion API over gRPC
Full /v1/completions support through the gRPC pipeline with streaming and PD disaggregation:
- 6-PR pipeline -- CompletionRequest type, preparation stage, request building with backend sampling params, response processing, pipeline wiring, streaming support
- Streaming -- OpenAI-compatible SSE events with per-index stop decoder tracking, echo and suffix handling
- PD mode -- Dual streaming for prefill-decode disaggregation
- Type safety -- Native
RequestType::Completionthroughout the pipeline, exhaustive match arms in shared stages
Per-Model Retry Configuration
Different models can now have different retry policies:
- WorkerRegistry integration -- Workers declare per-model retry config via
WorkerSpec.resilience, stored inWorkerRegistrywith last-write-wins semantics - All routers updated -- HTTP, gRPC, OpenAI, Gemini, gRPC PD, and HTTP PD routers all look up per-model config at request time, falling back to the global default
- Cleanup on removal -- Retry config is automatically cleaned up when the last worker for a model is removed
Impact: GPU-constrained models can have longer timeouts and more retries, while fast models use aggressive retry budgets. No more one-size-fits-all.
Three-Phase Graceful Shutdown
Replace fixed-timeout shutdown with an intelligent Gate → Drain → Teardown approach:
- Phase 1 (Gate): Stop accepting new requests
- Phase 2 (Drain): Wait for in-flight requests to complete (up to configured timeout)
- Phase 3 (Teardown): MCP orchestrator cleanup + exit
Impact: Requests finishing in 2s no longer wait 28s for a fixed grace period. Requests needing 35s no longer get killed at 30s. The system drains to zero when possible.
Worker Registry & REST API Improvements
- Model field required (#713) -- Clients omitting
modelnow get 400 Bad Request instead of silent"unknown"injection. Matches OpenAI API spec. Breaking change. - REST semantics (#875) --
POST /workers(create-only, 409 on conflict),PUT /workers/{id}(full replace),PATCH /workers/{id}(partial update). Breaking change:PUTnow requiresWorkerSpecinstead ofWorkerUpdateRequest. - Split register paths (#836) --
register()(create-only),replace()(overwrite-then-diff, no transient gap),register_or_replace()(idempotent upsert)
vLLM gRPC Embedding Support
End-to-end embedding pipeline for vLLM via gRPC:
- Rust gateway + Python servicer (calls
engine.encode()withPoolingParams) - Flattened SGLang
EmbedResponseproto (removed oneof, usestonic::Statusfor errors) - Removed SGLang-specific
log_metricsandcached_tokensfrom embed/classify protos
DeepSeek V3.1 Tool Call Parser
Native parser for DeepSeek V3.1's tool calling format:
- Handles V3.1's simplified format (no
functiontype prefix, no markdown code blocks) - Complete + streaming (
parse_incremental) support - Auto-registered for
deepseek-v3.1*anddeepseek-ai/DeepSeek-V3.1*model patterns - E2E validated against live DeepSeek V3.1 (FP8) on 8×H200
Additional Features
- Configurable storage hook context (#807) -- Map HTTP headers to storage hook request context via
storage_context_headers - Conversation memories schema (#976) -- First-class
conversation_memoriestable in data-connector with Oracle Flyway DDL and insert seam - gRPC health checking (#885) -- Standard
grpc.health.v1health service for vLLM workers - Model metadata in GetModelInfo (#871) -- vLLM
GetModelInfoRPC now returns model metadata fields - Metrics server refactored to axum (#966) -- Foundation for
/ws/metricsWebSocket endpoint max_total_num_tokensin GetServerInfo (#817) -- Aligns gRPC response with HTTP server
Performance Improvements
- Tokenizer: Optimized stop decoder and incremental sequence decoding (#990)
- Routing: Optimized
extract_text_for_routingstring handling (#967) - Mesh: Eliminated per-request CRDT serialization in
sync_tree_operation(#948) - Multimodal: Thread-local Resizer reuse (#923), eliminated DynamicImage clones (#928), optimized serialization and tensor conversion (#1012)
Bug Fixes
- Multimodal: Fixed Phi-3-vision for string-format chat templates (#942), LLaVA-Next anyres multi-crop for vLLM gRPC (#941), hardened registry matching and token geometry (#945), propagated placeholder resolution errors (#943), fixed images smaller than patch_size × merge_size (#908), use preprocessor token counts in LlavaSpec (#958), fall back to config.model_type for aliased model IDs (#898)
- Chat Templates: Inject special tokens (bos_token, eos_token) into chat template context (#914), correct content format detection for Qwen3-style templates (#981), inject special tokens inside tokenizer impls (#918)
- Protocol: Accept null for boolean fields
logprobsandstream(#1020), validate reasoning parser name at CLI and startup (#901) - gRPC: Fix assistant tool_calls message serialization for chat templates (#1023), include stop tokens in TRT-LLM output for Harmony parsing (#879), handle vllm log forwarding on servicer side (#975)
- Mesh: Stop advertising 0.0.0.0 to peers (#883), set tonic message size limits to match application limit (#893), prevent duplicate store events from inflating tree_sizes (#946)
- Gateway: Update metric when removing unhealthy workers (#884), filter empty-string backend defaults in CLI arg fallback (#934)
- Responses API: Align store=false state persistence behavior (#916)
- Serve: Respect user-set CUDA_VISIBLE_DEVICES in gpu_env (#751)
- Metrics: Increase default Prometheus bucket coverage to 1h (#909)
- Logging: Show logs from all workspace crates, not just smg (#965)
- Dependencies: Pin unicode-segmentation <1.12 in WASM guest crates (#886)
Refactoring
- Unified circuit breaker recording to
record_outcome(status_code)across all routers (#937) - Removed dead
GenerateErrorfrom protos and gateway (#1002) - Renamed shared dispatcher stages, removed dead arms (#1004)
- Used
EngineClientinterface instead ofAsyncLLMin vLLM servicer (#949) - Responses API retrieval refactored to use data layer directly (#938)
Infrastructure & CI
- Docker: Build 3 engine versions on release, test all engines on PR (#858), run SMG as non-root user (#863)
- Helm CI: Helm chart release workflow to GHCR (#864)
- Claude Code PR review: Automated review workflow with severity markers (#985, #986)
- E2E: MMMU benchmark (#910), basic multimodal tests (#931), array content format coverage (#999), Harmony tests on vLLM and TRT-LLM (#801), worker log artifacts on failure (#963)
- Benchmarks: Llama-4 and Llama-3.3 70b in nightly benchmarks (#900)
Breaking Changes
- Model field required -- Requests omitting
modelnow return 400 Bad Request instead of defaulting to"unknown". Matches OpenAI API spec. - Worker API REST semantics --
PUT /workers/{id}now expectsWorkerSpec(full replace). UsePATCH /workers/{id}for partial updates withWorkerUpdateRequest. - Embedding proto changes --
log_metricsandcached_tokensremoved from embedding/classify protos. SGLangEmbedResponseflattened (oneof removed).
Full Changelog: v1.3.3...v1.4.0
Upgrade now: pip install smg --upgrade
Shepherd your LLM infrastructure with confidence.
Docker Images
Pre-built engine images on GitHub Container Registry:
SGLang:
docker pull ghcr.io/lightseekorg/smg:1.4.0-sglang-v0.5.9vLLM:
docker pull ghcr.io/lightseekorg/smg:1.4.0-vllm-v0.18.0TensorRT-LLM:
docker pull ghcr.io/lightseekorg/smg:1.4.0-trtllm-1.3.0rc8All images for v1.4.0:
| Engine | Tag | Pull Command |
|---|---|---|
| sglang | 1.4.0-sglang-v0.5.9 |
docker pull ghcr.io/lightseekorg/smg:1.4.0-sglang-v0.5.9 |
| trtllm | 1.4.0-trtllm-1.3.0rc8 |
docker pull ghcr.io/lightseekorg/smg:1.4.0-trtllm-1.3.0rc8 |
| vllm | 1.4.0-vllm-v0.18.0 |
docker pull ghcr.io/lightseekorg/smg:1.4.0-vllm-v0.18.0 |
What's Changed
- feat support max_total_num_tokens in getserverinforesponse to keep align with get_server_info in HTTP server by @Huixxi in #817
- feat(helm): add Helm chart for SMG router deployment by @slin1237 in #699
- fix(helm): correct CLI flag names and defaults in chart templates by @slin1237 in #859
- ci(docker): build 3 engine versions on release, test all engines on PR by @slin1237 in #858
- fix(helm): quote custom labels and fix default securityContext by @slin1237 in #862
- fix(helm): fix security context and test discovery for SMG chart by @slin1237 in #861
- fix(docker): run SMG as non-root user and fix Helm chart compatibility by @slin1237 in #863
- fix(ci): fix Docker build UID conflict and split build vs push triggers by @slin1237 in #865
- ci(helm): add Helm chart release workflow to GHCR by @slin1237 in #864
- refactor: make model field required, remove resolve_model_id by @CatherineSue in #713
- fix(ci): pause minimax nightly bench and add HF token for H200 node by @key4ng in #868
- feat(grpc): add model metadata fields to vLLM GetModelInfo RPC by @slin1237 in #871
- chore(grpc): release smg-grpc-proto 0.5.0 and smg-grpc-servicer 0.6.0 by @slin1237 in #872
- refactor(registry): split register into create-only and replace paths by @CatherineSue in #836
- fix(ci): disable Docker layer cache for engine image builds by @slin1237 in #878
- feat(api): enforce REST semantics for worker endpoints by @CatherineSue in #875
- feat(core): add per-model retry config to WorkerRegistry by @CatherineSue in #821
- feat(server): three-phase graceful shutdown with drain coordination by @slin1237 in #876
- refactor(grpc): remove redundant harmony_stop_ids from build_generate_request_from_responses by @CatherineSue in #880
- fix(deps): pin unicode-segmentation <1.12 in WASM guest crates by @slin1237 in #886
- feat(grpc): add VllmHealthServicer for standard gRPC health checking (grpc.health.v1) by @V2arK in #885
- chore(deps): bump azure/setup-helm from 4 to 5 by @dependabot[bot] in #887
- fix(grpc): include stop tokens in trtllm output for Harmony parsing by @CatherineSue in #879
- fix(mesh): set tonic message size limits to match application limit by @slin1237 in #893
- chore: bump smg-grpc-proto to 0.4.5 by @CatherineSue in #896
- feat(gateway): add configurable storage hook request context mapping by @zhaowenzi in #807
- ci(mergify): upgrade configuration to current format by @mergify[bot] in #815
- ci: Change to use model cache by @XinyueZhang369 in #904
- fix(parsers): validate reasoning parser name at CLI and startup by @CatherineSue in #901
- fix(multimodal): fall back to config.model_type for aliased model IDs by @CatherineSue in #898
- feat(completions): add native gRPC pipeline typing for /v1/completions by @vschandramourya in #840
- fix(multimodal): handle images smaller than patch_size * merge_size by @CatherineSue in #908
- perf(mesh): delta encoding for tree state sync + profiling instrumentation by @slin1237 in #899
- feat(mesh): register mesh-synced workers locally for health checking by @rajatgoel in #892
- fix(metrics): Increase default prometheus bucket coverage to 1h by @ekzhang in #909
- feat(e2e): add MMMU benchmark and fix Qwen VL vision token duplication by @CatherineSue in #910
- fix(mesh): mirror health status for mesh-synced workers by @slin1237 in #912
- feat(completions): add CompletionPreparationStage for gRPC pipeline by @vschandramourya in #907
- fix(chat): inject special tokens (bos_token, eos_token) into chat template context by @CatherineSue in #914
- perf(multimodal): 7-11x faster image preprocessing via SIMD resize, fused ops, and zero-copy patchify by @CatherineSue in #913
- fix(mesh): stop advertising 0.0.0.0 to peers by @jshanson7 in #883
- fix(responses): align store=false state persistence behavior by @zhaowenzi in #916
- refactor(tokenizer): inject special tokens inside tokenizer impls instead of callers by @CatherineSue in #918
- fix(serve): filter empty-string backend defaults in CLI arg fallback by @CatherineSue in #934
- feat(http-router): use per-model retry config from WorkerRegistry by @CatherineSue in #881
- feat(tui): add terminal dashboard for SMG by @key4ng in #867
- feat(grpc-router): use per-model retry config from WorkerRegistry by @CatherineSue in #933
- feat(openai-router): use per-model retry config from WorkerRegistry by @CatherineSue in #935
- refactor(ci): Add actions.summerwind.dev ARC runner deployment option by @XinyueZhang369 in #797
- refactor(responses): retrieval to use data layer directly by @zhaowenzi in #938
- perf(mesh): eliminate per-request TreeState serialization on hot path by @slin1237 in #919
- perf(multimodal): reuse Resizer via thread_local by @slin1237 in #923
- perf(multimodal): avoid DynamicImage clone in no-op transform paths by @slin1237 in #928
- feat(completions): add CompletionRequestBuildingStage and backend sampling params by @vschandramourya in #915
- fix(ci): resolve trtllm cuda-bindings conflict with torch 2.10 by @CatherineSue in #940
- feat(e2e): add basic multimodal chat completion tests by @CatherineSue in #931
- refactor(routers): unify CB recording to record_outcome(status_code) by @CatherineSue in #937
- fix(multimodal): fix Phi-3-vision image processing for string-format chat templates by @CatherineSue in #942
- fix(multimodal): add LLaVA-Next anyres multi-crop support for vLLM gRPC by @CatherineSue in #941
- chore: remove unnecessary result wrapping in pd_router by @lawrence-harmonic in #939
- fix(multimodal): propagate placeholder resolution errors instead of swallowing by @CatherineSue in #943
- fix(kv_index): prevent duplicate store events from inflating tree_sizes by @slin1237 in #946
- fix(multimodal): harden LLaVA-Next registry matching and token geometry by @CatherineSue in #945
- fix(codeowners): correct grpc_servicer path and ownership by @CatherineSue in #950
- perf(mesh): eliminate per-request CRDT serialization in sync_tree_operation by @slin1237 in #948
- refactor(grpc): use EngineClient interface instead of AsyncLLM in vLLM servicer by @CatherineSue in #949
- fix(mesh): enforce timeout contract across all RPC and stream paths by @slin1237 in #952
- fix(multimodal): remove num_img_tokens fallback in Phi3VisionSpec by @CatherineSue in #955
- fix(multimodal): fix phi3_v test broken by num_img_tokens fallback removal by @CatherineSue in #957
- fix(multimodal): use preprocessor token counts in LlavaSpec prompt replacements by @CatherineSue in #958
- fix(multimodal): remove unused test_preprocessed helper by @CatherineSue in #962
- feat(completions): add CompletionResponseProcessingStage for non-streaming responses by @vschandramourya in #953
- feat(completions): wire Completion API pipeline into gRPC routers by @vschandramourya in #964
- fix(mesh): bypass CRDT operation log for tree state — eliminate memory leak by @slin1237 in #961
- perf(protocols): Optimize extract_text_for_routing strings by @ppraneth in #967
- fix(logging): show logs from all workspace crates, not just smg by @slin1237 in #965
- fix(ci): serialize model downloads to prevent PermissionError on shared NVMe by @key4ng in #968
- feat(helm): add worker deployment support with engine-specific images by @slin1237 in #969
- feat(helm): add mesh HA support with K8s service discovery by @slin1237 in #972
- feat(scripts): add --model and --host flags to mesh load generator by @slin1237 in #973
- fix(grpc_servicer): handle vllm log forwarding on servicer side by @CatherineSue in #975
- feat(routers): use per-model retry config for Gemini, gRPC PD, HTTP PD by @CatherineSue in #936
- ci(e2e): upload worker logs as artifacts on test failure by @key4ng in #963
- ci(e2e): add step-level timeouts for trtllm setup and e2e tests by @key4ng in #977
- feat(metrics-ws): [1/4] refactor metrics server to axum with PrometheusHandle by @key4ng in #966
- feat(kv-index): radix tree snapshot serialization for mesh sync [WIP] by @slin1237 in #974
- ci(e2e): bump genai-bench to 0.0.4 and improve benchmark uploads by @key4ng in #980
- fix(tokenizer): correct content format detection for Qwen3-style chat templates by @CatherineSue in #981
- fix(gateway): Update metric when removing unhealthy workers by @ekzhang in #884
- ci(review): add Claude Code PR review workflow by @slin1237 in #985
- feat(ci): add severity markers to Claude review comments by @slin1237 in #986
- fix(serve): respect user-set CUDA_VISIBLE_DEVICES in gpu_env by @paxiaatucsdedu in #751
- chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #991
- chore(deps): bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #992
- chore(deps): update ratatui requirement from 0.29 to 0.30 by @dependabot[bot] in #996
- fix(ci): detect and abort on incomplete GPU cleanup by @key4ng in #988
- test(e2e): enable Harmony tests on vLLM and TRT-LLM backends by @CatherineSue in #801
- chore(deps): update criterion requirement from 0.5 to 0.8 by @dependabot[bot] in #994
- fix(ci): deduplicate Claude review comments across pushes by @key4ng in #1003
- test(e2e): add array content format coverage to chat completion tests by @key4ng in #999
- refactor(grpc): rename shared dispatcher stages, remove dead arms by @CatherineSue in #1004
- ci: pin sglang to v0.5.10rc0 for proto compat by @CatherineSue in #1005
- chore(deps): update crossterm requirement from 0.28 to 0.29 by @dependabot[bot] in #995
- refactor(grpc): remove dead GenerateError from protos and gateway by @CatherineSue in #1002
- fix(ci): skip Claude review on fork PRs by @slin1237 in #1009
- feat(embed): add vLLM gRPC embedding support, clean up proto by @CatherineSue in #1001
- perf(multimodal): optimize preprocessing serialization, tensor conversion, and pad fusion by @CatherineSue in #1012
- fix(mesh): eliminate memory leaks in two-layer tree sync protocol by @slin1237 in #1011
- perf(tokenizer): optimize stop decoder and incremental sequence decoding by @CatherineSue in #990
- fix(ci): update labeler paths after crate extraction by @CatherineSue in #1014
- test(e2e): add Llama-4 and Llama-3.3 70b to nightly benchmarks by @paxiaatucsdedu in #900
- fix(multimodal): remove as_slice_memory_order fallback in pixel serialization by @CatherineSue in #1013
- feat(tool_parser): add DeepSeek V3.1 tool call parser by @key4ng in #1006
- feat(data-connector): add conversation memories schema and insert seam by @zhoug9127 in #976
- feat(completions): add Completion API streaming support to gRPC router by @vschandramourya in #978
- fix(protocols): accept null for boolean fields logprobs and stream by @CatherineSue in #1020
- fix(grpc): fix assistant tool_calls message serialization for chat templates by @key4ng in #1023
- chore: bump versions for v1.4.0 release by @slin1237 in #1018
New Contributors
- @V2arK made their first contribution in #885
- @mergify[bot] made their first contribution in #815
- @vschandramourya made their first contribution in #840
- @rajatgoel made their first contribution in #892
- @jshanson7 made their first contribution in #883
- @lawrence-harmonic made their first contribution in #939
- @paxiaatucsdedu made their first contribution in #751
- @zhoug9127 made their first contribution in #976
Full Changelog: v1.3.3...v1.4.0