Release v1.4.0 · lightseekorg/smg

🚀 Shepherd Model Gateway v1.4.0 Released

The biggest SMG release yet -- Kubernetes-native deployment via Helm, a terminal dashboard, 200x mesh memory reduction, 7-11x faster multimodal preprocessing, native Completion API over gRPC, and per-model retry configuration.

Kubernetes-Native Deployment with Helm

Production-ready Helm chart for deploying SMG on Kubernetes:

One-command deployment -- helm install smg oci://ghcr.io/lightseekorg/smg-helm deploys the full gateway stack
Router + Worker deployment -- A single chart deploys both the gateway router and inference engine workers (vLLM, SGLang, TRT-LLM) with GPU scheduling
Mesh HA with service discovery -- Deploy multiple gateway replicas as a StatefulSet with automatic gossip-based peer discovery via --router-selector
Full K8s integration -- RBAC, Ingress, HPA, PDB, ServiceMonitor, Grafana dashboard ConfigMap, JSON Schema validation at helm lint time
5 example configurations -- Router-only, with-postgres, with-service-discovery, with-ingress, with-monitoring

Impact: Zero-to-production SMG deployment on Kubernetes with a single helm install. Declarative configuration, automatic scaling, and built-in observability.

Terminal Dashboard (smg-tui)

Full-featured terminal UI for real-time monitoring and interactive chat:

7 tabs -- Pulse (real-time dashboard with sparklines), Workers (per-worker stats + circuit breaker state), Chat (streaming markdown playground), Logs (per-component with ANSI stripping), Benchmark, Traffic, Mesh
Worker management -- Quick-add presets for OpenAI/Anthropic/xAI/Gemini, local worker launch with automatic GPU selection via nvidia-smi, GPU claim tracking to prevent double-allocation
Gateway auto-start -- smg-tui --auto-start launches the gateway, polls health, and cleans up on exit
Chat playground -- Streaming SSE with live cursor, markdown rendering, multi-turn support, Tab to cycle models

Mesh Performance & Reliability Revolution

Eliminated catastrophic memory growth and achieved >200x improvement in mesh resource usage:

Delta encoding (#899): Only send new tree operations since last sync -- 40x smaller sync payloads (18.3 MB → 417 KB), gzip compression for additional 5-8x wire reduction
Lazy serialization (#919): Moved full TreeState serialization off the hot path -- memory: OOM crash → 31 MB stable, CPU: 280-345% → 56-58%, latency: 12s degrading → stable
CRDT bypass (#961): Moved tree state out of CRDT operation log -- eliminated ~1 GB/1.5hr memory leak under sustained load
Two-layer sync fix (#1011): Eliminated remaining memory leaks in the tree sync protocol
Snapshot serialization (#974): Structure-preserving radix tree snapshots for mesh sync -- shared prefixes stored once, replacing 40 MB flat operation replay with compact tree format
Timeout enforcement (#952): Consistent timeout contract across all RPC and stream paths
Health mirroring (#912, #892): Mesh-synced workers now register locally for health checking with proper status mirroring

Benchmark Results (20 min, 500 rps, 20K-char prompts):

565,920 requests, 0 errors
Memory plateaus at ~2.3 GB (no linear growth)

7-11x Faster Multimodal Image Preprocessing

SMG now matches or beats HuggingFace Python preprocessing performance:

SIMD resize -- Replaced image crate (pure Rust) with fast_image_resize v6 (AVX2/SSE4.1) for 10-25x faster resize
Fused operations -- Combined to_tensor_and_normalize(), zero-copy patchify_into(), fused pad + normalize + tile split for Llama4
Additional optimizations -- Thread-local Resizer reuse, eliminated DynamicImage clones, optimized serialization and tensor conversion

Benchmark Results (Qwen3-VL):

Image Size	Before	After	vs HuggingFace Python
224×224	4.77 ms	0.44 ms (10.8x)	2.5x faster
640×480	15.5 ms	1.59 ms (9.7x)	1.8x faster
1024×768	40.6 ms	4.31 ms (9.4x)	1.6x faster
1920×1080	286 ms	39.6 ms (7.2x)	~parity

Native Completion API over gRPC

Full /v1/completions support through the gRPC pipeline with streaming and PD disaggregation:

6-PR pipeline -- CompletionRequest type, preparation stage, request building with backend sampling params, response processing, pipeline wiring, streaming support
Streaming -- OpenAI-compatible SSE events with per-index stop decoder tracking, echo and suffix handling
PD mode -- Dual streaming for prefill-decode disaggregation
Type safety -- Native RequestType::Completion throughout the pipeline, exhaustive match arms in shared stages

Per-Model Retry Configuration

Different models can now have different retry policies:

WorkerRegistry integration -- Workers declare per-model retry config via WorkerSpec.resilience, stored in WorkerRegistry with last-write-wins semantics
All routers updated -- HTTP, gRPC, OpenAI, Gemini, gRPC PD, and HTTP PD routers all look up per-model config at request time, falling back to the global default
Cleanup on removal -- Retry config is automatically cleaned up when the last worker for a model is removed

Impact: GPU-constrained models can have longer timeouts and more retries, while fast models use aggressive retry budgets. No more one-size-fits-all.

Three-Phase Graceful Shutdown

Replace fixed-timeout shutdown with an intelligent Gate → Drain → Teardown approach:

Phase 1 (Gate): Stop accepting new requests
Phase 2 (Drain): Wait for in-flight requests to complete (up to configured timeout)
Phase 3 (Teardown): MCP orchestrator cleanup + exit

Impact: Requests finishing in 2s no longer wait 28s for a fixed grace period. Requests needing 35s no longer get killed at 30s. The system drains to zero when possible.

Worker Registry & REST API Improvements

Model field required (#713) -- Clients omitting model now get 400 Bad Request instead of silent "unknown" injection. Matches OpenAI API spec. Breaking change.
REST semantics (#875) -- POST /workers (create-only, 409 on conflict), PUT /workers/{id} (full replace), PATCH /workers/{id} (partial update). Breaking change: PUT now requires WorkerSpec instead of WorkerUpdateRequest.
Split register paths (#836) -- register() (create-only), replace() (overwrite-then-diff, no transient gap), register_or_replace() (idempotent upsert)

vLLM gRPC Embedding Support

End-to-end embedding pipeline for vLLM via gRPC:

Rust gateway + Python servicer (calls engine.encode() with PoolingParams)
Flattened SGLang EmbedResponse proto (removed oneof, uses tonic::Status for errors)
Removed SGLang-specific log_metrics and cached_tokens from embed/classify protos

DeepSeek V3.1 Tool Call Parser

Native parser for DeepSeek V3.1's tool calling format:

Handles V3.1's simplified format (no function type prefix, no markdown code blocks)
Complete + streaming (parse_incremental) support
Auto-registered for deepseek-v3.1* and deepseek-ai/DeepSeek-V3.1* model patterns
E2E validated against live DeepSeek V3.1 (FP8) on 8×H200

Additional Features

Configurable storage hook context (#807) -- Map HTTP headers to storage hook request context via storage_context_headers
Conversation memories schema (#976) -- First-class conversation_memories table in data-connector with Oracle Flyway DDL and insert seam
gRPC health checking (#885) -- Standard grpc.health.v1 health service for vLLM workers
Model metadata in GetModelInfo (#871) -- vLLM GetModelInfo RPC now returns model metadata fields
Metrics server refactored to axum (#966) -- Foundation for /ws/metrics WebSocket endpoint
max_total_num_tokens in GetServerInfo (#817) -- Aligns gRPC response with HTTP server

Performance Improvements

Tokenizer: Optimized stop decoder and incremental sequence decoding (#990)
Routing: Optimized extract_text_for_routing string handling (#967)
Mesh: Eliminated per-request CRDT serialization in sync_tree_operation (#948)
Multimodal: Thread-local Resizer reuse (#923), eliminated DynamicImage clones (#928), optimized serialization and tensor conversion (#1012)

Bug Fixes

Multimodal: Fixed Phi-3-vision for string-format chat templates (#942), LLaVA-Next anyres multi-crop for vLLM gRPC (#941), hardened registry matching and token geometry (#945), propagated placeholder resolution errors (#943), fixed images smaller than patch_size × merge_size (#908), use preprocessor token counts in LlavaSpec (#958), fall back to config.model_type for aliased model IDs (#898)
Chat Templates: Inject special tokens (bos_token, eos_token) into chat template context (#914), correct content format detection for Qwen3-style templates (#981), inject special tokens inside tokenizer impls (#918)
Protocol: Accept null for boolean fields logprobs and stream (#1020), validate reasoning parser name at CLI and startup (#901)
gRPC: Fix assistant tool_calls message serialization for chat templates (#1023), include stop tokens in TRT-LLM output for Harmony parsing (#879), handle vllm log forwarding on servicer side (#975)
Mesh: Stop advertising 0.0.0.0 to peers (#883), set tonic message size limits to match application limit (#893), prevent duplicate store events from inflating tree_sizes (#946)
Gateway: Update metric when removing unhealthy workers (#884), filter empty-string backend defaults in CLI arg fallback (#934)
Responses API: Align store=false state persistence behavior (#916)
Serve: Respect user-set CUDA_VISIBLE_DEVICES in gpu_env (#751)
Metrics: Increase default Prometheus bucket coverage to 1h (#909)
Logging: Show logs from all workspace crates, not just smg (#965)
Dependencies: Pin unicode-segmentation <1.12 in WASM guest crates (#886)

Refactoring

Unified circuit breaker recording to record_outcome(status_code) across all routers (#937)
Removed dead GenerateError from protos and gateway (#1002)
Renamed shared dispatcher stages, removed dead arms (#1004)
Used EngineClient interface instead of AsyncLLM in vLLM servicer (#949)
Responses API retrieval refactored to use data layer directly (#938)

Infrastructure & CI

Docker: Build 3 engine versions on release, test all engines on PR (#858), run SMG as non-root user (#863)
Helm CI: Helm chart release workflow to GHCR (#864)
Claude Code PR review: Automated review workflow with severity markers (#985, #986)
E2E: MMMU benchmark (#910), basic multimodal tests (#931), array content format coverage (#999), Harmony tests on vLLM and TRT-LLM (#801), worker log artifacts on failure (#963)
Benchmarks: Llama-4 and Llama-3.3 70b in nightly benchmarks (#900)

Breaking Changes

Model field required -- Requests omitting model now return 400 Bad Request instead of defaulting to "unknown". Matches OpenAI API spec.
Worker API REST semantics -- PUT /workers/{id} now expects WorkerSpec (full replace). Use PATCH /workers/{id} for partial updates with WorkerUpdateRequest.
Embedding proto changes -- log_metrics and cached_tokens removed from embedding/classify protos. SGLang EmbedResponse flattened (oneof removed).

Full Changelog: v1.3.3...v1.4.0

Upgrade now: pip install smg --upgrade

Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.4.0-sglang-v0.5.9

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.4.0-vllm-v0.18.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.4.0-trtllm-1.3.0rc8

All images for v1.4.0:

Engine	Tag	Pull Command
sglang	`1.4.0-sglang-v0.5.9`	`docker pull ghcr.io/lightseekorg/smg:1.4.0-sglang-v0.5.9`
trtllm	`1.4.0-trtllm-1.3.0rc8`	`docker pull ghcr.io/lightseekorg/smg:1.4.0-trtllm-1.3.0rc8`
vllm	`1.4.0-vllm-v0.18.0`	`docker pull ghcr.io/lightseekorg/smg:1.4.0-vllm-v0.18.0`

What's Changed

feat support max_total_num_tokens in getserverinforesponse to keep align with get_server_info in HTTP server by @Huixxi in #817
feat(helm): add Helm chart for SMG router deployment by @slin1237 in #699
fix(helm): correct CLI flag names and defaults in chart templates by @slin1237 in #859
ci(docker): build 3 engine versions on release, test all engines on PR by @slin1237 in #858
fix(helm): quote custom labels and fix default securityContext by @slin1237 in #862
fix(helm): fix security context and test discovery for SMG chart by @slin1237 in #861
fix(docker): run SMG as non-root user and fix Helm chart compatibility by @slin1237 in #863
fix(ci): fix Docker build UID conflict and split build vs push triggers by @slin1237 in #865
ci(helm): add Helm chart release workflow to GHCR by @slin1237 in #864
refactor: make model field required, remove resolve_model_id by @CatherineSue in #713
fix(ci): pause minimax nightly bench and add HF token for H200 node by @key4ng in #868
feat(grpc): add model metadata fields to vLLM GetModelInfo RPC by @slin1237 in #871
chore(grpc): release smg-grpc-proto 0.5.0 and smg-grpc-servicer 0.6.0 by @slin1237 in #872
refactor(registry): split register into create-only and replace paths by @CatherineSue in #836
fix(ci): disable Docker layer cache for engine image builds by @slin1237 in #878
feat(api): enforce REST semantics for worker endpoints by @CatherineSue in #875
feat(core): add per-model retry config to WorkerRegistry by @CatherineSue in #821
feat(server): three-phase graceful shutdown with drain coordination by @slin1237 in #876
refactor(grpc): remove redundant harmony_stop_ids from build_generate_request_from_responses by @CatherineSue in #880
fix(deps): pin unicode-segmentation <1.12 in WASM guest crates by @slin1237 in #886
feat(grpc): add VllmHealthServicer for standard gRPC health checking (grpc.health.v1) by @V2arK in #885
chore(deps): bump azure/setup-helm from 4 to 5 by @dependabot[bot] in #887
fix(grpc): include stop tokens in trtllm output for Harmony parsing by @CatherineSue in #879
fix(mesh): set tonic message size limits to match application limit by @slin1237 in #893
chore: bump smg-grpc-proto to 0.4.5 by @CatherineSue in #896
feat(gateway): add configurable storage hook request context mapping by @zhaowenzi in #807
ci(mergify): upgrade configuration to current format by @mergify[bot] in #815
ci: Change to use model cache by @XinyueZhang369 in #904
fix(parsers): validate reasoning parser name at CLI and startup by @CatherineSue in #901
fix(multimodal): fall back to config.model_type for aliased model IDs by @CatherineSue in #898
feat(completions): add native gRPC pipeline typing for /v1/completions by @vschandramourya in #840
fix(multimodal): handle images smaller than patch_size * merge_size by @CatherineSue in #908
perf(mesh): delta encoding for tree state sync + profiling instrumentation by @slin1237 in #899
feat(mesh): register mesh-synced workers locally for health checking by @rajatgoel in #892
fix(metrics): Increase default prometheus bucket coverage to 1h by @ekzhang in #909
feat(e2e): add MMMU benchmark and fix Qwen VL vision token duplication by @CatherineSue in #910
fix(mesh): mirror health status for mesh-synced workers by @slin1237 in #912
feat(completions): add CompletionPreparationStage for gRPC pipeline by @vschandramourya in #907
fix(chat): inject special tokens (bos_token, eos_token) into chat template context by @CatherineSue in #914
perf(multimodal): 7-11x faster image preprocessing via SIMD resize, fused ops, and zero-copy patchify by @CatherineSue in #913
fix(mesh): stop advertising 0.0.0.0 to peers by @jshanson7 in #883
fix(responses): align store=false state persistence behavior by @zhaowenzi in #916
refactor(tokenizer): inject special tokens inside tokenizer impls instead of callers by @CatherineSue in #918
fix(serve): filter empty-string backend defaults in CLI arg fallback by @CatherineSue in #934
feat(http-router): use per-model retry config from WorkerRegistry by @CatherineSue in #881
feat(tui): add terminal dashboard for SMG by @key4ng in #867
feat(grpc-router): use per-model retry config from WorkerRegistry by @CatherineSue in #933
feat(openai-router): use per-model retry config from WorkerRegistry by @CatherineSue in #935
refactor(ci): Add actions.summerwind.dev ARC runner deployment option by @XinyueZhang369 in #797
refactor(responses): retrieval to use data layer directly by @zhaowenzi in #938
perf(mesh): eliminate per-request TreeState serialization on hot path by @slin1237 in #919
perf(multimodal): reuse Resizer via thread_local by @slin1237 in #923
perf(multimodal): avoid DynamicImage clone in no-op transform paths by @slin1237 in #928
feat(completions): add CompletionRequestBuildingStage and backend sampling params by @vschandramourya in #915
fix(ci): resolve trtllm cuda-bindings conflict with torch 2.10 by @CatherineSue in #940
feat(e2e): add basic multimodal chat completion tests by @CatherineSue in #931
refactor(routers): unify CB recording to record_outcome(status_code) by @CatherineSue in #937
fix(multimodal): fix Phi-3-vision image processing for string-format chat templates by @CatherineSue in #942
fix(multimodal): add LLaVA-Next anyres multi-crop support for vLLM gRPC by @CatherineSue in #941
chore: remove unnecessary result wrapping in pd_router by @lawrence-harmonic in #939
fix(multimodal): propagate placeholder resolution errors instead of swallowing by @CatherineSue in #943
fix(kv_index): prevent duplicate store events from inflating tree_sizes by @slin1237 in #946
fix(multimodal): harden LLaVA-Next registry matching and token geometry by @CatherineSue in #945
fix(codeowners): correct grpc_servicer path and ownership by @CatherineSue in #950
perf(mesh): eliminate per-request CRDT serialization in sync_tree_operation by @slin1237 in #948
refactor(grpc): use EngineClient interface instead of AsyncLLM in vLLM servicer by @CatherineSue in #949
fix(mesh): enforce timeout contract across all RPC and stream paths by @slin1237 in #952
fix(multimodal): remove num_img_tokens fallback in Phi3VisionSpec by @CatherineSue in #955
fix(multimodal): fix phi3_v test broken by num_img_tokens fallback removal by @CatherineSue in #957
fix(multimodal): use preprocessor token counts in LlavaSpec prompt replacements by @CatherineSue in #958
fix(multimodal): remove unused test_preprocessed helper by @CatherineSue in #962
feat(completions): add CompletionResponseProcessingStage for non-streaming responses by @vschandramourya in #953
feat(completions): wire Completion API pipeline into gRPC routers by @vschandramourya in #964
fix(mesh): bypass CRDT operation log for tree state — eliminate memory leak by @slin1237 in #961
perf(protocols): Optimize extract_text_for_routing strings by @ppraneth in #967
fix(logging): show logs from all workspace crates, not just smg by @slin1237 in #965
fix(ci): serialize model downloads to prevent PermissionError on shared NVMe by @key4ng in #968
feat(helm): add worker deployment support with engine-specific images by @slin1237 in #969
feat(helm): add mesh HA support with K8s service discovery by @slin1237 in #972
feat(scripts): add --model and --host flags to mesh load generator by @slin1237 in #973
fix(grpc_servicer): handle vllm log forwarding on servicer side by @CatherineSue in #975
feat(routers): use per-model retry config for Gemini, gRPC PD, HTTP PD by @CatherineSue in #936
ci(e2e): upload worker logs as artifacts on test failure by @key4ng in #963
ci(e2e): add step-level timeouts for trtllm setup and e2e tests by @key4ng in #977
feat(metrics-ws): [1/4] refactor metrics server to axum with PrometheusHandle by @key4ng in #966
feat(kv-index): radix tree snapshot serialization for mesh sync [WIP] by @slin1237 in #974
ci(e2e): bump genai-bench to 0.0.4 and improve benchmark uploads by @key4ng in #980
fix(tokenizer): correct content format detection for Qwen3-style chat templates by @CatherineSue in #981
fix(gateway): Update metric when removing unhealthy workers by @ekzhang in #884
ci(review): add Claude Code PR review workflow by @slin1237 in #985
feat(ci): add severity markers to Claude review comments by @slin1237 in #986
fix(serve): respect user-set CUDA_VISIBLE_DEVICES in gpu_env by @paxiaatucsdedu in #751
chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #991
chore(deps): bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #992
chore(deps): update ratatui requirement from 0.29 to 0.30 by @dependabot[bot] in #996
fix(ci): detect and abort on incomplete GPU cleanup by @key4ng in #988
test(e2e): enable Harmony tests on vLLM and TRT-LLM backends by @CatherineSue in #801
chore(deps): update criterion requirement from 0.5 to 0.8 by @dependabot[bot] in #994
fix(ci): deduplicate Claude review comments across pushes by @key4ng in #1003
test(e2e): add array content format coverage to chat completion tests by @key4ng in #999
refactor(grpc): rename shared dispatcher stages, remove dead arms by @CatherineSue in #1004
ci: pin sglang to v0.5.10rc0 for proto compat by @CatherineSue in #1005
chore(deps): update crossterm requirement from 0.28 to 0.29 by @dependabot[bot] in #995
refactor(grpc): remove dead GenerateError from protos and gateway by @CatherineSue in #1002
fix(ci): skip Claude review on fork PRs by @slin1237 in #1009
feat(embed): add vLLM gRPC embedding support, clean up proto by @CatherineSue in #1001
perf(multimodal): optimize preprocessing serialization, tensor conversion, and pad fusion by @CatherineSue in #1012
fix(mesh): eliminate memory leaks in two-layer tree sync protocol by @slin1237 in #1011
perf(tokenizer): optimize stop decoder and incremental sequence decoding by @CatherineSue in #990
fix(ci): update labeler paths after crate extraction by @CatherineSue in #1014
test(e2e): add Llama-4 and Llama-3.3 70b to nightly benchmarks by @paxiaatucsdedu in #900
fix(multimodal): remove as_slice_memory_order fallback in pixel serialization by @CatherineSue in #1013
feat(tool_parser): add DeepSeek V3.1 tool call parser by @key4ng in #1006
feat(data-connector): add conversation memories schema and insert seam by @zhoug9127 in #976
feat(completions): add Completion API streaming support to gRPC router by @vschandramourya in #978
fix(protocols): accept null for boolean fields logprobs and stream by @CatherineSue in #1020
fix(grpc): fix assistant tool_calls message serialization for chat templates by @key4ng in #1023
chore: bump versions for v1.4.0 release by @slin1237 in #1018

New Contributors

@V2arK made their first contribution in #885
@mergify[bot] made their first contribution in #815
@vschandramourya made their first contribution in #840
@rajatgoel made their first contribution in #892
@jshanson7 made their first contribution in #883
@lawrence-harmonic made their first contribution in #939
@paxiaatucsdedu made their first contribution in #751
@zhoug9127 made their first contribution in #976