Skip to content

v1.4.0

Choose a tag to compare

@slin1237 slin1237 released this 02 Apr 15:53
· 440 commits to main since this release
52564df

🚀 Shepherd Model Gateway v1.4.0 Released

The biggest SMG release yet -- Kubernetes-native deployment via Helm, a terminal dashboard, 200x mesh memory reduction, 7-11x faster multimodal preprocessing, native Completion API over gRPC, and per-model retry configuration.

Kubernetes-Native Deployment with Helm

Production-ready Helm chart for deploying SMG on Kubernetes:

  • One-command deployment -- helm install smg oci://ghcr.io/lightseekorg/smg-helm deploys the full gateway stack
  • Router + Worker deployment -- A single chart deploys both the gateway router and inference engine workers (vLLM, SGLang, TRT-LLM) with GPU scheduling
  • Mesh HA with service discovery -- Deploy multiple gateway replicas as a StatefulSet with automatic gossip-based peer discovery via --router-selector
  • Full K8s integration -- RBAC, Ingress, HPA, PDB, ServiceMonitor, Grafana dashboard ConfigMap, JSON Schema validation at helm lint time
  • 5 example configurations -- Router-only, with-postgres, with-service-discovery, with-ingress, with-monitoring

Impact: Zero-to-production SMG deployment on Kubernetes with a single helm install. Declarative configuration, automatic scaling, and built-in observability.

Terminal Dashboard (smg-tui)

Full-featured terminal UI for real-time monitoring and interactive chat:

  • 7 tabs -- Pulse (real-time dashboard with sparklines), Workers (per-worker stats + circuit breaker state), Chat (streaming markdown playground), Logs (per-component with ANSI stripping), Benchmark, Traffic, Mesh
  • Worker management -- Quick-add presets for OpenAI/Anthropic/xAI/Gemini, local worker launch with automatic GPU selection via nvidia-smi, GPU claim tracking to prevent double-allocation
  • Gateway auto-start -- smg-tui --auto-start launches the gateway, polls health, and cleans up on exit
  • Chat playground -- Streaming SSE with live cursor, markdown rendering, multi-turn support, Tab to cycle models

Mesh Performance & Reliability Revolution

Eliminated catastrophic memory growth and achieved >200x improvement in mesh resource usage:

  • Delta encoding (#899): Only send new tree operations since last sync -- 40x smaller sync payloads (18.3 MB → 417 KB), gzip compression for additional 5-8x wire reduction
  • Lazy serialization (#919): Moved full TreeState serialization off the hot path -- memory: OOM crash → 31 MB stable, CPU: 280-345% → 56-58%, latency: 12s degrading → stable
  • CRDT bypass (#961): Moved tree state out of CRDT operation log -- eliminated ~1 GB/1.5hr memory leak under sustained load
  • Two-layer sync fix (#1011): Eliminated remaining memory leaks in the tree sync protocol
  • Snapshot serialization (#974): Structure-preserving radix tree snapshots for mesh sync -- shared prefixes stored once, replacing 40 MB flat operation replay with compact tree format
  • Timeout enforcement (#952): Consistent timeout contract across all RPC and stream paths
  • Health mirroring (#912, #892): Mesh-synced workers now register locally for health checking with proper status mirroring

Benchmark Results (20 min, 500 rps, 20K-char prompts):

  • 565,920 requests, 0 errors
  • Memory plateaus at ~2.3 GB (no linear growth)

7-11x Faster Multimodal Image Preprocessing

SMG now matches or beats HuggingFace Python preprocessing performance:

  • SIMD resize -- Replaced image crate (pure Rust) with fast_image_resize v6 (AVX2/SSE4.1) for 10-25x faster resize
  • Fused operations -- Combined to_tensor_and_normalize(), zero-copy patchify_into(), fused pad + normalize + tile split for Llama4
  • Additional optimizations -- Thread-local Resizer reuse, eliminated DynamicImage clones, optimized serialization and tensor conversion

Benchmark Results (Qwen3-VL):

Image Size Before After vs HuggingFace Python
224×224 4.77 ms 0.44 ms (10.8x) 2.5x faster
640×480 15.5 ms 1.59 ms (9.7x) 1.8x faster
1024×768 40.6 ms 4.31 ms (9.4x) 1.6x faster
1920×1080 286 ms 39.6 ms (7.2x) ~parity

Native Completion API over gRPC

Full /v1/completions support through the gRPC pipeline with streaming and PD disaggregation:

  • 6-PR pipeline -- CompletionRequest type, preparation stage, request building with backend sampling params, response processing, pipeline wiring, streaming support
  • Streaming -- OpenAI-compatible SSE events with per-index stop decoder tracking, echo and suffix handling
  • PD mode -- Dual streaming for prefill-decode disaggregation
  • Type safety -- Native RequestType::Completion throughout the pipeline, exhaustive match arms in shared stages

Per-Model Retry Configuration

Different models can now have different retry policies:

  • WorkerRegistry integration -- Workers declare per-model retry config via WorkerSpec.resilience, stored in WorkerRegistry with last-write-wins semantics
  • All routers updated -- HTTP, gRPC, OpenAI, Gemini, gRPC PD, and HTTP PD routers all look up per-model config at request time, falling back to the global default
  • Cleanup on removal -- Retry config is automatically cleaned up when the last worker for a model is removed

Impact: GPU-constrained models can have longer timeouts and more retries, while fast models use aggressive retry budgets. No more one-size-fits-all.

Three-Phase Graceful Shutdown

Replace fixed-timeout shutdown with an intelligent Gate → Drain → Teardown approach:

  • Phase 1 (Gate): Stop accepting new requests
  • Phase 2 (Drain): Wait for in-flight requests to complete (up to configured timeout)
  • Phase 3 (Teardown): MCP orchestrator cleanup + exit

Impact: Requests finishing in 2s no longer wait 28s for a fixed grace period. Requests needing 35s no longer get killed at 30s. The system drains to zero when possible.

Worker Registry & REST API Improvements

  • Model field required (#713) -- Clients omitting model now get 400 Bad Request instead of silent "unknown" injection. Matches OpenAI API spec. Breaking change.
  • REST semantics (#875) -- POST /workers (create-only, 409 on conflict), PUT /workers/{id} (full replace), PATCH /workers/{id} (partial update). Breaking change: PUT now requires WorkerSpec instead of WorkerUpdateRequest.
  • Split register paths (#836) -- register() (create-only), replace() (overwrite-then-diff, no transient gap), register_or_replace() (idempotent upsert)

vLLM gRPC Embedding Support

End-to-end embedding pipeline for vLLM via gRPC:

  • Rust gateway + Python servicer (calls engine.encode() with PoolingParams)
  • Flattened SGLang EmbedResponse proto (removed oneof, uses tonic::Status for errors)
  • Removed SGLang-specific log_metrics and cached_tokens from embed/classify protos

DeepSeek V3.1 Tool Call Parser

Native parser for DeepSeek V3.1's tool calling format:

  • Handles V3.1's simplified format (no function type prefix, no markdown code blocks)
  • Complete + streaming (parse_incremental) support
  • Auto-registered for deepseek-v3.1* and deepseek-ai/DeepSeek-V3.1* model patterns
  • E2E validated against live DeepSeek V3.1 (FP8) on 8×H200

Additional Features

  • Configurable storage hook context (#807) -- Map HTTP headers to storage hook request context via storage_context_headers
  • Conversation memories schema (#976) -- First-class conversation_memories table in data-connector with Oracle Flyway DDL and insert seam
  • gRPC health checking (#885) -- Standard grpc.health.v1 health service for vLLM workers
  • Model metadata in GetModelInfo (#871) -- vLLM GetModelInfo RPC now returns model metadata fields
  • Metrics server refactored to axum (#966) -- Foundation for /ws/metrics WebSocket endpoint
  • max_total_num_tokens in GetServerInfo (#817) -- Aligns gRPC response with HTTP server

Performance Improvements

  • Tokenizer: Optimized stop decoder and incremental sequence decoding (#990)
  • Routing: Optimized extract_text_for_routing string handling (#967)
  • Mesh: Eliminated per-request CRDT serialization in sync_tree_operation (#948)
  • Multimodal: Thread-local Resizer reuse (#923), eliminated DynamicImage clones (#928), optimized serialization and tensor conversion (#1012)

Bug Fixes

  • Multimodal: Fixed Phi-3-vision for string-format chat templates (#942), LLaVA-Next anyres multi-crop for vLLM gRPC (#941), hardened registry matching and token geometry (#945), propagated placeholder resolution errors (#943), fixed images smaller than patch_size × merge_size (#908), use preprocessor token counts in LlavaSpec (#958), fall back to config.model_type for aliased model IDs (#898)
  • Chat Templates: Inject special tokens (bos_token, eos_token) into chat template context (#914), correct content format detection for Qwen3-style templates (#981), inject special tokens inside tokenizer impls (#918)
  • Protocol: Accept null for boolean fields logprobs and stream (#1020), validate reasoning parser name at CLI and startup (#901)
  • gRPC: Fix assistant tool_calls message serialization for chat templates (#1023), include stop tokens in TRT-LLM output for Harmony parsing (#879), handle vllm log forwarding on servicer side (#975)
  • Mesh: Stop advertising 0.0.0.0 to peers (#883), set tonic message size limits to match application limit (#893), prevent duplicate store events from inflating tree_sizes (#946)
  • Gateway: Update metric when removing unhealthy workers (#884), filter empty-string backend defaults in CLI arg fallback (#934)
  • Responses API: Align store=false state persistence behavior (#916)
  • Serve: Respect user-set CUDA_VISIBLE_DEVICES in gpu_env (#751)
  • Metrics: Increase default Prometheus bucket coverage to 1h (#909)
  • Logging: Show logs from all workspace crates, not just smg (#965)
  • Dependencies: Pin unicode-segmentation <1.12 in WASM guest crates (#886)

Refactoring

  • Unified circuit breaker recording to record_outcome(status_code) across all routers (#937)
  • Removed dead GenerateError from protos and gateway (#1002)
  • Renamed shared dispatcher stages, removed dead arms (#1004)
  • Used EngineClient interface instead of AsyncLLM in vLLM servicer (#949)
  • Responses API retrieval refactored to use data layer directly (#938)

Infrastructure & CI

  • Docker: Build 3 engine versions on release, test all engines on PR (#858), run SMG as non-root user (#863)
  • Helm CI: Helm chart release workflow to GHCR (#864)
  • Claude Code PR review: Automated review workflow with severity markers (#985, #986)
  • E2E: MMMU benchmark (#910), basic multimodal tests (#931), array content format coverage (#999), Harmony tests on vLLM and TRT-LLM (#801), worker log artifacts on failure (#963)
  • Benchmarks: Llama-4 and Llama-3.3 70b in nightly benchmarks (#900)

Breaking Changes

  1. Model field required -- Requests omitting model now return 400 Bad Request instead of defaulting to "unknown". Matches OpenAI API spec.
  2. Worker API REST semantics -- PUT /workers/{id} now expects WorkerSpec (full replace). Use PATCH /workers/{id} for partial updates with WorkerUpdateRequest.
  3. Embedding proto changes -- log_metrics and cached_tokens removed from embedding/classify protos. SGLang EmbedResponse flattened (oneof removed).

Full Changelog: v1.3.3...v1.4.0

Upgrade now: pip install smg --upgrade

Shepherd your LLM infrastructure with confidence.

Docker Images

Pre-built engine images on GitHub Container Registry:

SGLang:

docker pull ghcr.io/lightseekorg/smg:1.4.0-sglang-v0.5.9

vLLM:

docker pull ghcr.io/lightseekorg/smg:1.4.0-vllm-v0.18.0

TensorRT-LLM:

docker pull ghcr.io/lightseekorg/smg:1.4.0-trtllm-1.3.0rc8

All images for v1.4.0:

Engine Tag Pull Command
sglang 1.4.0-sglang-v0.5.9 docker pull ghcr.io/lightseekorg/smg:1.4.0-sglang-v0.5.9
trtllm 1.4.0-trtllm-1.3.0rc8 docker pull ghcr.io/lightseekorg/smg:1.4.0-trtllm-1.3.0rc8
vllm 1.4.0-vllm-v0.18.0 docker pull ghcr.io/lightseekorg/smg:1.4.0-vllm-v0.18.0

What's Changed

  • feat support max_total_num_tokens in getserverinforesponse to keep align with get_server_info in HTTP server by @Huixxi in #817
  • feat(helm): add Helm chart for SMG router deployment by @slin1237 in #699
  • fix(helm): correct CLI flag names and defaults in chart templates by @slin1237 in #859
  • ci(docker): build 3 engine versions on release, test all engines on PR by @slin1237 in #858
  • fix(helm): quote custom labels and fix default securityContext by @slin1237 in #862
  • fix(helm): fix security context and test discovery for SMG chart by @slin1237 in #861
  • fix(docker): run SMG as non-root user and fix Helm chart compatibility by @slin1237 in #863
  • fix(ci): fix Docker build UID conflict and split build vs push triggers by @slin1237 in #865
  • ci(helm): add Helm chart release workflow to GHCR by @slin1237 in #864
  • refactor: make model field required, remove resolve_model_id by @CatherineSue in #713
  • fix(ci): pause minimax nightly bench and add HF token for H200 node by @key4ng in #868
  • feat(grpc): add model metadata fields to vLLM GetModelInfo RPC by @slin1237 in #871
  • chore(grpc): release smg-grpc-proto 0.5.0 and smg-grpc-servicer 0.6.0 by @slin1237 in #872
  • refactor(registry): split register into create-only and replace paths by @CatherineSue in #836
  • fix(ci): disable Docker layer cache for engine image builds by @slin1237 in #878
  • feat(api): enforce REST semantics for worker endpoints by @CatherineSue in #875
  • feat(core): add per-model retry config to WorkerRegistry by @CatherineSue in #821
  • feat(server): three-phase graceful shutdown with drain coordination by @slin1237 in #876
  • refactor(grpc): remove redundant harmony_stop_ids from build_generate_request_from_responses by @CatherineSue in #880
  • fix(deps): pin unicode-segmentation <1.12 in WASM guest crates by @slin1237 in #886
  • feat(grpc): add VllmHealthServicer for standard gRPC health checking (grpc.health.v1) by @V2arK in #885
  • chore(deps): bump azure/setup-helm from 4 to 5 by @dependabot[bot] in #887
  • fix(grpc): include stop tokens in trtllm output for Harmony parsing by @CatherineSue in #879
  • fix(mesh): set tonic message size limits to match application limit by @slin1237 in #893
  • chore: bump smg-grpc-proto to 0.4.5 by @CatherineSue in #896
  • feat(gateway): add configurable storage hook request context mapping by @zhaowenzi in #807
  • ci(mergify): upgrade configuration to current format by @mergify[bot] in #815
  • ci: Change to use model cache by @XinyueZhang369 in #904
  • fix(parsers): validate reasoning parser name at CLI and startup by @CatherineSue in #901
  • fix(multimodal): fall back to config.model_type for aliased model IDs by @CatherineSue in #898
  • feat(completions): add native gRPC pipeline typing for /v1/completions by @vschandramourya in #840
  • fix(multimodal): handle images smaller than patch_size * merge_size by @CatherineSue in #908
  • perf(mesh): delta encoding for tree state sync + profiling instrumentation by @slin1237 in #899
  • feat(mesh): register mesh-synced workers locally for health checking by @rajatgoel in #892
  • fix(metrics): Increase default prometheus bucket coverage to 1h by @ekzhang in #909
  • feat(e2e): add MMMU benchmark and fix Qwen VL vision token duplication by @CatherineSue in #910
  • fix(mesh): mirror health status for mesh-synced workers by @slin1237 in #912
  • feat(completions): add CompletionPreparationStage for gRPC pipeline by @vschandramourya in #907
  • fix(chat): inject special tokens (bos_token, eos_token) into chat template context by @CatherineSue in #914
  • perf(multimodal): 7-11x faster image preprocessing via SIMD resize, fused ops, and zero-copy patchify by @CatherineSue in #913
  • fix(mesh): stop advertising 0.0.0.0 to peers by @jshanson7 in #883
  • fix(responses): align store=false state persistence behavior by @zhaowenzi in #916
  • refactor(tokenizer): inject special tokens inside tokenizer impls instead of callers by @CatherineSue in #918
  • fix(serve): filter empty-string backend defaults in CLI arg fallback by @CatherineSue in #934
  • feat(http-router): use per-model retry config from WorkerRegistry by @CatherineSue in #881
  • feat(tui): add terminal dashboard for SMG by @key4ng in #867
  • feat(grpc-router): use per-model retry config from WorkerRegistry by @CatherineSue in #933
  • feat(openai-router): use per-model retry config from WorkerRegistry by @CatherineSue in #935
  • refactor(ci): Add actions.summerwind.dev ARC runner deployment option by @XinyueZhang369 in #797
  • refactor(responses): retrieval to use data layer directly by @zhaowenzi in #938
  • perf(mesh): eliminate per-request TreeState serialization on hot path by @slin1237 in #919
  • perf(multimodal): reuse Resizer via thread_local by @slin1237 in #923
  • perf(multimodal): avoid DynamicImage clone in no-op transform paths by @slin1237 in #928
  • feat(completions): add CompletionRequestBuildingStage and backend sampling params by @vschandramourya in #915
  • fix(ci): resolve trtllm cuda-bindings conflict with torch 2.10 by @CatherineSue in #940
  • feat(e2e): add basic multimodal chat completion tests by @CatherineSue in #931
  • refactor(routers): unify CB recording to record_outcome(status_code) by @CatherineSue in #937
  • fix(multimodal): fix Phi-3-vision image processing for string-format chat templates by @CatherineSue in #942
  • fix(multimodal): add LLaVA-Next anyres multi-crop support for vLLM gRPC by @CatherineSue in #941
  • chore: remove unnecessary result wrapping in pd_router by @lawrence-harmonic in #939
  • fix(multimodal): propagate placeholder resolution errors instead of swallowing by @CatherineSue in #943
  • fix(kv_index): prevent duplicate store events from inflating tree_sizes by @slin1237 in #946
  • fix(multimodal): harden LLaVA-Next registry matching and token geometry by @CatherineSue in #945
  • fix(codeowners): correct grpc_servicer path and ownership by @CatherineSue in #950
  • perf(mesh): eliminate per-request CRDT serialization in sync_tree_operation by @slin1237 in #948
  • refactor(grpc): use EngineClient interface instead of AsyncLLM in vLLM servicer by @CatherineSue in #949
  • fix(mesh): enforce timeout contract across all RPC and stream paths by @slin1237 in #952
  • fix(multimodal): remove num_img_tokens fallback in Phi3VisionSpec by @CatherineSue in #955
  • fix(multimodal): fix phi3_v test broken by num_img_tokens fallback removal by @CatherineSue in #957
  • fix(multimodal): use preprocessor token counts in LlavaSpec prompt replacements by @CatherineSue in #958
  • fix(multimodal): remove unused test_preprocessed helper by @CatherineSue in #962
  • feat(completions): add CompletionResponseProcessingStage for non-streaming responses by @vschandramourya in #953
  • feat(completions): wire Completion API pipeline into gRPC routers by @vschandramourya in #964
  • fix(mesh): bypass CRDT operation log for tree state — eliminate memory leak by @slin1237 in #961
  • perf(protocols): Optimize extract_text_for_routing strings by @ppraneth in #967
  • fix(logging): show logs from all workspace crates, not just smg by @slin1237 in #965
  • fix(ci): serialize model downloads to prevent PermissionError on shared NVMe by @key4ng in #968
  • feat(helm): add worker deployment support with engine-specific images by @slin1237 in #969
  • feat(helm): add mesh HA support with K8s service discovery by @slin1237 in #972
  • feat(scripts): add --model and --host flags to mesh load generator by @slin1237 in #973
  • fix(grpc_servicer): handle vllm log forwarding on servicer side by @CatherineSue in #975
  • feat(routers): use per-model retry config for Gemini, gRPC PD, HTTP PD by @CatherineSue in #936
  • ci(e2e): upload worker logs as artifacts on test failure by @key4ng in #963
  • ci(e2e): add step-level timeouts for trtllm setup and e2e tests by @key4ng in #977
  • feat(metrics-ws): [1/4] refactor metrics server to axum with PrometheusHandle by @key4ng in #966
  • feat(kv-index): radix tree snapshot serialization for mesh sync [WIP] by @slin1237 in #974
  • ci(e2e): bump genai-bench to 0.0.4 and improve benchmark uploads by @key4ng in #980
  • fix(tokenizer): correct content format detection for Qwen3-style chat templates by @CatherineSue in #981
  • fix(gateway): Update metric when removing unhealthy workers by @ekzhang in #884
  • ci(review): add Claude Code PR review workflow by @slin1237 in #985
  • feat(ci): add severity markers to Claude review comments by @slin1237 in #986
  • fix(serve): respect user-set CUDA_VISIBLE_DEVICES in gpu_env by @paxiaatucsdedu in #751
  • chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #991
  • chore(deps): bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #992
  • chore(deps): update ratatui requirement from 0.29 to 0.30 by @dependabot[bot] in #996
  • fix(ci): detect and abort on incomplete GPU cleanup by @key4ng in #988
  • test(e2e): enable Harmony tests on vLLM and TRT-LLM backends by @CatherineSue in #801
  • chore(deps): update criterion requirement from 0.5 to 0.8 by @dependabot[bot] in #994
  • fix(ci): deduplicate Claude review comments across pushes by @key4ng in #1003
  • test(e2e): add array content format coverage to chat completion tests by @key4ng in #999
  • refactor(grpc): rename shared dispatcher stages, remove dead arms by @CatherineSue in #1004
  • ci: pin sglang to v0.5.10rc0 for proto compat by @CatherineSue in #1005
  • chore(deps): update crossterm requirement from 0.28 to 0.29 by @dependabot[bot] in #995
  • refactor(grpc): remove dead GenerateError from protos and gateway by @CatherineSue in #1002
  • fix(ci): skip Claude review on fork PRs by @slin1237 in #1009
  • feat(embed): add vLLM gRPC embedding support, clean up proto by @CatherineSue in #1001
  • perf(multimodal): optimize preprocessing serialization, tensor conversion, and pad fusion by @CatherineSue in #1012
  • fix(mesh): eliminate memory leaks in two-layer tree sync protocol by @slin1237 in #1011
  • perf(tokenizer): optimize stop decoder and incremental sequence decoding by @CatherineSue in #990
  • fix(ci): update labeler paths after crate extraction by @CatherineSue in #1014
  • test(e2e): add Llama-4 and Llama-3.3 70b to nightly benchmarks by @paxiaatucsdedu in #900
  • fix(multimodal): remove as_slice_memory_order fallback in pixel serialization by @CatherineSue in #1013
  • feat(tool_parser): add DeepSeek V3.1 tool call parser by @key4ng in #1006
  • feat(data-connector): add conversation memories schema and insert seam by @zhoug9127 in #976
  • feat(completions): add Completion API streaming support to gRPC router by @vschandramourya in #978
  • fix(protocols): accept null for boolean fields logprobs and stream by @CatherineSue in #1020
  • fix(grpc): fix assistant tool_calls message serialization for chat templates by @key4ng in #1023
  • chore: bump versions for v1.4.0 release by @slin1237 in #1018

New Contributors

Full Changelog: v1.3.3...v1.4.0