v1.2.0
🚀 Shepherd Model Gateway v1.2.0 Released!
We're thrilled to announce Shepherd Model Gateway v1.2.0 – a transformative release featuring enhanced event-driven cache-aware routing, production-ready client SDKs, Google Gemini integration, and vLLM gRPC server adoption!
⚡ Enhanced Event-Driven Cache-Aware Routing
Inspired by Amazon Dynamo's distributed caching principles, SMG extends its existing cache-aware routing with real-time KV cache event subscriptions:
- SubscribeKvEvents RPC - Real-time KV cache event stream from all backends (SGLang, vLLM, TensorRT-LLM)
- KvEventMonitor - Per-worker KV cache event subscriptions with automatic recovery
- PositionalIndexer - Event-driven cache-aware routing with router prefix hash for query-path disambiguation
- Auto-learned block_size - Dynamically learn from KV event stream
- Flash Indexer parity - Closed 4 performance gaps, tuned DashMap shards to 256
Production Results (8 Llama model replicas):
- TTFT avg: -23.0% (93.10 → 71.66 ms)
- TTFT p99: -27.9% (186.98 → 134.88 ms)
- TPOT avg: -0.9% (6.39 → 6.33 ms)
- Latency avg: -3.8% (731.60 → 703.92 ms)
- Latency max: -11.8% (1034.27 → 912.47 ms)
- Req/sec: +1.3% (9.959 → 10.093)
Impact: Maximum KV cache utilization across your inference fleet. Route requests to workers with matching cached prefixes, eliminating redundant computation and dramatically reducing TTFT.
🎨 TensorRT-LLM Multimodal Support
Complete vision-language model integration:
- gRPC multimodal pipeline - preprocessed data with hashing
- Backend-specific variants - optimized for TRT-LLM
- String-based stop sequences - no pre-tokenization overhead
- Matched_stop support - proper stop sequence handling
🔄 vLLM Upstream gRPC Adoption
SMG's gRPC server implementation is now upstream in vLLM!
vLLM's PR #36169 formalizes SMG's protobuf and gRPC server implementation as an upstream dependency. gRPC is now an officially supported protocol in the vllm serve command.
- smg-grpc-servicer package published to PyPI
- Production-grade gRPC server infrastructure
- Credit to @CatherineSue and @njhill for driving this milestone
Impact: A significant milestone for the project—SMG's gRPC innovations are now the foundation for vLLM's official gRPC support.
📦 Production-Ready Client SDKs
Multi-language SDK ecosystem with OpenAPI codegen:
- Python SDK - Drop-in replacement for OpenAI/Anthropic SDKs with complete API coverage
- Rust HTTP Client - Type-safe, async-first client with all endpoints
- Java Type Generation - Full OpenAPI-derived types
Endpoints: Chat completions, classify, parser, responses, workers, loads, and more.
Impact: Integrate SMG into any tech stack with idiomatic, type-safe clients. Zero friction migration from OpenAI/Anthropic.
🐳 Engine-Specific Docker Images
Pre-built Docker images for each inference engine:
docker pull ghcr.io/lightseekorg/smg:1.2.0-sglang-v0.5.9docker pull ghcr.io/lightseekorg/smg:1.2.0-vllm-v0.17.0docker pull ghcr.io/lightseekorg/smg:1.2.0-trtllm-1.3.0rc6
Impact: Zero-configuration deployment with engine-optimized images. Pull and run your preferred backend instantly. Credit to @gongwei-130 for driving this feature.
🔮 Google Gemini Integration
New Gemini router for Google's Interactions API:
- Complete router registration and infrastructure
- Native protocol support for Gemini models
- Seamless integration alongside OpenAI, Anthropic, and self-hosted engines
Impact: Route to Gemini alongside your entire model fleet. One gateway, all providers.
💾 Advanced Data Persistence
Enterprise-grade data connector enhancements:
- Schema versioning with safe-by-default migrations (Flyway integration)
- SchemaConfig - Customizable table and column names for existing databases
- Storage hooks - Pre/post persistence callbacks
- WASM bridge - Call storage APIs from WASM middleware
Performance: Reduced Oracle DB round trips in pagination, batch linking, and delete operations.
📊 Load Monitoring & Discovery
New /v1/loads endpoint with gRPC support, per-worker model customization (--model-id-from), external worker discovery with per-provider API keys, and model aliasing.
🔌 MCP Enhancements
Responses API Integration:
- MCP approval items in protocols and routers
X-SMG-MCPheader for MCP passthrough control
Anthropic Features:
tool_search_toolsupportdefer_loadingfor lazy tool initialization
Performance:
- Concurrent tool execution in McpToolSession
- Lock-free pool stats with AtomicUsize
- Fixed O(n²) insert in inject_mcp_output_items
- Reverse iteration with optional limit in AuditLog
🎨 Multimodal Improvements
Qwen VL Support:
- Proper patchification and prompt replacement token counts
- Backend-specific MultimodalData variants
vLLM Integration:
- Send preprocessed multimodal data with hashing and structured tokens
- Derive keep_on_cpu keys from model spec
Code Quality:
- Split registry.rs into per-model spec modules
- Removed dead MultiModalInputs/Tensor/Value types
⚡ Performance Optimizations
Core Engine:
- Zero-allocation JSON streaming validation with
IgnoredAny - Move semantics in gRPC request handling instead of cloning
- Optimized clone usage and finalize/emit_completed ordering
Indexing:
- Worker_blocks moved to caller-owned storage
- Setup moved out of concurrent benchmark timing loop
🛡️ Critical Bug Fixes
- gRPC: Proper status codes, circuit breaker accuracy, return
tonic::Statusdirectly, usee.message()for errors - Reasoning Parser: 4MB buffer limit, non-fatal parse errors
- Tokenizer: Jinja2 trim_blocks/lstrip_blocks enabled,
Value::UNDEFINEDfor missing params, SGLang tp_size fix - Mesh: Prevent stale state overwrites via relay paths, increased multi-node timeouts
- Gateway: Release semaphore permit before completion wait, unified worker registration
- Multimodal: Qwen VL patchification and token counts, lazy model discovery in /v1/models
🔧 Refactoring & Code Quality
- Repository: Library crates moved to
crates/directory, UUID v4 → v7 migration workspace-wide - MCP: Extracted shared iterators, deduplicated constructors, removed dead paths
- gRPC: Split utils.rs into focused modules, deduplicated streaming logic
- Workflow Engine: Simplified definition and internals
- E2E Testing: Removed parallel infrastructure, simplified helpers and architecture
- CI/CD: Reusable build workflows, pre-commit checks, conventional commits enforcement, file changes detection for E2E skipping, sequential execution, block AI co-author lines
📚 Documentation
- Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
- Documented O(n) complexity on pool URL-based lookups
🎯 Additional Features
- Unified flag for cache token usage report in HTTP mode
- OpenAI-compatible cached token usage
- GetTokenizer proto and tokenizer bundle streaming
- TRT-LLM parameter pass-through fixes
- Matched_stop support for vLLM and TensorRT-LLM
- Respect workers_config in multi-worker gRPC setup
- Realtime API protocol foundations (session, conversation, response, WebSocket handler)
- Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
- Documented O(n) complexity on pool URL-based lookups
🔗 Full Changelog: v1.1.0...v1.2.0
Upgrade now: pip install smg --upgrade
🐑 Shepherd your LLM infrastructure with confidence.
⚡ Built for speed. Engineered for scale. Production-proven.
What's Changed
- ci(nightly): Add vLLM HTTP support to nightly benchmarks by @CatherineSue in #502
- docs: standardize runtime ordering to SGLang, vLLM, TensorRT-LLM by @slin1237 in #514
- fix(ci): install CUDA toolkit for SGLang JIT kernel compilation by @slin1237 in #513
- fix parameters pass through for trtllm by @gongwei-130 in #509
- chore(ci): remove pull request trigger from nightly benchmark workflow by @key4ng in #524
- refactor: migrate from UUID v4 to v7 across the workspace by @slin1237 in #518
- perf: optimize JSON streaming validation with zero-allocation
IgnoredAnyby @ppraneth in #516 - fix(e2e): respect workers_config in vLLM/TRT-LLM gRPC multi-worker setup by @slin1237 in #525
- feat(responses): add mcp approval items to protocols and routers by @zhaowenzi in #491
- fix(ci): modify docker-storage dir and add scaler for a10 runners by @XinyueZhang369 in #402
- feat(anthropic): add X-SMG-MCP header for MCP passthrough by @key4ng in #517
- feat: add --served-model-name support for model aliasing by @ConnorLi96 in #521
- feat(data-connector): add SchemaConfig for customizable table and column names by @slin1237 in #526
- refactor(protocol): model ResponseTool as tagged enum to match Responses spec and tighten MCP validation by @zhaowenzi in #532
- ci: add PR title conventional commits check by @CatherineSue in #540
- feat: add unified flag for cache token usage report in http mode by @gongwei-130 in #535
- feat(realtime-api): realtime api conversation and response protocol by @pallasathena92 in #387
- chore(deps): update hf-hub requirement from 0.4.3 to 0.5.0 by @dependabot[bot] in #528
- feat(data-connector): storage hooks, schema config, and WASM bridge by @slin1237 in #541
- chore(deps): update pyo3 requirement from 0.27.1 to 0.28.2 by @dependabot[bot] in #529
- chore(deps): update wasm-encoder requirement from 0.244 to 0.245 by @dependabot[bot] in #530
- feat(clients): add Python/Rust HTTP client SDKs and OpenAPI codegen pipeline by @slin1237 in #539
- chore(grpc-pip): fix grpc client workflow, bump 0.4.1 by @slin1237 in #545
- feat(clients): add classify, parser, responses, and workers endpoints by @slin1237 in #543
- ci(agentic-api): Move Oracle DB and Brave MCP to shared K8s services and simplify CI workflow by @key4ng in #533
- feat(clients): add Java type generation from OpenAPI spec by @slin1237 in #547
- fix(tokenizer): use Value::UNDEFINED for missing chat template params by @slin1237 in #548
- ci: remove check-ci gate, drop run-ci label, skip e2e for Dependabot by @CatherineSue in #550
- refactor(mesh): refector: make crdt into a general or-map kv struct by @llfl in #534
- feat(anthropic): add tool_search_tool and defer_loading support by @key4ng in #542
- fix: unify --model/--model-path and revert in-gateway served-model-name by @slin1237 in #537
- feat(load-monitor): add dedicated config, /v1/loads endpoint, and gRPC support by @slin1237 in #552
- feat(proto): add sglang encoder proto to smg-grpc-proto by @chenzongyao200127 in #536
- feat(data-connector): add schema versioning with safe-by-default migrations by @slin1237 in #544
- chore(ci): optimize DinD startup by @key4ng in #551
- feat(gRPC): Add GetTokenizer Proto and Tokenizer Bundle Streaming by @YouNeedCryDear in #470
- feat(load-monitor): per-group polling with rich load data by @slin1237 in #554
- feat(proto): add SubscribeKvEvents RPC and KV cache event messages by @slin1237 in #557
- feat(grpc): add subscribe_kv_events to all backend clients by @slin1237 in #558
- chore(ci): migrate Brave MCP to official image with Streamable HTTP by @key4ng in #553
- feat(kv-index): add PositionalIndexer for event-driven cache-aware routing by @slin1237 in #560
- ci(pre-commit): add pre-commit checks to CI pipeline by @CatherineSue in #563
- fix(openai): read lazily discovered models in /v1/models endpoint by @CatherineSue in #564
- feat(kv-index): router prefix hash for query-path disambiguation by @slin1237 in #565
- perf(e2e): reorder tests by model affinity to minimize GPU swaps by @key4ng in #556
- feat: add openai compatible cached token usage by @gongwei-130 in #519
- feat(gateway): add KvEventMonitor for per-worker KV cache event subscriptions by @slin1237 in #568
- perf(data-connector): reduce DB round trips in Oracle pagination, batch linking, and delete by @key4ng in #567
- feat(smg-client): make Python client a drop-in replacement for OpenAI/Anthropic SDKs by @slin1237 in #559
- feat(grpc): send preprocessed multimodal data to vLLM with hashing and structured tokens by @CatherineSue in #570
- feat(realtime-api): realtime api wire-formated event protocol by @pallasathena92 in #389
- fix(e2e): resolve marker MRO mismatch and add missing model marker by @key4ng in #569
- refactor(gateway): unify local and external worker registration into single workflow by @slin1237 in #573
- feat(discovery): add --model-id-from flag for per-worker model_id customization by @frankzhouhr in #562
- refactor(data-connector): remove redundant fields from StoredResponse by @key4ng in #574
- feat(policy): add event-driven cache-aware routing to CacheAwarePolicy by @slin1237 in #571
- fix(kv): stop retrying SubscribeKvEvents when backend returns UNIMPLEMENTED by @CatherineSue in #577
- test(oracle): add E2E tests for Flyway-managed Oracle schema by @key4ng in #576
- feat(discovery): per-provider admin API keys for external worker model discovery by @slin1237 in #578
- feat(bench): add PositionalIndexer benchmarks to radix tree benchmark by @slin1237 in #579
- fix(multimodal): Qwen VL patchification and prompt replacement token counts by @CatherineSue in #582
- perf(grpc): optimize request handling by using move semantics instead of cloning by @CatherineSue in #583
- refactor(grpc): split MultimodalData into backend-specific variants by @CatherineSue in #588
- feat(multimodal): derive keep_on_cpu keys from model spec by @CatherineSue in #589
- refactor(multimodal): remove dead MultiModalInputs/Tensor/Value types by @CatherineSue in #590
- fix(oracle): add project_id to conversation_items and resolve PR #576 feedback by @key4ng in #591
- refactor(multimodal): split registry.rs into per-model spec modules by @CatherineSue in #593
- chore(deps): bump actions/download-artifact from 7 to 8 by @dependabot[bot] in #596
- chore(deps): bump actions/upload-artifact from 6 to 7 by @dependabot[bot] in #595
- chore(deps): update toml requirement from 0.9 to 1.0 by @dependabot[bot] in #598
- refactor(grpc): deduplicate Harmony Chat Completion streaming decode logic by @CatherineSue in #594
- fix(grpc): add matched_stop support for vLLM and TensorRT-LLM by @CatherineSue in #602
- feat(realtime-api): realtime api protocol builder by @pallasathena92 in #390
- perf(grpc): optimize clone usage and fix finalize/emit_completed ordering by @CatherineSue in #603
- perf(kv-index): close 4 perf gaps with Flash Indexer by @slin1237 in #584
- refactor(e2e): simplify test infrastructure by removing duplication and dead code by @slin1237 in #587
- refactor(harmony): clean up parser signatures, fix delta accumulation, and GptOss tests by @CatherineSue in #592
- fix(grpc-trtllm): use string stop/bad words, remove pre-tokenization by @CatherineSue in #606
- refactor(grpc-proto): move version to pyproject.toml as single source of truth by @CatherineSue in #610
- chore(grpc-proto): bump version to 0.4.2 by @CatherineSue in #611
- perf(kv-index): tune index DashMap shard count to 256 by @slin1237 in #608
- feat(realtime-api): add Realtime API REST handlers and session registry by @pallasathena92 in #406
- refactor(mcp): remove dead execution paths from McpOrchestrator by @slin1237 in #625
- refactor(mcp): replace magic "mcp" fallback strings with DEFAULT_SERVER_LABEL constant by @slin1237 in #624
- perf(mcp): execute tools concurrently in McpToolSession by @slin1237 in #612
- refactor(mcp): extract shared iterator for response bridge builders by @slin1237 in #614
- refactor(mcp): deduplicate McpConnectionPool constructors by @slin1237 in #616
- docs(mcp): document O(n) complexity on pool URL-based lookups by @slin1237 in #617
- perf(mcp): use AtomicUsize for lock-free pool stats reads by @slin1237 in #621
- refactor(grpc): replace hand-rolled tool-info JSON with build_mcp_tool_infos by @slin1237 in #620
- perf(mcp): reverse iterate and add optional limit to AuditLog::for_request by @slin1237 in #618
- refactor(mcp): extract unknown server key sentinel into shared constant by @slin1237 in #615
- perf(mcp): fix O(n²) insert in inject_mcp_output_items by @slin1237 in #613
- perf(kv-index): move worker_blocks to caller-owned storage by @slin1237 in #626
- refactor(data-connector): extract shared random hex ID helper by @CatherineSue in #629
- perf(data-connector): eliminate double HashMap lookup in list_identifier_responses by @CatherineSue in #632
- refactor(data-connector): extract item row-parsing helpers in Postgres and Oracle by @CatherineSue in #630
- refactor(mcp): extract fallback label to DEFAULT_SERVER_LABEL constant by @slin1237 in #619
- fix(bench): move setup out of concurrent benchmark timing loop by @slin1237 in #634
- ci: skip irrelevant E2E jobs on PRs with file changes detection by @CatherineSue in #633
- refactor(data-connector): consolidate parse JSON helpers by @CatherineSue in #631
- ci: run E2E tests sequentially and fix duplicate log output by @CatherineSue in #635
- refactor(oracle): use existing CONVERSATIONS table in Flyway test schema by @key4ng in #627
- feat: add smg-grpc-servicer package by @CatherineSue in #638
- feat(bench): add trace-driven throughput benchmark for PositionalIndexer by @slin1237 in #636
- refactor(mesh): simplify code and fix flaky integration tests by @slin1237 in #628
- feat(realtime-api): realtime websocket handler by @pallasathena92 in #637
- refactor(workflow): simplify definition and engine internals by @slin1237 in #641
- refactor(mcp): remove unused invoked_tool_name and resolved_tool_name from ToolExecutionOutput by @slin1237 in #623
- refactor(mcp): extract schema_to_value helper and document clone cost by @slin1237 in #622
- feat(interactions): Create basic structure for gemini router by @XinyueZhang369 in #417
- fix: gRPC error handling — proper status codes and circuit breaker accuracy by @CatherineSue in #645
- refactor(grpc): split grpc
utils.rsinto focused modules by @CatherineSue in #646 - fix(grpc): return
tonic::Statusdirectly from grpc_client RPC methods by @CatherineSue in #647 - ci(docker): add engine-specific image build pipelines by @slin1237 in #651
- ci: block AI co-author and sign-off lines in commits by @CatherineSue in #652
- fix(reasoning-parser): increase buffer limit to 4MB and make parse errors non-fatal by @CatherineSue in #653
- fix(grpc): use
e.message()instead ofDisplayfortonic::Statusin user-facing errors by @CatherineSue in #654 - fix(docker): fix image build failures in engine Dockerfile and release workflows by @gongwei-130 in #656
- chore: add njhill to code owner of grpc proto and servicer by @slin1237 in #661
- ci: Change to use github runner set for ci runners by @XinyueZhang369 in #659
- refactor(docker): consolidate engine image build into reusable workflow by @slin1237 in #658
- fix(ci): add missing Python client types generation to nightly benchmark by @CatherineSue in #663
- refactor(grpc): optimize vLLM servicer and remove duplicate server.py by @CatherineSue in #662
- fix(serve): use correct argparse attr for sglang tp_size by @CatherineSue in #665
- fix(docker): remove GHA docker cache from engine image builds by @slin1237 in #666
- refactor(realtime): simplify proxy and registry internals by @slin1237 in #649
- fix(gateway): release job queue semaphore permit before waiting for completion by @slin1237 in #667
- fix(mesh): increase timeouts in test_multi_node_data_propagation by @CatherineSue in #670
- refactor: remove dead embedding/classify code from regular PreparationStage by @CatherineSue in #668
- fix(scripts): handle empty arrays in check_release_versions.sh by @slin1237 in #672
- fix(ci): add CPU/memory resource requests to GPU runner pods by @slin1237 in #671
- docs: update link by @lightseek-bot in #674
- fix(docker): add --no-deps to engine install scripts by @CatherineSue in #675
- refactor: move library crates into crates/ directory by @CatherineSue in #676
- feat(kv-events): learn block_size from KV event stream by @slin1237 in #677
- refactor(e2e): remove parallel testing infrastructure and simplify architecture by @slin1237 in #643
- fix(ci): prevent runner shutdown on concurrent main pushes and restore path filters by @slin1237 in #678
- test(realtime-api): add integration test for websocket by @pallasathena92 in #660
- ops(ci): increase 1-gpu runner max from 8 to 16 by @slin1237 in #682
- fix: address PR #676 follow-up comments by @CatherineSue in #684
- refactor(gateway): remove redundant connection_mode special-case for external providers by @CatherineSue in #680
- fix(benchmarks): increase PD benchmark timeout from 240s to 480s by @slin1237 in #681
- fix(tokenizer): enable trim_blocks and lstrip_blocks in Jinja2 environment by @slin1237 in #685
- refactor(e2e): clean up test infrastructure code quality and efficiency by @slin1237 in #686
- refactor(e2e): simplify test helpers and fix redundant HF_HOME env passing by @slin1237 in #688
- feat(interactions): Register Gemini Router by @XinyueZhang369 in #679
- fix(mesh): prevent stale app state overwrites via relay paths by @slin1237 in #689
- fix(scripts): add smg-grpc-servicer to version check and fix pyproject.toml support by @CatherineSue in #683
- feat(release): bump v1.2.0 with version sync and engine docker auto-triggers by @slin1237 in #695
New Contributors
- @ConnorLi96 made their first contribution in #521
- @chenzongyao200127 made their first contribution in #536
- @YouNeedCryDear made their first contribution in #470
- @frankzhouhr made their first contribution in #562
- @lightseek-bot made their first contribution in #674
Full Changelog: v1.1.0...v1.2.0