Skip to content

v1.2.0

Choose a tag to compare

@slin1237 slin1237 released this 10 Mar 16:23
· 662 commits to main since this release
4b03d32

🚀 Shepherd Model Gateway v1.2.0 Released!

We're thrilled to announce Shepherd Model Gateway v1.2.0 – a transformative release featuring enhanced event-driven cache-aware routing, production-ready client SDKs, Google Gemini integration, and vLLM gRPC server adoption!

Enhanced Event-Driven Cache-Aware Routing

Inspired by Amazon Dynamo's distributed caching principles, SMG extends its existing cache-aware routing with real-time KV cache event subscriptions:

  • SubscribeKvEvents RPC - Real-time KV cache event stream from all backends (SGLang, vLLM, TensorRT-LLM)
  • KvEventMonitor - Per-worker KV cache event subscriptions with automatic recovery
  • PositionalIndexer - Event-driven cache-aware routing with router prefix hash for query-path disambiguation
  • Auto-learned block_size - Dynamically learn from KV event stream
  • Flash Indexer parity - Closed 4 performance gaps, tuned DashMap shards to 256

Production Results (8 Llama model replicas):

  • TTFT avg: -23.0% (93.10 → 71.66 ms)
  • TTFT p99: -27.9% (186.98 → 134.88 ms)
  • TPOT avg: -0.9% (6.39 → 6.33 ms)
  • Latency avg: -3.8% (731.60 → 703.92 ms)
  • Latency max: -11.8% (1034.27 → 912.47 ms)
  • Req/sec: +1.3% (9.959 → 10.093)

Impact: Maximum KV cache utilization across your inference fleet. Route requests to workers with matching cached prefixes, eliminating redundant computation and dramatically reducing TTFT.

🎨 TensorRT-LLM Multimodal Support

Complete vision-language model integration:

  • gRPC multimodal pipeline - preprocessed data with hashing
  • Backend-specific variants - optimized for TRT-LLM
  • String-based stop sequences - no pre-tokenization overhead
  • Matched_stop support - proper stop sequence handling

🔄 vLLM Upstream gRPC Adoption

SMG's gRPC server implementation is now upstream in vLLM!

vLLM's PR #36169 formalizes SMG's protobuf and gRPC server implementation as an upstream dependency. gRPC is now an officially supported protocol in the vllm serve command.

  • smg-grpc-servicer package published to PyPI
  • Production-grade gRPC server infrastructure
  • Credit to @CatherineSue and @njhill for driving this milestone

Impact: A significant milestone for the project—SMG's gRPC innovations are now the foundation for vLLM's official gRPC support.

📦 Production-Ready Client SDKs

Multi-language SDK ecosystem with OpenAPI codegen:

  • Python SDK - Drop-in replacement for OpenAI/Anthropic SDKs with complete API coverage
  • Rust HTTP Client - Type-safe, async-first client with all endpoints
  • Java Type Generation - Full OpenAPI-derived types

Endpoints: Chat completions, classify, parser, responses, workers, loads, and more.

Impact: Integrate SMG into any tech stack with idiomatic, type-safe clients. Zero friction migration from OpenAI/Anthropic.

🐳 Engine-Specific Docker Images

Pre-built Docker images for each inference engine:

  • docker pull ghcr.io/lightseekorg/smg:1.2.0-sglang-v0.5.9
  • docker pull ghcr.io/lightseekorg/smg:1.2.0-vllm-v0.17.0
  • docker pull ghcr.io/lightseekorg/smg:1.2.0-trtllm-1.3.0rc6

Impact: Zero-configuration deployment with engine-optimized images. Pull and run your preferred backend instantly. Credit to @gongwei-130 for driving this feature.

🔮 Google Gemini Integration

New Gemini router for Google's Interactions API:

  • Complete router registration and infrastructure
  • Native protocol support for Gemini models
  • Seamless integration alongside OpenAI, Anthropic, and self-hosted engines

Impact: Route to Gemini alongside your entire model fleet. One gateway, all providers.

💾 Advanced Data Persistence

Enterprise-grade data connector enhancements:

  • Schema versioning with safe-by-default migrations (Flyway integration)
  • SchemaConfig - Customizable table and column names for existing databases
  • Storage hooks - Pre/post persistence callbacks
  • WASM bridge - Call storage APIs from WASM middleware

Performance: Reduced Oracle DB round trips in pagination, batch linking, and delete operations.

📊 Load Monitoring & Discovery

New /v1/loads endpoint with gRPC support, per-worker model customization (--model-id-from), external worker discovery with per-provider API keys, and model aliasing.

🔌 MCP Enhancements

Responses API Integration:

  • MCP approval items in protocols and routers
  • X-SMG-MCP header for MCP passthrough control

Anthropic Features:

  • tool_search_tool support
  • defer_loading for lazy tool initialization

Performance:

  • Concurrent tool execution in McpToolSession
  • Lock-free pool stats with AtomicUsize
  • Fixed O(n²) insert in inject_mcp_output_items
  • Reverse iteration with optional limit in AuditLog

🎨 Multimodal Improvements

Qwen VL Support:

  • Proper patchification and prompt replacement token counts
  • Backend-specific MultimodalData variants

vLLM Integration:

  • Send preprocessed multimodal data with hashing and structured tokens
  • Derive keep_on_cpu keys from model spec

Code Quality:

  • Split registry.rs into per-model spec modules
  • Removed dead MultiModalInputs/Tensor/Value types

Performance Optimizations

Core Engine:

  • Zero-allocation JSON streaming validation with IgnoredAny
  • Move semantics in gRPC request handling instead of cloning
  • Optimized clone usage and finalize/emit_completed ordering

Indexing:

  • Worker_blocks moved to caller-owned storage
  • Setup moved out of concurrent benchmark timing loop

🛡️ Critical Bug Fixes

  • gRPC: Proper status codes, circuit breaker accuracy, return tonic::Status directly, use e.message() for errors
  • Reasoning Parser: 4MB buffer limit, non-fatal parse errors
  • Tokenizer: Jinja2 trim_blocks/lstrip_blocks enabled, Value::UNDEFINED for missing params, SGLang tp_size fix
  • Mesh: Prevent stale state overwrites via relay paths, increased multi-node timeouts
  • Gateway: Release semaphore permit before completion wait, unified worker registration
  • Multimodal: Qwen VL patchification and token counts, lazy model discovery in /v1/models

🔧 Refactoring & Code Quality

  • Repository: Library crates moved to crates/ directory, UUID v4 → v7 migration workspace-wide
  • MCP: Extracted shared iterators, deduplicated constructors, removed dead paths
  • gRPC: Split utils.rs into focused modules, deduplicated streaming logic
  • Workflow Engine: Simplified definition and internals
  • E2E Testing: Removed parallel infrastructure, simplified helpers and architecture
  • CI/CD: Reusable build workflows, pre-commit checks, conventional commits enforcement, file changes detection for E2E skipping, sequential execution, block AI co-author lines

📚 Documentation

  • Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
  • Documented O(n) complexity on pool URL-based lookups

🎯 Additional Features

  • Unified flag for cache token usage report in HTTP mode
  • OpenAI-compatible cached token usage
  • GetTokenizer proto and tokenizer bundle streaming
  • TRT-LLM parameter pass-through fixes
  • Matched_stop support for vLLM and TensorRT-LLM
  • Respect workers_config in multi-worker gRPC setup
  • Realtime API protocol foundations (session, conversation, response, WebSocket handler)
  • Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
  • Documented O(n) complexity on pool URL-based lookups

🔗 Full Changelog: v1.1.0...v1.2.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

  • ci(nightly): Add vLLM HTTP support to nightly benchmarks by @CatherineSue in #502
  • docs: standardize runtime ordering to SGLang, vLLM, TensorRT-LLM by @slin1237 in #514
  • fix(ci): install CUDA toolkit for SGLang JIT kernel compilation by @slin1237 in #513
  • fix parameters pass through for trtllm by @gongwei-130 in #509
  • chore(ci): remove pull request trigger from nightly benchmark workflow by @key4ng in #524
  • refactor: migrate from UUID v4 to v7 across the workspace by @slin1237 in #518
  • perf: optimize JSON streaming validation with zero-allocation IgnoredAny by @ppraneth in #516
  • fix(e2e): respect workers_config in vLLM/TRT-LLM gRPC multi-worker setup by @slin1237 in #525
  • feat(responses): add mcp approval items to protocols and routers by @zhaowenzi in #491
  • fix(ci): modify docker-storage dir and add scaler for a10 runners by @XinyueZhang369 in #402
  • feat(anthropic): add X-SMG-MCP header for MCP passthrough by @key4ng in #517
  • feat: add --served-model-name support for model aliasing by @ConnorLi96 in #521
  • feat(data-connector): add SchemaConfig for customizable table and column names by @slin1237 in #526
  • refactor(protocol): model ResponseTool as tagged enum to match Responses spec and tighten MCP validation by @zhaowenzi in #532
  • ci: add PR title conventional commits check by @CatherineSue in #540
  • feat: add unified flag for cache token usage report in http mode by @gongwei-130 in #535
  • feat(realtime-api): realtime api conversation and response protocol by @pallasathena92 in #387
  • chore(deps): update hf-hub requirement from 0.4.3 to 0.5.0 by @dependabot[bot] in #528
  • feat(data-connector): storage hooks, schema config, and WASM bridge by @slin1237 in #541
  • chore(deps): update pyo3 requirement from 0.27.1 to 0.28.2 by @dependabot[bot] in #529
  • chore(deps): update wasm-encoder requirement from 0.244 to 0.245 by @dependabot[bot] in #530
  • feat(clients): add Python/Rust HTTP client SDKs and OpenAPI codegen pipeline by @slin1237 in #539
  • chore(grpc-pip): fix grpc client workflow, bump 0.4.1 by @slin1237 in #545
  • feat(clients): add classify, parser, responses, and workers endpoints by @slin1237 in #543
  • ci(agentic-api): Move Oracle DB and Brave MCP to shared K8s services and simplify CI workflow by @key4ng in #533
  • feat(clients): add Java type generation from OpenAPI spec by @slin1237 in #547
  • fix(tokenizer): use Value::UNDEFINED for missing chat template params by @slin1237 in #548
  • ci: remove check-ci gate, drop run-ci label, skip e2e for Dependabot by @CatherineSue in #550
  • refactor(mesh): refector: make crdt into a general or-map kv struct by @llfl in #534
  • feat(anthropic): add tool_search_tool and defer_loading support by @key4ng in #542
  • fix: unify --model/--model-path and revert in-gateway served-model-name by @slin1237 in #537
  • feat(load-monitor): add dedicated config, /v1/loads endpoint, and gRPC support by @slin1237 in #552
  • feat(proto): add sglang encoder proto to smg-grpc-proto by @chenzongyao200127 in #536
  • feat(data-connector): add schema versioning with safe-by-default migrations by @slin1237 in #544
  • chore(ci): optimize DinD startup by @key4ng in #551
  • feat(gRPC): Add GetTokenizer Proto and Tokenizer Bundle Streaming by @YouNeedCryDear in #470
  • feat(load-monitor): per-group polling with rich load data by @slin1237 in #554
  • feat(proto): add SubscribeKvEvents RPC and KV cache event messages by @slin1237 in #557
  • feat(grpc): add subscribe_kv_events to all backend clients by @slin1237 in #558
  • chore(ci): migrate Brave MCP to official image with Streamable HTTP by @key4ng in #553
  • feat(kv-index): add PositionalIndexer for event-driven cache-aware routing by @slin1237 in #560
  • ci(pre-commit): add pre-commit checks to CI pipeline by @CatherineSue in #563
  • fix(openai): read lazily discovered models in /v1/models endpoint by @CatherineSue in #564
  • feat(kv-index): router prefix hash for query-path disambiguation by @slin1237 in #565
  • perf(e2e): reorder tests by model affinity to minimize GPU swaps by @key4ng in #556
  • feat: add openai compatible cached token usage by @gongwei-130 in #519
  • feat(gateway): add KvEventMonitor for per-worker KV cache event subscriptions by @slin1237 in #568
  • perf(data-connector): reduce DB round trips in Oracle pagination, batch linking, and delete by @key4ng in #567
  • feat(smg-client): make Python client a drop-in replacement for OpenAI/Anthropic SDKs by @slin1237 in #559
  • feat(grpc): send preprocessed multimodal data to vLLM with hashing and structured tokens by @CatherineSue in #570
  • feat(realtime-api): realtime api wire-formated event protocol by @pallasathena92 in #389
  • fix(e2e): resolve marker MRO mismatch and add missing model marker by @key4ng in #569
  • refactor(gateway): unify local and external worker registration into single workflow by @slin1237 in #573
  • feat(discovery): add --model-id-from flag for per-worker model_id customization by @frankzhouhr in #562
  • refactor(data-connector): remove redundant fields from StoredResponse by @key4ng in #574
  • feat(policy): add event-driven cache-aware routing to CacheAwarePolicy by @slin1237 in #571
  • fix(kv): stop retrying SubscribeKvEvents when backend returns UNIMPLEMENTED by @CatherineSue in #577
  • test(oracle): add E2E tests for Flyway-managed Oracle schema by @key4ng in #576
  • feat(discovery): per-provider admin API keys for external worker model discovery by @slin1237 in #578
  • feat(bench): add PositionalIndexer benchmarks to radix tree benchmark by @slin1237 in #579
  • fix(multimodal): Qwen VL patchification and prompt replacement token counts by @CatherineSue in #582
  • perf(grpc): optimize request handling by using move semantics instead of cloning by @CatherineSue in #583
  • refactor(grpc): split MultimodalData into backend-specific variants by @CatherineSue in #588
  • feat(multimodal): derive keep_on_cpu keys from model spec by @CatherineSue in #589
  • refactor(multimodal): remove dead MultiModalInputs/Tensor/Value types by @CatherineSue in #590
  • fix(oracle): add project_id to conversation_items and resolve PR #576 feedback by @key4ng in #591
  • refactor(multimodal): split registry.rs into per-model spec modules by @CatherineSue in #593
  • chore(deps): bump actions/download-artifact from 7 to 8 by @dependabot[bot] in #596
  • chore(deps): bump actions/upload-artifact from 6 to 7 by @dependabot[bot] in #595
  • chore(deps): update toml requirement from 0.9 to 1.0 by @dependabot[bot] in #598
  • refactor(grpc): deduplicate Harmony Chat Completion streaming decode logic by @CatherineSue in #594
  • fix(grpc): add matched_stop support for vLLM and TensorRT-LLM by @CatherineSue in #602
  • feat(realtime-api): realtime api protocol builder by @pallasathena92 in #390
  • perf(grpc): optimize clone usage and fix finalize/emit_completed ordering by @CatherineSue in #603
  • perf(kv-index): close 4 perf gaps with Flash Indexer by @slin1237 in #584
  • refactor(e2e): simplify test infrastructure by removing duplication and dead code by @slin1237 in #587
  • refactor(harmony): clean up parser signatures, fix delta accumulation, and GptOss tests by @CatherineSue in #592
  • fix(grpc-trtllm): use string stop/bad words, remove pre-tokenization by @CatherineSue in #606
  • refactor(grpc-proto): move version to pyproject.toml as single source of truth by @CatherineSue in #610
  • chore(grpc-proto): bump version to 0.4.2 by @CatherineSue in #611
  • perf(kv-index): tune index DashMap shard count to 256 by @slin1237 in #608
  • feat(realtime-api): add Realtime API REST handlers and session registry by @pallasathena92 in #406
  • refactor(mcp): remove dead execution paths from McpOrchestrator by @slin1237 in #625
  • refactor(mcp): replace magic "mcp" fallback strings with DEFAULT_SERVER_LABEL constant by @slin1237 in #624
  • perf(mcp): execute tools concurrently in McpToolSession by @slin1237 in #612
  • refactor(mcp): extract shared iterator for response bridge builders by @slin1237 in #614
  • refactor(mcp): deduplicate McpConnectionPool constructors by @slin1237 in #616
  • docs(mcp): document O(n) complexity on pool URL-based lookups by @slin1237 in #617
  • perf(mcp): use AtomicUsize for lock-free pool stats reads by @slin1237 in #621
  • refactor(grpc): replace hand-rolled tool-info JSON with build_mcp_tool_infos by @slin1237 in #620
  • perf(mcp): reverse iterate and add optional limit to AuditLog::for_request by @slin1237 in #618
  • refactor(mcp): extract unknown server key sentinel into shared constant by @slin1237 in #615
  • perf(mcp): fix O(n²) insert in inject_mcp_output_items by @slin1237 in #613
  • perf(kv-index): move worker_blocks to caller-owned storage by @slin1237 in #626
  • refactor(data-connector): extract shared random hex ID helper by @CatherineSue in #629
  • perf(data-connector): eliminate double HashMap lookup in list_identifier_responses by @CatherineSue in #632
  • refactor(data-connector): extract item row-parsing helpers in Postgres and Oracle by @CatherineSue in #630
  • refactor(mcp): extract fallback label to DEFAULT_SERVER_LABEL constant by @slin1237 in #619
  • fix(bench): move setup out of concurrent benchmark timing loop by @slin1237 in #634
  • ci: skip irrelevant E2E jobs on PRs with file changes detection by @CatherineSue in #633
  • refactor(data-connector): consolidate parse JSON helpers by @CatherineSue in #631
  • ci: run E2E tests sequentially and fix duplicate log output by @CatherineSue in #635
  • refactor(oracle): use existing CONVERSATIONS table in Flyway test schema by @key4ng in #627
  • feat: add smg-grpc-servicer package by @CatherineSue in #638
  • feat(bench): add trace-driven throughput benchmark for PositionalIndexer by @slin1237 in #636
  • refactor(mesh): simplify code and fix flaky integration tests by @slin1237 in #628
  • feat(realtime-api): realtime websocket handler by @pallasathena92 in #637
  • refactor(workflow): simplify definition and engine internals by @slin1237 in #641
  • refactor(mcp): remove unused invoked_tool_name and resolved_tool_name from ToolExecutionOutput by @slin1237 in #623
  • refactor(mcp): extract schema_to_value helper and document clone cost by @slin1237 in #622
  • feat(interactions): Create basic structure for gemini router by @XinyueZhang369 in #417
  • fix: gRPC error handling — proper status codes and circuit breaker accuracy by @CatherineSue in #645
  • refactor(grpc): split grpc utils.rs into focused modules by @CatherineSue in #646
  • fix(grpc): return tonic::Status directly from grpc_client RPC methods by @CatherineSue in #647
  • ci(docker): add engine-specific image build pipelines by @slin1237 in #651
  • ci: block AI co-author and sign-off lines in commits by @CatherineSue in #652
  • fix(reasoning-parser): increase buffer limit to 4MB and make parse errors non-fatal by @CatherineSue in #653
  • fix(grpc): use e.message() instead of Display for tonic::Status in user-facing errors by @CatherineSue in #654
  • fix(docker): fix image build failures in engine Dockerfile and release workflows by @gongwei-130 in #656
  • chore: add njhill to code owner of grpc proto and servicer by @slin1237 in #661
  • ci: Change to use github runner set for ci runners by @XinyueZhang369 in #659
  • refactor(docker): consolidate engine image build into reusable workflow by @slin1237 in #658
  • fix(ci): add missing Python client types generation to nightly benchmark by @CatherineSue in #663
  • refactor(grpc): optimize vLLM servicer and remove duplicate server.py by @CatherineSue in #662
  • fix(serve): use correct argparse attr for sglang tp_size by @CatherineSue in #665
  • fix(docker): remove GHA docker cache from engine image builds by @slin1237 in #666
  • refactor(realtime): simplify proxy and registry internals by @slin1237 in #649
  • fix(gateway): release job queue semaphore permit before waiting for completion by @slin1237 in #667
  • fix(mesh): increase timeouts in test_multi_node_data_propagation by @CatherineSue in #670
  • refactor: remove dead embedding/classify code from regular PreparationStage by @CatherineSue in #668
  • fix(scripts): handle empty arrays in check_release_versions.sh by @slin1237 in #672
  • fix(ci): add CPU/memory resource requests to GPU runner pods by @slin1237 in #671
  • docs: update link by @lightseek-bot in #674
  • fix(docker): add --no-deps to engine install scripts by @CatherineSue in #675
  • refactor: move library crates into crates/ directory by @CatherineSue in #676
  • feat(kv-events): learn block_size from KV event stream by @slin1237 in #677
  • refactor(e2e): remove parallel testing infrastructure and simplify architecture by @slin1237 in #643
  • fix(ci): prevent runner shutdown on concurrent main pushes and restore path filters by @slin1237 in #678
  • test(realtime-api): add integration test for websocket by @pallasathena92 in #660
  • ops(ci): increase 1-gpu runner max from 8 to 16 by @slin1237 in #682
  • fix: address PR #676 follow-up comments by @CatherineSue in #684
  • refactor(gateway): remove redundant connection_mode special-case for external providers by @CatherineSue in #680
  • fix(benchmarks): increase PD benchmark timeout from 240s to 480s by @slin1237 in #681
  • fix(tokenizer): enable trim_blocks and lstrip_blocks in Jinja2 environment by @slin1237 in #685
  • refactor(e2e): clean up test infrastructure code quality and efficiency by @slin1237 in #686
  • refactor(e2e): simplify test helpers and fix redundant HF_HOME env passing by @slin1237 in #688
  • feat(interactions): Register Gemini Router by @XinyueZhang369 in #679
  • fix(mesh): prevent stale app state overwrites via relay paths by @slin1237 in #689
  • fix(scripts): add smg-grpc-servicer to version check and fix pyproject.toml support by @CatherineSue in #683
  • feat(release): bump v1.2.0 with version sync and engine docker auto-triggers by @slin1237 in #695

New Contributors

Full Changelog: v1.1.0...v1.2.0