Release v1.2.0 · lightseekorg/smg

🚀 Shepherd Model Gateway v1.2.0 Released!

We're thrilled to announce Shepherd Model Gateway v1.2.0 – a transformative release featuring enhanced event-driven cache-aware routing, production-ready client SDKs, Google Gemini integration, and vLLM gRPC server adoption!

⚡ Enhanced Event-Driven Cache-Aware Routing

Inspired by Amazon Dynamo's distributed caching principles, SMG extends its existing cache-aware routing with real-time KV cache event subscriptions:

SubscribeKvEvents RPC - Real-time KV cache event stream from all backends (SGLang, vLLM, TensorRT-LLM)
KvEventMonitor - Per-worker KV cache event subscriptions with automatic recovery
PositionalIndexer - Event-driven cache-aware routing with router prefix hash for query-path disambiguation
Auto-learned block_size - Dynamically learn from KV event stream
Flash Indexer parity - Closed 4 performance gaps, tuned DashMap shards to 256

Production Results (8 Llama model replicas):

TTFT avg: -23.0% (93.10 → 71.66 ms)
TTFT p99: -27.9% (186.98 → 134.88 ms)
TPOT avg: -0.9% (6.39 → 6.33 ms)
Latency avg: -3.8% (731.60 → 703.92 ms)
Latency max: -11.8% (1034.27 → 912.47 ms)
Req/sec: +1.3% (9.959 → 10.093)

Impact: Maximum KV cache utilization across your inference fleet. Route requests to workers with matching cached prefixes, eliminating redundant computation and dramatically reducing TTFT.

🎨 TensorRT-LLM Multimodal Support

Complete vision-language model integration:

gRPC multimodal pipeline - preprocessed data with hashing
Backend-specific variants - optimized for TRT-LLM
String-based stop sequences - no pre-tokenization overhead
Matched_stop support - proper stop sequence handling

🔄 vLLM Upstream gRPC Adoption

SMG's gRPC server implementation is now upstream in vLLM!

vLLM's PR #36169 formalizes SMG's protobuf and gRPC server implementation as an upstream dependency. gRPC is now an officially supported protocol in the vllm serve command.

smg-grpc-servicer package published to PyPI
Production-grade gRPC server infrastructure
Credit to @CatherineSue and @njhill for driving this milestone

Impact: A significant milestone for the project—SMG's gRPC innovations are now the foundation for vLLM's official gRPC support.

📦 Production-Ready Client SDKs

Multi-language SDK ecosystem with OpenAPI codegen:

Python SDK - Drop-in replacement for OpenAI/Anthropic SDKs with complete API coverage
Rust HTTP Client - Type-safe, async-first client with all endpoints
Java Type Generation - Full OpenAPI-derived types

Endpoints: Chat completions, classify, parser, responses, workers, loads, and more.

Impact: Integrate SMG into any tech stack with idiomatic, type-safe clients. Zero friction migration from OpenAI/Anthropic.

🐳 Engine-Specific Docker Images

Pre-built Docker images for each inference engine:

docker pull ghcr.io/lightseekorg/smg:1.2.0-sglang-v0.5.9
docker pull ghcr.io/lightseekorg/smg:1.2.0-vllm-v0.17.0
docker pull ghcr.io/lightseekorg/smg:1.2.0-trtllm-1.3.0rc6

Impact: Zero-configuration deployment with engine-optimized images. Pull and run your preferred backend instantly. Credit to @gongwei-130 for driving this feature.

🔮 Google Gemini Integration

New Gemini router for Google's Interactions API:

Complete router registration and infrastructure
Native protocol support for Gemini models
Seamless integration alongside OpenAI, Anthropic, and self-hosted engines

Impact: Route to Gemini alongside your entire model fleet. One gateway, all providers.

💾 Advanced Data Persistence

Enterprise-grade data connector enhancements:

Schema versioning with safe-by-default migrations (Flyway integration)
SchemaConfig - Customizable table and column names for existing databases
Storage hooks - Pre/post persistence callbacks
WASM bridge - Call storage APIs from WASM middleware

Performance: Reduced Oracle DB round trips in pagination, batch linking, and delete operations.

📊 Load Monitoring & Discovery

New /v1/loads endpoint with gRPC support, per-worker model customization (--model-id-from), external worker discovery with per-provider API keys, and model aliasing.

🔌 MCP Enhancements

Responses API Integration:

MCP approval items in protocols and routers
X-SMG-MCP header for MCP passthrough control

Anthropic Features:

tool_search_tool support
defer_loading for lazy tool initialization

Performance:

Concurrent tool execution in McpToolSession
Lock-free pool stats with AtomicUsize
Fixed O(n²) insert in inject_mcp_output_items
Reverse iteration with optional limit in AuditLog

🎨 Multimodal Improvements

Qwen VL Support:

Proper patchification and prompt replacement token counts
Backend-specific MultimodalData variants

vLLM Integration:

Send preprocessed multimodal data with hashing and structured tokens
Derive keep_on_cpu keys from model spec

Code Quality:

Split registry.rs into per-model spec modules
Removed dead MultiModalInputs/Tensor/Value types

⚡ Performance Optimizations

Core Engine:

Zero-allocation JSON streaming validation with IgnoredAny
Move semantics in gRPC request handling instead of cloning
Optimized clone usage and finalize/emit_completed ordering

Indexing:

Worker_blocks moved to caller-owned storage
Setup moved out of concurrent benchmark timing loop

🛡️ Critical Bug Fixes

gRPC: Proper status codes, circuit breaker accuracy, return tonic::Status directly, use e.message() for errors
Reasoning Parser: 4MB buffer limit, non-fatal parse errors
Tokenizer: Jinja2 trim_blocks/lstrip_blocks enabled, Value::UNDEFINED for missing params, SGLang tp_size fix
Mesh: Prevent stale state overwrites via relay paths, increased multi-node timeouts
Gateway: Release semaphore permit before completion wait, unified worker registration
Multimodal: Qwen VL patchification and token counts, lazy model discovery in /v1/models

🔧 Refactoring & Code Quality

Repository: Library crates moved to crates/ directory, UUID v4 → v7 migration workspace-wide
MCP: Extracted shared iterators, deduplicated constructors, removed dead paths
gRPC: Split utils.rs into focused modules, deduplicated streaming logic
Workflow Engine: Simplified definition and internals
E2E Testing: Removed parallel infrastructure, simplified helpers and architecture
CI/CD: Reusable build workflows, pre-commit checks, conventional commits enforcement, file changes detection for E2E skipping, sequential execution, block AI co-author lines

📚 Documentation

Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
Documented O(n) complexity on pool URL-based lookups

🎯 Additional Features

Unified flag for cache token usage report in HTTP mode
OpenAI-compatible cached token usage
GetTokenizer proto and tokenizer bundle streaming
TRT-LLM parameter pass-through fixes
Matched_stop support for vLLM and TensorRT-LLM
Respect workers_config in multi-worker gRPC setup
Realtime API protocol foundations (session, conversation, response, WebSocket handler)
Standardized runtime ordering to SGLang, vLLM, TensorRT-LLM
Documented O(n) complexity on pool URL-based lookups

🔗 Full Changelog: v1.1.0...v1.2.0

Upgrade now: pip install smg --upgrade

🐑 Shepherd your LLM infrastructure with confidence.

⚡ Built for speed. Engineered for scale. Production-proven.

What's Changed

ci(nightly): Add vLLM HTTP support to nightly benchmarks by @CatherineSue in #502
docs: standardize runtime ordering to SGLang, vLLM, TensorRT-LLM by @slin1237 in #514
fix(ci): install CUDA toolkit for SGLang JIT kernel compilation by @slin1237 in #513
fix parameters pass through for trtllm by @gongwei-130 in #509
chore(ci): remove pull request trigger from nightly benchmark workflow by @key4ng in #524
refactor: migrate from UUID v4 to v7 across the workspace by @slin1237 in #518
perf: optimize JSON streaming validation with zero-allocation IgnoredAny by @ppraneth in #516
fix(e2e): respect workers_config in vLLM/TRT-LLM gRPC multi-worker setup by @slin1237 in #525
feat(responses): add mcp approval items to protocols and routers by @zhaowenzi in #491
fix(ci): modify docker-storage dir and add scaler for a10 runners by @XinyueZhang369 in #402
feat(anthropic): add X-SMG-MCP header for MCP passthrough by @key4ng in #517
feat: add --served-model-name support for model aliasing by @ConnorLi96 in #521
feat(data-connector): add SchemaConfig for customizable table and column names by @slin1237 in #526
refactor(protocol): model ResponseTool as tagged enum to match Responses spec and tighten MCP validation by @zhaowenzi in #532
ci: add PR title conventional commits check by @CatherineSue in #540
feat: add unified flag for cache token usage report in http mode by @gongwei-130 in #535
feat(realtime-api): realtime api conversation and response protocol by @pallasathena92 in #387
chore(deps): update hf-hub requirement from 0.4.3 to 0.5.0 by @dependabot[bot] in #528
feat(data-connector): storage hooks, schema config, and WASM bridge by @slin1237 in #541
chore(deps): update pyo3 requirement from 0.27.1 to 0.28.2 by @dependabot[bot] in #529
chore(deps): update wasm-encoder requirement from 0.244 to 0.245 by @dependabot[bot] in #530
feat(clients): add Python/Rust HTTP client SDKs and OpenAPI codegen pipeline by @slin1237 in #539
chore(grpc-pip): fix grpc client workflow, bump 0.4.1 by @slin1237 in #545
feat(clients): add classify, parser, responses, and workers endpoints by @slin1237 in #543
ci(agentic-api): Move Oracle DB and Brave MCP to shared K8s services and simplify CI workflow by @key4ng in #533
feat(clients): add Java type generation from OpenAPI spec by @slin1237 in #547
fix(tokenizer): use Value::UNDEFINED for missing chat template params by @slin1237 in #548
ci: remove check-ci gate, drop run-ci label, skip e2e for Dependabot by @CatherineSue in #550
refactor(mesh): refector: make crdt into a general or-map kv struct by @llfl in #534
feat(anthropic): add tool_search_tool and defer_loading support by @key4ng in #542
fix: unify --model/--model-path and revert in-gateway served-model-name by @slin1237 in #537
feat(load-monitor): add dedicated config, /v1/loads endpoint, and gRPC support by @slin1237 in #552
feat(proto): add sglang encoder proto to smg-grpc-proto by @chenzongyao200127 in #536
feat(data-connector): add schema versioning with safe-by-default migrations by @slin1237 in #544
chore(ci): optimize DinD startup by @key4ng in #551
feat(gRPC): Add GetTokenizer Proto and Tokenizer Bundle Streaming by @YouNeedCryDear in #470
feat(load-monitor): per-group polling with rich load data by @slin1237 in #554
feat(proto): add SubscribeKvEvents RPC and KV cache event messages by @slin1237 in #557
feat(grpc): add subscribe_kv_events to all backend clients by @slin1237 in #558
chore(ci): migrate Brave MCP to official image with Streamable HTTP by @key4ng in #553
feat(kv-index): add PositionalIndexer for event-driven cache-aware routing by @slin1237 in #560
ci(pre-commit): add pre-commit checks to CI pipeline by @CatherineSue in #563
fix(openai): read lazily discovered models in /v1/models endpoint by @CatherineSue in #564
feat(kv-index): router prefix hash for query-path disambiguation by @slin1237 in #565
perf(e2e): reorder tests by model affinity to minimize GPU swaps by @key4ng in #556
feat: add openai compatible cached token usage by @gongwei-130 in #519
feat(gateway): add KvEventMonitor for per-worker KV cache event subscriptions by @slin1237 in #568
perf(data-connector): reduce DB round trips in Oracle pagination, batch linking, and delete by @key4ng in #567
feat(smg-client): make Python client a drop-in replacement for OpenAI/Anthropic SDKs by @slin1237 in #559
feat(grpc): send preprocessed multimodal data to vLLM with hashing and structured tokens by @CatherineSue in #570
feat(realtime-api): realtime api wire-formated event protocol by @pallasathena92 in #389
fix(e2e): resolve marker MRO mismatch and add missing model marker by @key4ng in #569
refactor(gateway): unify local and external worker registration into single workflow by @slin1237 in #573
feat(discovery): add --model-id-from flag for per-worker model_id customization by @frankzhouhr in #562
refactor(data-connector): remove redundant fields from StoredResponse by @key4ng in #574
feat(policy): add event-driven cache-aware routing to CacheAwarePolicy by @slin1237 in #571
fix(kv): stop retrying SubscribeKvEvents when backend returns UNIMPLEMENTED by @CatherineSue in #577
test(oracle): add E2E tests for Flyway-managed Oracle schema by @key4ng in #576
feat(discovery): per-provider admin API keys for external worker model discovery by @slin1237 in #578
feat(bench): add PositionalIndexer benchmarks to radix tree benchmark by @slin1237 in #579
fix(multimodal): Qwen VL patchification and prompt replacement token counts by @CatherineSue in #582
perf(grpc): optimize request handling by using move semantics instead of cloning by @CatherineSue in #583
refactor(grpc): split MultimodalData into backend-specific variants by @CatherineSue in #588
feat(multimodal): derive keep_on_cpu keys from model spec by @CatherineSue in #589
refactor(multimodal): remove dead MultiModalInputs/Tensor/Value types by @CatherineSue in #590
fix(oracle): add project_id to conversation_items and resolve PR #576 feedback by @key4ng in #591
refactor(multimodal): split registry.rs into per-model spec modules by @CatherineSue in #593
chore(deps): bump actions/download-artifact from 7 to 8 by @dependabot[bot] in #596
chore(deps): bump actions/upload-artifact from 6 to 7 by @dependabot[bot] in #595
chore(deps): update toml requirement from 0.9 to 1.0 by @dependabot[bot] in #598
refactor(grpc): deduplicate Harmony Chat Completion streaming decode logic by @CatherineSue in #594
fix(grpc): add matched_stop support for vLLM and TensorRT-LLM by @CatherineSue in #602
feat(realtime-api): realtime api protocol builder by @pallasathena92 in #390
perf(grpc): optimize clone usage and fix finalize/emit_completed ordering by @CatherineSue in #603
perf(kv-index): close 4 perf gaps with Flash Indexer by @slin1237 in #584
refactor(e2e): simplify test infrastructure by removing duplication and dead code by @slin1237 in #587
refactor(harmony): clean up parser signatures, fix delta accumulation, and GptOss tests by @CatherineSue in #592
fix(grpc-trtllm): use string stop/bad words, remove pre-tokenization by @CatherineSue in #606
refactor(grpc-proto): move version to pyproject.toml as single source of truth by @CatherineSue in #610
chore(grpc-proto): bump version to 0.4.2 by @CatherineSue in #611
perf(kv-index): tune index DashMap shard count to 256 by @slin1237 in #608
feat(realtime-api): add Realtime API REST handlers and session registry by @pallasathena92 in #406
refactor(mcp): remove dead execution paths from McpOrchestrator by @slin1237 in #625
refactor(mcp): replace magic "mcp" fallback strings with DEFAULT_SERVER_LABEL constant by @slin1237 in #624
perf(mcp): execute tools concurrently in McpToolSession by @slin1237 in #612
refactor(mcp): extract shared iterator for response bridge builders by @slin1237 in #614
refactor(mcp): deduplicate McpConnectionPool constructors by @slin1237 in #616
docs(mcp): document O(n) complexity on pool URL-based lookups by @slin1237 in #617
perf(mcp): use AtomicUsize for lock-free pool stats reads by @slin1237 in #621
refactor(grpc): replace hand-rolled tool-info JSON with build_mcp_tool_infos by @slin1237 in #620
perf(mcp): reverse iterate and add optional limit to AuditLog::for_request by @slin1237 in #618
refactor(mcp): extract unknown server key sentinel into shared constant by @slin1237 in #615
perf(mcp): fix O(n²) insert in inject_mcp_output_items by @slin1237 in #613
perf(kv-index): move worker_blocks to caller-owned storage by @slin1237 in #626
refactor(data-connector): extract shared random hex ID helper by @CatherineSue in #629
perf(data-connector): eliminate double HashMap lookup in list_identifier_responses by @CatherineSue in #632
refactor(data-connector): extract item row-parsing helpers in Postgres and Oracle by @CatherineSue in #630
refactor(mcp): extract fallback label to DEFAULT_SERVER_LABEL constant by @slin1237 in #619
fix(bench): move setup out of concurrent benchmark timing loop by @slin1237 in #634
ci: skip irrelevant E2E jobs on PRs with file changes detection by @CatherineSue in #633
refactor(data-connector): consolidate parse JSON helpers by @CatherineSue in #631
ci: run E2E tests sequentially and fix duplicate log output by @CatherineSue in #635
refactor(oracle): use existing CONVERSATIONS table in Flyway test schema by @key4ng in #627
feat: add smg-grpc-servicer package by @CatherineSue in #638
feat(bench): add trace-driven throughput benchmark for PositionalIndexer by @slin1237 in #636
refactor(mesh): simplify code and fix flaky integration tests by @slin1237 in #628
feat(realtime-api): realtime websocket handler by @pallasathena92 in #637
refactor(workflow): simplify definition and engine internals by @slin1237 in #641
refactor(mcp): remove unused invoked_tool_name and resolved_tool_name from ToolExecutionOutput by @slin1237 in #623
refactor(mcp): extract schema_to_value helper and document clone cost by @slin1237 in #622
feat(interactions): Create basic structure for gemini router by @XinyueZhang369 in #417
fix: gRPC error handling — proper status codes and circuit breaker accuracy by @CatherineSue in #645
refactor(grpc): split grpc utils.rs into focused modules by @CatherineSue in #646
fix(grpc): return tonic::Status directly from grpc_client RPC methods by @CatherineSue in #647
ci(docker): add engine-specific image build pipelines by @slin1237 in #651
ci: block AI co-author and sign-off lines in commits by @CatherineSue in #652
fix(reasoning-parser): increase buffer limit to 4MB and make parse errors non-fatal by @CatherineSue in #653
fix(grpc): use e.message() instead of Display for tonic::Status in user-facing errors by @CatherineSue in #654
fix(docker): fix image build failures in engine Dockerfile and release workflows by @gongwei-130 in #656
chore: add njhill to code owner of grpc proto and servicer by @slin1237 in #661
ci: Change to use github runner set for ci runners by @XinyueZhang369 in #659
refactor(docker): consolidate engine image build into reusable workflow by @slin1237 in #658
fix(ci): add missing Python client types generation to nightly benchmark by @CatherineSue in #663
refactor(grpc): optimize vLLM servicer and remove duplicate server.py by @CatherineSue in #662
fix(serve): use correct argparse attr for sglang tp_size by @CatherineSue in #665
fix(docker): remove GHA docker cache from engine image builds by @slin1237 in #666
refactor(realtime): simplify proxy and registry internals by @slin1237 in #649
fix(gateway): release job queue semaphore permit before waiting for completion by @slin1237 in #667
fix(mesh): increase timeouts in test_multi_node_data_propagation by @CatherineSue in #670
refactor: remove dead embedding/classify code from regular PreparationStage by @CatherineSue in #668
fix(scripts): handle empty arrays in check_release_versions.sh by @slin1237 in #672
fix(ci): add CPU/memory resource requests to GPU runner pods by @slin1237 in #671
docs: update link by @lightseek-bot in #674
fix(docker): add --no-deps to engine install scripts by @CatherineSue in #675
refactor: move library crates into crates/ directory by @CatherineSue in #676
feat(kv-events): learn block_size from KV event stream by @slin1237 in #677
refactor(e2e): remove parallel testing infrastructure and simplify architecture by @slin1237 in #643
fix(ci): prevent runner shutdown on concurrent main pushes and restore path filters by @slin1237 in #678
test(realtime-api): add integration test for websocket by @pallasathena92 in #660
ops(ci): increase 1-gpu runner max from 8 to 16 by @slin1237 in #682
fix: address PR #676 follow-up comments by @CatherineSue in #684
refactor(gateway): remove redundant connection_mode special-case for external providers by @CatherineSue in #680
fix(benchmarks): increase PD benchmark timeout from 240s to 480s by @slin1237 in #681
fix(tokenizer): enable trim_blocks and lstrip_blocks in Jinja2 environment by @slin1237 in #685
refactor(e2e): clean up test infrastructure code quality and efficiency by @slin1237 in #686
refactor(e2e): simplify test helpers and fix redundant HF_HOME env passing by @slin1237 in #688
feat(interactions): Register Gemini Router by @XinyueZhang369 in #679
fix(mesh): prevent stale app state overwrites via relay paths by @slin1237 in #689
fix(scripts): add smg-grpc-servicer to version check and fix pyproject.toml support by @CatherineSue in #683
feat(release): bump v1.2.0 with version sync and engine docker auto-triggers by @slin1237 in #695

New Contributors

@ConnorLi96 made their first contribution in #521
@chenzongyao200127 made their first contribution in #536
@YouNeedCryDear made their first contribution in #470
@frankzhouhr made their first contribution in #562
@lightseek-bot made their first contribution in #674

Full Changelog: v1.1.0...v1.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0

Choose a tag to compare

Sorry, something went wrong.