Skip to content

Add KVStore, ready-check mechanism, and ServiceLauncher#215

Merged
nv-alicheng merged 8 commits intofeat/alicheng-pubsub-integrationfrom
feat/alicheng-kvstore
Mar 31, 2026
Merged

Add KVStore, ready-check mechanism, and ServiceLauncher#215
nv-alicheng merged 8 commits intofeat/alicheng-pubsub-integrationfrom
feat/alicheng-kvstore

Conversation

@nv-alicheng
Copy link
Copy Markdown
Collaborator

  • BasicKVStore (mmap on /dev/shm) for cross-process metric reads
  • Ready-check protocol for subprocess startup synchronization
  • ServiceLauncher for managing subprocess lifecycle
  • Refactored metrics_table to remove duplicated code

What does this PR do?

Adds basic KVStore functionality (lightweight Prometheus implementation) to be the default MetricsAggregator backend

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@nv-alicheng nv-alicheng requested a review from a team as a code owner March 26, 2026 01:13
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions bot requested review from arekay-nv and nvzhihanj March 26, 2026 01:13
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the metrics aggregation and service management infrastructure. It introduces a high-performance, shared-memory key-value store for metrics, enabling efficient cross-process reads. Alongside this, a reliable ready-check protocol and a service launcher utility were added to ensure robust and synchronized startup of various subprocesses. These changes enhance the system's ability to collect and manage real-time metrics while improving overall service orchestration.

Highlights

  • KVStore Implementation: Introduced a new KVStore abstract base class and a BasicKVStore implementation that uses memory-mapped files on /dev/shm for efficient, lock-free, cross-process metric storage. This replaces the previous MetricEmitter system.
  • Ready-Check Mechanism: Implemented a robust ZMQ PUSH/PULL-based readiness protocol, featuring ReadyCheckReceiver and send_ready_signal, to synchronize subprocess startup and ensure services are fully initialized before use.
  • ServiceLauncher Utility: Added a ServiceLauncher class to streamline the management of service subprocesses, including their launch, readiness synchronization using the new ready-check mechanism, and graceful shutdown.
  • Metrics Aggregator Refactoring: Refactored the MetricsAggregatorService and MetricsTable to integrate with the new KVStore, simplifying metric trigger logic and removing the JsonlMetricEmitter.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the metrics aggregation system to use a KVStore backed by mmap files in /dev/shm for lock-free cross-process reads, replacing the previous MetricEmitter JSONL file output. It also introduces a generic ZMQ-based readiness check mechanism for subprocess synchronization, including a ServiceLauncher to manage service startup and shutdown. The review highlights a high-severity issue where the wait_for_exit method in ServiceLauncher is blocking and could cause deadlocks in an async context, suggesting asyncio.to_thread as a fix. Additionally, a medium-severity security concern is raised regarding the use of 0o666 file permissions for mmap files in /dev/shm, recommending more restrictive permissions like 0o660 or 0o600 to prevent unauthorized modification.

Comment thread src/inference_endpoint/async_utils/services/launcher.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py Outdated
Comment thread src/inference_endpoint/async_utils/services/launcher.py
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
@nvzhihanj
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: standard

Found 4 issues across 3 files:

  • 1 high
  • 3 medium
# File Line Severity Category Summary
1 launcher.py 89 high error-handling Crashed subprocess detection leaks remaining running processes and receiver socket
2 kv_store.py 127 medium bug _grow() leaves _capacity doubled on mmap failure — subsequent writes go out of bounds
3 kv_store.py 109 medium concurrency Series write relies on x86 TSO — breaks on ARM (Graviton, Apple Silicon)
4 __main__.py 82 medium error-handling /dev/shm/metrics_* directory leaked if constructor raises before try/finally

🤖 Generated with Claude Code

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated
Comment thread src/inference_endpoint/async_utils/services/launcher.py
Comment thread src/inference_endpoint/async_utils/services/launcher.py Outdated
@nv-alicheng nv-alicheng changed the base branch from feat/alicheng-cleanup to feat/alicheng-pubsub-integration March 30, 2026 18:07
- BasicKVStore (mmap on /dev/shm) for cross-process metric reads
- Ready-check protocol for subprocess startup synchronization
- ServiceLauncher for managing subprocess lifecycle
- Refactored metrics_table to remove duplicated code
@nv-alicheng nv-alicheng force-pushed the feat/alicheng-kvstore branch from fd7b4ed to b3e80eb Compare March 31, 2026 00:07
Copy link
Copy Markdown
Collaborator Author

@nv-alicheng nv-alicheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — CLI crash) | Depth: thorough

Found 11 issues across 6 files.

Must Fix (critical/high)

Issues that will cause incorrect behavior or data loss.

# File Line Category Reviewer(s) Summary
1 launcher.py 129 bug Claude Crash detection includes still-running processes: poll() returns None for running processes, and
2 kv_store.py 247 performance Claude mmap.flush() issues an msync() syscall on every append(). On /dev/shm (tmpfs), msync is a
3 __main__.py 97 data-integrity Claude The metrics_dir path is auto-generated via tempfile.mkdtemp() and only known inside the subproce

Should Fix (medium)

Real issues under specific conditions or design flaws that will compound.

# File Line Category Reviewer(s) Summary
4 aggregator.py 169 data-integrity Claude n_samples_issued/n_samples_completed are incremented for ALL samples, including those arriving b
5 kv_store.py 264 concurrency Claude _grow() replaces the mmap while readers in other processes may have the same file mmap'd. The safe
6 metrics_table.py 128 design Claude EmitTrigger.kv_store is initialized to None with type: ignore — calling fire() before `add_t
7 launcher.py 153 api-contract Claude wait_for_exit applies timeout per-process, not as a total deadline. With N processes, worst case
8 kv_store.py 249 concurrency Claude The barrier protocol is incomplete: flush() before the count update (line 247) targets ARM orderin

Consider (low)

Valid improvements that could be follow-ups.

# File Line Category Reviewer(s) Summary
9 kv_store.py 120 api-contract Claude Parameter type shadows the Python builtin. Consider renaming to key_type or kind.
10 ready_check.py 59 design Claude PUSH socket LINGER is 5000ms. For readiness signals (happens once at startup, not hot-path), a silen
11 __main__.py 136 error-handling Claude Triple cleanup: _finalize() closes kv_store, then inner finally calls kv_store.unlink() (which

🤖 Generated with Claude Code

Comment thread src/inference_endpoint/async_utils/services/launcher.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py Outdated
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
Comment thread src/inference_endpoint/async_utils/services/launcher.py
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py Outdated
Comment thread src/inference_endpoint/async_utils/transport/zmq/ready_check.py
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated
@nv-alicheng nv-alicheng merged commit e4ef534 into feat/alicheng-pubsub-integration Mar 31, 2026
2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 31, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants