Add KVStore, ready-check mechanism, and ServiceLauncher#215
Add KVStore, ready-check mechanism, and ServiceLauncher#215nv-alicheng merged 8 commits intofeat/alicheng-pubsub-integrationfrom
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the metrics aggregation and service management infrastructure. It introduces a high-performance, shared-memory key-value store for metrics, enabling efficient cross-process reads. Alongside this, a reliable ready-check protocol and a service launcher utility were added to ensure robust and synchronized startup of various subprocesses. These changes enhance the system's ability to collect and manage real-time metrics while improving overall service orchestration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the metrics aggregation system to use a KVStore backed by mmap files in /dev/shm for lock-free cross-process reads, replacing the previous MetricEmitter JSONL file output. It also introduces a generic ZMQ-based readiness check mechanism for subprocess synchronization, including a ServiceLauncher to manage service startup and shutdown. The review highlights a high-severity issue where the wait_for_exit method in ServiceLauncher is blocking and could cause deadlocks in an async context, suggesting asyncio.to_thread as a fix. Additionally, a medium-severity security concern is raised regarding the use of 0o666 file permissions for mmap files in /dev/shm, recommending more restrictive permissions like 0o660 or 0o600 to prevent unauthorized modification.
Review Council — Multi-AI Code ReviewReviewed by: Claude | Depth: standard Found 4 issues across 3 files:
🤖 Generated with Claude Code |
- BasicKVStore (mmap on /dev/shm) for cross-process metric reads - Ready-check protocol for subprocess startup synchronization - ServiceLauncher for managing subprocess lifecycle - Refactored metrics_table to remove duplicated code
fd7b4ed to
b3e80eb
Compare
nv-alicheng
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Claude (Codex unavailable — CLI crash) | Depth: thorough
Found 11 issues across 6 files.
Must Fix (critical/high)
Issues that will cause incorrect behavior or data loss.
| # | File | Line | Category | Reviewer(s) | Summary |
|---|---|---|---|---|---|
| 1 | launcher.py |
129 | bug | Claude | Crash detection includes still-running processes: poll() returns None for running processes, and |
| 2 | kv_store.py |
247 | performance | Claude | mmap.flush() issues an msync() syscall on every append(). On /dev/shm (tmpfs), msync is a |
| 3 | __main__.py |
97 | data-integrity | Claude | The metrics_dir path is auto-generated via tempfile.mkdtemp() and only known inside the subproce |
Should Fix (medium)
Real issues under specific conditions or design flaws that will compound.
| # | File | Line | Category | Reviewer(s) | Summary |
|---|---|---|---|---|---|
| 4 | aggregator.py |
169 | data-integrity | Claude | n_samples_issued/n_samples_completed are incremented for ALL samples, including those arriving b |
| 5 | kv_store.py |
264 | concurrency | Claude | _grow() replaces the mmap while readers in other processes may have the same file mmap'd. The safe |
| 6 | metrics_table.py |
128 | design | Claude | EmitTrigger.kv_store is initialized to None with type: ignore — calling fire() before `add_t |
| 7 | launcher.py |
153 | api-contract | Claude | wait_for_exit applies timeout per-process, not as a total deadline. With N processes, worst case |
| 8 | kv_store.py |
249 | concurrency | Claude | The barrier protocol is incomplete: flush() before the count update (line 247) targets ARM orderin |
Consider (low)
Valid improvements that could be follow-ups.
| # | File | Line | Category | Reviewer(s) | Summary |
|---|---|---|---|---|---|
| 9 | kv_store.py |
120 | api-contract | Claude | Parameter type shadows the Python builtin. Consider renaming to key_type or kind. |
| 10 | ready_check.py |
59 | design | Claude | PUSH socket LINGER is 5000ms. For readiness signals (happens once at startup, not hot-path), a silen |
| 11 | __main__.py |
136 | error-handling | Claude | Triple cleanup: _finalize() closes kv_store, then inner finally calls kv_store.unlink() (which |
🤖 Generated with Claude Code
…apacity update ordering, document ARM workaround
e4ef534
into
feat/alicheng-pubsub-integration
What does this PR do?
Adds basic KVStore functionality (lightweight Prometheus implementation) to be the default MetricsAggregator backend
Type of change
Related issues
Testing
Checklist