Add KVStore, ready-check mechanism, and ServiceLauncher by nv-alicheng · Pull Request #215 · mlcommons/endpoints

nv-alicheng · 2026-03-26T01:13:38Z

BasicKVStore (mmap on /dev/shm) for cross-process metric reads
Ready-check protocol for subprocess startup synchronization
ServiceLauncher for managing subprocess lifecycle
Refactored metrics_table to remove duplicated code

What does this PR do?

Adds basic KVStore functionality (lightweight Prometheus implementation) to be the default MetricsAggregator backend

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-03-26T01:13:47Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist · 2026-03-26T01:14:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the metrics aggregation and service management infrastructure. It introduces a high-performance, shared-memory key-value store for metrics, enabling efficient cross-process reads. Alongside this, a reliable ready-check protocol and a service launcher utility were added to ensure robust and synchronized startup of various subprocesses. These changes enhance the system's ability to collect and manage real-time metrics while improving overall service orchestration.

Highlights

KVStore Implementation: Introduced a new KVStore abstract base class and a BasicKVStore implementation that uses memory-mapped files on /dev/shm for efficient, lock-free, cross-process metric storage. This replaces the previous MetricEmitter system.
Ready-Check Mechanism: Implemented a robust ZMQ PUSH/PULL-based readiness protocol, featuring ReadyCheckReceiver and send_ready_signal, to synchronize subprocess startup and ensure services are fully initialized before use.
ServiceLauncher Utility: Added a ServiceLauncher class to streamline the management of service subprocesses, including their launch, readiness synchronization using the new ready-check mechanism, and graceful shutdown.
Metrics Aggregator Refactoring: Refactored the MetricsAggregatorService and MetricsTable to integrate with the new KVStore, simplifying metric trigger logic and removing the JsonlMetricEmitter.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the metrics aggregation system to use a KVStore backed by mmap files in /dev/shm for lock-free cross-process reads, replacing the previous MetricEmitter JSONL file output. It also introduces a generic ZMQ-based readiness check mechanism for subprocess synchronization, including a ServiceLauncher to manage service startup and shutdown. The review highlights a high-severity issue where the wait_for_exit method in ServiceLauncher is blocking and could cause deadlocks in an async context, suggesting asyncio.to_thread as a fix. Additionally, a medium-severity security concern is raised regarding the use of 0o666 file permissions for mmap files in /dev/shm, recommending more restrictive permissions like 0o660 or 0o600 to prevent unauthorized modification.

nvzhihanj · 2026-03-30T04:02:28Z

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: standard

Found 4 issues across 3 files:

1 high
3 medium

#	File	Line	Severity	Category	Summary
1	`launcher.py`	89	high	error-handling	Crashed subprocess detection leaks remaining running processes and receiver socket
2	`kv_store.py`	127	medium	bug	`_grow()` leaves `_capacity` doubled on mmap failure — subsequent writes go out of bounds
3	`kv_store.py`	109	medium	concurrency	Series write relies on x86 TSO — breaks on ARM (Graviton, Apple Silicon)
4	`__main__.py`	82	medium	error-handling	`/dev/shm/metrics_*` directory leaked if constructor raises before try/finally

🤖 Generated with Claude Code

- BasicKVStore (mmap on /dev/shm) for cross-process metric reads - Ready-check protocol for subprocess startup synchronization - ServiceLauncher for managing subprocess lifecycle - Refactored metrics_table to remove duplicated code

nv-alicheng

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — CLI crash) | Depth: thorough

Found 11 issues across 6 files.

Must Fix (critical/high)

Issues that will cause incorrect behavior or data loss.

#	File	Line	Category	Reviewer(s)	Summary
1	`launcher.py`	129	bug	Claude	Crash detection includes still-running processes: `poll()` returns `None` for running processes, and
2	`kv_store.py`	247	performance	Claude	`mmap.flush()` issues an `msync()` syscall on every `append()`. On `/dev/shm` (tmpfs), `msync` is a
3	`__main__.py`	97	data-integrity	Claude	The `metrics_dir` path is auto-generated via `tempfile.mkdtemp()` and only known inside the subproce

Should Fix (medium)

Real issues under specific conditions or design flaws that will compound.

#	File	Line	Category	Reviewer(s)	Summary
4	`aggregator.py`	169	data-integrity	Claude	`n_samples_issued`/`n_samples_completed` are incremented for ALL samples, including those arriving b
5	`kv_store.py`	264	concurrency	Claude	`_grow()` replaces the mmap while readers in other processes may have the same file mmap'd. The safe
6	`metrics_table.py`	128	design	Claude	`EmitTrigger.kv_store` is initialized to `None` with `type: ignore` — calling `fire()` before `add_t
7	`launcher.py`	153	api-contract	Claude	`wait_for_exit` applies `timeout` per-process, not as a total deadline. With N processes, worst case
8	`kv_store.py`	249	concurrency	Claude	The barrier protocol is incomplete: `flush()` before the count update (line 247) targets ARM orderin

Consider (low)

Valid improvements that could be follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
9	`kv_store.py`	120	api-contract	Claude	Parameter `type` shadows the Python builtin. Consider renaming to `key_type` or `kind`.
10	`ready_check.py`	59	design	Claude	PUSH socket LINGER is 5000ms. For readiness signals (happens once at startup, not hot-path), a silen
11	`__main__.py`	136	error-handling	Claude	Triple cleanup: `_finalize()` closes kv_store, then inner `finally` calls `kv_store.unlink()` (which

🤖 Generated with Claude Code

…apacity update ordering, document ARM workaround

nv-alicheng requested a review from a team as a code owner March 26, 2026 01:13

github-actions bot requested review from arekay-nv and nvzhihanj March 26, 2026 01:13

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/launcher.py Outdated

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py Outdated

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py Outdated

nvzhihanj reviewed Mar 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/launcher.py

nvzhihanj reviewed Mar 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py

nvzhihanj reviewed Mar 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/kv_store.py

nvzhihanj reviewed Mar 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py Outdated

arekay-nv reviewed Mar 30, 2026

View reviewed changes

nv-alicheng changed the base branch from feat/alicheng-cleanup to feat/alicheng-pubsub-integration March 30, 2026 18:07

arekay-nv approved these changes Mar 30, 2026

View reviewed changes

nv-alicheng added 3 commits March 30, 2026 13:51

Address PR comments

7ace72d

Move MetricsAggregator to use Enums instead of strings to track metrics

b3e80eb

nv-alicheng force-pushed the feat/alicheng-kvstore branch from fd7b4ed to b3e80eb Compare March 31, 2026 00:07

Minor comments, fixes. Restrict filemode to 600 instead of 666

153368c

nv-alicheng commented Mar 31, 2026

View reviewed changes

nv-alicheng added 2 commits March 30, 2026 19:36

fix: address PR #215 review feedback - crash detection bug, _grow() c…

ebbe16e

…apacity update ordering, document ARM workaround

Add guard to avoid no-op flush() syscall on x86

e073ce2

nvzhihanj approved these changes Mar 31, 2026

View reviewed changes

nv-alicheng added 2 commits March 30, 2026 20:26

Pass in KVStore to triggers to avoid injection during table.add_trigger

0e2e9fa

Fix timeout to be across all processes, not per process

7eb76ee

nv-alicheng merged commit e4ef534 into feat/alicheng-pubsub-integration Mar 31, 2026
2 checks passed

github-actions bot locked and limited conversation to collaborators Mar 31, 2026

Conversation

nv-alicheng commented Mar 26, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj commented Mar 30, 2026

Review Council — Multi-AI Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-alicheng left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Must Fix (critical/high)

Should Fix (medium)

Consider (low)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 26, 2026 •

edited

Loading