Skip to content

[da-vinci][server] Add OTel store.version ASYNC_GAUGE for current/future version numbers#2722

Open
m-nagarajan wants to merge 6 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForVeniceVersionedStats
Open

[da-vinci][server] Add OTel store.version ASYNC_GAUGE for current/future version numbers#2722
m-nagarajan wants to merge 6 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForVeniceVersionedStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

@m-nagarajan m-nagarajan commented Apr 11, 2026

Problem Statement

Venice stores track which version number is current and which is future for each store via Tehuti current_version and future_version AsyncGauge sensors in VeniceVersionedStatsReporter. These metrics have no OTel counterpart, making them unavailable in OTel-based monitoring dashboards.

Solution

Add StoreVersionOtelStats — a standalone StoreDataChangedListener that registers OTel ASYNC_GAUGE callbacks per store for current and future version numbers, with VERSION_ROLE dimension (CURRENT/FUTURE). Created via StoreVersionOtelStats.create() factory method in both VeniceServer and DaVinciBackend.

Design choice: A standalone listener was chosen over adding OTel to VeniceVersionedStatsReporter because each store has 6 VeniceVersionedStatsReporter instances (one per AbstractVeniceAggVersionedStats subclass). Putting OTel in the reporter would create 6 duplicate ASYNC_GAUGE registrations per store. The standalone listener ensures exactly one registration per store.

Key design decisions:

  • create() factory method combines construction + registration to eliminate temporal coupling (cannot forget to call register()).
  • register() both adds the listener and initializes gauges for pre-existing stores via getAllStores() — the metadata repository does not replay existing stores to newly registered listeners.
  • initializeStoreIfAbsent() uses computeIfAbsent only (no unconditional .set()) to avoid overwriting concurrent ZK events with stale snapshot data during startup initialization.
  • Store deletion resets AtomicReference<VersionInfo> to NON_EXISTING rather than removing the map entry — OTel async gauge callbacks cannot be deregistered, so this avoids duplicate callback registration on store re-creation.
  • Early return in handleStoreChanged/handleStoreDeleted when otelRepository == null skips all computation for non-OTel deployments.

Additional cleanup:

  • Extract computeFutureVersion into OtelVersionedStatsUtils — shared by AbstractVeniceAggVersionedStats.applyVersionInfo() and StoreVersionOtelStats, eliminating duplicated STARTED/PUSHED version logic. Added null guard on input list.
  • Add VersionInfo.NON_EXISTING sentinel constant, replacing 5 inline new VersionInfo(NON_EXISTING_VERSION, NON_EXISTING_VERSION) duplicates across OTel stats classes.
  • Make VersionInfo a final class to prevent subclassing that could break the shared sentinel pattern.
  • Clean up 3 unused NON_EXISTING_VERSION imports.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • StoreVersionOtelStatsTest (12 tests): CURRENT/FUTURE gauge values, version change updates, multi-store isolation, store delete/recreate lifecycle, handleStoreCreated delegation, future version computation (STARTED/PUSHED), no future when all ONLINE, NPE prevention (OTel disabled + plain MetricsRepository), register() initializes pre-existing stores, register() does not overwrite concurrent ZK events, delete for never-seen store is no-op.
  • VeniceVersionedStatsOtelMetricEntityTest: metric entity validation via ModuleMetricEntityTestFixture.
  • OtelVersionedStatsUtilsTest (6 new tests): computeFutureVersion — empty list, all ONLINE, STARTED only, PUSHED only, mixed returns max, ERROR ignored.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings April 11, 2026 22:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenTelemetry counterpart to existing per-store Tehuti current_version / future_version gauges by introducing a dedicated store metadata listener that emits a single store.version ASYNC_GAUGE per store with a VERSION_ROLE dimension.

Changes:

  • Register a new StoreVersionOtelStats StoreDataChangedListener in both VeniceServer and DaVinciBackend to publish store.version (CURRENT/FUTURE).
  • Introduce VeniceVersionedStatsOtelMetricEntity and wire it into the server metric entity set (and associated entity-count test).
  • Extract shared “future version” computation into OtelVersionedStatsUtils.computeFutureVersion and add a VersionInfo.NON_EXISTING sentinel reused across OTel stats classes (with new/updated unit tests).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java Registers the new per-store OTel version gauge listener on the server metadata repo.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java Registers the same listener for DaVinci backend store repo events.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StoreVersionOtelStats.java New listener emitting store.version ASYNC_GAUGE per store with CURRENT/FUTURE role.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntity.java Defines the new OTel metric entity (store.version) and its required dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Adds the new metric entity module to the server’s exported metric entity list.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/OtelVersionedStatsUtils.java Adds VersionInfo.NON_EXISTING and shared computeFutureVersion(...) helper.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/AbstractVeniceAggVersionedStats.java Switches future-version computation to the extracted helper.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StorageEngineOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BlobTransferOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StoreVersionOtelStatsTest.java New unit tests covering gauge values/updates, lifecycle, and OTel-disabled safety.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntityTest.java Validates the new metric entity definition.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/OtelVersionedStatsUtilsTest.java Adds unit tests for computeFutureVersion(...).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected server metric entity count to include the new entity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
Copilot AI review requested due to automatic review settings April 30, 2026 20:03
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForVeniceVersionedStats branch from ef877e9 to ebe759e Compare April 30, 2026 20:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

version numbers

Add StoreVersionOtelStats — a standalone StoreDataChangedListener that emits
per-store ASYNC_GAUGE for version numbers with VERSION_ROLE dimension
(CURRENT/FUTURE).
Registered once per process in both VeniceServer and DaVinciBackend.

Also:
- Extract computeFutureVersion into OtelVersionedStatsUtils (shared by
  AbstractVeniceAggVersionedStats and StoreVersionOtelStats)
- Add VersionInfo.NON_EXISTING sentinel constant, replace 5 inline duplicates
- Clean up unused NON_EXISTING_VERSION imports in 3 OTel stats classes
Copilot AI review requested due to automatic review settings May 2, 2026 03:40
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForVeniceVersionedStats branch from 914cd39 to ee6e5a6 Compare May 2, 2026 03:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* callback on re-creation.
*/
@Override
public void handleStoreDeleted(String storeName) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use close methods to unregister callbacks?

Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup — sentinel + factory + extracted util are all good moves, and the test suite covers the lifecycle thoroughly. A few inline notes below; the lifecycle/leak items (#2, #3) are the most material. Nothing blocking.

* pre-existing stores. Prefer {@link #create} which combines construction and registration.
*/
void register(ReadOnlyStoreRepository metadataRepository) {
metadataRepository.registerStoreDataChangedListener(this);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this registers the listener before the otelRepository == null check below, so on non-OTel deployments a no-op listener still walks every create/change/delete event. Could we early-return before registering, and log once that OTel is disabled so it's discoverable?

* created lazily on first store change and bounded by the number of distinct store names ever
* observed by the process (entries are not removed on deletion — see cleanup limitation below).
*
* <p><b>Cleanup limitation:</b> OTel async gauge callbacks cannot be deregistered (OTel SDK
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small accuracy nit: the OTel SDK does support deregistration via ObservableLongGauge.close(). Our AsyncMetricEntityStateBase.create(...) is what doesn't surface the handle. Could we phrase it as a Venice-wrapper limitation so future readers know where to look if they want to add a deregister path?


/**
* Resets version info to {@link VersionInfo#NON_EXISTING} rather than removing the map entry.
* OTel async gauge callbacks cannot be deregistered, so removing and re-creating the entry
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "duplicate callbacks" framing is the symptom; the underlying reason is that the supplier lambda closes over the original AtomicReference, so allocating a new ref would orphan the live callback and the new one would shadow it. Worth calling that out so the next reader doesn't conclude "just track and remove the callback."

clusterConfig.getClusterName());

// OTel per-store version gauge
StoreVersionOtelStats.create(metricsRepository, clusterConfig.getClusterName(), metadataRepo);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returned reference is discarded. There's no way to unregister the listener later, and a second create() call in the same JVM would double-register the async-gauge callbacks. Worth holding the reference, making the class Closeable, and unregistering in the existing close path? Same comment applies to DaVinciBackend.


// OTel per-store version gauge
StoreVersionOtelStats
.create(metricsRepository, configLoader.getVeniceClusterConfig().getClusterName(), storeRepository);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as VeniceServer: reference is discarded. DaVinciBackend.close() already explicitly unregisters other listeners, so an unregister hook here would slot in naturally. Repeated DaVinci create/close cycles in the same JVM (tests, embedded use) will leak listeners.

* Returns the highest such version number, or {@link com.linkedin.venice.meta.Store#NON_EXISTING_VERSION} if none.
*/
public static int computeFutureVersion(List<Version> versions) {
if (versions == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the AbstractVeniceAggVersionedStats.applyVersionInfo path, existingVersions.stream() is called before this util, so a null list would NPE there first. Either drop this guard so the contract matches surrounding code, or also null-guard applyVersionInfo so the silent -1 path is observable. Half-defensive feels worse than either choice.

Store snapshot = createMockStore(TEST_STORE_NAME, 3, createVersion(3, VersionStatus.ONLINE));
when(mockRepo.getAllStores()).thenReturn(Collections.singletonList(snapshot));

// Simulate a concurrent ZK event with newer version 5 arriving before register() scans
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test calls handleStoreChanged before register() runs, which is sequential rather than concurrent. The production race is an event arriving between registerStoreDataChangedListener and getAllStores(). Renaming to testRegisterDoesNotOverwriteExistingEntry and adjusting the comment would match what's actually verified. A real concurrency test (latched threads racing the snapshot) would be a nice follow-up.

}

@Test
public void testComputeFutureVersionIgnoresErrorStatus() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VersionStatus has 9 values; only STARTED, PUSHED, ONLINE, ERROR are exercised. A parameterized "every status except STARTED/PUSHED is ignored" test would catch a regression if someone widens the future-set later (e.g. adds PARTIALLY_ONLINE).

@Test
public void testServerMetricEntitiesCount() {
assertEquals(SERVER_METRIC_ENTITIES.size(), 156, "Expected 156 unique metric entities");
assertEquals(SERVER_METRIC_ENTITIES.size(), 157, "Expected 157 unique metric entities");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by: every PR adding a metric has to bump this constant, and the failure message just restates the value. Could replace with non-emptiness plus per-entity invariants (unique names, valid types) when convenient. Not blocking for this PR.

@sushantmane sushantmane added requested-changes Reviewer requested changes and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
Copilot AI review requested due to automatic review settings May 5, 2026 00:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +45 to +53
* Computes the future version from a list of versions. A version is considered "future"
* if its status is {@link VersionStatus#STARTED} or {@link VersionStatus#PUSHED}.
* Returns the highest such version number, or {@link com.linkedin.venice.meta.Store#NON_EXISTING_VERSION} if none.
*
* <p>Caller is responsible for passing a non-null list.
*/
public static int computeFutureVersion(List<Version> versions) {
int futureVersion = NON_EXISTING_VERSION;
for (Version version: versions) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requested-changes Reviewer requested changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants