[da-vinci][server] Add OTel store.version ASYNC_GAUGE for current/future version numbers#2722
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an OpenTelemetry counterpart to existing per-store Tehuti current_version / future_version gauges by introducing a dedicated store metadata listener that emits a single store.version ASYNC_GAUGE per store with a VERSION_ROLE dimension.
Changes:
- Register a new
StoreVersionOtelStatsStoreDataChangedListenerin bothVeniceServerandDaVinciBackendto publishstore.version(CURRENT/FUTURE). - Introduce
VeniceVersionedStatsOtelMetricEntityand wire it into the server metric entity set (and associated entity-count test). - Extract shared “future version” computation into
OtelVersionedStatsUtils.computeFutureVersionand add aVersionInfo.NON_EXISTINGsentinel reused across OTel stats classes (with new/updated unit tests).
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java | Registers the new per-store OTel version gauge listener on the server metadata repo. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java | Registers the same listener for DaVinci backend store repo events. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StoreVersionOtelStats.java | New listener emitting store.version ASYNC_GAUGE per store with CURRENT/FUTURE role. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntity.java | Defines the new OTel metric entity (store.version) and its required dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Adds the new metric entity module to the server’s exported metric entity list. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/OtelVersionedStatsUtils.java | Adds VersionInfo.NON_EXISTING and shared computeFutureVersion(...) helper. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/AbstractVeniceAggVersionedStats.java | Switches future-version computation to the extracted helper. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StorageEngineOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BlobTransferOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StoreVersionOtelStatsTest.java | New unit tests covering gauge values/updates, lifecycle, and OTel-disabled safety. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntityTest.java | Validates the new metric entity definition. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/OtelVersionedStatsUtilsTest.java | Adds unit tests for computeFutureVersion(...). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected server metric entity count to include the new entity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ef877e9 to
ebe759e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
version numbers Add StoreVersionOtelStats — a standalone StoreDataChangedListener that emits per-store ASYNC_GAUGE for version numbers with VERSION_ROLE dimension (CURRENT/FUTURE). Registered once per process in both VeniceServer and DaVinciBackend. Also: - Extract computeFutureVersion into OtelVersionedStatsUtils (shared by AbstractVeniceAggVersionedStats and StoreVersionOtelStats) - Add VersionInfo.NON_EXISTING sentinel constant, replace 5 inline duplicates - Clean up unused NON_EXISTING_VERSION imports in 3 OTel stats classes
914cd39 to
ee6e5a6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * callback on re-creation. | ||
| */ | ||
| @Override | ||
| public void handleStoreDeleted(String storeName) { |
There was a problem hiding this comment.
Can we use close methods to unregister callbacks?
| * pre-existing stores. Prefer {@link #create} which combines construction and registration. | ||
| */ | ||
| void register(ReadOnlyStoreRepository metadataRepository) { | ||
| metadataRepository.registerStoreDataChangedListener(this); |
There was a problem hiding this comment.
Nit: this registers the listener before the otelRepository == null check below, so on non-OTel deployments a no-op listener still walks every create/change/delete event. Could we early-return before registering, and log once that OTel is disabled so it's discoverable?
| * created lazily on first store change and bounded by the number of distinct store names ever | ||
| * observed by the process (entries are not removed on deletion — see cleanup limitation below). | ||
| * | ||
| * <p><b>Cleanup limitation:</b> OTel async gauge callbacks cannot be deregistered (OTel SDK |
There was a problem hiding this comment.
Small accuracy nit: the OTel SDK does support deregistration via ObservableLongGauge.close(). Our AsyncMetricEntityStateBase.create(...) is what doesn't surface the handle. Could we phrase it as a Venice-wrapper limitation so future readers know where to look if they want to add a deregister path?
|
|
||
| /** | ||
| * Resets version info to {@link VersionInfo#NON_EXISTING} rather than removing the map entry. | ||
| * OTel async gauge callbacks cannot be deregistered, so removing and re-creating the entry |
There was a problem hiding this comment.
The "duplicate callbacks" framing is the symptom; the underlying reason is that the supplier lambda closes over the original AtomicReference, so allocating a new ref would orphan the live callback and the new one would shadow it. Worth calling that out so the next reader doesn't conclude "just track and remove the callback."
| clusterConfig.getClusterName()); | ||
|
|
||
| // OTel per-store version gauge | ||
| StoreVersionOtelStats.create(metricsRepository, clusterConfig.getClusterName(), metadataRepo); |
There was a problem hiding this comment.
Returned reference is discarded. There's no way to unregister the listener later, and a second create() call in the same JVM would double-register the async-gauge callbacks. Worth holding the reference, making the class Closeable, and unregistering in the existing close path? Same comment applies to DaVinciBackend.
|
|
||
| // OTel per-store version gauge | ||
| StoreVersionOtelStats | ||
| .create(metricsRepository, configLoader.getVeniceClusterConfig().getClusterName(), storeRepository); |
There was a problem hiding this comment.
Same as VeniceServer: reference is discarded. DaVinciBackend.close() already explicitly unregisters other listeners, so an unregister hook here would slot in naturally. Repeated DaVinci create/close cycles in the same JVM (tests, embedded use) will leak listeners.
| * Returns the highest such version number, or {@link com.linkedin.venice.meta.Store#NON_EXISTING_VERSION} if none. | ||
| */ | ||
| public static int computeFutureVersion(List<Version> versions) { | ||
| if (versions == null) { |
There was a problem hiding this comment.
On the AbstractVeniceAggVersionedStats.applyVersionInfo path, existingVersions.stream() is called before this util, so a null list would NPE there first. Either drop this guard so the contract matches surrounding code, or also null-guard applyVersionInfo so the silent -1 path is observable. Half-defensive feels worse than either choice.
| Store snapshot = createMockStore(TEST_STORE_NAME, 3, createVersion(3, VersionStatus.ONLINE)); | ||
| when(mockRepo.getAllStores()).thenReturn(Collections.singletonList(snapshot)); | ||
|
|
||
| // Simulate a concurrent ZK event with newer version 5 arriving before register() scans |
There was a problem hiding this comment.
The test calls handleStoreChanged before register() runs, which is sequential rather than concurrent. The production race is an event arriving between registerStoreDataChangedListener and getAllStores(). Renaming to testRegisterDoesNotOverwriteExistingEntry and adjusting the comment would match what's actually verified. A real concurrency test (latched threads racing the snapshot) would be a nice follow-up.
| } | ||
|
|
||
| @Test | ||
| public void testComputeFutureVersionIgnoresErrorStatus() { |
There was a problem hiding this comment.
VersionStatus has 9 values; only STARTED, PUSHED, ONLINE, ERROR are exercised. A parameterized "every status except STARTED/PUSHED is ignored" test would catch a regression if someone widens the future-set later (e.g. adds PARTIALLY_ONLINE).
| @Test | ||
| public void testServerMetricEntitiesCount() { | ||
| assertEquals(SERVER_METRIC_ENTITIES.size(), 156, "Expected 156 unique metric entities"); | ||
| assertEquals(SERVER_METRIC_ENTITIES.size(), 157, "Expected 157 unique metric entities"); |
There was a problem hiding this comment.
Drive-by: every PR adding a metric has to bump this constant, and the failure message just restates the value. Could replace with non-emptiness plus per-entity invariants (unique names, valid types) when convenient. Not blocking for this PR.
mnagaraj/addOtelMetricsForVeniceVersionedStats
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * Computes the future version from a list of versions. A version is considered "future" | ||
| * if its status is {@link VersionStatus#STARTED} or {@link VersionStatus#PUSHED}. | ||
| * Returns the highest such version number, or {@link com.linkedin.venice.meta.Store#NON_EXISTING_VERSION} if none. | ||
| * | ||
| * <p>Caller is responsible for passing a non-null list. | ||
| */ | ||
| public static int computeFutureVersion(List<Version> versions) { | ||
| int futureVersion = NON_EXISTING_VERSION; | ||
| for (Version version: versions) { |
Problem Statement
Venice stores track which version number is current and which is future for each store via Tehuti
current_versionandfuture_versionAsyncGauge sensors inVeniceVersionedStatsReporter. These metrics have no OTel counterpart, making them unavailable in OTel-based monitoring dashboards.Solution
Add
StoreVersionOtelStats— a standaloneStoreDataChangedListenerthat registers OTel ASYNC_GAUGE callbacks per store for current and future version numbers, withVERSION_ROLEdimension (CURRENT/FUTURE). Created viaStoreVersionOtelStats.create()factory method in bothVeniceServerandDaVinciBackend.Design choice: A standalone listener was chosen over adding OTel to
VeniceVersionedStatsReporterbecause each store has 6VeniceVersionedStatsReporterinstances (one perAbstractVeniceAggVersionedStatssubclass). Putting OTel in the reporter would create 6 duplicate ASYNC_GAUGE registrations per store. The standalone listener ensures exactly one registration per store.Key design decisions:
create()factory method combines construction + registration to eliminate temporal coupling (cannot forget to callregister()).register()both adds the listener and initializes gauges for pre-existing stores viagetAllStores()— the metadata repository does not replay existing stores to newly registered listeners.initializeStoreIfAbsent()usescomputeIfAbsentonly (no unconditional.set()) to avoid overwriting concurrent ZK events with stale snapshot data during startup initialization.AtomicReference<VersionInfo>toNON_EXISTINGrather than removing the map entry — OTel async gauge callbacks cannot be deregistered, so this avoids duplicate callback registration on store re-creation.handleStoreChanged/handleStoreDeletedwhenotelRepository == nullskips all computation for non-OTel deployments.Additional cleanup:
computeFutureVersionintoOtelVersionedStatsUtils— shared byAbstractVeniceAggVersionedStats.applyVersionInfo()andStoreVersionOtelStats, eliminating duplicated STARTED/PUSHED version logic. Added null guard on input list.VersionInfo.NON_EXISTINGsentinel constant, replacing 5 inlinenew VersionInfo(NON_EXISTING_VERSION, NON_EXISTING_VERSION)duplicates across OTel stats classes.VersionInfoafinalclass to prevent subclassing that could break the shared sentinel pattern.NON_EXISTING_VERSIONimports.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
StoreVersionOtelStatsTest(12 tests): CURRENT/FUTURE gauge values, version change updates, multi-store isolation, store delete/recreate lifecycle,handleStoreCreateddelegation, future version computation (STARTED/PUSHED), no future when all ONLINE, NPE prevention (OTel disabled + plain MetricsRepository),register()initializes pre-existing stores,register()does not overwrite concurrent ZK events, delete for never-seen store is no-op.VeniceVersionedStatsOtelMetricEntityTest: metric entity validation viaModuleMetricEntityTestFixture.OtelVersionedStatsUtilsTest(6 new tests):computeFutureVersion— empty list, all ONLINE, STARTED only, PUSHED only, mixed returns max, ERROR ignored.Does this PR introduce any user-facing or breaking changes?