Skip to content

[da-vinci][server] Add OTel metrics to KafkaConsumerServiceStats#2604

Merged
m-nagarajan merged 10 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForKafkaConsumerServiceStats
Mar 14, 2026
Merged

[da-vinci][server] Add OTel metrics to KafkaConsumerServiceStats#2604
m-nagarajan merged 10 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForKafkaConsumerServiceStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

@m-nagarajan m-nagarajan commented Mar 12, 2026

Problem Statement

KafkaConsumerServiceStats has 17 Tehuti sensors tracking PubSub consumer pool behavior (poll latency, record counts, partition assignments, idle time, etc.) but no OpenTelemetry instrumentation.

Key challenges:

  • Per-store vs total-only split: Some metrics (poll bytes, record count) are per-store with Tehuti parent propagation to total; others are total-only recorded via AggKafkaConsumerServiceStats.recordTotal*().
  • Asymmetric partition histogram: Tehuti uses 4 pre-computed gauges (min/max/avg/subscribed); OTel records raw per-consumer values to a single MIN_MAX_COUNT_SUM histogram.
  • AsyncGauge: max_elapsed_time_since_last_successful_poll uses a LongSupplier callback — Tehuti-only, intentionally excluded from OTel (redundant with idle_time / POLL_TIME_SINCE_LAST_SUCCESS).
  • Shared OTel instrument: delegate_subscribe_latency and update_current_assignment_latency consolidate into one OTel metric differentiated by a VeniceConsumerPoolAction dimension.

Solution

Migrated 16 of 17 Tehuti sensors to 12 OTel metrics using the joint Tehuti+OTel API pattern, integrated directly into KafkaConsumerServiceStats (no separate OtelStats class). One Tehuti sensor (max_elapsed_time_since_last_successful_poll AsyncGauge) is intentionally Tehuti-only.

OTel Metric Mapping (12 OTel metrics from 17 Tehuti sensors)

# Tehuti Sensor(s) OTel Metric Name OTel Type Dimensions
1 bytes_per_poll ingestion.pubsub.consumer.poll.bytes MIN_MAX_COUNT_SUM STORE_NAME, CLUSTER_NAME
2 consumer_poll_result_num ingestion.pubsub.consumer.poll.record_count MIN_MAX_COUNT_SUM STORE_NAME, CLUSTER_NAME
3 consumer_poll_request ingestion.pubsub.consumer.poll.count ASYNC_COUNTER CLUSTER_NAME
4 consumer_poll_request_latency ingestion.pubsub.consumer.poll.time HISTOGRAM CLUSTER_NAME
5 consumer_poll_non_zero_result_num ingestion.pubsub.consumer.poll.non_empty_count ASYNC_COUNTER CLUSTER_NAME
6 consumer_poll_error ingestion.pubsub.consumer.poll.error_count COUNTER CLUSTER_NAME
7 consumer_records_producing_to_write_buffer_latency ingestion.pubsub.consumer.produce_to_write_buffer_time HISTOGRAM CLUSTER_NAME
8 detected_deleted_topic_num ingestion.pubsub.consumer.topic.detected_deleted_count COUNTER CLUSTER_NAME
9 detected_no_running_ingestion_topic_partition_num ingestion.pubsub.consumer.orphan_subscription_count COUNTER CLUSTER_NAME
10 delegate_subscribe_latency + update_current_assignment_latency ingestion.pubsub.consumer.pool_action.time HISTOGRAM CLUSTER_NAME, CONSUMER_POOL_ACTION
11 idle_time ingestion.pubsub.consumer.poll.time_since_last_success MIN_MAX_COUNT_SUM CLUSTER_NAME
12 4 partition gauges (min/max/avg/subscribed) ingestion.pubsub.consumer.partition_assignment.count MIN_MAX_COUNT_SUM CLUSTER_NAME
max_elapsed_time_since_last_successful_poll (Tehuti-only — no OTel, redundant with #11) AsyncGauge

Key design decisions

  1. Shared OTel instrument for consumer actions (10): Two MetricEntityStateOneEnum<VeniceConsumerPoolAction> fields share the same OTel metric entity, each binding a different Tehuti sensor. OTel deduplicates by metric name; recordings are differentiated by the SUBSCRIBE/UPDATE_ASSIGNMENT dimension value.

  2. Asymmetric partition histogram (12): OTel-only MetricEntityStateBase (4-arg factory) records raw per-consumer partition counts. Tehuti keeps its 4 pre-computed gauges unchanged. New recordTotalPartitionAssignmentForOtel() method on AggKafkaConsumerServiceStats called from KafkaConsumerService.recordPartitionsPerConsumerSensor().

  3. ASYNC_COUNTER per-store suppression (3, 5): Per-store instances pass totalOnlyOtelRepo=null, suppressing OTel callbacks that would produce misleading data points (pool-wide values tagged with per-store attributes). Per-store Tehuti shares the total's LongAdderRateGauge sensor.

  4. No isTotalStats(true) on OpenTelemetryMetricsSetup: Unlike per-store-aggregated classes, this class has metrics that are only recorded on the total instance, so OTel must remain enabled on total.

  5. New dimension enum: VeniceConsumerPoolAction (SUBSCRIBE, UPDATE_ASSIGNMENT) implementing VeniceDimensionInterface, with corresponding VENICE_CONSUMER_POOL_ACTION entry in VeniceMetricsDimensions.

  6. Extracted shared registerPerStoreAndTotal(MetricEntityState): Moved from a private method in ServerHttpRequestStats to a protected method in AbstractVeniceStats, reused by both ServerHttpRequestStats and KafkaConsumerServiceStats.

  7. VENICE_STORE_NAME removed from total-only metrics (3–9, 11, 12): Total-only metrics are always recorded on the total instance where storeName is an opaque identifier (e.g., total.kafka_consumer_service_for_<region>), not an actual store name. Including VENICE_STORE_NAME would emit a misleading dimension value. Only the 2 per-store metrics (1, 2) retain VENICE_STORE_NAME. The production code builds a separate clusterOnlyDimensionsMap (backed by Collections.unmodifiableMap) for these metrics.

Files changed

New files (6):

  • KafkaConsumerServiceOtelMetricEntity.java — 12 OTel metric entity definitions
  • VeniceConsumerPoolAction.java — dimension enum
  • KafkaConsumerServiceStatsOtelTest.java — 22 test methods
  • KafkaConsumerServiceOtelMetricEntityTest.java — ModuleMetricEntityTestFixture
  • KafkaConsumerServiceTehutiMetricNameTest.java — TehutiMetricNameEnumTestFixture
  • VeniceConsumerPoolActionTest.java — VeniceDimensionTestFixture

Modified files (10):

  • KafkaConsumerServiceStats.java — added MetricEntityState fields, joint API recording
  • AggKafkaConsumerServiceStats.java — added recordTotalPartitionAssignmentForOtel()
  • KafkaConsumerService.java — calls new OTel partition recording method
  • ServerMetricEntity.java — registered KafkaConsumerServiceOtelMetricEntity
  • ServerMetricEntityTest.java — updated count (87 → 99)
  • VeniceMetricsDimensions.java — added VENICE_CONSUMER_POOL_ACTION
  • VeniceMetricsDimensionsTest.java — added switch cases for new dimension
  • AbstractVeniceStats.java — extracted shared registerPerStoreAndTotal(MetricEntityState) method
  • AbstractVeniceStatsTest.java — test for extracted registerPerStoreAndTotal method
  • ServerHttpRequestStats.java — removed duplicate private registerPerStoreAndTotal method

Code changes

  • Added new code behind a config. OTel metrics are gated behind VeniceMetricsConfig.emitOtelMetrics() (existing config, default: disabled). No new config flags introduced.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues. All MetricEntityState fields are final and initialized in the constructor. Per-store parent propagation uses existing Tehuti sensor linking (immutable after construction).
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed. No new synchronization needed — all state is final or uses existing thread-safe patterns.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList). No new collections — existing patterns preserved.
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added. KafkaConsumerServiceStatsOtelTest (22 test methods) covering:
    • All 12 OTel metrics with value validation via InMemoryMetricReader
    • Tehuti regression (all recording methods produce correct Tehuti sensor values)
    • Double-counting prevention (per-store OTel data doesn't leak to total attributes and vice versa)
    • ASYNC_COUNTER per-store suppression and Tehuti sensor sharing
    • NPE prevention with OTel disabled and with plain MetricsRepository
    • Enum dimension isolation (SUBSCRIBE vs UPDATE_ASSIGNMENT)
    • AsyncGauge callback validation
    • OTel-only partition histogram recording
    • Tehuti-only partition gauges
  • New integration tests added. KafkaConsumerServiceOtelMetricEntityTest, KafkaConsumerServiceTehutiMetricNameTest, and VeniceConsumerPoolActionTest (enum validation fixtures).
  • Modified or extended existing tests. ServerMetricEntityTest count updated (87 → 99).
  • Verified backward compatibility (if applicable). All existing Tehuti sensor names and types preserved exactly.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

Copilot AI review requested due to automatic review settings March 12, 2026 07:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) instrumentation for KafkaConsumerServiceStats as part of the server-side OTel migration, including a new consumer-pool-action dimension and associated metric entity definitions/tests.

Changes:

  • Introduced KafkaConsumerServiceOtelMetricEntity with 13 OTel metric definitions and wired it into server metric entity registration.
  • Refactored KafkaConsumerServiceStats to use the joint Tehuti+OTel MetricEntityState APIs (including an OTel-only partition assignment histogram and an enum-dimensioned “consumer action” metric).
  • Added a new OTel dimension enum (VeniceConsumerPoolAction) + dimension registry updates and extensive unit/fixture tests validating OTel and Tehuti behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Adds VENICE_CONSUMER_POOL_ACTION dimension constant.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceConsumerPoolAction.java New dimension enum for subscribe vs update-assignment action tagging.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java Extends dimension-name format tests to cover the new dimension.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceConsumerPoolActionTest.java Fixture-based test validating dimension enum values/mapping.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/KafkaConsumerServiceOtelMetricEntity.java Defines OTel metric entities for KafkaConsumerServiceStats.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers KafkaConsumerServiceOtelMetricEntity in server entity list.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/KafkaConsumerServiceStats.java Main joint Tehuti+OTel refactor; adds MetricEntityState fields and recording paths.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/AggKafkaConsumerServiceStats.java Threads Venice cluster name through stats supplier; adds total-only OTel partition recording helper.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/KafkaConsumerService.java Passes cluster name into agg stats; records per-consumer partition counts to the new OTel histogram.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected server metric entity count after registering the new entity class.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/KafkaConsumerServiceOtelMetricEntityTest.java Fixture test validating the new metric entity definitions (name/type/unit/dims).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/KafkaConsumerServiceStatsOtelTest.java End-to-end tests validating OTel exports + Tehuti regression + isolation behavior.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/KafkaConsumerServiceTehutiMetricNameTest.java Ensures Tehuti metric-name enum maps to the expected legacy sensor names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

ASYNC_COUNTER_FOR_HIGH_PERF_CASES registers an ObservableCounter callback
per instance. For pollCountOtel and pollNonEmptyCountOtel (total-only
metrics), per-store callbacks would always report zero. Pass null as
otelRepository on per-store instances to avoid registering redundant
callbacks, matching the same pattern used for the AsyncGauge metric.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments

Copilot AI review requested due to automatic review settings March 13, 2026 20:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 13, 2026 22:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 13, 2026 22:59
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 14, 2026 00:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @m-nagarajan!

@m-nagarajan m-nagarajan merged commit 1ee70b4 into linkedin:main Mar 14, 2026
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants