Skip to content

Conversation

@WhitneyDeng
Copy link
Contributor

@WhitneyDeng WhitneyDeng commented Apr 1, 2025

Problem Statement

  • integrating OpenTelemetry into controllers
  • adding metrics for log compaction

Solution

OpenTelemetry integration into controller

  • add OpenTelemetry dependency to venice-controller/build.gradle
  • VeniceControllerWrapper metricsRepository initialised with VeniceMetricsRepository

Log Compaction logic change

  • VeniceParentHelixAdmin::getStoresForCompaction() calls VeniceHelixAdmin::getStoresForCompaction() rather than throw exception

Metric Type added:

  • Gauge (DoubleGauge)

Metrics added in LogCompactionStats:

Metric Name Metric Description Metric Type Dimensions (VENICE_CLUSTER_NAME is a base dimension)
REPUSH_CALL_COUNT Repush API call count Counter VENICE_STORE_NAME, VENICE_RESPONSE_STATUS_CODE_CATEGORY, VENICE_CLUSTER_NAME, REPUSH_TRIGGER_SOURCE
COMPACTION_ELIGIBLE_STATE Duration a store is nominated for compaction before repush triggered. Value set to 1: when store is nominated for compaction. Value set to 0: when repush on the store succeeds. Gauge VENICE_STORE_NAME, VENICE_CLUSTER_NAME
STORE_NOMINATED_FOR_COMPACTION_COUNT Number of times a store is nominated for compaction Counter VENICE_STORE_NAME, VENICE_CLUSTER_NAME

Metric Dimension Added

  • REPUSH_TRIGGER_SOURCE for REPUSH_CALL_COUNT

Other OTel changes

  • VeniceOpenTelemetryMetricsRepository::getFullMetricName take MetricEntity instead of String prefix name and metric name

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
    • LogCompactionStatsTest to test LogCompactionStats metric emission
  • New integration tests added.
  • Modified or extended existing tests.
    • VeniceOpenTelemetryMetricsRepositoryTest: getFullMetricName & Gauge changes
    • VeniceMetricsDimensionsTest: add REPUSH_TRIGGER_SOURCE
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

@WhitneyDeng WhitneyDeng changed the title [controller] integrate OpenTelemetry into Controller [controller] integrate OpenTelemetry into Controller & add log compaction metrics May 8, 2025
Copy link
Contributor

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Left some very initial review.

@WhitneyDeng WhitneyDeng force-pushed the log_compaction/observability/integrate-otel-to-controller branch from 6797aeb to 2062312 Compare May 23, 2025 22:11
Whitney Deng added 23 commits June 26, 2025 11:22
- add RepushStoreTriggerSource enum for metric dimension
- import `venice-client-common` in `venice-controller` to allow `VeniceParentHelixAdmin` to access `VeniceResponseStatusCategory` to emit metric
- add clusterName to RepushJobRequest for metric dimension
- in RepushJobRequest, streamline triggerSource to `RepushStoreTriggerSource`
- VeniceParentHelixAdmin::getStoresForCompaction call VeniceHelixAdmin::getStoresForCompaction directly rather than throw exception
- VeniceController only creates LogCompactionService if is parent controller
…::initializeParentAdmin to pass in metricsRepository for testing metric emission in VeniceParentHelixAdmin
Whitney Deng and others added 17 commits June 27, 2025 11:52
…licating

getMetricPrefix(metricEntity), metricEntity.getMetricName()
for all metric types
… `storeNominationToCompactionCompleteDurationMetric` emission from repush trigger logic
… `storeNominationToCompactionCompleteDurationMetric` metric description
Copy link
Contributor

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Thanks for reiterating. Can you also add a detailed PR description with the changes involved for stats including the metrics being added and also any other non-stats related log compaction related changes in the PR.

Whitney Deng added 8 commits July 7, 2025 18:57
- nixed STORE_COMPACTION_TRIGGER_STATUS
- move recordRepushStoreCall emission from VeniceParentHelixAdmin to common path in VeniceHelixAdmin
- problem is to mock all relevant components in VeniceHelixAdmin for testing in LogCompactionStatsTest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants