Skip to content

Conversation

@m-nagarajan
Copy link
Contributor

@m-nagarajan m-nagarajan commented Apr 11, 2025

Problem Statement

Add client availability/latency metrics in Otel on top of the existing tehuti metrics under existing Otel configs in VeniceMetricsConfig

Solution

2 solutions can be considered:

  1. name of the metrics be venice.client.<> and add a dimension venice.client.type with values thin_client, fast_client, da_vinci_client, etc. This requires a metric with the same name across clients (eg: venice.client.call_count) to have the same set of dimensions.
  2. name of the metrics be venice.thin_client.<> , venice.fast_client.<>, venice.davinci_client.<>, etc with no new dimension for the type.

This PR follows the second approach to

  1. keep the metrics of each clients isolated and not let some clients like davinci which emits a lot of metrics (including ingestion metrics) reduce the searchability aspect of metrics emitted by other clients which emits a small amount of metrics
  2. to keep all metrics of the same name emit the same set of dimensions, to keep data aggregated via the pre-aggregates in the underlying metrics processing frameworks to be consistent

The below metrics are added

venice.thin_client.call_count
venice.thin_client.call_time
venice.fast_client.call_count
venice.fast_client.call_time
venice.davinci_client.call_count
venice.davinci_client.call_time

All these metrics are currently in a single class BasicClientStats.java, this PR resuses this class by also adding unhealthy_ latency_metrics to it (from ClientStats.java). This means DVC will be emitting this metric as well from now on.

Other existing tehuti metric like request, success_request_ratio will be derived metrics in OTel based by aggregating based on the dimension venice.response.status_code_category (success/fail)

Added setOtelAdditionalMetricsReader in VeniceMetricsConfig to be able to pass inInMemoryMetricReader from io.opentelemetry:opentelemetry-sdk-testing to be able to test the otel metrics.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

TBD

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.
    Clients will start emitting these new metrics in OTel if enabled

Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @m-nagarajan! The change looks good to me. Just a few minor comments.

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lluwm for the review. Addressed the comments.

@m-nagarajan m-nagarajan marked this pull request as ready for review April 15, 2025 21:01
lluwm
lluwm previously approved these changes Apr 17, 2025
@FelixGV
Copy link
Contributor

FelixGV commented Apr 17, 2025

[common] is an invalid commit summary prefix tag. Please fix prior to merging.

Valid tags are those of the deployables and user libraries affected by the change (i.e. including all those calling into the modified venice-common code).

Thanks.

Copy link
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall and left some minor comments.

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaojieliu for the review. Left some replies to your comments.

@m-nagarajan m-nagarajan changed the title [tc][fc][dvc][common] Add client availability/latency metrics in Otel [tc][fc][dvc] Add client availability/latency metrics in Otel Apr 23, 2025
@m-nagarajan m-nagarajan enabled auto-merge (squash) May 8, 2025 01:11
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks @m-nagarajan !

@m-nagarajan m-nagarajan merged commit 0d7d335 into linkedin:main May 8, 2025
58 checks passed
WhitneyDeng pushed a commit to WhitneyDeng/venice that referenced this pull request May 16, 2025
…in#1689)

Add client availability/latency metrics in Otel on top of the existing tehuti metrics under existing Otel configs in VeniceMetricsConfig: name of the metrics be venice.thin_client.<> , venice.fast_client.<>, venice.davinci_client.<>, etc
- to keep the metrics of each clients isolated and not let some clients like davinci which emits a lot of metrics (including ingestion metrics) reduce the searchability aspect of metrics emitted by other clients which emits a small amount of metrics, 
- to keep all metrics of the same name emit the same set of dimensions to keep data aggregated via the pre-aggregates in the underlying metrics processing frameworks to be consistent

The below metrics are added
- venice.thin_client.call_count
- venice.thin_client.call_time
- venice.fast_client.call_count
- venice.fast_client.call_time
- venice.davinci_client.call_count
- venice.davinci_client.call_time
All these metrics are currently in a single class BasicClientStats.java, this PR resuses this class by also adding unhealthy_ latency_metrics to it (from ClientStats.java). This means DVC will be emitting this metric as well from now on. Other existing tehuti metric like request, success_request_ratio will be derived metrics in OTel based by aggregating based on the dimension venice.response.status_code_category (success/fail)

Added setOtelAdditionalMetricsReader in VeniceMetricsConfig to be able to pass inInMemoryMetricReader from io.opentelemetry:opentelemetry-sdk-testing to be able to test the otel metrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants