Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/spanmetrics] Produce delta temporality span metrics with timestamps representing an uninterrupted series #31780

Conversation

swar8080
Copy link
Contributor

Closes #31671

Description:
Currently delta temporality span metrics are produced with (StartTimestamp, Timestamp)'s of (T1, T2), (T3, T4) .... However, the specification says that the correct pattern for an uninterrupted delta series is (T1, T2), (T2, T3) ...

This misalignment with the spec can confuse downstream components' conversion from delta temporality to cumulative temporality, causing each data point to be viewed as a cumulative counter "reset". An example of this is in prometheusexporter

The conversion issue forces you to generate cumulative span metrics, which use significantly more memory to cache the cumulative counts.

At work, I applied this patch to our collectors and switched to producing delta temporality metrics for prometheusexporter to then convert to cumulative. That caused a significant drop in-memory usage:

image

Testing:

  • Unit tests asserting the timestamps
  • Manual testing with prometheusexporter to make sure counter values are cumulative and no longer being reset after receiving each delta data point

connector/spanmetricsconnector/README.md Outdated Show resolved Hide resolved
connector/spanmetricsconnector/README.md Outdated Show resolved Hide resolved
@@ -322,6 +361,7 @@ func (p *connectorImp) resetState() {
// and span metadata such as name, kind, status_code and any additional
// dimensions the user has configured.
func (p *connectorImp) aggregateMetrics(traces ptrace.Traces) {
startTimestamp := pcommon.NewTimestampFromTime(time.Now())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this just to make testing easier, since before that the startTimestamp was regenerated during each iteration of the for loop on the next line

@swar8080 swar8080 changed the title (#31671) produce delta temporality span metrics with timestamps repre… [connector/spanmetrics] Produce delta temporality span metrics with timestamps representing an uninterrupted series Mar 15, 2024
@swar8080 swar8080 closed this Mar 15, 2024
@swar8080 swar8080 reopened this Mar 15, 2024
Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Attention: Patch coverage is 82.69231% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 81.89%. Comparing base (0d9b1b0) to head (f97f823).
Report is 3 commits behind head on main.

Files Patch % Lines
...r/spanmetricsconnector/internal/metrics/metrics.go 0.00% 6 Missing ⚠️
connector/spanmetricsconnector/connector.go 91.42% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #31780      +/-   ##
==========================================
+ Coverage   81.88%   81.89%   +0.01%     
==========================================
  Files        1858     1858              
  Lines      172702   172739      +37     
==========================================
+ Hits       141410   141459      +49     
+ Misses      26980    26967      -13     
- Partials     4312     4313       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@swar8080 swar8080 closed this Mar 15, 2024
@swar8080 swar8080 reopened this Mar 15, 2024
@swar8080 swar8080 force-pushed the spanmetrics_uninterrupted_delta_timestamps branch from f97f823 to 397ec40 Compare March 18, 2024 23:44
@djaglowski
Copy link
Member

Thanks for the issue and PR @swar8080. Hopefully @portertech or someone from @open-telemetry/collector-contrib-approvers can review.

startTime = lastTimestamp
}
// Collect lastDeltaTimestamps keys that need to be updated. Metrics can share the same key, so defer the update.
deltaMetricKeys[mk] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update lastDeltaTimestamps here directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This anonymous function is first called for the calls metric and then called with the same key for the duration metric. So if we update it here then the duration metric will get the incorrect value from the cache

That is what I was trying to say in the comment above this line, but feel free to change the wording if it's unclear!

@@ -125,6 +129,16 @@ func newConnector(logger *zap.Logger, config component.Config, ticker *clock.Tic
resourceMetricsKeyAttributes[attr] = s
}

var lastDeltaTimestamps *simplelru.LRU[metrics.Key, pcommon.Timestamp]
if cfg.GetAggregationTemporality() == pmetric.AggregationTemporalityDelta {
lastDeltaTimestamps, err = simplelru.NewLRU[metrics.Key, pcommon.Timestamp](cfg.GetDeltaTimestampCacheSize(), func(k metrics.Key, v pcommon.Timestamp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use cache.NewCache instead? This will help us maintain uniformity.

Copy link
Contributor Author

@swar8080 swar8080 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but cache.NewCache keeps items evicted from its LRU cache in memory, which isn't needed here like it is for the other caches. We would have to remember to call RemoveEvictedItems() to discard those items from memory

@swar8080 swar8080 requested a review from Frapschen April 8, 2024 22:43
@swar8080 swar8080 force-pushed the spanmetrics_uninterrupted_delta_timestamps branch from cec515c to a5182b5 Compare April 9, 2024 02:09
…mestamps representing an uninterrupted series. This can avoid significant memory usage compared to producing cumulative span metrics, as long a downstream component can convert from delta back to cumulative, which can depend on the timestamps being uninterrupted.
@swar8080 swar8080 force-pushed the spanmetrics_uninterrupted_delta_timestamps branch from a5182b5 to c419f4c Compare April 9, 2024 13:01
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 24, 2024
@swar8080
Copy link
Contributor Author

Hi @Frapschen @portertech is this something you still want to include? The component should be usable with prometheusexporter now that I fixed #32210

So this MR is mostly to allow using delta metrics as an optimization. It reduces memory/CPU by not buffering the cumulative counts since prometheusexporter already does this, and only pushes metrics that changed to the next component.

Also having timestamps aligned with the specification might prevent future bug reports. For example, i'm still maintaining a fork with this timestamp change in order to work with #32521 (which is also forked).

@github-actions github-actions bot removed the Stale label May 1, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 15, 2024
Copy link
Contributor

@ankitpatel96 ankitpatel96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change to fix a nasty bug! I'm not an approver but I think this is in a mergeable state

@github-actions github-actions bot removed the Stale label May 24, 2024
@mx-psi
Copy link
Member

mx-psi commented May 24, 2024

@swar8080 would you have time to fix the merge conflicts?

@swar8080 swar8080 closed this May 31, 2024
@swar8080 swar8080 reopened this May 31, 2024
@@ -61,6 +61,7 @@ require (
github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 // indirect
github.com/hashicorp/go-version v1.7.0 // indirect
github.com/hashicorp/golang-lru v1.0.2 // indirect
github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed because this module depends on spanmetricsconnector

@swar8080
Copy link
Contributor Author

swar8080 commented Jun 2, 2024

Finished implementing the most recent review feedback so this seems ready to merge cc @mx-psi

@mx-psi mx-psi merged commit 9395f36 into open-telemetry:main Jun 4, 2024
162 checks passed
@github-actions github-actions bot added this to the next release milestone Jun 4, 2024
@mx-psi
Copy link
Member

mx-psi commented Jun 5, 2024

@swar8080 I see some failures on main on Windows that may be related to this PR. Can you take a look?

=== Failed
=== FAIL: . TestTimestampsForUninterruptedStream (0.00s)
    connector_test.go:1656: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/spanmetricsconnector/connector_test.go:1656
        	            				D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/spanmetricsconnector/connector_test.go:1714
        	Error:      	"2024-06-05 10:15:52.5883464 +0000 UTC" is not greater than "2024-06-05 10:15:52.5883464 +0000 UTC"
        	Test:       	TestTimestampsForUninterruptedStream
--- PASS: TestTimestampsForUninterruptedStream/AGGREGATION_TEMPORALITY_CUMULATIVE (0.00s)
--- PASS: TestTimestampsForUninterruptedStream/AGGREGATION_TEMPORALITY_DELTA (0.00s)

=== FAIL: . TestDeltaTimestampCacheExpiry (0.00s)
    connector_test.go:1837: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/spanmetricsconnector/connector_test.go:1837
        	Error:      	"2024-06-05 10:15:52.5905404 +0000 UTC" is not greater than "2024-06-05 10:15:52.5905404 +0000 UTC"
        	Test:       	TestDeltaTimestampCacheExpiry

=== FAIL: . TestTimestampsForUninterruptedStream (re-run 1) (0.01s)
    connector_test.go:1665: 
        	Error Trace:	D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/spanmetricsconnector/connector_test.go:1665
        	            				D:/a/opentelemetry-collector-contrib/opentelemetry-collector-contrib/connector/spanmetricsconnector/connector_test.go:1714
        	Error:      	"2024-06-05 10:15:56.1286061 +0000 UTC" is not greater than "2024-06-05 10:15:56.1286061 +0000 UTC"
        	Test:       	TestTimestampsForUninterruptedStream
--- PASS: TestTimestampsForUninterruptedStream/AGGREGATION_TEMPORALITY_CUMULATIVE (0.00s)
--- PASS: TestTimestampsForUninterruptedStream/AGGREGATION_TEMPORALITY_DELTA (0.00s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants