Memory leaks for Sum metric exemplars #31683

tiithansen · 2024-03-11T15:50:07Z

Component(s)

connector/spanmetrics

What happened?

Description

There is a memory leaks if exemplars are enabled. In file connector.go histogram exemplars are reset with every export but sum metrics exemplars are not.

Expected Result

Memory usage should be steady depending how much metrics are being generated.

Actual Result

Memory usage keeps growing until process gets OOM killed in k8s cluster.

Collector version

v0.95.0

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions · 2024-03-11T15:50:26Z

Pinging code owners:

connector/spanmetrics: @portertech

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-03-14T16:55:16Z

I'm not super familiar with this connector, but following your comment it looks like the following lines should potentially be added:

			m.events.Reset()
			m.sums.Reset()

To this section:

opentelemetry-collector-contrib/connector/spanmetricsconnector/connector.go

Lines 301 to 313 in d4458d9

    
           p.resourceMetrics.ForEach(func(k resourceKey, m *resourceMetrics) { 
        
           	// Exemplars are only relevant to this batch of traces, so must be cleared within the lock 
        
           	if !p.config.Histogram.Disable { 
        
           		m.histograms.Reset(true) 
        
           	} 
        
           	// If metrics expiration is configured, remove metrics that haven't been seen for longer than the expiration period. 
        
           	if p.config.MetricsExpiration > 0 { 
        
           		if now.Sub(m.lastSeen) >= p.config.MetricsExpiration { 
        
           			p.resourceMetrics.Remove(k) 
        
           		} 
        
           	} 
        
           })

This would need to be confirmed through testing though, and preferably a long running test doing memory analysis on the collector to ensure this actually resolves it.

tiithansen · 2024-03-14T17:58:24Z

We are running, slightly modified (adding exemplars if specific attribute is set on span) version in production with 150k spans per second going through this connector. Before applying fix it got OOM killed (8GB limit in 5 instances total 40GB) in half an hour or so. After applying change, its has been steady 200 - 300 MB for days now.

crobert-1 · 2024-03-14T18:00:20Z

Would you be able to share your fix, or even post a PR?

tiithansen · 2024-03-17T16:28:09Z

I will create a PR soon.

tcaty · 2024-04-02T08:56:18Z

Hi @crobert-1 and @tiithansen! I can confirm that issue really exists. We use helm chart opentelemetry-collector-0.82.0 with otel/opentelemetry-collector-contrib:0.95.0 to deploy it in k8s cluster. I'll share our config details that probably relates to this issue.

values.yml

mode: "deployment"

config:
receivers:
  otlp:
    protocols:
      http: {}
  prometheus:
    config: 
      scrape_configs:
        - job_name: otel-collector-metrics
          scrape_interval: 15s
          static_configs:
          - targets: ['localhost:8888']

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 4000
    spike_limit_percentage: 20
  batch: {}

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    enable_open_metrics: true

connectors:
  spanmetrics:
    exemplars:
      enabled: true
          
extensions:
  pprof:
    endpoint: "0.0.0.0:1777"

service:
  ...
         
resources:
requests:
  cpu: 100m
  memory: 2Gi
limits:
  memory: 4Gi
useGOMEMLIMIT: true

There are our metrics between 6:00 a.m. and 8:00 a.m. on screenshots below, may be they will be useful.

screenshots

Crucial points:

6:00 a.m. - otel-collector started with minimal load
7:00 a.m - otel-collector reached limits

Pod resources

Pprof at 6:00 a.m.

Pprof at 7:00 a.m.

And some otel-collector metrics

We'll try to run otel-collector without exemplars generation and I'll be back with feedback about resource usage.

swar8080 · 2024-04-02T11:30:54Z

@tcaty configuring metrics_expiration added in 0.97 helps a bit by discarding exemplars for infrequently received spans

Also, generating delta temporality span metrics and then converting them to cumulative would solve the problem if either of these become available:

[connector/spanmetrics] Add maximum span duration metric #31885 if using prometheusexporter
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/deltatocumulativeprocessor once development's complete

The delta span metrics only keep exemplars received since the last metrics_flush_interval

swar8080 · 2024-04-02T11:32:47Z

@tiithansen are you still planning to submit a fix? If not then I can give this a go

…lushing (#32210) **Description:** Discard counter span metric exemplars after flushing to avoid unbounded memory growth when exemplars are enabled. This is needed because #28671 added exemplars to counter span metrics, but they are not removed after each flush interval like they are for histogram span metrics. Note: this may change behaviour if using the undocumented `exemplars.max_per_data_point` configuration option, since exemplars would no longer be accumulated up until that count. However, i'm unclear on the value of that feature since there's no mechanism to replace old exemplars with newer ones once the maximum is reached. Maybe a follow-up enhancement is only discarding exemplars once the maximum is reached, or using a circular buffer to replace them. That could be useful for pull-based exporters like `prometheusexporter`, as retaining exemplars for longer would decrease the chance of them getting discarded before being scraped. **Link to tracking Issue:** Closes #31683 **Testing:** - Unit tests - Running the collector and setting a breakpoint to verify the exemplars are being cleared in-between flushes. Before the change I could see the exemplar count continually growing **Documentation:** <Describe the documentation added.> Updated the documentation to mention that exemplars are added to all span metrics. Also mentioned when they are discarded

…lushing (open-telemetry#32210) **Description:** Discard counter span metric exemplars after flushing to avoid unbounded memory growth when exemplars are enabled. This is needed because open-telemetry#28671 added exemplars to counter span metrics, but they are not removed after each flush interval like they are for histogram span metrics. Note: this may change behaviour if using the undocumented `exemplars.max_per_data_point` configuration option, since exemplars would no longer be accumulated up until that count. However, i'm unclear on the value of that feature since there's no mechanism to replace old exemplars with newer ones once the maximum is reached. Maybe a follow-up enhancement is only discarding exemplars once the maximum is reached, or using a circular buffer to replace them. That could be useful for pull-based exporters like `prometheusexporter`, as retaining exemplars for longer would decrease the chance of them getting discarded before being scraped. **Link to tracking Issue:** Closes open-telemetry#31683 **Testing:** - Unit tests - Running the collector and setting a breakpoint to verify the exemplars are being cleared in-between flushes. Before the change I could see the exemplar count continually growing **Documentation:** <Describe the documentation added.> Updated the documentation to mention that exemplars are added to all span metrics. Also mentioned when they are discarded

tiithansen added bug Something isn't working needs triage New item requiring triage labels Mar 11, 2024

github-actions bot added the connector/spanmetrics label Mar 11, 2024

github-actions bot mentioned this issue Mar 12, 2024

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

crobert-1 removed the needs triage New item requiring triage label Mar 14, 2024

swar8080 mentioned this issue Apr 6, 2024

[connector/spanmetrics] Discard counter span metric exemplars after flushing #32210

Merged

evan-bradley closed this as completed in #32210 Apr 16, 2024

ptodev mentioned this issue Apr 25, 2024

Possible memory leak in Alloy grafana/alloy#670

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks for Sum metric exemplars #31683

Memory leaks for Sum metric exemplars #31683

tiithansen commented Mar 11, 2024

github-actions bot commented Mar 11, 2024

crobert-1 commented Mar 14, 2024

tiithansen commented Mar 14, 2024

crobert-1 commented Mar 14, 2024

tiithansen commented Mar 17, 2024

tcaty commented Apr 2, 2024 •

edited

swar8080 commented Apr 2, 2024

swar8080 commented Apr 2, 2024

Memory leaks for Sum metric exemplars #31683

Memory leaks for Sum metric exemplars #31683

Comments

tiithansen commented Mar 11, 2024

Component(s)

What happened?

Description

Expected Result

Actual Result

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Mar 11, 2024

crobert-1 commented Mar 14, 2024

tiithansen commented Mar 14, 2024

crobert-1 commented Mar 14, 2024

tiithansen commented Mar 17, 2024

tcaty commented Apr 2, 2024 • edited

swar8080 commented Apr 2, 2024

swar8080 commented Apr 2, 2024

tcaty commented Apr 2, 2024 •

edited