connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods #30908

NickAnge · 2024-01-31T08:19:03Z

Component(s)

connector/datadog

What happened?

Description

In our setup, we've activated both the Datadog connector and exporter to avoid APM stats sampling. We've been experiencing a continuous increase in memory, eventually leading to the pod reaching an Out-of-Memory (OOM) state after a few hours. We followed the suggested configuration from the README.md and have datadog/connector as the receiver for traces.

Steps to Reproduce

Setup a OpenTelemetry Collector with the above configuration
Publish trace telemetry data through the telemetry collector
Evaluate memory through pprof

Expected Result

Not Memory increase that kills the pod

Actual Result

Memory increase that kills the pod eventually.

Collector version

opentelemetry-collector-contrib:0.88.0

Environment information

Environment

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

 probabilistic_sampler:
    sampling_percentage: 20

connectors:
    datadog/connector:

service:
  extensions: [ health_check, pprof ]
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ datadog/connector ]
 
    traces/sampled:
      receivers: [ datadog/connector ]
      processors:[ probabilistic_sampler, batch ]
      exporters: [ datadog ]

Log output

No response

Additional context

No response

github-actions · 2024-01-31T08:19:20Z

Pinging code owners:

connector/datadog: @mx-psi @dineshg13

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mackjmr · 2024-02-07T12:36:53Z

@NickAnge thanks for reporting. We were able to reproduce/ identify a memory leak in the code path for Datadog connector in the Trace to Trace pipeline, which is what is being used in your case. This memory leak was fixed in the following PR, which will be part of the next collector release.

dmedinag · 2024-02-08T09:11:40Z

Hey we see the release 0.94.0 is available in github since 8 hours ago but the image is not yet present in dockerhub, are the schedules for the two artifacts different?

mackjmr · 2024-02-08T09:14:32Z

The docker image should be available once 0.94.0 gets released in https://github.com/open-telemetry/opentelemetry-collector-releases. See open-telemetry/opentelemetry-collector-releases#472.

diogotorres97 · 2024-02-15T10:58:06Z

Still happening here too 😢

mackjmr · 2024-02-15T12:34:36Z

@diogotorres97 we aren't able to reproduce a memory leak in 0.94.0. Can you please clarify what behaviour you are seeing, your config, the collector version you are using ? Can you also please generate profiles and output traces in json format via the file exporter so we can attempt reproducing using your traces ?

diogotorres97 · 2024-02-16T10:37:26Z

We still receive traces but the stats are unavailable in Datadog after a few hours. If we have low load in the system, it can take 24h until we lost stats. But with a spike in requests we can lose that in less than that (yday was 12h).

Screenshots from Memory Consuption:

The logs from deployments when we don't see more stats:

│ 2024-02-15T23:22:13.943Z    info    memorylimiter/memorylimiter.go:222    Memory usage is above soft limit. Forcing a GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1238}                          │
│ 2024-02-15T23:22:15.648Z    info    memorylimiter/memorylimiter.go:192    Memory usage after GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1228}                                                   │
│ 2024-02-15T23:22:15.648Z    warn    memorylimiter/memorylimiter.go:229    Memory usage is above soft limit. Refusing data.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1228}                         │
│ 2024-02-15T23:24:23.914Z    info    memorylimiter/memorylimiter.go:215    Memory usage back within limits. Resuming normal operation.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1018}              │
│ 2024-02-15T23:24:28.943Z    info    memorylimiter/memorylimiter.go:222    Memory usage is above soft limit. Forcing a GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1249}                          │
│ 2024-02-15T23:24:30.745Z    info    memorylimiter/memorylimiter.go:192    Memory usage after GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1241}                                                   │
│ 2024-02-15T23:24:30.745Z    warn    memorylimiter/memorylimiter.go:229    Memory usage is above soft limit. Refusing data.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1241}

Also, after a restart of the collector-deployment only we start receiving the stats again (but this is not a solution 😄 )

Config:

  - chart: opentelemetry-collector
      releaseName: opentelemetry-collector-deployment
      values: |
        config:
          connectors:
            datadog/connector:
          exporters:
            datadog:
              api:
                key: ${env:DD_API_KEY}
              traces:
                trace_buffer: 500
          processors:
            batch:
              timeout: 5s
              send_batch_max_size: 1000
              send_batch_size: 250
            memory_limiter:
              check_interval: 1s
              limit_percentage: 80
              spike_limit_percentage: 25
            tail_sampling/limit:
              decision_wait: 1s
              num_traces: 100000
              policies:
              - name: rate-limit
                rate_limiting:
                  spans_per_second: 10000
                type: rate_limiting
            tail_sampling/logic:
              num_traces: 100000
              policies:
              - name: http-server-errors
                numeric_attribute:
                  key: http.status_code
                  max_value: 599
                  min_value: 500
                type: numeric_attribute
              - name: grpc-unknown-errors
                numeric_attribute:
                  key: rpc.grpc.status_code
                  max_value: 2
                  min_value: 2
                type: numeric_attribute
              - name: grpc-server-errors
                numeric_attribute:
                  key: rpc.grpc.status_code
                  max_value: 15
                  min_value: 12
                type: numeric_attribute
              - latency:
                  threshold_ms: 400
                name: slow
                type: latency
          service:
            pipelines:
              traces:
                exporters: [datadog/connector]
                processors: [memory_limiter, tail_sampling/logic, tail_sampling/limit]
                receivers: [otlp]
              traces/2:
                exporters: [datadog]
                processors: [memory_limiter, batch]
                receivers: [datadog/connector]
              metrics:
                exporters: [datadog]
                processors: [memory_limiter, batch]
                receivers: [datadog/connector]
        image:
          tag: 0.94.0
        mode: deployment
        podAnnotations:
          ad.datadoghq.com/opentelemetry-collector.checks: |
            {
              "openmetrics": {
                "instances": [
                  {
                    "openmetrics_endpoint": "http://%%host%%:%%port_metrics%%/metrics",
                    "namespace": "monitoring",
                    "metrics": [
                      "otelcol_exporter_sent_spans",
                      "otelcol_process_runtime_total_alloc_bytes",
                      "otelcol_process_runtime_total_sys_memory_bytes",
                      "otelcol_processor_tail_sampling_count_traces_sampled",
                      "otelcol_processor_tail_sampling_sampling_decision_latency",
                      "otelcol_processor_tail_sampling_sampling_traces_on_memory"
                    ]
                  }
                ]
              }
            }
        ports:
          metrics:
            enabled: true
        replicaCount: 5
        resources:
          limits:
            cpu: '2'
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 2Gi
        service:
          clusterIP: None
    repoURL: https://open-telemetry.github.io/opentelemetry-helm-charts
    targetRevision: "0.80.0"
  - chart: opentelemetry-collector
    helm:
      releaseName: opentelemetry-collector-agent
      values: |
        config:
          exporters:
            loadbalancing:
              protocol:
                otlp:
                  timeout: 1s
                  tls:
                    insecure: true
              resolver:
                k8s:
                  service: opentelemetry-collector-deployment.monitoring
            otlp:
              endpoint: http://opentelemetry-collector-deployment.monitoring.svc.cluster.local:4317
              tls:
                insecure: true
          processors:
            memory_limiter:
              check_interval: 1s
              limit_percentage: 80
              spike_limit_percentage: 25
          service:
            pipelines:
              traces:
                exporters: [loadbalancing]
                processors: [memory_limiter]
                receivers: [otlp]
        image:
          tag: 0.94.0
        mode: daemonset
        podAnnotations:
          ad.datadoghq.com/opentelemetry-collector.checks: |
            {
              "openmetrics": {
                "instances": [
                  {
                    "openmetrics_endpoint": "http://%%host%%:%%port_metrics%%/metrics",
                    "namespace": "monitoring",
                    "metrics": [
                      "otelcol_exporter_sent_spans",
                      "otelcol_loadbalancer_backend_latency",
                      "otelcol_loadbalancer_backend_outcome",
                      "otelcol_process_runtime_total_alloc_bytes",
                      "otelcol_process_runtime_total_sys_memory_bytes"
                    ]
                  }
                ]
              }
            }
        ports:
          metrics:
            enabled: true
        resources:
          limits:
            cpu: '2'
            memory: 1500Mi
          requests:
            cpu: 500m
            memory: 1500Mi
        service:
          enabled: true
    repoURL: https://open-telemetry.github.io/opentelemetry-helm-charts
    targetRevision: "0.80.0"

We are updating to the latest version everytime there is a new release in hope to fix the problem 😄

mackjmr · 2024-02-16T11:03:20Z

@diogotorres97 towards 21h20 in screenshot you shared was there an increase in data sent to the collectors ? Did that time correspond to the spike in requests you mentioned ?

diogotorres97 · 2024-02-16T15:07:16Z

@diogotorres97 towards 21h20 in screenshot you shared was there an increase in data sent to the collectors ? Did that time correspond to the spike in requests you mentioned ?

yes. Usually without spikes it will increase the memory in one day or two, with spikes (it depends) but can grow very fast...

mackjmr · 2024-02-16T15:31:20Z

@diogotorres97 if higher data/ cardinality is being sent, higher memory consumption is expected.

Usually without spikes it will increase the memory in one day or two

Memory increasing with steady traffic/ cardinality is unexpected. We've been unable to reproduce a memory leak with 0.94.0 with tests of different cardinality/ different traffic.

In the scenario where the memory increases under steady traffic, can you please provide us with output traces in json format via the file exporter, graphs showing the steady increase in memory, as well as profiles. Ideally, having two profiles spaced out in a time where memory was increasing. Having these two profiles, we'll be able to see what is growing in memory.

github-actions · 2024-05-08T03:29:13Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

connector/datadog: @mx-psi @dineshg13

See Adding Labels via Comments if you do not have permissions to add labels yourself.

NickAnge · 2024-05-08T06:19:10Z

Hello @mackjmr . I just wanted to let you know that we have been using the new version and we do not see any memory leaks coming from this component. I am not sure if we can close this issue, or should we wait for more time. Thanks in advance

mx-psi · 2024-05-08T10:58:19Z

Thanks for getting back to us @NickAnge! I think we can close this for now. If the issue comes back, please comment on the issue and we can reopen :)

NickAnge added bug Something isn't working needs triage New item requiring triage labels Jan 31, 2024

github-actions bot added the connector/datadog label Jan 31, 2024

NickAnge mentioned this issue Jan 31, 2024

exporter/datadog: disable APM stats via feature flag #28616

Merged

github-actions bot mentioned this issue Feb 6, 2024

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

github-actions bot mentioned this issue Feb 13, 2024

Weekly Report: 2024-02-06 - 2024-02-13 #31192

Closed

mackjmr mentioned this issue Feb 15, 2024

[exporter/datadog] Memory leak in trace stats module #30828

Open

This was referenced Feb 20, 2024

Weekly Report: 2024-02-13 - 2024-02-20 #31323

Closed

Weekly Report: 2024-02-13 - 2024-02-20 asuresh4/opentelemetry-collector-contrib#11541

Open

This was referenced Feb 27, 2024

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

This was referenced Mar 5, 2024

Weekly Report: 2024-02-27 - 2024-03-05 #31560

Closed

Weekly Report: 2024-02-27 - 2024-03-05 asuresh4/opentelemetry-collector-contrib#11543

Open

mx-psi added the waiting for author label Mar 8, 2024

github-actions bot mentioned this issue Mar 12, 2024

Weekly Report: 2024-03-05 - 2024-03-12 #31693

Closed

This was referenced Mar 19, 2024

Weekly Report: 2024-03-12 - 2024-03-19 #31825

Closed

Weekly Report: 2024-03-12 - 2024-03-19 asuresh4/opentelemetry-collector-contrib#11544

Open

Weekly Report: 2024-03-19 - 2024-03-26 #31947

Closed

This was referenced Apr 2, 2024

Weekly Report: 2024-03-26 - 2024-04-02 #32082

Closed

Weekly Report: 2024-04-02 - 2024-04-09 #32230

Open

github-actions bot mentioned this issue Apr 16, 2024

Weekly Report: 2024-04-09 - 2024-04-16 #32407

Open

github-actions bot added the Stale label May 8, 2024

NickAnge changed the title ~~connectors/datadog/connector: Increasing Memory That Eventually Kills Collector Pods~~ connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods May 8, 2024

mx-psi closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods #30908

connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods #30908

NickAnge commented Jan 31, 2024

github-actions bot commented Jan 31, 2024

mackjmr commented Feb 7, 2024

dmedinag commented Feb 8, 2024

mackjmr commented Feb 8, 2024

diogotorres97 commented Feb 15, 2024

mackjmr commented Feb 15, 2024 •

edited

diogotorres97 commented Feb 16, 2024 •

edited

mackjmr commented Feb 16, 2024

diogotorres97 commented Feb 16, 2024

mackjmr commented Feb 16, 2024

github-actions bot commented May 8, 2024

NickAnge commented May 8, 2024

mx-psi commented May 8, 2024

connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods #30908

connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods #30908

Comments

NickAnge commented Jan 31, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jan 31, 2024

mackjmr commented Feb 7, 2024

dmedinag commented Feb 8, 2024

mackjmr commented Feb 8, 2024

diogotorres97 commented Feb 15, 2024

mackjmr commented Feb 15, 2024 • edited

diogotorres97 commented Feb 16, 2024 • edited

mackjmr commented Feb 16, 2024

diogotorres97 commented Feb 16, 2024

mackjmr commented Feb 16, 2024

github-actions bot commented May 8, 2024

NickAnge commented May 8, 2024

mx-psi commented May 8, 2024

mackjmr commented Feb 15, 2024 •

edited

diogotorres97 commented Feb 16, 2024 •

edited