Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheusremotewriteexporter] memory leak when downstream prometheus endpoint is slow/non-responsive leads to GC and "out of order" errors #33324

Open
diranged opened this issue May 31, 2024 · 2 comments
Labels
bug Something isn't working exporter/prometheusremotewrite needs triage New item requiring triage

Comments

@diranged
Copy link

diranged commented May 31, 2024

Component(s)

exporter/prometheusremotewrite

What happened?

Description

I'm reporting what I think is a memory leak in the prometheusremotewriteexporter that is triggered when the downstream prometheus endpoint is either slow to respond or failing entirely. This leak ultimately puts the collector into a GC loop that never recovers, causing impact to all the work that the collector is doing, not just the pipeline with the downstream problems. I've spent the last ~2 days troubleshooting this with AWS and talking through it with @Aneurysm9.

Steps to Reproduce

In my test case - we have a set of OTEL Collectors called metric-aggregators which accept inbound OTLP Metric data (generally sourced from Prometheus Receivers) and write the data into two different pipelines - call it a production and a debug pipeline. The data going into these pipelines can be the same, or it can be totally unique. In this case, the data is unique... I have data=foo -> production and data=bar -> debug essentially.

Once the pipeline is humming along, introduce intentional throttling to the Prometheus endpoint on the debug pipeline - I did this by setting resources.requests.cpu=1 and resources.limits.cpu=1 ... and we're writing ~50-80k datapoints/sec, so that was enough to introduce throttling.

Expected Result

My expectation is that the debug pipeline will start failing requests (_I'd expect to see context deadline exceeded messages) - and data would ultimately be refused by the batch processor, which would in turn refuse data upstream. I expect the production pipeline to continue to operate just fine because there's no impact to its downstream targets.

Actual Result

Interestingly, we see impact that starts with the debug pipeline, but then spreads to all of the pipelines in the collector. After a period of time (~20-40m), the collectors are completely stuck and are in a GC loop triggered by the memory_limiter. Data then fails to write to the production pipeline. Additionally, when we un-clog the prometheus debug endpoint, the collector doesn't self recover ... it is stuck in this GC loop essentially indefinitely until we restart the pods.

Collector version

0.101.0

Environment information

Environment

OS: BottleRocket 1.19.4

OpenTelemetry Collector configuration

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-metrics-aggregation
  namespace: otel
spec:
  args:
    feature-gates: +processor.resourcedetection.hostCPUSteppingAsString,+exporter.prometheusremotewritexporter.RetryOn429
  autoscaler:
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 600
          type: Pods
          value: 1
        selectPolicy: Min
        stabilizationWindowSeconds: 900
      scaleUp:
        policies:
        - periodSeconds: 60
          type: Pods
          value: 4
        - periodSeconds: 60
          type: Percent
          value: 100
        selectPolicy: Max
        stabilizationWindowSeconds: 60
    maxReplicas: 24
    minReplicas: 3
    targetCPUUtilization: 60
    targetMemoryUtilization: 65
  config:
    exporters:
      debug:
        sampling_initial: 15
        sampling_thereafter: 60
      debug/verbose:
        sampling_initial: 15
        sampling_thereafter: 60
        verbosity: detailed
      prometheusremotewrite/amp:
        add_metric_suffixes: true
        auth:
          authenticator: sigv4auth
        endpoint: https://.../api/v1/remote_write
        max_batch_size_bytes: "1000000"
        remote_write_queue:
          num_consumers: 5
          queue_size: 50000
        resource_to_telemetry_conversion:
          enabled: true
        retry_on_failure:
          enabled: true
          initial_interval: 200ms
          max_elapsed_time: 60s
          max_interval: 5s
        send_metadata: false
        target_info:
          enabled: false
        timeout: 90s
      prometheusremotewrite/centralProd:
        add_metric_suffixes: true
        endpoint: https://.../api/v1/remote_write
        max_batch_size_bytes: "1000000"
        remote_write_queue:
          num_consumers: 5
          queue_size: 50000
        resource_to_telemetry_conversion:
          enabled: true
        retry_on_failure:
          enabled: true
          initial_interval: 200ms
          max_elapsed_time: 60s
          max_interval: 5s
        send_metadata: false
        target_info:
          enabled: false
        timeout: 90s
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          insecure_skip_verify: true
          key_file: /tls/tls.key
      prometheusremotewrite/debug:
        add_metric_suffixes: true
        endpoint: http://prometheus-operated:9090/api/v1/write
        max_batch_size_bytes: "1000000"
        remote_write_queue:
          num_consumers: 5
          queue_size: 50000
        resource_to_telemetry_conversion:
          enabled: true
        retry_on_failure:
          enabled: true
          initial_interval: 200ms
          max_elapsed_time: 60s
          max_interval: 5s
        send_metadata: false
        target_info:
          enabled: false
        timeout: 90s
        tls:
          insecure: true
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: :1777
      sigv4auth:
        region: us-west-2
    processors:
      attributes/common:
        actions:
        - action: upsert
          key: k8s.cluster.name
          value: test
      batch/prometheus:
        send_batch_max_size: 16384
        send_batch_size: 8192
        timeout: 5s
      filter/drop_istio_metrics:
        error_mode: ignore
        metrics:
          exclude:
            match_type: regexp
            resource_attributes:
            - key: service.name
              value: istio-system/envoy-stats-monitor
      filter/prometheus_beta:
        error_mode: ignore
        metrics:
          include:
            match_type: regexp
            resource_attributes:
            - key: _meta.level
              value: beta
      filter/prometheus_prod:
        error_mode: ignore
        metrics:
          include:
            match_type: regexp
            resource_attributes:
            - key: _meta.level
              value: prod
      memory_limiter:
        check_interval: 1s
        limit_percentage: 85
        spike_limit_percentage: 10
      transform/set_beta_or_prod_flag:
        error_mode: ignore
        metric_statements:
        - context: resource
          statements:
          - set(attributes["_meta.level"], "beta") where attributes["service.name"]
            == "istio-system/envoy-stats-monitor"
          - set(attributes["_meta.level"], "prod") where attributes["service.name"]
            == "... other metrics ..."
    receivers:
      otlp:
        protocols:
          grpc:
            max_recv_msg_size_mib: 128
            tls:
              ca_file: /tls/ca.crt
              cert_file: /tls/tls.crt
              client_ca_file: /tls/ca.crt
              key_file: /tls/tls.key
    service:
      extensions:
      - health_check
      - pprof
      - sigv4auth
      pipelines:
        metrics/prometheus_beta:
          exporters:
          - debug
          - prometheusremotewrite/debug
          processors:
          - memory_limiter
          - transform/set_beta_or_prod_flag
          - filter/prometheus_beta
          - batch/prometheus
          receivers:
          - otlp
        metrics/prometheus_prod:
          exporters:
          - prometheusremotewrite/amp
          - prometheusremotewrite/centralProd
          processors:
          - memory_limiter
          - transform/set_beta_or_prod_flag
          - filter/prometheus_prod
          - attributes/common
          - batch/prometheus
          receivers:
          - otlp
      telemetry:
        logs:
          level: info
        metrics:
          level: detailed
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  env:
  - name: GOMAXPROCS
    value: "8"
  - name: GOMEMLIMIT
    valueFrom:
      resourceFieldRef:
        divisor: "1"
        resource: limits.memory
  ingress:
    route: {}
  livenessProbe:
    failureThreshold: 30
    initialDelaySeconds: 60
    periodSeconds: 30
  managementState: managed
  mode: statefulset
  observability:
    metrics: {}
  podDisruptionBudget:
    maxUnavailable: 1
  readinessProbe:
    failureThreshold: 3
    periodSeconds: 15
  replicas: 24
  resources:
    limits:
      memory: 3Gi
    requests:
      cpu: "2"
      memory: 3Gi
  serviceAccount: otel-collector-metrics-aggregation
  upgradeStrategy: automatic
  volumeMounts:
  - mountPath: /tls/ca.crt
    name: tls-ca
    readOnly: true
    subPath: ca.crt
  - mountPath: /tls/tls.key
    name: tls
    readOnly: true
    subPath: tls.key
  - mountPath: /tls/tls.crt
    name: tls
    readOnly: true
    subPath: tls.crt
  volumes:
  - name: tls-ca
    secret:
      defaultMode: 420
      items:
      - key: ca.crt
        path: ca.crt
      secretName: otel-collector-cacert
  - name: tls
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      secretName: otel-collector-metrics-v2

Additional context

Setting the Scene

I think the only way to explain the flow here is to start with a picture, and then talk through the timeline. In this picture, we have 5 graphs that are important to see at the same time.

  • Metric Datapoints Exported: This is the graph of successful exported metrics per exporter. The blue and yellow lines are two exporters connected to the metrics/prometheus_prod pipeline. They are sending "production" data that we've validated. The orange line is the metrics/prometheus_beta pipeline that is sending data we haven't yet validated - but a high volume of it. The green line is a debug output, it can be ignored.
  • Percentage of Metrics Exported: This is the success-rate graph for each of the exporters described above.
  • HTTP Response Times - ...amazonaws.com: This is the response time graph for the prometheusremotewrite/amp exporter (a local AMP endpoint in the same region/account)
  • HTTP Response Times - ...com: This is the exporter prometheusremotewrite/centralProd which happens to be an AMP endpoint, but is cross-account and region (going through a proxy).
  • HTTP Response Times - prometheus-operated: This is the internal prometheusremotewrite/debug endpoint which is a single internal prometheus pod attached to the prometheus_beta pipeline that I used to introduce throttling.

Screenshot 2024-05-31 at 12 08 22 PM

Timeline

  • 9:40:00: Everything is roughly humming along just fine....
  • 9:44:00: Throttling is introduced to the prometheusremotewrite/debug exporter by reducing the CPU limits on the Prometheus pod. Latency starts to creep up.
  • 10:07:00: We finally start to see the success-rate for the prometheusremotewrite/debug exporter tank. Note at this point the other two exporters are still operating just fine.
  • 10:33:00: We see a dip in the success-rate for the prometheusremotewrite/centralProd and prometheusremotewrite/amp endpoints now.
  • 10:37:00: Success rate tanks on all exporters now other than the debug exporter.
  • ...
  • 11:12:00: I un-cork the Prometheus Pod by unlimiting its CPU and letting it restart. We see immediate response to the latencies for the prometheusremotewrite/debug exporter (though, it does not drop enough, and starts climbing again)
  • ... recovery never happens automatically...

Logs

Obviously we have lots of logs ... but here are two graphs that are interesting. First, just the high level graph of error log lines:
image

Rather than looking at the logs individually, I started looking at them in terms of two key messages... Forcing a GC and out of order errors:

image

We can see that roughly at 10:03 we start seeing the Forcing a GC message rates start climbing, and at 10:07 and 10:36 respectively we see corresponding jumps in the out of order error messages from the upstream Prometheus endpoints. We never see an active recovery of these metrics, even after we un-corked the downstream prometheusremotewrite/debug endpoint.

Finally - PPROF...

At @Aneurysm9's suggestion, I grabbed a profile and a few heap dumps from one pod during this time frame:

We can see in the CPU profile that most of the time is spent in GC:

image

When we look at the heap we can see an interesting memory usage in the prometheusremotewrite code:
image

Final thoughts

In this scenario, I expect the batch processor to prevent data from getting into the pipeline after the initial ~8-16k datapoints are collected and are failing to send. Once they fail to send, I expect that pressure to push upstream all the way to the receiver. I then expect to not really see any memory problems during this outage, just a blockage of the data going to the prometheusremotewrite/debug endpoint.

Instead I believe we see a memory leak in the prometheusremotewrite code. When that leak happens, it has the downstream impact of eventually tripping the memory_limiter circuit breaker which then starts forcing GCs ... but these GCs can't recover the data, so it just happens over and over and over again. This cycle then causes impact to the rest of the data pipeline flowing through the collector.

Lastly, I think this memory leak has some critical impact to the data payloads themselves sent to Prometheus which then causes duplicate or out-of-order samples to be sent that normally would not be, and this further exasterbates the problem.

@diranged diranged added bug Something isn't working needs triage New item requiring triage labels May 31, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@diranged
Copy link
Author

Oh there is a secondary question here - why does the p99 graph pin out at 10s... all of my timeouts are far higher than that. I found that pretty suspicious too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheusremotewrite needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

1 participant