error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe #30700

tcaty · 2024-01-22T08:21:42Z

Component(s)

exporter/prometheus

What happened?

Description

Hi! We faced issue with exporting metrics recently. Prometheus exporter stops exporting metrics after ~ 8 hours and gets the error specified in title:

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe

Now it works only with cronjob which restart otel-collector every 8 hours. But it seems like it should not work like this, sometimes collector gets this error earlier than 8 hours. It is very unstable.

We think that this error could be occured by low memory, but it seems like it's not, cause there is the memory_limiter in processors and there are no any logs about soft or hard memory limits in otel-collector pods stdout.

Steps to Reproduce

Run otel-collector with following configuration
Send a lot of metrics here
Get this error

Expected Result

Otel-collector works stable and export prometheus metrics through all time without restarts.

Actual Result

Otel-collector can work stable without restart only ~ 8 hours.

Collector version

v0.88.0

Environment information

Environment

kubernetes v1.24
opentelemtry-collector helm chart v0.73.1
prometheus v2.47.1

OpenTelemetry Collector configuration

receivers:
    otlp:
      protocols:
        http: {}
    prometheus:
      config: 
        scrape_configs:
          - job_name: otel-collector-metrics
            scrape_interval: 15s
            static_configs:
            - targets: ['localhost:8888']

  processors:
    memory_limiter:
      check_interval: 1s
      # -- limit_mib defines the hard limit. 
      # it should be equal to resources.limits.memory.
      limit_mib: 4000
      # -- soft limit value equals to (limit_mib - spike_limit_mib)
      # it should be equal to autoscaling.targetMemoryUtilizationPercentage
      # therefore spike_limit_percentage = 100 - autoscaling.targetMemoryUtilizationPercentage
      spike_limit_percentage: 20
    batch: {}
    metricstransform:
      transforms:
        - include: duration
          match_type: strict
          action: update
          operations:
            - action: aggregate_labels
              label_set: 
                - env
                - service.name
              aggregation_type: sum
        - include: calls
          match_type: strict
          action: update
          operations:
            - action: aggregate_labels
              label_set: 
                - env
                - service.name
                - span.name
                - status.code
              aggregation_type: sum

  exporters:
    otlphttp:
      endpoint: "http://tempo-gateway:80"
      tls:
        insecure: true
    prometheus:
      endpoint: "0.0.0.0:8889"

  connectors:
    spanmetrics:
      # -- parse env from OTEL_RESOURCE_ATTRIBUTES
      # and create separated series for each one
      dimensions:
        - name: env
          default: Stage
    servicegraph:

  service:
    pipelines:
      traces:
        receivers:
          - otlp
        processors: 
          - memory_limiter
          - batch
        exporters:
          - otlphttp
          - spanmetrics
          - servicegraph
      metrics:
        receivers:
          - otlp
          - prometheus
          - spanmetrics
          - servicegraph
        processors:
          - memory_limiter
          - batch
          - metricstransform
        exporters:
          - prometheus
    telemetry:
      metrics:
        address: localhost:8888
        level: detailed

Log output

### otel-collector logs
2024-01-19T12:09:36.382Z error prometheusexporter@v0.88.0/log.go:23 error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe
 {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
 github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.88.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
 github.com/prometheus/client_golang@v1.17.0/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
 github.com/prometheus/client_golang@v1.17.0/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
 net/http/server.go:2136
net/http.(*ServeMux).ServeHTTP
 net/http/server.go:2514
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
 go.opentelemetry.io/collector/config/confighttp@v0.88.0/compression.go:147
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.45.0/handler.go:217
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.45.0/handler.go:81
net/http.HandlerFunc.ServeHTTP
 net/http/server.go:2136
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
 go.opentelemetry.io/collector/config/confighttp@v0.88.0/clientinfohandler.go:28
net/http.serverHandler.ServeHTTP
 net/http/server.go:2938
net/http.(*conn).serve
 net/http/server.go:2009


### prometheus logs
ts=2024-01-19T12:09:51.104Z caller=scrape.go:1384 level=debug component="scrape manager" scrape_pool=otel-collector target=http://otel-collector-opentelemetry-collector:8889/metrics msg="Scrape failed" err="context deadline exceeded"

Additional context

Otel-collector metrics

OpenTelemetry Collector dashboard

Kubernetes / Views / Pods

Otel-collector resources configuration

resources:
  requests:
    cpu: 100m
    memory: 2Gi
  limits:
    cpu: 500m
    memory: 4Gi
# -- When enabled, the chart will set the GOMEMLIMIT env var to 80% of the configured
# resources.limits.memory and remove the memory ballast extension.
useGOMEMLIMIT: true        
  
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 3
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

Prometheus chart main configuration

scrapeInterval: 15s
scrapeTimeout: 10s
evaluationInterval: 15s
enableAdminAPI: true
logLevel: debug
retention: 14d
      
# this setting fix error on ingesting out-of-order samples
# see: https://promlabs.com/blog/2022/12/15/understanding-duplicate-samples-and-out-of-order-timestamp-errors-in-prometheus/#intentional-ingestion-of-out-of-order-data
# by default otel-collector batch processor has timeout = 200ms 
# it is time duration after which a batch will be sent regardless of size
tsdb:
  outOfOrderTimeWindow: 200ms

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-22T08:22:01Z

Pinging code owners:

exporter/prometheus: @Aneurysm9

See Adding Labels via Comments if you do not have permissions to add labels yourself.

crobert-1 · 2024-01-23T17:02:34Z

Hello @tcaty, can you share more information about the environment you're running Kubernetes in? This error is a networking issue coming from Prometheus closing the connection, and could be related to your underlying OS and kernel version.

References: Kernel bug fix, another related issue, and another related issue.

tcaty · 2024-01-24T09:37:11Z

Hi @crobert-1! Thank you for your reply!
We use KaaS from cloud provider named Yandex Cloud. Yesterday I made request to technical support, they said, that there are no any network problems in our cluster.
But I heard recently that CPU throttling maybe connection killer. So I removed cpu limits from resources configuration and increase otel-collector restart interval to 24 hours. There were no errors for last 22 hours, I hope this is the solution. I'll give feedback in several days.

tcaty · 2024-01-29T07:07:15Z

Hello again @crobert-1) We have increased cpu limits for our otel-collector and now it works good, there were no errors with sending metrics for last 5 days. Actually the problem is that cpu throttling is connection killer. There is one of related articles here.
Thank you for your reply again, issue can be closed :)

crobert-1 · 2024-01-29T16:10:28Z

Glad to hear it's working again, thanks for including a solution too, it's really helpful as a future reference!

tcaty added bug Something isn't working needs triage New item requiring triage labels Jan 22, 2024

github-actions bot added the exporter/prometheus label Jan 22, 2024

github-actions bot mentioned this issue Jan 23, 2024

Weekly Report: 2024-01-16 - 2024-01-23 #30711

Closed

crobert-1 closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe #30700

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe #30700

tcaty commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024

crobert-1 commented Jan 23, 2024

tcaty commented Jan 24, 2024

tcaty commented Jan 29, 2024

crobert-1 commented Jan 29, 2024

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe #30700

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe #30700

Comments

tcaty commented Jan 22, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Otel-collector metrics

OpenTelemetry Collector dashboard

Kubernetes / Views / Pods

Otel-collector resources configuration

Prometheus chart main configuration

github-actions bot commented Jan 22, 2024

crobert-1 commented Jan 23, 2024

tcaty commented Jan 24, 2024

tcaty commented Jan 29, 2024

crobert-1 commented Jan 29, 2024

tcaty commented Jan 22, 2024 •

edited

Loading