Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe #30700

Closed
tcaty opened this issue Jan 22, 2024 · 5 comments
Labels
bug Something isn't working exporter/prometheus needs triage New item requiring triage

Comments

@tcaty
Copy link

tcaty commented Jan 22, 2024

Component(s)

exporter/prometheus

What happened?

Description

Hi! We faced issue with exporting metrics recently. Prometheus exporter stops exporting metrics after ~ 8 hours and gets the error specified in title:

error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe

image

Now it works only with cronjob which restart otel-collector every 8 hours. But it seems like it should not work like this, sometimes collector gets this error earlier than 8 hours. It is very unstable.

We think that this error could be occured by low memory, but it seems like it's not, cause there is the memory_limiter in processors and there are no any logs about soft or hard memory limits in otel-collector pods stdout.

Steps to Reproduce

  • Run otel-collector with following configuration
  • Send a lot of metrics here
  • Get this error

Expected Result

Otel-collector works stable and export prometheus metrics through all time without restarts.

Actual Result

Otel-collector can work stable without restart only ~ 8 hours.

Collector version

v0.88.0

Environment information

Environment

kubernetes v1.24
opentelemtry-collector helm chart v0.73.1
prometheus v2.47.1

OpenTelemetry Collector configuration

receivers:
    otlp:
      protocols:
        http: {}
    prometheus:
      config: 
        scrape_configs:
          - job_name: otel-collector-metrics
            scrape_interval: 15s
            static_configs:
            - targets: ['localhost:8888']

  processors:
    memory_limiter:
      check_interval: 1s
      # -- limit_mib defines the hard limit. 
      # it should be equal to resources.limits.memory.
      limit_mib: 4000
      # -- soft limit value equals to (limit_mib - spike_limit_mib)
      # it should be equal to autoscaling.targetMemoryUtilizationPercentage
      # therefore spike_limit_percentage = 100 - autoscaling.targetMemoryUtilizationPercentage
      spike_limit_percentage: 20
    batch: {}
    metricstransform:
      transforms:
        - include: duration
          match_type: strict
          action: update
          operations:
            - action: aggregate_labels
              label_set: 
                - env
                - service.name
              aggregation_type: sum
        - include: calls
          match_type: strict
          action: update
          operations:
            - action: aggregate_labels
              label_set: 
                - env
                - service.name
                - span.name
                - status.code
              aggregation_type: sum

  exporters:
    otlphttp:
      endpoint: "http://tempo-gateway:80"
      tls:
        insecure: true
    prometheus:
      endpoint: "0.0.0.0:8889"

  connectors:
    spanmetrics:
      # -- parse env from OTEL_RESOURCE_ATTRIBUTES
      # and create separated series for each one
      dimensions:
        - name: env
          default: Stage
    servicegraph:

  service:
    pipelines:
      traces:
        receivers:
          - otlp
        processors: 
          - memory_limiter
          - batch
        exporters:
          - otlphttp
          - spanmetrics
          - servicegraph
      metrics:
        receivers:
          - otlp
          - prometheus
          - spanmetrics
          - servicegraph
        processors:
          - memory_limiter
          - batch
          - metricstransform
        exporters:
          - prometheus
    telemetry:
      metrics:
        address: localhost:8888
        level: detailed

Log output

### otel-collector logs
2024-01-19T12:09:36.382Z error prometheusexporter@v0.88.0/log.go:23 error encoding and sending metric family: write tcp <ip> -> <ip>: write: broken pipe
 {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
 github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.88.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1.2
 github.com/prometheus/client_golang@v1.17.0/prometheus/promhttp/http.go:192
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
 github.com/prometheus/client_golang@v1.17.0/prometheus/promhttp/http.go:210
net/http.HandlerFunc.ServeHTTP
 net/http/server.go:2136
net/http.(*ServeMux).ServeHTTP
 net/http/server.go:2514
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
 go.opentelemetry.io/collector/config/confighttp@v0.88.0/compression.go:147
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.45.0/handler.go:217
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
 go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.45.0/handler.go:81
net/http.HandlerFunc.ServeHTTP
 net/http/server.go:2136
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
 go.opentelemetry.io/collector/config/confighttp@v0.88.0/clientinfohandler.go:28
net/http.serverHandler.ServeHTTP
 net/http/server.go:2938
net/http.(*conn).serve
 net/http/server.go:2009


### prometheus logs
ts=2024-01-19T12:09:51.104Z caller=scrape.go:1384 level=debug component="scrape manager" scrape_pool=otel-collector target=http://otel-collector-opentelemetry-collector:8889/metrics msg="Scrape failed" err="context deadline exceeded"

Additional context

Otel-collector metrics

OpenTelemetry Collector dashboard

image
image
image

Kubernetes / Views / Pods

image
image

Otel-collector resources configuration

resources:
  requests:
    cpu: 100m
    memory: 2Gi
  limits:
    cpu: 500m
    memory: 4Gi
# -- When enabled, the chart will set the GOMEMLIMIT env var to 80% of the configured
# resources.limits.memory and remove the memory ballast extension.
useGOMEMLIMIT: true        
  
autoscaling:
  enabled: true
  minReplicas: 1
  maxReplicas: 3
  targetCPUUtilizationPercentage: 80
  targetMemoryUtilizationPercentage: 80

Prometheus chart main configuration

scrapeInterval: 15s
scrapeTimeout: 10s
evaluationInterval: 15s
enableAdminAPI: true
logLevel: debug
retention: 14d
      
# this setting fix error on ingesting out-of-order samples
# see: https://promlabs.com/blog/2022/12/15/understanding-duplicate-samples-and-out-of-order-timestamp-errors-in-prometheus/#intentional-ingestion-of-out-of-order-data
# by default otel-collector batch processor has timeout = 200ms 
# it is time duration after which a batch will be sent regardless of size
tsdb:
  outOfOrderTimeWindow: 200ms
@tcaty tcaty added bug Something isn't working needs triage New item requiring triage labels Jan 22, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Hello @tcaty, can you share more information about the environment you're running Kubernetes in? This error is a networking issue coming from Prometheus closing the connection, and could be related to your underlying OS and kernel version.

References: Kernel bug fix, another related issue, and another related issue.

@tcaty
Copy link
Author

tcaty commented Jan 24, 2024

Hi @crobert-1! Thank you for your reply!
We use KaaS from cloud provider named Yandex Cloud. Yesterday I made request to technical support, they said, that there are no any network problems in our cluster.
But I heard recently that CPU throttling maybe connection killer. So I removed cpu limits from resources configuration and increase otel-collector restart interval to 24 hours. There were no errors for last 22 hours, I hope this is the solution. I'll give feedback in several days.

@tcaty
Copy link
Author

tcaty commented Jan 29, 2024

Hello again @crobert-1) We have increased cpu limits for our otel-collector and now it works good, there were no errors with sending metrics for last 5 days. Actually the problem is that cpu throttling is connection killer. There is one of related articles here.
Thank you for your reply again, issue can be closed :)

@crobert-1
Copy link
Member

Glad to hear it's working again, thanks for including a solution too, it's really helpful as a future reference!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/prometheus needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

2 participants