Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error encoding and sending metric family - write: broken pipe #2387

Closed
bocmanpy opened this issue Jun 1, 2022 · 13 comments
Closed

error encoding and sending metric family - write: broken pipe #2387

bocmanpy opened this issue Jun 1, 2022 · 13 comments

Comments

@bocmanpy
Copy link

bocmanpy commented Jun 1, 2022

Host operating system: output of uname -a

The problem is reproduced on several versions.
Linux 4.15.0-171-generic #180-Ubuntu SMP Wed Mar 2 17:25:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Linux 5.4.0-107-generic #121~18.04.1-Ubuntu SMP Thu Mar 24 17:21:33 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.3.1 (branch: non-git, revision: non-git)
  build user:       root@devops-k8s-go1.15-vqdj1
  build date:       20211206-03:01:18
  go version:       go1.15.6
  platform:         linux/amd64

node_exporter command line flags

/usr/bin/node-exporter \
--collector.bcache \
--collector.cpu.info \
--collector.systemd \
--collector.systemd.unit-include=.+\\.(service|socket|timer) \
--collector.buddyinfo \
--collector.meminfo_numa \
--collector.ksmd \
--collector.tcpstat \
--collector.netstat.fields=^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(.*)|Tcp_(ActiveOpens|InSegs|OutSegs|OutR
--collector.textfile.directory=/var/lib/node-exporter \
--no-collector.nfs \
--no-collector.nfsd \
--no-collector.zfs \
--no-collector.fibrechannel \
--no-collector.infiniband \
--no-collector.ipvs \
--no-collector.nvme \
--no-collector.tapestats \

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Nothing, node-exporter working in normal mode.

What did you expect to see?

Stable working without broken-pipe errors.

What did you see instead?

мая 31 16:36:58 node-exporter[628868]: ts=2022-05-31T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe"
@bocmanpy
Copy link
Author

bocmanpy commented Jun 1, 2022

Debug logs
node-exporter-debug_issue.log

@bocmanpy
Copy link
Author

bocmanpy commented Jun 1, 2022

I found similar problem #1066 , but root cause of problem was losing connection because of collector scrape time was above scrape_timeout.
Longest scrape time is
node-exporter[1471287]: ts=2022-05-31T16:14:42.138Z caller=systemd_linux.go:200 level=debug collector=systemd msg="collectSummaryMetrics took" duration_seconds=8.4897e-05

@ventifus
Copy link
Contributor

ventifus commented Jun 8, 2022

This still looks generally like the Prometheus side of the scrape timing out and being closed from the that side. A number of the collectors look slower than I expect, is this system very heavily loaded?

мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.138Z caller=collector.go:173 level=debug msg="collector succeeded" name=netclass duration_seconds=0.985494687
мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.238Z caller=collector.go:173 level=debug msg="collector succeeded" name=textfile duration_seconds=1.183959943
мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.338Z caller=collector.go:173 level=debug msg="collector succeeded" name=tcpstat duration_seconds=1.283826206
мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.459Z caller=collector.go:173 level=debug msg="collector succeeded" name=systemd duration_seconds=1.405464851

You can try disabling some of these if you don't need them. Review the node_scrape_collector_duration_seconds metric to see which ones are slowest. If Prometheus is unable to finish scraping for extended periods of time, try retrieving those metrics manually using curl, for example curl -s http://localhost:9100/metrics | fgrep node_scrape_collector_duration_seconds. You could also try increasing Prometheus' scrape_timeout and scrape_interval.

@bocmanpy
Copy link
Author

bocmanpy commented Jun 8, 2022

Hi @ventifus , thanks for answer.
No, barely loaded any subsystems - cpu, ram, net, i/o.

Yep, it's good idea to check and disable some of collectors, I'll try find out slowest and disabled them.

p.s.
I met case when on system with 128 Core CPU cpufreq collector went beyond scrape_timeout.

@bocmanpy
Copy link
Author

Find out that was a network problem. 🫣

@HwiLu
Copy link

HwiLu commented Nov 15, 2022

Find out that was a network problem. 🫣

How did you determine that it was a network problem, where are there obvious logs that show it?

@HwiLu
Copy link

HwiLu commented Nov 15, 2022

@bocmanpy

@shoce
Copy link

shoce commented Nov 17, 2022

Having the same issue now on two servers. Reasons are absolutely obscure to me.

@bocmanpy how did you find out that was a network problem?

@shoce
Copy link

shoce commented Nov 18, 2022

image

I have this graph of node_scrape_collector_duration_seconds for the recent two days. Looks like something dramatically changed but it is all collectors, not just one got slower.

@shoce
Copy link

shoce commented Nov 18, 2022

image

Another node's graph of node_scrape_collector_duration_seconds for the recent two days.

@shoce
Copy link

shoce commented Nov 18, 2022

That update after which the scraping changed so much included:

update of linux kernel: linux-image-virtual (5.4.0.132.132) over (5.4.0.131.131)
linux-modules-5.4.0-132-generic (5.4.0-132.148)
libexpat1:amd64 (2.2.9-1ubuntu0.5) over (2.2.9-1ubuntu0.4)
kpartx (0.8.3-1ubuntu2.1) over (0.8.3-1ubuntu2)
multipath-tools (0.8.3-1ubuntu2.1) over (0.8.3-1ubuntu2)
k0s binary 1.25.4 over 1.25.3

prometheus-node-exporter 1.4.0 - that night update did not touch this

@anneum
Copy link

anneum commented Nov 21, 2022

We have the same problem after an update only with a different k8s version:

linux-modules-5.4.0-132-generic (5.4.0-132.148)
libexpat1:amd64 (2.2.9-1ubuntu0.5)
kpartx (0.8.3-1ubuntu2.1)
multipath-tools (0.8.3-1ubuntu2.1)
k8s binary 1.24.6
prometheus-node-exporter 1.4.0

One of our node exporter pods randomly loses connection with the message error encoding and sending metric family: write tcp write: broken pipe and restores it after a few minutes.

@SuperQ
Copy link
Member

SuperQ commented Nov 21, 2022

@shoce That looks like #2500.

@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants