error encoding and sending metric family - write: broken pipe #2387

bocmanpy · 2022-06-01T15:42:35Z

Host operating system: output of `uname -a`

The problem is reproduced on several versions.
Linux 4.15.0-171-generic #180-Ubuntu SMP Wed Mar 2 17:25:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Linux 5.4.0-107-generic #121~18.04.1-Ubuntu SMP Thu Mar 24 17:21:33 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 1.3.1 (branch: non-git, revision: non-git)
  build user:       root@devops-k8s-go1.15-vqdj1
  build date:       20211206-03:01:18
  go version:       go1.15.6
  platform:         linux/amd64

node_exporter command line flags

/usr/bin/node-exporter \
--collector.bcache \
--collector.cpu.info \
--collector.systemd \
--collector.systemd.unit-include=.+\\.(service|socket|timer) \
--collector.buddyinfo \
--collector.meminfo_numa \
--collector.ksmd \
--collector.tcpstat \
--collector.netstat.fields=^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(.*)|Tcp_(ActiveOpens|InSegs|OutSegs|OutR
--collector.textfile.directory=/var/lib/node-exporter \
--no-collector.nfs \
--no-collector.nfsd \
--no-collector.zfs \
--no-collector.fibrechannel \
--no-collector.infiniband \
--no-collector.ipvs \
--no-collector.nvme \
--no-collector.tapestats \

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Nothing, node-exporter working in normal mode.

What did you expect to see?

Stable working without broken-pipe errors.

What did you see instead?

мая 31 16:36:58 node-exporter[628868]: ts=2022-05-31T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe"

The text was updated successfully, but these errors were encountered:

bocmanpy · 2022-06-01T15:45:49Z

Debug logs
node-exporter-debug_issue.log

bocmanpy · 2022-06-01T15:50:11Z

I found similar problem #1066 , but root cause of problem was losing connection because of collector scrape time was above scrape_timeout.
Longest scrape time is
node-exporter[1471287]: ts=2022-05-31T16:14:42.138Z caller=systemd_linux.go:200 level=debug collector=systemd msg="collectSummaryMetrics took" duration_seconds=8.4897e-05

ventifus · 2022-06-08T16:48:11Z

This still looks generally like the Prometheus side of the scrape timing out and being closed from the that side. A number of the collectors look slower than I expect, is this system very heavily loaded?

мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.138Z caller=collector.go:173 level=debug msg="collector succeeded" name=netclass duration_seconds=0.985494687
мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.238Z caller=collector.go:173 level=debug msg="collector succeeded" name=textfile duration_seconds=1.183959943
мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.338Z caller=collector.go:173 level=debug msg="collector succeeded" name=tcpstat duration_seconds=1.283826206
мая 31 16:14:42 node-exporter[1471287]: ts=2022-05-31T16:14:42.459Z caller=collector.go:173 level=debug msg="collector succeeded" name=systemd duration_seconds=1.405464851

You can try disabling some of these if you don't need them. Review the node_scrape_collector_duration_seconds metric to see which ones are slowest. If Prometheus is unable to finish scraping for extended periods of time, try retrieving those metrics manually using curl, for example curl -s http://localhost:9100/metrics | fgrep node_scrape_collector_duration_seconds. You could also try increasing Prometheus' scrape_timeout and scrape_interval.

bocmanpy · 2022-06-08T23:07:34Z

Hi @ventifus , thanks for answer.
No, barely loaded any subsystems - cpu, ram, net, i/o.

Yep, it's good idea to check and disable some of collectors, I'll try find out slowest and disabled them.

p.s.
I met case when on system with 128 Core CPU cpufreq collector went beyond scrape_timeout.

bocmanpy · 2022-06-16T13:25:29Z

Find out that was a network problem. 🫣

HwiLu · 2022-11-15T02:00:58Z

Find out that was a network problem. 🫣

How did you determine that it was a network problem, where are there obvious logs that show it?

HwiLu · 2022-11-15T02:01:17Z

@bocmanpy

shoce · 2022-11-17T21:26:05Z

Having the same issue now on two servers. Reasons are absolutely obscure to me.

@bocmanpy how did you find out that was a network problem?

shoce · 2022-11-18T09:12:26Z

I have this graph of node_scrape_collector_duration_seconds for the recent two days. Looks like something dramatically changed but it is all collectors, not just one got slower.

shoce · 2022-11-18T09:22:45Z

Another node's graph of node_scrape_collector_duration_seconds for the recent two days.

shoce · 2022-11-18T09:58:21Z

That update after which the scraping changed so much included:

update of linux kernel: linux-image-virtual (5.4.0.132.132) over (5.4.0.131.131)
linux-modules-5.4.0-132-generic (5.4.0-132.148)
libexpat1:amd64 (2.2.9-1ubuntu0.5) over (2.2.9-1ubuntu0.4)
kpartx (0.8.3-1ubuntu2.1) over (0.8.3-1ubuntu2)
multipath-tools (0.8.3-1ubuntu2.1) over (0.8.3-1ubuntu2)
k0s binary 1.25.4 over 1.25.3

prometheus-node-exporter 1.4.0 - that night update did not touch this

anneum · 2022-11-21T08:27:44Z

We have the same problem after an update only with a different k8s version:

linux-modules-5.4.0-132-generic (5.4.0-132.148)
libexpat1:amd64 (2.2.9-1ubuntu0.5)
kpartx (0.8.3-1ubuntu2.1)
multipath-tools (0.8.3-1ubuntu2.1)
k8s binary 1.24.6
prometheus-node-exporter 1.4.0

One of our node exporter pods randomly loses connection with the message error encoding and sending metric family: write tcp write: broken pipe and restores it after a few minutes.

SuperQ · 2022-11-21T09:17:28Z

@shoce That looks like #2500.

bocmanpy closed this as completed Jun 16, 2022

prometheus locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error encoding and sending metric family - write: broken pipe #2387

error encoding and sending metric family - write: broken pipe #2387

bocmanpy commented Jun 1, 2022

bocmanpy commented Jun 1, 2022 •

edited

Loading

bocmanpy commented Jun 1, 2022

ventifus commented Jun 8, 2022

bocmanpy commented Jun 8, 2022

bocmanpy commented Jun 16, 2022

HwiLu commented Nov 15, 2022

HwiLu commented Nov 15, 2022

shoce commented Nov 17, 2022

shoce commented Nov 18, 2022 •

edited

Loading

shoce commented Nov 18, 2022

shoce commented Nov 18, 2022 •

edited

Loading

anneum commented Nov 21, 2022

SuperQ commented Nov 21, 2022

error encoding and sending metric family - write: broken pipe #2387

error encoding and sending metric family - write: broken pipe #2387

Comments

bocmanpy commented Jun 1, 2022

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

bocmanpy commented Jun 1, 2022 • edited Loading

bocmanpy commented Jun 1, 2022

ventifus commented Jun 8, 2022

bocmanpy commented Jun 8, 2022

bocmanpy commented Jun 16, 2022

HwiLu commented Nov 15, 2022

HwiLu commented Nov 15, 2022

shoce commented Nov 17, 2022

shoce commented Nov 18, 2022 • edited Loading

shoce commented Nov 18, 2022

shoce commented Nov 18, 2022 • edited Loading

anneum commented Nov 21, 2022

SuperQ commented Nov 21, 2022

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

bocmanpy commented Jun 1, 2022 •

edited

Loading

shoce commented Nov 18, 2022 •

edited

Loading

shoce commented Nov 18, 2022 •

edited

Loading