-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while sending encoded metrics: write tcp ... write: broken pipe #1066
Comments
Try with |
My apologies for not mentioning something, |
That's fine, but the context of the error is important to understand the timing. Having the full log for a failed request is required to determine the fault. |
Understood. Here's the full log with the error message included:
|
I noticed this issue seems to be happening on CentOS v7 only because I have the same node_exporter version running on Ubuntu and it works fine. Could it be kernel related? EDIT: For anyone wondering what the kernels are for both CentOS & Ubuntu, here is it: |
It looks like the filesystem collector is having problems on this machine. It's taking 10 seconds to collect data, which is the default timeout.
Are there network filesystems mounted on this machine? Or anything strange going on with the filesystems? |
@SuperQ Disabling the filesystem seems to have fix the issue. There is a network filesystem mounted on the CentOS machine and removing it seems to fix the issue. It would be nice if the error message can point out which collector is taking the longest. This would help avoid the need of having to get the debug log and scan through to find a collector that's taking too long. Thanks for the help! |
It's not the default because they are technically valid filesystems, but I would recommend excluding the network filesystem types with I'm not sure it's easily possible to capture what caused the scrape failure due to the request being canceled. It could be a combination of collectors, rather than a single obvious one like this case. |
@SuperQ any idea why it timed out yet logged success? |
No, I don't have any idea, I haven't really looked at the code closely. |
@SuperQ Looking at the code, it seems the mountTimeout it set to 30 seconds. If this would timeout, it also should log something, so not sure what is going on here. |
The timeout hit was the Prometheus scrape timeout of 10 seconds. It closed the connection to the node_exporter, hence the original error. The Prometheus client library has methods to pass through the server's timeout value. We could pass this to collectors. But I don't think it would be useful to fix this problem, as the filesystem read is deadlocked on IO anyway, the timeout would still result in an error. |
Ah got it, so the client closed the connection while the exporter was writing to it in the filesystem collector.. The log is a bit confusing.. maybe we could include a stacktrace. Anyway, the actual problem here is IMO that it logged success. But I think this is due to client_golang design where there is no way to detect an error when sending something to the collect channel. |
I think there isn't much more we can do here, so closing. |
Host operating system: output of
uname -a
node_exporter version: output of
node_exporter --version
node_exporter command line flags
Are you running node_exporter in Docker?
No.
What did you do that produced an error?
Every time the metrics are viewed through browser (Firefox) or Prometheus tries to scrape the metrics.
What did you expect to see?
To return the metrics in a timely manner and not report
broken-pipe
errors.What did you see instead?
Every time the metrics are viewed, this error is produced:
I noticed viewing the metrics took longer than before (worked before Sept. 6th and nothing was changed during that time).
For anyone curious, Prometheus reports this error:
context deadline exceeded
Tried testing it through
curl
command and it does report the output but it took longer than usual (at least 10 seconds, instead of 1-3 seconds). However, the error is still reported through node_exporter service.The text was updated successfully, but these errors were encountered: