New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429
Comments
@trauta Thank you for making an issue, so I assume that the connection might be getting closed after a day. The log message unfortunately isn't that helpful and is missing the actual error information. I created a pull request that will produce new Telegraf binaries with an improved log message. Would you mind trying the binaries posted by the Telegraf Bot to see what the error message says so we can help investigate this issue further? Thanks! |
Hi @sspaink, thanks for your reply! I've copied the binary from your MR to some development hosts (Ubu 18, 20, CentOS 7, AlmaLinux 8). Once the error messages start I will post them here. Regards, |
Hi, here are the new error messages:
Does the new error message give you any more information about the problem? |
In a previous test without your new debug build, I've ran a tcpdump on the client to capture the telegraf traffic. The client closes the active TCP session to one carbon-c-relay server after some time, after this the error messages started and the client only send the metrics to the other remaining carbon server. There are no hints of TCP errors in the dump as far as I can tell. |
Thank you for trying the artifacts and sharing the logs with me! So, I think the problem is that the Graphite output plugin will only try reconnecting if ALL of the server connections are closed or have issues. This happens because the send function doesn't return an error if at least ONE server succeeds, and then if this function doesn't return an error the following reconnect logic doesn't get called. So if this is correct, I assume you expect the plugin to attempt to reconnect to any server if it fails? The send function could be updated to track the failed servers and attempt to reconnect just those instead of waiting for all servers to fail. A config option like |
Thank you for the clarification! Your suggestion to implement a logic that restores the broken connection in a timely manner sounds great! I have configured the two carbon servers to realize a high-availability scenario. If one connection fails it is not too bad, but the many log messages are a bit annoying and lead to confusion. A new connection attempt should not cause a big load on the Graphite (carbon) servers, in my opinion a parameter like |
@trauta I've created a pull request with the updated retry logic as I described: #11439 would you be able to take the artifacts posted by the Tiger bot and see if this change works for you? I updated the connection errors to only be printed to the log if you run telegraf with |
Hi, thank you for the merge request. I ran the new binary for a few days on two test servers, the new retry logic seem to work as intended. Here is a snippet from the debug logs:
Since there are no errors in the default logs, I would say that your change solves the problem. |
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.23.0 (git: HEAD 806dc28) on Ubuntu 18.04 and newer & CentOS 7 and newer | carbon-c-relay v3.7.4 (2022-02-13) on receiving server
Docker
No response
Steps to reproduce
Hi,
I’m currently rolling out telegraf on all machines in our datacenter. Currently we have about 1000 telegraf installations sending their metrics via the Graphite Output Plugin to two load-balanced carbon-c-relay servers.
After running for a few hours (usually about 20 hours), the telegraf agent starts spaming the said error message every iteration.
After restarting the telegraf agent, there are no error messages for a few hours. After that, the error messages continue to appear every further iteration.
On the receiver side, there are no indications for any problems, no noticeable log entries, no TCP errors and so on.
Expected behavior
No constant error messages after running telegraf for more than one day.
Actual behavior
After running telegraf for about a day, there are error messages every single iteration.
Additional info
Do you have any idea what causes these messages? Since there are no missing metrics on our Graphite Server, is there a way to suppress these log messages?
Thanks for your help in advance!
The text was updated successfully, but these errors were encountered: