Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429

Closed
trauta opened this issue Jun 30, 2022 · 8 comments · Fixed by #11439
Closed
Assignees
Labels
bug unexpected problem or unintended behavior

Comments

@trauta
Copy link

trauta commented Jun 30, 2022

Relevant telegraf.conf

[agent]
  hostname = "HOST"
  omit_hostname = false
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = false
  quiet = false

...

[outputs.graphite]
graphite_tag_support = false
prefix = "telegraf"
servers = [„carbon-c-relay-server1:2003“, "carbon-c-relay-server2:2003"]
template = "host.measurement.tags.field"
timeout = 10

Logs from Telegraf

Jun 30 14:17:05 HOST telegraf[760]: 2022-06-30T12:17:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:17:05 HOST telegraf[760]: 2022-06-30T12:17:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:20:05 HOST telegraf[760]: 2022-06-30T12:20:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:20:05 HOST telegraf[760]: 2022-06-30T12:20:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:26:05 HOST telegraf[760]: 2022-06-30T12:26:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:26:05 HOST telegraf[760]: 2022-06-30T12:26:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:28:05 HOST telegraf[760]: 2022-06-30T12:28:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:28:05 HOST telegraf[760]: 2022-06-30T12:28:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:29:05 HOST telegraf[760]: 2022-06-30T12:29:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:29:05 HOST telegraf[760]: 2022-06-30T12:29:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:31:05 HOST telegraf[760]: 2022-06-30T12:31:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:31:05 HOST telegraf[760]: 2022-06-30T12:31:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection

System info

Telegraf 1.23.0 (git: HEAD 806dc28) on Ubuntu 18.04 and newer & CentOS 7 and newer | carbon-c-relay v3.7.4 (2022-02-13) on receiving server

Docker

No response

Steps to reproduce

Hi,

I’m currently rolling out telegraf on all machines in our datacenter. Currently we have about 1000 telegraf installations sending their metrics via the Graphite Output Plugin to two load-balanced carbon-c-relay servers.

After running for a few hours (usually about 20 hours), the telegraf agent starts spaming the said error message every iteration.

After restarting the telegraf agent, there are no error messages for a few hours. After that, the error messages continue to appear every further iteration.

On the receiver side, there are no indications for any problems, no noticeable log entries, no TCP errors and so on.

Expected behavior

No constant error messages after running telegraf for more than one day.

Actual behavior

After running telegraf for about a day, there are error messages every single iteration.

Additional info

Do you have any idea what causes these messages? Since there are no missing metrics on our Graphite Server, is there a way to suppress these log messages?

Thanks for your help in advance!

@trauta trauta added the bug unexpected problem or unintended behavior label Jun 30, 2022
@sspaink
Copy link
Contributor

sspaink commented Jun 30, 2022

@trauta Thank you for making an issue, so I assume that the connection might be getting closed after a day. The log message unfortunately isn't that helpful and is missing the actual error information. I created a pull request that will produce new Telegraf binaries with an improved log message. Would you mind trying the binaries posted by the Telegraf Bot to see what the error message says so we can help investigate this issue further? Thanks!

@trauta
Copy link
Author

trauta commented Jul 1, 2022

Hi @sspaink, thanks for your reply! I've copied the binary from your MR to some development hosts (Ubu 18, 20, CentOS 7, AlmaLinux 8). Once the error messages start I will post them here.

Regards,
Alexander

@trauta
Copy link
Author

trauta commented Jul 3, 2022

Hi, here are the new error messages:

# Host 1 (Ubuntu 20.04)
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Starting Telegraf 1.24.0-ba285e94
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded inputs: cpu disk diskio kernel mem net processes swap system
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded aggregators:
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded processors:
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded outputs: graphite
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Tags enabled: host=HOST1
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"HOST1", Flush Interval:10s
Jul 01 12:00:08 HOST1 telegraf[3094976]: 2022-07-01T10:00:08Z E! [outputs.graphite] Conn &{{%!s(*net.netFD=&{{{0 0 0} 7 {139964603260600} <nil> 0 0 true true false} 2 1 true tcp 0xc0007b7e30 0xc0007b7e60})}} is closed. closing conn explicitly
Jul 01 12:00:08 HOST1 telegraf[3094976]: 2022-07-01T10:00:08Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.74:45120->0.0.0.211:2003: use of closed network connection
Jul 01 12:02:08 HOST1 telegraf[3094976]: 2022-07-01T10:02:08Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.74:45120: use of closed network connection with remote address 0.0.0.211:2003. closing conn explicitly
Jul 01 12:02:08 HOST1 telegraf[3094976]: 2022-07-01T10:02:08Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.74:45120->0.0.0.211:2003: use of closed network connection
Jul 01 12:05:08 HOST1 telegraf[3094976]: 2022-07-01T10:05:08Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.74:45120: use of closed network connection with remote address 0.0.0.211:2003. closing conn explicitly
# Host 2 (AlmaLinux 8)
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Starting Telegraf 1.24.0-ba285e94
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded inputs: cpu disk diskio kernel mem net processes swap system
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded aggregators:
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded processors:
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded outputs: graphite
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Tags enabled: host=HOST2
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"HOST2", Flush Interval:10s
Jul 01 21:52:00 HOST2 telegraf[523334]: 2022-07-01T19:52:00Z E! [outputs.graphite] Conn &{{%!s(*net.netFD=&{{{0 0 0} 8 {139970770851656} <nil> 0 0 true true false} 2 1 true tcp 0xc000978390 0xc0009783c0})}} is closed. closing conn explicitly
Jul 01 21:52:00 HOST2 telegraf[523334]: 2022-07-01T19:52:00Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.184:55126->0.0.0.212:2003: use of closed network connection
Jul 01 21:55:00 HOST2 telegraf[523334]: 2022-07-01T19:55:00Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.184:55126: use of closed network connection with remote address 0.0.0.212:2003. closing conn explicitly
Jul 01 21:55:00 HOST2 telegraf[523334]: 2022-07-01T19:55:00Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.184:55126->0.0.0.212:2003: use of closed network connection
Jul 01 21:56:00 HOST2 telegraf[523334]: 2022-07-01T19:56:00Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.184:55126: use of closed network connection with remote address 0.0.0.212:2003. closing conn explicitly
Jul 01 21:56:00 HOST2 telegraf[523334]: 2022-07-01T19:56:00Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.184:55126->0.0.0.212:2003: use of closed network connection
Jul 01 21:58:00 HOST2 telegraf[523334]: 2022-07-01T19:58:00Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.184:55126: use of closed network connection with remote address 0.0.0.212:2003. closing conn explicitly

Does the new error message give you any more information about the problem?

@trauta
Copy link
Author

trauta commented Jul 3, 2022

In a previous test without your new debug build, I've ran a tcpdump on the client to capture the telegraf traffic.

The client closes the active TCP session to one carbon-c-relay server after some time, after this the error messages started and the client only send the metrics to the other remaining carbon server. There are no hints of TCP errors in the dump as far as I can tell.

telegraf_tcpdump

@sspaink sspaink self-assigned this Jul 5, 2022
@sspaink
Copy link
Contributor

sspaink commented Jul 5, 2022

Thank you for trying the artifacts and sharing the logs with me! So, I think the problem is that the Graphite output plugin will only try reconnecting if ALL of the server connections are closed or have issues. This happens because the send function doesn't return an error if at least ONE server succeeds, and then if this function doesn't return an error the following reconnect logic doesn't get called.

So if this is correct, I assume you expect the plugin to attempt to reconnect to any server if it fails? The send function could be updated to track the failed servers and attempt to reconnect just those instead of waiting for all servers to fail. A config option like retry_delay similar to https://github.com/influxdata/telegraf/tree/master/plugins/inputs/jti_openconfig_telemetry will need to be added so we don't accidentally overload the failed server with retries. Does this make sense? I personally haven't used Graphite at all so want to make sure there isn't anything I am missing. I did also notice there are more errors being ignored that the plugin probably should log.

@trauta
Copy link
Author

trauta commented Jul 7, 2022

Thank you for the clarification!

Your suggestion to implement a logic that restores the broken connection in a timely manner sounds great! I have configured the two carbon servers to realize a high-availability scenario. If one connection fails it is not too bad, but the many log messages are a bit annoying and lead to confusion. A new connection attempt should not cause a big load on the Graphite (carbon) servers, in my opinion a parameter like retry_delay is not mandatory, but would not bother.

@sspaink
Copy link
Contributor

sspaink commented Jul 12, 2022

@trauta I've created a pull request with the updated retry logic as I described: #11439 would you be able to take the artifacts posted by the Tiger bot and see if this change works for you? I updated the connection errors to only be printed to the log if you run telegraf with --debug so it should be less noisy as well, but if you notice issues perhaps try running it with --debug to learn more.

@trauta
Copy link
Author

trauta commented Jul 18, 2022

Hi, thank you for the merge request. I ran the new binary for a few days on two test servers, the new retry logic seem to work as intended. Here is a snippet from the debug logs:

2022-07-18T09:14:01Z D! [outputs.graphite] Graphite Error: write tcp 1.2.3.184:43960->1.2.3.211:2003: use of closed network connection
2022-07-18T09:14:01Z D! [outputs.graphite] Failed to close the connection: close tcp 1.2.3.184:43960->1.2.3.211:2003: use of closed network connection
2022-07-18T09:14:01Z D! [outputs.graphite] Graphite: Reconnecting and retrying for the following servers: 1.2.3.211:2003

Since there are no errors in the default logs, I would say that your change solves the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants