Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429

trauta · 2022-06-30T12:58:13Z

Relevant telegraf.conf

[agent]
  hostname = "HOST"
  omit_hostname = false
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = false
  quiet = false

...

[outputs.graphite]
graphite_tag_support = false
prefix = "telegraf"
servers = [„carbon-c-relay-server1:2003“, "carbon-c-relay-server2:2003"]
template = "host.measurement.tags.field"
timeout = 10

Logs from Telegraf

Jun 30 14:17:05 HOST telegraf[760]: 2022-06-30T12:17:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:17:05 HOST telegraf[760]: 2022-06-30T12:17:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:20:05 HOST telegraf[760]: 2022-06-30T12:20:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:20:05 HOST telegraf[760]: 2022-06-30T12:20:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:26:05 HOST telegraf[760]: 2022-06-30T12:26:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:26:05 HOST telegraf[760]: 2022-06-30T12:26:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:28:05 HOST telegraf[760]: 2022-06-30T12:28:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:28:05 HOST telegraf[760]: 2022-06-30T12:28:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:29:05 HOST telegraf[760]: 2022-06-30T12:29:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:29:05 HOST telegraf[760]: 2022-06-30T12:29:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection
Jun 30 14:31:05 HOST telegraf[760]: 2022-06-30T12:31:05Z E! [outputs.graphite] Couldn't set read deadline for connection &{{%!s(*net.netFD=&{{{1 0 0} -1 {0} <nil> 0 0 true true false} 2 1 true tcp 0xc00037ef00 0xc00037ef30})}}. closing conn explicitly
Jun 30 14:31:05 HOST telegraf[760]: 2022-06-30T12:31:05Z E! [outputs.graphite] Graphite Error: write tcp 1.2.3.4:57410->5.6.7.8:2003: use of closed network connection

System info

Telegraf 1.23.0 (git: HEAD 806dc28) on Ubuntu 18.04 and newer & CentOS 7 and newer | carbon-c-relay v3.7.4 (2022-02-13) on receiving server

Docker

No response

Steps to reproduce

Hi,

I’m currently rolling out telegraf on all machines in our datacenter. Currently we have about 1000 telegraf installations sending their metrics via the Graphite Output Plugin to two load-balanced carbon-c-relay servers.

After running for a few hours (usually about 20 hours), the telegraf agent starts spaming the said error message every iteration.

After restarting the telegraf agent, there are no error messages for a few hours. After that, the error messages continue to appear every further iteration.

On the receiver side, there are no indications for any problems, no noticeable log entries, no TCP errors and so on.

Expected behavior

No constant error messages after running telegraf for more than one day.

Actual behavior

After running telegraf for about a day, there are error messages every single iteration.

Additional info

Do you have any idea what causes these messages? Since there are no missing metrics on our Graphite Server, is there a way to suppress these log messages?

Thanks for your help in advance!

The text was updated successfully, but these errors were encountered:

sspaink · 2022-06-30T20:28:14Z

@trauta Thank you for making an issue, so I assume that the connection might be getting closed after a day. The log message unfortunately isn't that helpful and is missing the actual error information. I created a pull request that will produce new Telegraf binaries with an improved log message. Would you mind trying the binaries posted by the Telegraf Bot to see what the error message says so we can help investigate this issue further? Thanks!

trauta · 2022-07-01T06:17:54Z

Hi @sspaink, thanks for your reply! I've copied the binary from your MR to some development hosts (Ubu 18, 20, CentOS 7, AlmaLinux 8). Once the error messages start I will post them here.

Regards,
Alexander

trauta · 2022-07-03T09:42:49Z

Hi, here are the new error messages:

# Host 1 (Ubuntu 20.04)
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Starting Telegraf 1.24.0-ba285e94
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded inputs: cpu disk diskio kernel mem net processes swap system
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded aggregators:
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded processors:
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Loaded outputs: graphite
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! Tags enabled: host=HOST1
Jul 01 08:08:47 HOST1 telegraf[3094976]: 2022-07-01T06:08:47Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"HOST1", Flush Interval:10s
Jul 01 12:00:08 HOST1 telegraf[3094976]: 2022-07-01T10:00:08Z E! [outputs.graphite] Conn &{{%!s(*net.netFD=&{{{0 0 0} 7 {139964603260600} <nil> 0 0 true true false} 2 1 true tcp 0xc0007b7e30 0xc0007b7e60})}} is closed. closing conn explicitly
Jul 01 12:00:08 HOST1 telegraf[3094976]: 2022-07-01T10:00:08Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.74:45120->0.0.0.211:2003: use of closed network connection
Jul 01 12:02:08 HOST1 telegraf[3094976]: 2022-07-01T10:02:08Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.74:45120: use of closed network connection with remote address 0.0.0.211:2003. closing conn explicitly
Jul 01 12:02:08 HOST1 telegraf[3094976]: 2022-07-01T10:02:08Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.74:45120->0.0.0.211:2003: use of closed network connection
Jul 01 12:05:08 HOST1 telegraf[3094976]: 2022-07-01T10:05:08Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.74:45120: use of closed network connection with remote address 0.0.0.211:2003. closing conn explicitly

# Host 2 (AlmaLinux 8)
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Starting Telegraf 1.24.0-ba285e94
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded inputs: cpu disk diskio kernel mem net processes swap system
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded aggregators:
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded processors:
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Loaded outputs: graphite
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! Tags enabled: host=HOST2
Jul 01 08:10:17 HOST2 telegraf[523334]: 2022-07-01T06:10:17Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"HOST2", Flush Interval:10s
Jul 01 21:52:00 HOST2 telegraf[523334]: 2022-07-01T19:52:00Z E! [outputs.graphite] Conn &{{%!s(*net.netFD=&{{{0 0 0} 8 {139970770851656} <nil> 0 0 true true false} 2 1 true tcp 0xc000978390 0xc0009783c0})}} is closed. closing conn explicitly
Jul 01 21:52:00 HOST2 telegraf[523334]: 2022-07-01T19:52:00Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.184:55126->0.0.0.212:2003: use of closed network connection
Jul 01 21:55:00 HOST2 telegraf[523334]: 2022-07-01T19:55:00Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.184:55126: use of closed network connection with remote address 0.0.0.212:2003. closing conn explicitly
Jul 01 21:55:00 HOST2 telegraf[523334]: 2022-07-01T19:55:00Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.184:55126->0.0.0.212:2003: use of closed network connection
Jul 01 21:56:00 HOST2 telegraf[523334]: 2022-07-01T19:56:00Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.184:55126: use of closed network connection with remote address 0.0.0.212:2003. closing conn explicitly
Jul 01 21:56:00 HOST2 telegraf[523334]: 2022-07-01T19:56:00Z E! [outputs.graphite] Graphite Error: write tcp 0.0.0.184:55126->0.0.0.212:2003: use of closed network connection
Jul 01 21:58:00 HOST2 telegraf[523334]: 2022-07-01T19:58:00Z E! [outputs.graphite] Couldn't set read deadline for connection due to error set tcp 0.0.0.184:55126: use of closed network connection with remote address 0.0.0.212:2003. closing conn explicitly

Does the new error message give you any more information about the problem?

trauta · 2022-07-03T09:51:12Z

In a previous test without your new debug build, I've ran a tcpdump on the client to capture the telegraf traffic.

The client closes the active TCP session to one carbon-c-relay server after some time, after this the error messages started and the client only send the metrics to the other remaining carbon server. There are no hints of TCP errors in the dump as far as I can tell.

sspaink · 2022-07-05T19:39:26Z

Thank you for trying the artifacts and sharing the logs with me! So, I think the problem is that the Graphite output plugin will only try reconnecting if ALL of the server connections are closed or have issues. This happens because the send function doesn't return an error if at least ONE server succeeds, and then if this function doesn't return an error the following reconnect logic doesn't get called.

So if this is correct, I assume you expect the plugin to attempt to reconnect to any server if it fails? The send function could be updated to track the failed servers and attempt to reconnect just those instead of waiting for all servers to fail. A config option like retry_delay similar to https://github.com/influxdata/telegraf/tree/master/plugins/inputs/jti_openconfig_telemetry will need to be added so we don't accidentally overload the failed server with retries. Does this make sense? I personally haven't used Graphite at all so want to make sure there isn't anything I am missing. I did also notice there are more errors being ignored that the plugin probably should log.

trauta · 2022-07-07T08:21:57Z

Thank you for the clarification!

Your suggestion to implement a logic that restores the broken connection in a timely manner sounds great! I have configured the two carbon servers to realize a high-availability scenario. If one connection fails it is not too bad, but the many log messages are a bit annoying and lead to confusion. A new connection attempt should not cause a big load on the Graphite (carbon) servers, in my opinion a parameter like retry_delay is not mandatory, but would not bother.

sspaink · 2022-07-12T20:46:47Z

@trauta I've created a pull request with the updated retry logic as I described: #11439 would you be able to take the artifacts posted by the Tiger bot and see if this change works for you? I updated the connection errors to only be printed to the log if you run telegraf with --debug so it should be less noisy as well, but if you notice issues perhaps try running it with --debug to learn more.

trauta · 2022-07-18T11:00:14Z

Hi, thank you for the merge request. I ran the new binary for a few days on two test servers, the new retry logic seem to work as intended. Here is a snippet from the debug logs:

2022-07-18T09:14:01Z D! [outputs.graphite] Graphite Error: write tcp 1.2.3.184:43960->1.2.3.211:2003: use of closed network connection
2022-07-18T09:14:01Z D! [outputs.graphite] Failed to close the connection: close tcp 1.2.3.184:43960->1.2.3.211:2003: use of closed network connection
2022-07-18T09:14:01Z D! [outputs.graphite] Graphite: Reconnecting and retrying for the following servers: 1.2.3.211:2003

Since there are no errors in the default logs, I would say that your change solves the problem.

trauta added the bug unexpected problem or unintended behavior label Jun 30, 2022

sspaink mentioned this issue Jun 30, 2022

feat(outputs.graphite): Retry connecting to servers with failed send attempts #11439

Merged

sspaink self-assigned this Jul 5, 2022

powersj closed this as completed in #11439 Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429

Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429

trauta commented Jun 30, 2022

sspaink commented Jun 30, 2022

trauta commented Jul 1, 2022

trauta commented Jul 3, 2022

trauta commented Jul 3, 2022

sspaink commented Jul 5, 2022 •

edited

trauta commented Jul 7, 2022

sspaink commented Jul 12, 2022

trauta commented Jul 18, 2022

Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429

Graphite Error: Couldn't set read deadline for connection / use of closed network connection #11429

Comments

trauta commented Jun 30, 2022

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

sspaink commented Jun 30, 2022

trauta commented Jul 1, 2022

trauta commented Jul 3, 2022

trauta commented Jul 3, 2022

sspaink commented Jul 5, 2022 • edited

trauta commented Jul 7, 2022

sspaink commented Jul 12, 2022

trauta commented Jul 18, 2022

sspaink commented Jul 5, 2022 •

edited