Allow configuring idle incoming connections timeout #330

akqopensystems · 2018-04-26T15:17:56Z

Our setup:

--------------------------    ------------------    ---------------------    ----------------
| Client Application     | -> | Relay1         | -> | Relay2            | -> | Carbon Cache |
| Icinga2 GraphiteWriter |    | localhost:2013 |    | GraphiteHost:2013 |    | 6 Instances  |
--------------------------    ------------------    ---------------------    ----------------

The client application and relay1 are on the same host, as are relay2 and the carbon cache.
The client application monitors remote servers and generates in average 200M metrics/24h. We noticed that the client receives frequently a "Connection refused" error while talking to the relay1:

[2018-04-26 14:35:04 +0200] critical/Socket: send() failed with error code 104, "Connection reset by peer"
Context:
        (0) Processing check result for 'xxxpwphy!ram'

[2018-04-26 14:35:04 +0200] critical/GraphiteWriter: Cannot write to TCP socket on host '127.0.0.1' port '2013'.
Context:
        (0) Processing check result for 'xxxpwphy!ram'

[2018-04-26 14:35:04 +0200] critical/GraphiteWriter: Exception during Graphite operation: Verify that your backend is operational!
[...]
[2018-04-26 14:35:13 +0200] information/GraphiteWriter: Finished reconnecting to Graphite in 0.000180006 second(s).

The carbon-c-relay process is running with the following configuration:

[root@xxxlpmo02 ~]# tail -n1 /etc/sysconfig/carbon-c-relay                                                                                                                                                                                 
ARGS="-i 127.0.0.1 -p 2013 -w 6 -b 2500 -B 512 -U 524288 -q 1000000000"
[root@xxxlpmo02 ~]# grep -v "^#" /etc/carbon-c-relay.conf 
                                                                                                                                                                                                                                             
cluster graphite-prod                                                                                                                                                                                                                        
  forward                                                                                                                                                                                                                                    
    xxxlpmo01.xxx.de:2013                                                                                                                                                                                                             
    proto tcp                                                                                                                                                                                                                                
  ;                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                             
match *                                                                                                                                                                                                                                      
  send to graphite-prod                                                                                                                                                                                                                      
  stop                                                                                                                                                                                                                                       
  ; 
[root@xxxlpmo02 ~]# rpm -qa |grep carbon-c
carbon-c-relay-2.6-1.el7.x86_64
[root@osszplpmo02 ~]# uname -r
3.10.0-693.17.1.el7.x86_64

We sniffed the communication between the client and the relay and noticed that the relay closes the connection and directly afterwards, the client tries to send data for the already closed TCP session, which gets only a [RST, ACK] reply by the network stack. We also discuss this with the Icinga2 team in this issue.
From the discussion in issue #118 we gathered that there is an idle timeout in the carbon-c-relay to safeguard the relay from possible connection leaks of clients.
Is it likely that the client doesn't handle the idle timeout correctly and this leads to our problem as described in the Icinga2 issue? And would it be possible to make the idle timeout configurable?

The text was updated successfully, but these errors were encountered:

grobian · 2018-04-26T16:20:17Z

Yes, the relay closes connections when they are "idle".

#define IDLE_DISCONNECT_TIME  (10 * 60 * 1000 * 1000)  /* 10 minutes */

This indeed was necessary to prevent the relay from becoming unreachable due to (Windows) clients which opened for every metric a new connection, and never closed any.
Question here is, are you idling for 10 minutes, or is the connection closed because some read error was encountered.

Totally off-topic, but if you can send over UDP, then that would be an option perhaps, the Linux kernel guarantees UDP packages arrive for localhost IIRC.

akqopensystems · 2018-04-27T13:52:25Z

Unfortunately, we cannot use UDP. We've also tested with a setup where the client writes to carbon-relay which forwards to carbon-c-relay on localhost, but carbon-relay also had troubles with the connection:

27/04/2018 10:34:22 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionMade
27/04/2018 10:34:23 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionLost Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:23 :: CarbonClientFactory(127.0.0.1:2013:None)::clientConnectionLost (127.0.0.1:2013) Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:27 :: CarbonClientFactory(127.0.0.1:2013:None)::startedConnecting (127.0.0.1:2013)
27/04/2018 10:34:27 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionMade
27/04/2018 10:34:27 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionLost Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:27 :: CarbonClientFactory(127.0.0.1:2013:None)::clientConnectionLost (127.0.0.1:2013) Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:31 :: CarbonClientFactory(127.0.0.1:2013:None)::startedConnecting (127.0.0.1:2013)

So for the moment, we've switched to carbon-relay-ng on the monitoring hosts relaying to carbon-c-relay on the Graphite host. Unfortunately, our test environment did not experience this error, so the error seems to be triggered only in production load.

grobian · 2018-04-27T19:09:49Z

I'd be happy to understand better what's going wrong here. Is it the fact that the connection gets closed and the client (icinga2) not being very happy about this?

grobian · 2018-04-27T19:15:05Z

FYI: carbon-relay-ng does NOT timeout connections at the moment, but it may get it at some point: grafana/carbon-relay-ng#250 (comment)

If disconnects are the problem here, it means your problem would be back.

akqopensystems · 2018-04-30T08:05:46Z

It seems that the client doesn't notice that the connection is closed and tries to send data anyway. Of course then it gets a very clear answer from the TCP stack and only notices then that the connection is already closed. Unfortunately, the data which should have been sent by then gets dropped which leads to large gaps in our Whisper files.
It seems that the client has only been tested with carbon-relay and -relay-ng yet. We would be happy to stay with carbon-c-relay, as we already know that our Graphite host can be shut down for a few days without loss of data. Unfortunately, we needed a quick solution as the previous situation has been a big blocker to our Icinga2 migration project.
We would be very happy if the idle timeout default could be overridden with a config option. One small gap in the graphs once a week would not hurt us. But the graphs before haven't been acceptable:

Thanks for Your effort!

grobian · 2018-04-30T17:51:54Z

I can definitely see to making the timeout configurable. Regardless whether you use c-relay, I think it is in general a good thing to do. The timeout the outgoing connections use can also be specified.

For Issue #330 it is useful to disable the idle disconnection logic, for it breaks the client on the next send it does. Using the -E flag, this behaviour can be disabled now, and as such the bad interaction avoided.

Farfaday · 2018-10-23T10:54:30Z

Hi,
it seems that we have the same setup with icinga2 and carbon-c-relay. I upgraded yesterday to version 3.4 and put the -E flag, but we still have the same holes in our graphs.
Did I miss something ? Let me know if I can do anything to help on testing that.
Many Thanks !

piotr1212 · 2018-10-23T11:10:04Z

There might be 100 other reasons why a connection gets broken. IMHO the sender should try to reconnect but it seems the Icinga2 authors disagree with me. Try lowering your /proc/sys/net/ipv4/tcp_keepalive_time , that solves 90% of all network issues...

Farfaday · 2018-10-24T14:23:36Z

Thanks for you answer ! Tried that but it did not helped me... anyway, that is not a big deal :)

akqopensystems mentioned this issue Apr 26, 2018

Unreliable sending of Graphite performance metrics for host and service checks Icinga/icinga2#6261

Closed

akqopensystems closed this as completed Apr 27, 2018

grobian reopened this Apr 30, 2018

grobian added the enhancement label Apr 30, 2018

grobian changed the title ~~Client receives frequently "connection refused" from relay on localhost~~ Allow configuring idle incoming connections timeout Apr 30, 2018

grobian closed this as completed Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuring idle incoming connections timeout #330

Allow configuring idle incoming connections timeout #330

akqopensystems commented Apr 26, 2018

grobian commented Apr 26, 2018

akqopensystems commented Apr 27, 2018

grobian commented Apr 27, 2018

grobian commented Apr 27, 2018

akqopensystems commented Apr 30, 2018

grobian commented Apr 30, 2018

Farfaday commented Oct 23, 2018

piotr1212 commented Oct 23, 2018 •

edited

Farfaday commented Oct 24, 2018

Allow configuring idle incoming connections timeout #330

Allow configuring idle incoming connections timeout #330

Comments

akqopensystems commented Apr 26, 2018

grobian commented Apr 26, 2018

akqopensystems commented Apr 27, 2018

grobian commented Apr 27, 2018

grobian commented Apr 27, 2018

akqopensystems commented Apr 30, 2018

grobian commented Apr 30, 2018

Farfaday commented Oct 23, 2018

piotr1212 commented Oct 23, 2018 • edited

Farfaday commented Oct 24, 2018

piotr1212 commented Oct 23, 2018 •

edited