Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring idle incoming connections timeout #330

Closed
akqopensystems opened this issue Apr 26, 2018 · 9 comments
Closed

Allow configuring idle incoming connections timeout #330

akqopensystems opened this issue Apr 26, 2018 · 9 comments

Comments

@akqopensystems
Copy link

Our setup:

--------------------------    ------------------    ---------------------    ----------------
| Client Application     | -> | Relay1         | -> | Relay2            | -> | Carbon Cache |
| Icinga2 GraphiteWriter |    | localhost:2013 |    | GraphiteHost:2013 |    | 6 Instances  |
--------------------------    ------------------    ---------------------    ----------------

The client application and relay1 are on the same host, as are relay2 and the carbon cache.
The client application monitors remote servers and generates in average 200M metrics/24h. We noticed that the client receives frequently a "Connection refused" error while talking to the relay1:

[2018-04-26 14:35:04 +0200] critical/Socket: send() failed with error code 104, "Connection reset by peer"
Context:
        (0) Processing check result for 'xxxpwphy!ram'

[2018-04-26 14:35:04 +0200] critical/GraphiteWriter: Cannot write to TCP socket on host '127.0.0.1' port '2013'.
Context:
        (0) Processing check result for 'xxxpwphy!ram'

[2018-04-26 14:35:04 +0200] critical/GraphiteWriter: Exception during Graphite operation: Verify that your backend is operational!
[...]
[2018-04-26 14:35:13 +0200] information/GraphiteWriter: Finished reconnecting to Graphite in 0.000180006 second(s).

The carbon-c-relay process is running with the following configuration:

[root@xxxlpmo02 ~]# tail -n1 /etc/sysconfig/carbon-c-relay                                                                                                                                                                                 
ARGS="-i 127.0.0.1 -p 2013 -w 6 -b 2500 -B 512 -U 524288 -q 1000000000"
[root@xxxlpmo02 ~]# grep -v "^#" /etc/carbon-c-relay.conf 
                                                                                                                                                                                                                                             
cluster graphite-prod                                                                                                                                                                                                                        
  forward                                                                                                                                                                                                                                    
    xxxlpmo01.xxx.de:2013                                                                                                                                                                                                             
    proto tcp                                                                                                                                                                                                                                
  ;                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                             
match *                                                                                                                                                                                                                                      
  send to graphite-prod                                                                                                                                                                                                                      
  stop                                                                                                                                                                                                                                       
  ; 
[root@xxxlpmo02 ~]# rpm -qa |grep carbon-c
carbon-c-relay-2.6-1.el7.x86_64
[root@osszplpmo02 ~]# uname -r
3.10.0-693.17.1.el7.x86_64

We sniffed the communication between the client and the relay and noticed that the relay closes the connection and directly afterwards, the client tries to send data for the already closed TCP session, which gets only a [RST, ACK] reply by the network stack. We also discuss this with the Icinga2 team in this issue.
From the discussion in issue #118 we gathered that there is an idle timeout in the carbon-c-relay to safeguard the relay from possible connection leaks of clients.
Is it likely that the client doesn't handle the idle timeout correctly and this leads to our problem as described in the Icinga2 issue? And would it be possible to make the idle timeout configurable?

@grobian
Copy link
Owner

grobian commented Apr 26, 2018

Yes, the relay closes connections when they are "idle".

#define IDLE_DISCONNECT_TIME  (10 * 60 * 1000 * 1000)  /* 10 minutes */

This indeed was necessary to prevent the relay from becoming unreachable due to (Windows) clients which opened for every metric a new connection, and never closed any.
Question here is, are you idling for 10 minutes, or is the connection closed because some read error was encountered.

Totally off-topic, but if you can send over UDP, then that would be an option perhaps, the Linux kernel guarantees UDP packages arrive for localhost IIRC.

@akqopensystems
Copy link
Author

Unfortunately, we cannot use UDP. We've also tested with a setup where the client writes to carbon-relay which forwards to carbon-c-relay on localhost, but carbon-relay also had troubles with the connection:

27/04/2018 10:34:22 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionMade
27/04/2018 10:34:23 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionLost Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:23 :: CarbonClientFactory(127.0.0.1:2013:None)::clientConnectionLost (127.0.0.1:2013) Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:27 :: CarbonClientFactory(127.0.0.1:2013:None)::startedConnecting (127.0.0.1:2013)
27/04/2018 10:34:27 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionMade
27/04/2018 10:34:27 :: CarbonClientProtocol(127.0.0.1:2013:None)::connectionLost Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:27 :: CarbonClientFactory(127.0.0.1:2013:None)::clientConnectionLost (127.0.0.1:2013) Connection to the other side was lost in a non-clean fashion.
27/04/2018 10:34:31 :: CarbonClientFactory(127.0.0.1:2013:None)::startedConnecting (127.0.0.1:2013)

So for the moment, we've switched to carbon-relay-ng on the monitoring hosts relaying to carbon-c-relay on the Graphite host. Unfortunately, our test environment did not experience this error, so the error seems to be triggered only in production load.

@grobian
Copy link
Owner

grobian commented Apr 27, 2018

I'd be happy to understand better what's going wrong here. Is it the fact that the connection gets closed and the client (icinga2) not being very happy about this?

@grobian
Copy link
Owner

grobian commented Apr 27, 2018

FYI: carbon-relay-ng does NOT timeout connections at the moment, but it may get it at some point: grafana/carbon-relay-ng#250 (comment)

If disconnects are the problem here, it means your problem would be back.

@akqopensystems
Copy link
Author

It seems that the client doesn't notice that the connection is closed and tries to send data anyway. Of course then it gets a very clear answer from the TCP stack and only notices then that the connection is already closed. Unfortunately, the data which should have been sent by then gets dropped which leads to large gaps in our Whisper files.
It seems that the client has only been tested with carbon-relay and -relay-ng yet. We would be happy to stay with carbon-c-relay, as we already know that our Graphite host can be shut down for a few days without loss of data. Unfortunately, we needed a quick solution as the previous situation has been a big blocker to our Icinga2 migration project.
We would be very happy if the idle timeout default could be overridden with a config option. One small gap in the graphs once a week would not hurt us. But the graphs before haven't been acceptable:
grafik
Thanks for Your effort!

@grobian
Copy link
Owner

grobian commented Apr 30, 2018

I can definitely see to making the timeout configurable. Regardless whether you use c-relay, I think it is in general a good thing to do. The timeout the outgoing connections use can also be specified.

@grobian grobian reopened this Apr 30, 2018
@grobian grobian changed the title Client receives frequently "connection refused" from relay on localhost Allow configuring idle incoming connections timeout Apr 30, 2018
grobian added a commit that referenced this issue Jun 21, 2018
For Issue #330 it is useful to disable the idle disconnection logic, for
it breaks the client on the next send it does.  Using the -E flag, this
behaviour can be disabled now, and as such the bad interaction avoided.
@grobian grobian closed this as completed Jun 21, 2018
@Farfaday
Copy link

Hi,
it seems that we have the same setup with icinga2 and carbon-c-relay. I upgraded yesterday to version 3.4 and put the -E flag, but we still have the same holes in our graphs.
Did I miss something ? Let me know if I can do anything to help on testing that.
Many Thanks !

@piotr1212
Copy link
Contributor

piotr1212 commented Oct 23, 2018

There might be 100 other reasons why a connection gets broken. IMHO the sender should try to reconnect but it seems the Icinga2 authors disagree with me. Try lowering your /proc/sys/net/ipv4/tcp_keepalive_time , that solves 90% of all network issues...

@Farfaday
Copy link

Thanks for you answer ! Tried that but it did not helped me... anyway, that is not a big deal :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants