Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

No way to set TCP keepalive interval #4109

Closed
deanm opened this issue Oct 10, 2012 · 21 comments
Closed

No way to set TCP keepalive interval #4109

deanm opened this issue Oct 10, 2012 · 21 comments

Comments

@deanm
Copy link

deanm commented Oct 10, 2012

Currently the TCP keepalive support (socket setKeepAlive) allows an argument to set the initial delay (aka tcp_keepalive_time / TCP_KEEPIDLE), which is implemented in uv__tcp_keepalive. However there is no support for TCP_KEEPINTVL or TCP_KEEPCNT, which means you cannot change the time between keepalive probes or the number of failed probes before a connection is considered broken.

Is there a reason it was decided to only allow configuration of the initial delay?

@bnoordhuis
Copy link
Member

Is there a reason it was decided to only allow configuration of the initial delay?

Yes. TCP_KEEPINTVL and TCP_KEEPCNT are platform-specific extensions. We don't expose those unless there's a compelling reason.

@piscisaureus
Copy link

Note to self:

Linux and windows both support these options. On OS X it is only available from Mountain Lion and up. On solaris there is TCP_KEEPALIVE_THRESHOLD and TCP_KEEPALIVE_ABORT_THRESHOLD but I don't know how they compare.

@bnoordhuis
Copy link
Member

On solaris there is TCP_KEEPALIVE_THRESHOLD and TCP_KEEPALIVE_ABORT_THRESHOLD but I don't know how they compare.

TCP_KEEPALIVE_THRESHOLD is equivalent to TCP_KEEPIDLE but measured in milliseconds instead of seconds.

TCP_KEEPALIVE_ABORT_THRESHOLD is like TCP_KEEPCNT but TCP_KEEPCNT sets the number of keep-alive probes to send before severing the connection whereas TCP_KEEPALIVE_ABORT_THRESHOLD sets the timeout (in milliseconds).

@deanm
Copy link
Author

deanm commented Oct 11, 2012

Right now setting the initial delay is not really that useful without being able to fully configure the rest of the keepalive parameters. At least in my case it's important, otherwise I need to change the defaults at the sysctl level, and every socket will need to have the same keepalive configuration.

I realize there are some cross-platform issues, but that's really common with networking. For example, async accept() (setSimultaneousAccepts) is only supported on Windows.

@deanm
Copy link
Author

deanm commented Oct 12, 2012

It sounds like in Solaris's case we could more or less translate the parameters from the linux style. I suppose ABOUT_THRESHOLD would be something like KEEPCNT*KEEPINTVL, and converting to milliseconds is easy enough... So what platforms does that leave out?

@bnoordhuis
Copy link
Member

async accept() (setSimultaneousAccepts) is only supported on Windows.

It does something on Unices too (toggles accept back-off behavior) but that's admittedly very undocumented behavior right now. :-)

So what platforms does that leave out?

OpenBSD - but I don't think it allows you turn on keep-alive on a per-socket basis anyway. Besides, no one actually uses OpenBSD so I don't care.

The main thing is that I don't want to enlarge the uv_tcp_t struct with fields that are always never used. You would need those to store the values in case the socket hasn't been created yet.

Either we need to move away from lazy socket creation (which I'm belatedly in favor of but it's a lot of work) or we impose the additional restriction that you can't set any keep-alive related properties until you've finished connecting or listening.

A dirty little secret is that uv_tcp_keepalive() is broken in that respect: if you call it before the socket is created, it will silently ignore your delay argument and default to 60 seconds instead...

@jalateras
Copy link

I was wanting the use the setKeepAlive call to be informed when the client side of the connection is disrupted due to a reboot or loss of network connectivity but it seems that it's not completely working in node v0.8.15

I am using something like socket.setKeepAlive(true, 5000) but not sure whether this means, send a probe every 5 seconds when the connection is idle or something else. Additionally, is there any notification that will be send when the connection is borken.

In this particular application it is important we are able to quickly detect a broken connection.

@bnoordhuis
Copy link
Member

I was wanting the use the setKeepAlive call to be informed when the client side of the connection is disrupted due to a reboot or loss of network connectivity but it seems that it's not completely working in node v0.8.15

Being able to quickly detect a lost connection is not what TCP keep-alive is designed for. It's rather the opposite: keep-alive tries to keep a connection going in the face of unreliable network conditions (spotty connections, extreme congestion, network outages, etc.)

On most systems you won't get notified about a broken connection until about 2 hours have passed.

@jalateras
Copy link

That's not my understanding from reading this article http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

Keepalive can be used to advise you when your peer dies before it is able to notify you. This could happen for several reasons, like kernel panic or a brutal termination of the process handling that peer. Another scenario that illustrates when you need keepalive to detect peer death is when the peer is still alive but the network channel between it and you has gone down. In this scenario, if the network doesn't become operational again, you have the equivalent of peer death. This is one of those situations where normal TCP operations aren't useful to check the connection status

@bnoordhuis
Copy link
Member

The TCP keep-alive socket options control when and how many keep-alive probes are sent before the connection is written off. There is a reason the defaults are as high as they are (the aforementioned ~2 hours) because if you set them too low, you run the risk of false positives.

@deanm
Copy link
Author

deanm commented Jul 11, 2013

Isn't false positive really just a matter of definition. In theory a connection can be "gone" and "come back later", for example if a network link is down or if someone sleeps/restores a laptop, etc. This can happen across any window if time, even 2 hours is "not enough". So the idea of keep alives is just really decide what makes sense for your application. For example if I'm doing some sort of realtime communication over a TCP socket, and there is no response over the connection for even 10 seconds, I might want to consider that dead. Even if that socket might have been alive again 2 hours later, it doesn't make any sense for the application I'm using the socket for. This is why keep alives exist and why the values are adjustable, because there is no right answer, and actually it is all arbitrary and totally depends on what you want/expect for your application.

That is why there exists the mechanism to set these keepalive parameters on a per-socket basis.

@jameshartig
Copy link

I had already made a pull request for this:
#4882

It is useful in determining when a client has disconnected. I realize that might not be the intended purpose but it's useful in that you don't have to make your app randomly ping and process/wait for a response, everything is handled by the system.

@bnoordhuis, should I add support for Solaris in my PR? I don't have a machine to test that on though.

@jasonkuhrt
Copy link

I was going to add to this discussion but have nothing to add after @deanm @jalateras comments. 100% our situation too. We have realtime hardware socket connections which our app must know near-instantly when they lose connection.

@bnoordhuis
Copy link
Member

We have realtime hardware socket connections which our app must know near-instantly when they lose connection.

I want to refer you to this comment - in a nutshell, TCP keep-alive is not what you're looking for.

As I mentioned in the other issue, I'm okay with making the keep-alive configurable. On the other hand, I just know people will abuse it and then complain it's not working like they expect it to...

@jameshartig
Copy link

I made the PR for this because I'm not necessarily trying to detect when it exactly happens, just detect it in a more reasonable time than 2 hours... Like 15 minutes for instance.

I think its fine if we just clarify in the docs what it can be actually useful for.

@jalateras
Copy link

@bnoordhuis, i'm not sure why you think people will abuse this specific feature when this is true for other language features. The suggestion of providing the mechanism is a reasonable one don't you think.

@bnoordhuis
Copy link
Member

@jalateras Allow me to quote myself:

As I mentioned in the other issue, I'm okay with making the keep-alive configurable.

That said, it's a reasonable amount of work to implement and I won't be the one doing it. I know there is a PR for it but there were some issues when last I looked at it and the changes don't apply anymore.

What's more, there are tons of unreviewed pull requests for (IMO) more important changes and there's only limited reviewer time. If you want to get this in, you'll have to find another champion among the committers. :-)

@jameshartig
Copy link

I'd be more than willing to update the PR's I had already made to node/libuv so they can be merged if we can find a committer to do them. @isaacs had already commented on the node PR (#4882) and mentioned that he wanted to see a test accompany the changes but otherwise the code looked fine. If someone wants to help figure out a test, I'd be willing to implement it.

@davepacheco
Copy link

As others have mentioned, this would be pretty useful for us for detecting certain failure modes in networked systems (e.g., a remote system panicked or lost power). It doesn't have to be instant, but KeepAlive appears to be a fine mechanism for detecting such situations in the order of tens of seconds. (Whether those are false positives is entirely use-case-specific, and it's up to users to configure this to suit their environment. The docs can always link to appropriate resources on the subject.)

@hertzg
Copy link

hertzg commented Jul 26, 2015

I've written a module using FFI to set those values per connection using the socket._handle.fd (which is available only on linux).

It's in no way a solution to this issue, but just in case
https://github.com/hertzg/node-net-keepalive

@davepacheco
Copy link

I forgot this had not been mentioned earlier: SunOS systems (at least including illumos, and probably Solaris) support TCP_KEEPINTVL and TCP_KEEPCNT in addition to the other parameters mentioned above. It just translates them appropriately. No special support should be needed for this platform.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants