Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: KeepAlive is disabled by write to dead link #31490

Open
networkimprov opened this issue Apr 16, 2019 · 18 comments
Open

net: KeepAlive is disabled by write to dead link #31490

networkimprov opened this issue Apr 16, 2019 · 18 comments

Comments

@networkimprov
Copy link

@networkimprov networkimprov commented Apr 16, 2019

Was: net.Dialer.Dial() doesn't respect .KeepAlive

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Haven't tried 1.12.1+

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/go"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build601953146=/tmp/go-build -gno-record-gcc-switches"

What did you do/see?

For a TLS connection that's been dropped, c.Read() returns a net.Error with e.Timeout()==true after ~18 minutes, apparently due to KeepAlive failure. The connection is closed by the server while the client (laptop running VMWare with Linux) is suspended. Error string:
"read tcp n.n.n.n:p->n.n.n.n:p: read: connection timed out"

The note on 5bd7e9c (discussed in #23459) says the default for net.Dialer keepalive failure is just 2-3min. With an explicit KeepAlive, I also see odd waits:

net.Dialer{KeepAlive: 30 * time.Second} 18min
net.Dialer{KeepAlive: 25 * time.Second} 18min
net.Dialer{KeepAlive: 10 * time.Second} 16min
net.Dialer{KeepAlive: 5 * time.Second} <1min, but at least once 18min

Measurements aren't precise; 16-18 minutes could be 1000 or 1024 seconds. Code has:

   aDlr := net.Dialer{Timeout: 3*time.Second} // default KeepAlive
   aCfgTls := tls.Config{InsecureSkipVerify: true}
   aConn, err := tls.DialWithDialer(&aDlr, "tcp", "host:port", &aCfgTls)
   if err != nil { return err }
   err = aConn.Write(...) // <256 bytes
   // brief exchange of short messages
   aLen, err := aConn.Read(...)

Also filed #31449 to report that the error due to keepalive doesn't comply with the docs re connection timeout.

cc @FiloSottile @bradfitz

@bradfitz

This comment has been minimized.

Copy link
Member

@bradfitz bradfitz commented Apr 16, 2019

tls.DialWithDialer literally calls net.Dialer.Dial, so I don't think there's anything TLS-specific here.

@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Apr 16, 2019

Yeah, I looked and while net/http used to mess with the keep-alives, crypto/tls does not.

Can you figure out the KEEPCNT value on your system?

@FiloSottile FiloSottile changed the title tls: DialWithDialer() doesn't respect .KeepAlive net: net.Dialer.Dial() doesn't respect .KeepAlive Apr 16, 2019
@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Apr 16, 2019

I have the standard defaults:

$ cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
$ cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Apr 16, 2019

A likely factor... After resuming the client, its date jumps forward to the current time.

On my system, that happens after 0-10 mins (it isn't immediate due to a VMWare or Linux driver issue). But that should cause a max wait of 13mins for the default KeepAlive.

@julieqiu julieqiu added this to the Go1.13 milestone Apr 22, 2019
@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jun 23, 2019

cc @mikioh

@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 17, 2019

@networkimprov, I'm a little confused about your setup.
Your dialer has keep-alives active and you expect the dialed connection to die sooner, but the dialer side is the one being suspended, right?

Anyway I suspect that in your test case, a Write on the connection is happening after the network failure, therefore the keepalive state is abandoned or never reached.
A socket can only be in the keepalive state if it is idle. If there is outstanding data, the socket will be in the on state as it retransmits the data over and over because of the network failure.

It seems a common misconception that a connection with keep-alives enabled is going to return an error within TCP_KEEPIDLE + TCP_KEEPINTV * TCP_KEEPCNT seconds from a network failure. This is often not the case.

I reproduced the issue on Linux, with client and server on the same host, using iptables to cut the network:

# iptables -A INPUT -p tcp --dport 12345 -j DROP; iptables -A INPUT -p tcp --sport 12345 -j DROP; date

I used netcat as server, and a simple Go program with default keep-alives (15s) as client.
The client terminates on read error and has 2 modes: #1 writes 1 byte; #2 writes 1 byte, sleeps 10s, then writes another byte (during those 10s I cut the network).
In mode #1 I get the read error consistently after ~2m26s while in mode #2 I get it after ~16m.
It doesn't hang forever because tcp_retries2 is reached, but it could take as long as 30m.

This is the first web search result I found about the issue.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Oct 17, 2019

I haven't seen this with the same client app running on MacOS, so it may be Linux-specific and suspend-related. It could be due to my system, which is ArchLinux on VMWare on Win7. The net.Dialer side, which sets a keepalive, gets suspended. EDIT: the server side is on a remote host.

I'm pretty sure my client doesn't Write to that link during the period in question (it sits in Read); only the server is writing. ... oop, I misspoke...

EDIT 2: The client writes a pulse to the server every 115s. The lack of that pulse causes the server to drop the connection after the laptop suspends. On the MacOS client I use KeepAlive: 5 * time.Second which yields a timeout in <1m. I'll try logging the client pulses to see what happens when they fire right after resume and before timeout.

@gopherbot add OS-Linux

@gopherbot gopherbot added the OS-Linux label Oct 17, 2019
@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 17, 2019

@networkimprov, if I understand correctly:

  • the client is suspended;
  • the server drops the connection because it doesn't receive application level pulses anymore;
  • the client is resumed;
  • the client connection hangs for more than the expected 50s (for a 5s KeepAlive setting).

EDIT: sorry, I misunderstood the last part of your message; what I wrote below was probably unnecessary.

If the client doesn't receive a RST from the server, then it would hang until reaching either the keep-alive timeout or the retry timeout (13 to 30 minutes in Linux).
You said it yourself that the client *is* writing, so it seems likely to me that you are experiencing the longer retry timeout.

If the client writes anything to the connection within ~50s after resume, keep-alives are out of the game.
If the client tries to send a pulse in that time span, the connection will leave the keepalive state forever and die only after the longer retry timeout.
Even if your pulse period is longer than your keep-alive timeout, a pulse can still happen at any time relative to the resume.
If you have some reason to believe that this is not happening, you could try to log any Write to the underlying TCPConn to be absolutely sure.

Maybe MacOS just has a shorter retry timeout?

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Oct 17, 2019

Is there a way to trigger a RST to the client on resume?

I'm planning to add a server-to-client pulse and drop net.Dialer keepalive, but first I'll try to confirm that a write happens before the timeout on resume.

@bradfitz maybe the docs could mention that writing to a lost connection disables keepalive? Because keepalive is now the default for both dialed and accepted connections.

@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 19, 2019

Is there a way to trigger a RST to the client on resume?

If the server receives any segment after closing the connection, it should respond with RST. Most likely either the client transmissions or the server RSTs are being dropped by some stateful filter because they don't match any connection in its table anymore (NAT?). I've observed this with one provider, unfortunately I don't think anything can be done.

@bradfitz maybe the docs could mention that writing to a lost connection disables keepalive? Because keepalive is now the default for both dialed and accepted connections.

I agree. Given a lost connection, writing to it before the keep-alive timeout seems like a very common case to me.
And I don't see such case mentioned anywhere in the issues either, for example keep-alives are discussed here and here.
The docs could at least use a different wording, like changing s/active/idle/ in "keep-alive probes for an active network connection".

TCP_USER_TIMEOUT seems closer to users expectations, see man 7 tcp (I have no idea if it works on anything other than Linux).
It's the max time between a transmission and its ACK. It overrides TCP_KEEPCNT, but the time before and between keep-alives is unaffected.
The timeout is not exact because it is checked only after retry intervals, which may grow exponentially, but seems good enough.

func setTCPTimeout(d time.Duration) func(string, string, syscall.RawConn) error {
    return func(network, address string, c syscall.RawConn) error {
        var sysErr error
        var err = c.Control(func(fd uintptr) {
            sysErr = syscall.SetsockoptInt(int(fd), syscall.SOL_TCP, 0x12,
                int(d.Milliseconds()))
        })
        if sysErr != nil {
            return os.NewSyscallError("setsockopt", sysErr)
        }
        return err
    }
}

var dialer = &net.Dialer{
    KeepAlive: 15 * time.Second,
    Control:   setTCPTimeout(150 * time.Second),
}
@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Oct 19, 2019

A NAT, yes; my clients were connected to the server via a router doing NAT.

Re docs, how about "keepalive ceases to function if a write is made to a dead link, as that kicks the link out of idle mode and retries the write repeatedly for ~15 minutes".

TCP_USER_TIMEOUT

Do you think that default socket config in Go should set this in lieu of (or addition to) SO_KEEPALIVE? I agree that time-until-ack is more generally useful and less noisy than the keepalive protocol. (I'd only need a one-way heartbeat!) Given that keepalive ceases to function after a write to dead link, it seems not useful at all.

The Man7 page references RFC 793 and RFC 5482 for this feature.
Drill-down from Cloudflare: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

EDIT: a search for tcp_user_timeout reveals that it's apparently not implemented on Darwin nor Windows :-(

@networkimprov networkimprov changed the title net: net.Dialer.Dial() doesn't respect .KeepAlive net: KeepAlive is disabled by write to dead link Oct 19, 2019
@gopherbot gopherbot removed the OS-Linux label Oct 19, 2019
@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 20, 2019

Given that keepalive ceases to function after a write to dead link, it seems not useful at all.

It "ceases to function" because it is not needed anymore.
Keep-alives job is only to keep things moving when they are not moving on their own. If nothing is going out, there is nothing to wait back, and a network failure could remain undetected indefinitely.
(They are also useful in preventing intermediate nodes from killing the connection if there's no traffic for too long.)

The problem is not the scope of keep-alives, but the inability to change limits for non keep-alives:
tcp_keepalive_probes is for keep-alives what tcp_retries2 is for normal packets. Linux defaults are 9 and 15 respectively, but only the former is settable on a per socket basis.
Also wait intervals for keep-alives are linear and configurable whereas for normal packets they are exponential and non-configurable.

TCP_USER_TIMEOUT overrides both tcp_keepalive_probes and tcp_retries2, providing a uniform time limit instead of two separate counter limit.
This would be approximately the maximum amount of time a network failure can remain undetected, in any condition (if keep-alives are enabled).
Yes, I think it would have made a good default for the net package. Too bad for the lack of support in other OSs.

@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 23, 2019

About TCP_USER_TIMEOUT, I found vague mentions of TCP_MAXRT and TCP_CONNECTIONTIMEOUT being possible replacements for Windows and MacOS respectively, but I cannot try them because I don't have access to those systems.

In my Linux tests, TCP_USER_TIMEOUT seems to make the connection time out consistently in both the "idle" and "non-idle" case, but I get confusing results in a zero-window situation.

In Debian 9, kernel 4.9.168, Go 1.13:

  • with TCP_USER_TIMEOUT set, the connection dies if it stays in the zero-window state for longer than the specified time, regardless whether the window probes are acknowledged or not;
  • with TCP_USER_TIMEOUT unset, the connection dies after 16 (why not 15?) retries if the window probes are lost, but can live forever if the probes are acknowledged;
  • when the connection is terminated by the timeout, Read returns io.EOF in the zero-window case (normally it returns syscall.ETIMEDOUT).

In an older Debian, I observed TCP_USER_TIMEOUT being completely ignored in the zero-window case, this commit may be relevant, I can't find clear info about it.

I don't like this undocumented and changing behavior of TCP_USER_TIMEOUT. I think dying even when the probes are being acknowledged is a big semantic difference.
I generally want to be able to detect any network failure within 2-3 minutes or so, but a closed window is just like writing to a blocked pipe, I think it's the application's business to decide how long to wait on it, and after the connection is orphaned I would like some other limit to kick in.
Unfortunately TCP_USER_TIMEOUT doesn't seem to make that possible.

The current OS support for TCP user timeout controls seems too divergent/unpredictable/poor to do something with it in the net package :-(
The best we can do is probably to make sure users understand what keep-alives are.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Oct 24, 2019

Thanks for the investigation!

What was the version of the older Debian which ignores tcp_user_timeout in zero-window?

I don't think it's necessarily a show-stopper if zero-window (sometimes) triggers tcp_user_timeout. The question is whether that behavior is worse than keepalive shut-off on write to a dead link, which is pretty bad.

Hopefully the io.EOF return can be fixed.

@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 26, 2019

What was the version of the older Debian which ignores tcp_user_timeout in zero-window?

The one I tested was Debian 8, kernel 3.16.59 (probably anything older does the same).

The question is whether that behavior is worse than keepalive shut-off on write to a dead link, which is pretty bad.

It depends on the application. In many cases, timing out on zero-window shouldn't be a problem because there is no reason for the receiver not to consume its input for a significant amount of time, but there are exceptions.

I'm not saying I wouldn't use TCP_USER_TIMEOUT (AFAIK, is the only way to shorten the retry limit per-socket), I just don't like that some aspects of its behavior are undocumented and silently changing in a way that may break some user.

FYI, commands like netstat -not or ss -not are useful to understand what's happening, specially the Send-Q (unacknowledged byte count) and Timer columns.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Oct 26, 2019

Maybe we could propose to default to tcp_user_timeout on kernels newer than [x.y] and fall back to the current default keepalive otherwise?

Aren't exception cases known to their authors and clients, who would necessarily need something different than the Go default? The keepalive default in Go exists as training wheels; it's just a starting point.

@pam4

This comment has been minimized.

Copy link

@pam4 pam4 commented Oct 29, 2019

Maybe we could propose to default to tcp_user_timeout on kernels newer than [x.y] and fall back to the current default keepalive otherwise?

It's the newer behavior of TCP_USER_TIMEOUT that could break some use cases, not the older, and the older is still better than nothing, so I don't see the point in making such exception.
Keep-alives are necessary anyway, to cover the "idle" case and refresh middleboxes state.

Aren't exception cases known to their authors and clients, who would necessarily need something different than the Go default?

Currently they don't necessarily need something different than the Go default, but they will if we make TCP_USER_TIMEOUT the new default (because of the zero-window change introduced in newer kernels).

If a knob for TCP_USER_TIMEOUT can be added to the net package, I'm all for it.
We can always think later about turning it on by default. But I don't know about other OSs, and I don't see much interest, if the participation here is any indication...

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Oct 29, 2019

OK, I'll investigate the options you found for Windows and MacOS when I get a chance. If they work, I'll file a proposal to add a field for this in net.Dialer, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.