Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: document that Conn.Read/Write should return error that Is(context.DeadlineExceeded) for deadline exceeded #31449

Open
networkimprov opened this issue Apr 12, 2019 · 47 comments

Comments

@networkimprov
Copy link

@networkimprov networkimprov commented Apr 12, 2019

This is a bug. A keepalive error is a connection failure, not a deadline event.

net Docs say:

A zero value for [deadline] means I/O operations will not time out.

A zero value for [read deadline] means Read will not time out.

KeepAlive specifies the keep-alive period for an active network connection.
If zero, keep-alives are enabled if supported by the protocol and operating system.
Network protocols or operating systems that do not support keep-alives ignore this field.
If negative, keep-alives are disabled.

For a TLS connection that's been severed, Conn.Read() returns a net.Error with .Timeout()==true due to KeepAlive failure. (Go 1.12, Linux, amd64)

The Error should give .Timeout()==false to comply with the docs. Code that checks for .Timeout()==true would generally assume that an explicit deadline had passed.

The .Error() string should mention keepalive. It's currently:
"read tcp n.n.n.n:p->n.n.n.n:p: read: connection timed out"

Related: net.Dialer.KeepAlive gets 15*time.Second if unset/zero. This isn't documented in package net.

cc @bradfitz

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Apr 16, 2019

Also filed #31490 to report that tls.DialWithDialer() doesn't respect .KeepAlive.

@CAFxX

This comment has been minimized.

Copy link
Contributor

@CAFxX CAFxX commented Apr 17, 2019

Since 1.11, net.Dialer.KeepAlive gets 15*time.Second if unset/zero. This isn't documented in package net.

It is documented that we enable keep-alives if that field is unset/zero. The choice of not specifying 15s in the docs was deliberate and is actually documented in the commit message: 5bd7e9c

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jun 24, 2019

I believe this is a new bug in 1.12, so a fix should be backported to a 1.12.x release.
Also reported in #32735

cc @ianlancetaylor
@gopherbot add release-blocker

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 24, 2019

Do you know what makes this new in 1.12? I'm not seeing it.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jun 24, 2019

This decided on keepalive by default in 1.12: #23459. That causes deadline handlers in existing code to see non-deadline (i.e. keepalive) errors.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 24, 2019

If I understand you correctly, you are saying that the bug has existed for a long time for programs that enable TCP keepalive by setting the KeepAlive field in net.Dialer, but that it is more likely to occur in 1.12 because now that field is set by default. Is that correct?

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jun 24, 2019

Yes. It's probably rare to use both deadlines and explicit keepalive, so the bug wasn't reported.

Now any code with long deadlines (relatively common) will wrongly detect deadline events due to implicit keepalive.

@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Jul 15, 2019

Go 1.13 also enables Keep-Alives by default on the net.Listen side (1abf3aa), so this might be worth fixing now, before exposing a new wave of applications to it.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jul 16, 2019

@costela, your keepalive patch will trigger this bug...

@gopherbot please open backport 1.12

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jul 16, 2019

Backport issue(s) opened: #33137 (for 1.12).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@costela

This comment has been minimized.

Copy link
Contributor

@costela costela commented Jul 16, 2019

The Error should give .Timeout()==false to comply with the docs. Code that checks for .Timeout()==true would generally assume that an explicit deadline had passed.

I can't seem to find anything in the docs saying that a keepalive timeout should not set .Timeout()==true. IMHO this is not necessarily obvious and should be clarified in the docs if it's indeed intended.

This isn't to say the new default behavior wouldn't introduce a potentially breaking change, but I'm not sure changing the error to be .Timeout()==false is the right approach. Just as there might be code depending on .Timeout()==true for detecting deadlines, there might be code depending on the same behavior for explicitly set keepalives. Or am I missing something obvious?

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jul 16, 2019

Docs say "A zero value for [deadline] means I/O operations will not time out."

Don't worry about it; it's been assigned. Sorry to bother you.

@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Jul 17, 2019

Deadlines and keep-alive errors are deeply different: the former are fully recoverable, the latter aren't, for example. It seems unlikely any code would ever want to handle them the same way.

Keep-alives are more akin to the connection dropping, so I don't think marking them as Timeouts makes sense. I'll make this change tomorrow.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jul 17, 2019

@FiloSottile, let's make the .Error() string mention keep-alive. It's currently:
"read tcp n.n.n.n:p->n.n.n.n:p: read: connection timed out"

Also I opened a 1.12.x backport issue.

@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Jul 24, 2019

I looked into this, and as far as I can tell, there is no way to single out keep-alive errors: they just surface a ETIMEDOUT from the read().

What we might do, is isolate deadline errors, which as far as I can tell (although I am no expert on that part of the code) are handled by runtime timers, and make those more uniquely recognizable. For example, by making them an exported error type.

We might even decide to make only those have Timeout() true. Currently Timeout() is true for e == EAGAIN || e == EWOULDBLOCK || e == ETIMEDOUT and for some DNS resolution errors. That line has last been touched in 2011, and I don't feel confident about how each of those ids map to a timeout.

If I'm right, this is not a small fix to a specific error value anymore, and I don't think it's something we should ship at the very end of the freeze. /cc @golang/osp-team

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 24, 2019

I believe that if a read from a network connection fails due to a deadline, the read will return internal/poll.ErrTimeout. I believe that if a read fails due to a keep-alive error, the read will return syscall.ETIMEDOUT.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Jul 24, 2019

Ian, can we make net.OpError.Timeout() check internal/poll.ErrTimeout?

Altho we could undo #23378 for 1.13, we still need to fix this in 1.12.x & 1.13 due to #23459.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 24, 2019

Seems a fair bit simpler to change internal/poll.ErrTimeout to not return true for the Timeout method. Or just not implement the Timeout method. Might be a good idea to change the name, of course.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jul 30, 2019

That seems workable for 1.13.

@FiloSottile

This comment has been minimized.

Copy link
Member

@FiloSottile FiloSottile commented Aug 1, 2019

First, we are only looking at keep-alive errors now, because they are the only ones tied to a Go 1.13 behavior change. We are not touching anything else months into the freeze.

Second, I am growing unconvinced we can/should fix this.

  1. We don't have any tests for it, nor are we equipped to write them. To test keep-alives we'd have to force a system-side state change, or start dropping packets at the network level.

    • socktest.Filter doesn't look like it would actually test the OS behavior.
    • Existing keep-alive tests just use testHookSetKeepAlive to check setKeepAlive[Period] is called.
  2. ETIMEDOUT is just as unspecified as our net.Error.Timeout(). I couldn't even find any docs that guarantee a keep-alive error causes ETIMEDOUT, and on the other hand there are apparently hundreds of non-keep-alive conditions that cause ETIMEDOUT (from skimming the Linux kernel sources).

The simplest change, removing ETIMEDOUT from syscall.Errno.Timeout(), caused net.TestDialTimeout to fail because it expects a Timeout() == true error from a Dial that ends in ETIMEDOUT. I have no idea what else would break, because I assume the remaining behaviors are just as untestable and hence untested, and the RC is too late to find out by letting it out into the world.

I suppose we could special-case ETIMEDOUT from read or write, and/or only from a TCP connection, but at this point I feel like I am just stabbing in the dark, without tests, on the week of the RC.

From another perspective, though, keep-alives always returned Timeout() == true errors, and we enabled them by default on clients in Go 1.12 and no one noticed. After looking more into the semantics of Timeout(), I think they are unfortunately meaningless, and we should just leave things as is instead of rocking the boat, try to do better with the new errors (see #33411), and eventually deprecate net.Error, os.IsTimeout, and the underspecified Timeout methods.

/cc @golang/osp-team @ianlancetaylor

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Aug 1, 2019

Two issues were filed re spurious deadline events in 1.12. Such bugs can be hard to detect unless you check the elapsed time for deadlines! A lot more folks would notice them if they start appearing in server-side apps in 1.13.

Let's revert both keep-alive commits for 1.13, and back-port a revert to 1.12.

It's reasonable to ask users to explicitly enable keep-alive if required, and many of us already use deadlines as a context-sensitive keep-alive technique.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 1, 2019

@networkimprov Can you point us to the two bugs filed against 1.12? I'm not excited about changing the 1.12 behavior at this point.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 1, 2019

I note that it would be feasible to handle ETIMEDOUT errors from Read and Write specifically. It would mean changing a few functions in internal/poll/fd_unix.go to look for ETIMEDOUT and replace it with some other error. But it does seem risky to do that for 1.13 at this point.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Aug 1, 2019

#32735 and this one (I encountered spurious deadlines in an app with long-running links to servers.)

This bug won't stop the app, but does change the way it was intended to work, with rather variable consequences.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 2, 2019

Change https://golang.org/cl/188657 mentions this issue: internal/poll: if Read/Write get ETIMEDOUT, return ErrKeepAlive

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 2, 2019

For the record, https://golang.org/cl/188657 shows the change to not return ETIMEDOUT from a read or write system call. It's probably not a good idea to do this for 1.13 at this point.

@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Aug 2, 2019

ETIMEDOUT.Timeout() has been true for many releases. It seems quite clear we cannot redefine that now.

I was sympathetic to CL 188657 until I searched the Linux kernel sources for ETIMEDOUT, as Filippo mentioned above. It gets returned from an enormous number of places, most of them having nothing to do with TCP, and many of them having nothing to do with networking. So translating ETIMEDOUT to ErrKeepAlive is clearly incorrect in the overwhelming majority of places where it appears in the kernel source. I don't believe we can be more precise here without a tremendous amount of system-specific code.

Keepalives have been in previous releases and turned into errors with err.Timeout() == true. I realize they are a little more common now, but even so it does not seem terrible to continue the behavior of the past 13 releases into a 14th.

I suggest we leave this alone for Go 1.13.

For Go 1.14 maybe the answer is to change our net.Conn implementations to return an error that Is(context.DeadlineExceeded), to allow a more precise check than Timeout.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 2, 2019

Thanks, moving to 1.14.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.13, Go1.14 Aug 2, 2019
@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Aug 2, 2019

Should I file an issue to revert #23378 and #23459 for 1.13?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 2, 2019

In my opinion we should cover the issue in the release notes. I think many more people will be helped by turning on keep-alive by default than will be hurt by the confusion around the Timeout method.

@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Aug 2, 2019

A note like: "Code using SetDeadline() should generally disable keep-alive, which is now on by default for all TCP connections, both dialed and accepted" ...?

After many years of keep-alives off by default, I don't understand the urgency for default-on before these errors can be discerned from deadline events.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 2, 2019

Because keep-alive is already on by default for 1.12, and people are using 1.12. Rolling back and then forward again is also confusing.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 2, 2019

Change https://golang.org/cl/188819 mentions this issue: doc/go1.13: mention confusion between keep-alive timeout and deadline

gopherbot pushed a commit that referenced this issue Aug 2, 2019
Updates #31449

Change-Id: I4d7075b20cd8171bc792e40b388f4215264a3317
Reviewed-on: https://go-review.googlesource.com/c/go/+/188819
Reviewed-by: Filippo Valsorda <filippo@golang.org>
@networkimprov

This comment has been minimized.

Copy link
Author

@networkimprov networkimprov commented Aug 3, 2019

The text added to the release notes should also live in the docs, and be referenced by net.Dial() & .Listen(), net.Error.Timeout(), and net.Conn.Set*Deadline().

The docs are currently explicit that only deadlines cause "time out". See quote in issue text.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Aug 9, 2019

Change https://golang.org/cl/189757 mentions this issue: net: document that a keep-alive failure also returns a timeout

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 9, 2019

I sent a CL to update the docs for SetDeadline. That seems to be the relevant place. This kind of detail would be out of place in the other locations mentioned.

gopherbot pushed a commit that referenced this issue Aug 11, 2019
Updates #31449

Change-Id: I76490c5e83eb2f7ba529b387a57ba088428aece5
Reviewed-on: https://go-review.googlesource.com/c/go/+/189757
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
@rsc

This comment has been minimized.

Copy link
Contributor

@rsc rsc commented Oct 9, 2019

Back in #31449 (comment), I wrote:

For Go 1.14 maybe the answer is to change our net.Conn implementations to return an error that Is(context.DeadlineExceeded), to allow a more precise check than Timeout.

We did not get into this, and we are close to the freeze. It seems like we should probably put this off to next cycle. But I will retitle so it is easier to understand what this is about.

@rsc rsc changed the title net: Conn.Read() returns Error with .Timeout()==true on KeepAlive failure net: document that Conn.Read/Write should return error that Is(context.DeadlineExceeded) for deadline exceeded Oct 9, 2019
@rsc rsc modified the milestones: Go1.14, Go1.15 Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.