Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: Client.Do returns different errors on context deadline depending on req/conn state #31863

Open
whiteglasses opened this issue May 6, 2019 · 5 comments

Comments

@whiteglasses
Copy link

commented May 6, 2019

What version of Go are you using (go version)?

go version go1.12.4 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/m.laletin/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/m.laletin/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build654146316=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run multiple times program on my server (24 cpu)
https://play.golang.org/p/d00_wxKDN5f
With single goroutine - problem did not appears.
Problem appeared not every run, but in 3-4 runs at least one was faulty.

What did you expect to see?

all goroutines: done = true, err = Get http://240.0.0.1/test: context deadline exceeded

What did you see instead?

all gouroutines returned at the same time, after 4 seconds, like i specified in parent context.WithTimeout. But some goroutines returned before parentContext.Done was closed (error was i/o timeout), and some gorotines returned after parentCtx.Done was closed (with expected error "context deadline exceeded").

2019-05-06T18:29:37.969098118+03:00 start
2019-05-06T18:29:41.969304132+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969347778+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969298753+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969580442+03:00 done = true, err = Get http://240.0.0.1/test: context deadline exceeded
2019-05-06T18:29:41.969298324+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969589394+03:00 done = true, err = Get http://240.0.0.1/test: context deadline exceeded
2019-05-06T18:29:41.969461857+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969294915+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969336744+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout
2019-05-06T18:29:41.969433802+03:00 done = false, err = Get http://240.0.0.1/test: dial tcp 240.0.0.1:80: i/o timeout

@bradfitz bradfitz changed the title http.Client.Do context deadline before Done net/http: Client.Do returns different errors on context deadline depending on req/conn state May 6, 2019

@bradfitz bradfitz added this to the Go1.14 milestone May 6, 2019

@bradfitz bradfitz added the help wanted label May 6, 2019

@MikeBeller

This comment has been minimized.

Copy link

commented May 13, 2019

I thought this was an interesting way to learn more about the go runtime, so I looked into it.
First of all, I determined that this can be traced back to the behavior of net.DialContext.
If you call it with this nonexistent address, and a context that has a deadline, it will
always return with i/o timeout error, but sometimes the context will be Done and sometimes
it will not. So I wrote a much simpler test, that always fails (after a few loops). See the
test TestA1 in the following playground:

https://play.golang.org/p/xhSHGthpa3U

My next question was whether this was really a bug. Does it matter if the context is Done()
or not, as long as the reason for the error is obvious? So I developed the following workaround,
shown in TestA2 of the same playground. It always succeeds.

Just to see whether this was some kind of new behavior, I tested on go versions all the way
back to 1.7, and TestA1 behaves the same way on all versions. It is not a (recent) regression.

At this point i'm staring at runtime/netpoll.go and internal/poll/fd_poll_runtime.go, but I would
need some pointers to make sense of it.

I'm willing to look further in to this if it would be of help but first I'd like to hear from someone
in the know. Based on what I've written here, is it really a bug?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

It sounds like the issue is that context.WithTimeout gets turned into a net package deadline, and the effect is that it's unpredictable whether we return an I/O timeout error or a context done error. That doesn't seem like a bug, exactly, but I suppose it would be nice if we could consistently return a context done error. To make that work I think we would have to make sure that we can reliably distinguish a net package timeout due to setting a deadline based on a context, and a net package timeout due to any other cause. If we could do that then if we get an error due to a deadline based on context, we could safely wait for the context do be done and return the context's error.

@fraenkel

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

When I took a quick look, the problem was the one who did the dial saw the actual error and then it became a race between the deadline channel and the error propagation through layers of context as to what outcome each request sees. There is no easy way to fix this unless we have some well defined structure on how errors are propagated rather than creating go funcs that wait for 1 of many possible outcomes.

@MikeBeller

This comment has been minimized.

Copy link

commented May 14, 2019

Well, one approach (perhaps consistent with @ianlancetaylor's comment) is just to add a few lines to net/dial.go sysDial.dialSingle. Where it currently says:

	if err != nil {
		return nil, &OpError{Op: "dial", Net: sd.network, Source: la, Addr: ra, Err: err} // c is non-nil interface containing nil pointer
	}

Insert a check for the deadline being past due, and if so, wait on ctx.Done() as follows:

	if err != nil {
		if os.IsTimeout(err) {
			if dl, ok := ctx.Deadline(); !ok || dl.Before(time.Now()) {
				// Issue 31863.  Prefer the context error to the i/o error
				<-ctx.Done()
				return nil, &OpError{Op: "dial", Net: sd.network, Source: la, Addr: ra, Err: ctx.Err()}
			}
		}
		return nil, &OpError{Op: "dial", Net: sd.network, Source: la, Addr: ra, Err: err} // c is non-nil interface containing nil pointer
	}

I tried this and built go. It passes my tests and the standard regression suite.

@MikeBeller

This comment has been minimized.

Copy link

commented May 17, 2019

I'm checking back in on this issue -- I'm happy to prepare a PR with my proposed fix and tests, but didn't want to move forward on it unless folks with a good understanding of the runtime thought such a PR was worth considering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.