New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: Dial I/O Timeout even when request context is not canceled #36848
Comments
…ent" Workaround for flaky test "TestNewRateLimitedClient" ref. golang/go#36848 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Hello @gyuho, thank you for the question and welcome to the Go project! So the context is meant to control the lifetime of a request and any operations that may need to be controlled by a user. The external view of applications that use a context is to end their operations when context.Done() receives a result and then the corresponding error is checked with context.Err(). Therefore by passing in a timeout of 300us, your code is asking consumers of the context to not process anything after that very short deadline, thus your dial I/O timeout. We document that the context controls the lifetime of the request as per:
I'd highly recommend taking a look at the context blog post https://blog.golang.org/context as well. Thank you. |
When I am using the context as:
I see this context being used in |
300 microseconds is a very short time and when dialling even local connections this can timeout at times, (I've seen some syscalls in the magnitude microseconds and local RPCs on very fast local networks taking about 100 microseconds). That context is being passed directly to net.Dialer.DialContext when using http.DefaultClient. Here is a repro that'll return the I/O timeout and performs a dial directly to a plainly written server that just spits out HTTP/1.1 package main
import (
"bufio"
"context"
"log"
"net"
"net/http"
"net/http/httputil"
"time"
)
func main() {
ln, err := net.Listen("tcp", ":0")
if err != nil {
log.Fatal(err)
}
defer ln.Close()
go func() {
for {
conn, err := ln.Accept()
if err != nil {
log.Fatalf("Failed to accept connection: %v", err)
}
conn.Write([]byte("HTTP/1.1 200 OK\r\nHost: foo.golang.org\r\nConnection: keep-alive\r\nContent-Length: 10\r\n\r\nabcdefghij\r\n"))
}
}()
var dialer net.Dialer
for {
func() {
ctx, cancel := context.WithTimeout(context.Background(), 300*time.Microsecond)
defer cancel()
conn, err := dialer.DialContext(ctx, "tcp", ln.Addr().String())
if err != nil {
log.Fatalf("Failed to read response: %v\nErr isTimeout? %t", err, err.(net.Error).Timeout())
}
defer conn.Close()
br := bufio.NewReader(conn)
res, err := http.ReadResponse(br, nil)
if err != nil {
log.Fatalf("Failed to read response: %v", err)
}
blob, _ := httputil.DumpResponse(res, true)
_ = res.Body.Close()
println(string(blob) + "\n\n")
}()
<-time.After(800 * time.Millisecond)
}
} and when run will show you the I/O timeout on the second request but not the first $ go run dial.go
HTTP/1.1 200 OK
Content-Length: 10
Connection: keep-alive
Host: foo.golang.org
abcdefghij
2020/01/30 10:43:17 Failed to read response: dial tcp [::]:54107: i/o timeout
Err isTimeout? true
exit status 1 I am currently on a long haul flight but when I get home in about 12 hours, I'll take a look at your response, but really perhaps please examine the timeouts that you are passing into the context which controls the lifetime of the request, dials as well as responses. |
@odeke-em I am aware 300µs is very short, even for local test server. Your code doesn't reproduce my issue. My confusion is:
|
Btw, thanks for helping on this! |
@gyuho you may get more/quicker help by posting to golang-nuts. |
Yes, it is being passed in for the dial and is documented as controlling the lifetime of the request. If the dial happens fast enough, you'll get a result back, otherwise you'll get an I/O timeout, which I tried to illustrate with the net dialling snippet above, but that happens non-deterministically.
For a context, as I had mentioned there are 2 ways that it can be rendered moot: 1. If the deadline is exceeded, 2. If cancel is invoked. In your case the deadline is what controls how long the operation is expected to be alive for, and in that short duration the dial wasn't completed hence the "dial TCP I/O timeout", when passed into the net.Dialer as per
@networkimprov while this is true, it rubs off wrong and sounds a bit dismissive of others' efforts. Please try to assume best/good intent and also try to be more productive when replying to answers, as we are working on providing clarity to their answer, with my own time.
Glad to be of help as much as I can :) Thanks! |
@odeke-em I think you misread my comment. This appears to be a question, not a bug report, and usually those are referred to the forums so as not to overburden folks like you! |
Thank you for the clarification with clearer wording Liam, and my apologies
for taking it the wrong way!
…On Mon, Feb 3, 2020 at 3:08 PM Liam ***@***.***> wrote:
@odeke-em <https://github.com/odeke-em> I think you misread my comment.
This appears to be a question, not a bug report, and usually those are
referred to the forums so as not to overburden folks like you!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#36848?email_source=notifications&email_token=ABFL3V572MSD2W57UETYXPDRBCPYBA5CNFSM4KMXB7WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKVXVFY#issuecomment-581663383>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFL3V6MLK5Q6FIL4XSS3MDRBCPYBANCNFSM4KMXB7WA>
.
|
Sorry for jumping on this one here but it seems to me this could highlight some kind of inconsistency. I have a similar issue when performing multiple GET request, each from a separate go routine, each with a context derived from a parent context with a deadline of 2.5 seconds. If the machine is put under heavy load, some of the requests start failing with a I think it would be more consistent to have a (Sorry if it doesn't, first time commenting on an issue, please be kind :D) In our case this made the difference between investigating a network issue and a performance issue. (Edited to add comment about impact on investigation) |
@giannimassi could you post a runnable code example for the scenario you describe? |
@networkimprov Yes, I should've done it in the first place, thanks for asking. package main
import (
"context"
"fmt"
"net"
"net/http"
"strconv"
"strings"
"sync"
"time"
)
const (
n = 1000
timeout = 10000 * time.Microsecond
)
func main() {
ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(timeout))
defer cancel()
doConcurrentRequests(ctx, n)
}
func doConcurrentRequests(ctx context.Context, n int) {
// Explicitly configure Dial Timeout to 30 seconds
// (this is not necessary since the default client already has this configuration)
transport := http.DefaultTransport
transport.(*http.Transport).DialContext = (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
DualStack: true,
}).DialContext
client := http.DefaultClient
client.Transport = transport
wg := sync.WaitGroup{}
for i := 0; i < n; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
now := time.Now()
err := doRequest(client, ctx)
if err != nil && !strings.Contains(err.Error(), "deadline") {
fmt.Println(strconv.Itoa(id) + " - Request took:" + time.Since(now).String() + " " + err.Error())
}
}(i)
}
wg.Wait()
}
func doRequest(client *http.Client, ctx context.Context) error {
req, err := http.NewRequest("GET", "http://example.com", nil)
if err != nil {
return fmt.Errorf("while making new request: %w", err)
}
_, err = client.Do(req.WithContext(ctx))
if err != nil {
return fmt.Errorf("while executing request: %w", err)
}
return nil
} (Btw I know this is far from efficient code |
@giannimassi trying your script on |
@arl makes sense. Might be some syscall return value interpretation issue. It could also be the fact that the behaviour requires different parameters, with the same parameters the behaviour changes a lot on my machine. For the record this is an example output with 706 - Request took:8.420581ms while executing request: Get http://example.com: dial tcp [2606:2800:220:1:248:1893:25c8:1946]:80: i/o timeout
76 - Request took:8.604937ms while executing request: Get http://example.com: dial tcp [2606:2800:220:1:248:1893:25c8:1946]:80: i/o timeout I'm hoping to find some time to investigate this a bit more this weekend, and maybe make my first attempt at a fix in the standard library, although it doesn't look like the simplest piece of code (hints/help/feedback appreciated @networkimprov @cagedmantis). |
Can this be selinux related? I see "i/o timeout" instead of expected context deadline when running on selinux enabled centos 8. On my local machine i get deadline exceeded. |
any update? |
Seems related to 869e576 (c.f. kubernetes-sigs/aws-encryption-provider#61).
Maybe this is an expected behavior... Is it possible to see dial timeouts even when the request context has not been canceled? I am using https://golang.org/pkg/net/http/#RoundTripper with dial timeout 30 seconds and request context is 300µs. If this is possible, can you help understand why this happens?
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
What did you expect to see?
No panic.
What did you see instead?
The text was updated successfully, but these errors were encountered: