Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestNotTemporaryRead timeouts on aix/ppc64 #29685

Open
Helflym opened this issue Jan 11, 2019 · 9 comments
Labels
Milestone

Comments

@Helflym
Copy link
Contributor

@Helflym Helflym commented Jan 11, 2019

Hi,
I'm trying to resolve timeouts occurring on aix/ppc64 with net.TestNotTemporaryRead.

https://build.golang.org/log/45540cc03c1d37057e8f725d7f2dd431652ddf4c
https://build.golang.org/log/37d60c3b3cd46cf39d118f84e695049d390da40e
...

This timeout occurs because Accept() seems to be stuck in a infinite loop if the server is already closed. It's only a guess because I can't trigger the bug manually on my local machine. However, a similar behaviorr can be easily made with: https://play.golang.org/p/0IXrHf87i-2.
It does work on linux/amd64 but it times out on aix/ppc64. This might not be the root of this bug but a possible workaround can be to increase the delay on the server.

However, I've several questions:

  • Why is the client doing Accept() and the server doing Dial() ? Is it supposed to be the opposite or it doesn't matter ? This is the case for some others tests of net_test.go.
  • What behavior Go is expected when an accept() is made when the server is already closed (but the port is still listened) ?
    Should it succeed or not ? On aix/ppc64, accept syscall returns EAGAIN (because of O_NONBLOCK flag) and on linux/amd64 it does succeed.

I've also discovered that the behavior of connect is slightly different on AIX than on Linux (I don't know about others OSes). I've tried with the following C code (taken from #6828): accept_after_connect.c.txt. The first connect doesn't return EINPROGRESS as on Linux. It doesn't seem a bug as a connection can result from the listen syscall.
Does Go want EINPROGRESS to be returned ? (*netFD).connect will wait with netpoll if it is.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 11, 2019

Why is the client doing Accept() and the server doing Dial() ? Is it supposed to be the opposite or it doesn't matter ? This is the case for some others tests of net_test.go.

The names used are confusing but I can't see that it matters in this case.

What behavior Go is expected when an accept() is made when the server is already closed (but the port is still listened) ?

Go expects that calling the Accept method on a closed Listener will return an error. But I'm not sure that is what you are asking; I'm not sure quite what you are asking. The TestNotTemporaryRead test is intended to see what happens when Read is called on a connected socket, and then the other side of the socket connection is closed.

Does Go want EINPROGRESS to be returned ?

As opposed to what? The Go code will work fine if connect blocks until the connection is complete. It's more efficient if the kernel returns EINPROGRESS, because then we don't have to tie up a thread. But if we tie up a thread waiting for connect to complete, everything should still work.

The stack traces from the trybots suggest that the Accept never completes. But it appears that the Dial does complete. I don't understand how a Dial could complete without the corresponding Accept also completing. Does AIX have anything like strace that could show the exact system calls? What does your test program display on AIX?

@julieqiu julieqiu changed the title net: net.TestNotTemporaryRead timeouts on aix/ppc64 net: TestNotTemporaryRead timeouts on aix/ppc64 Jan 11, 2019
@Helflym

This comment has been minimized.

Copy link
Contributor Author

@Helflym Helflym commented Jan 14, 2019

Does Go want EINPROGRESS to be returned ?

As opposed to what? The Go code will work fine if connect blocks until the connection is complete. It's more efficient if the kernel returns EINPROGRESS, because then we don't have to tie up a thread. But if we tie up a thread waiting for connect to complete, everything should still work.

That's the problem. On AIX, connect doesn't block after a listen. On Linux, man says:

   listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2).

There is nothing similar on AIX. Therefore, once a socket is opened with listen, every connection is accepted. There is no "passive socket".

Using the C program above, I've the following trace on AIX:

socket(2, 1, 0)                                 = 3
bind(3, 0x0FFFFFFFFFFFEF1C, 16)                 = 0
listen(3, 5)                                    = 0
ngetsockname(3, 0x0FFFFFFFFFFFEF1C, 0x0FFFFFFFFFFFEF30) = 0
socket(2, 1, 0)                                 = 4
kfcntl(4, F_GETFL, 0x0000000110008158)          = 2
kioctl(4, -2147195266, 0x0FFFFFFFFFFFEDF0, 0x0000000000000000) = 0
kioctl(4, -2147195267, 0x0FFFFFFFFFFFEDF0, 0x0000000000000000) = 0
kfcntl(4, F_SETFL, 0x0000000000000006)          = 0
getsockopt(4, 65535, 4104, 0x0FFFFFFFFFFFEE64, 0x0FFFFFFFFFFFEE60) = 0
connext(4, 0x0FFFFFFFFFFFEF1C, 16)              = 0

The last connect (connext here) doesn't return any error, while on Linux it does.

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
listen(3, 5)                            = 0
getsockname(3, {sa_family=AF_INET, sin_port=htons(36821), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(4, {sa_family=AF_INET, sin_port=htons(36821), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

If you look at TestDialContextCancelRace, the connect function explicitly returns EINPROGRESS if not error is returned. It's the case on AIX.
However, I didn't managed to do something similar on TestNotTemporaryRead...

I don't know if it's possible to change how net.Listen works to match Linux behavior on AIX. Moreover, it seems a little risky to modify it as some AIX users might want this feature as they're used to.

Anyway, this difference only impacts TestNotTemporaryRead at the moment. Moreover, this bug only happens when a machine is very busy. Increasing the sleep duration and adding another sleep in the client should resolve the problem because Accept will be called before the server is closed and thanks to the second sleep, the read will still be issued after it (it might not be needed but it's safer I think).

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 14, 2019

On GNU/Linux EINPROGRESS just means that the descriptor was marked as non-blocking but the connect would block. The connect does still succeed, and when it does, the poller will mark the socket as readable, and a second call to connect will complete without error.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jan 16, 2019

Change https://golang.org/cl/158038 mentions this issue: net: increase TestNotTemporaryRead server sleep

@Helflym

This comment has been minimized.

Copy link
Contributor Author

@Helflym Helflym commented Jan 16, 2019

Well, I've submitted a CL to fix this test. I'll see later if I can provide a more general fix on that issue.

gopherbot pushed a commit that referenced this issue Jan 16, 2019
On aix/ppc64, if the server closes before the client calls Accept,
this test will fail.

Increasing the time before the server closes should resolve this
timeout.

Updates #29685

Change-Id: Iebb849d694fc9c37cf216ce1f0b8741249b98016
Reviewed-on: https://go-review.googlesource.com/c/158038
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Jul 8, 2019

Another recent timeout in this test, again on the aix-ppc64 builder:
https://build.golang.org/log/ac2fca1f71734020ee6533f578ccbbcf89fe8cd8

@Helflym

This comment has been minimized.

Copy link
Contributor Author

@Helflym Helflym commented Jul 9, 2019

Yes, it seems the new sleep time is still not enough... This test isn't fully compatible with AIX behavior which is slightly different if an accept is made after a listen. However, I won't have time to work on it before 1.13 release. Is it possible to mark it as flaky until a better fix is made ?

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Jul 10, 2019

Is it possible to mark it as flaky until a better fix is made ?

Yep, see testenv.SkipFlaky.

if runtime.GOOS == "aix" {
	testenv.SkipFlaky(t, 29685)
}
@bcmills bcmills added NeedsFix and removed WaitingForInfo labels Jul 10, 2019
@bcmills bcmills added this to the Go1.14 milestone Jul 10, 2019
@bcmills bcmills added the Testing label Jul 10, 2019
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jul 11, 2019

Change https://golang.org/cl/185717 mentions this issue: net: set TestNotTemporaryRead flaky for AIX

gopherbot pushed a commit that referenced this issue Jul 11, 2019
This test sometimes times out when the machine is busy.
The reason behind is still a bit blurry. But it seems to comes from
the fact that on AIX, once a listen is performed a socket, every
connection will be accepted even before an accept is made (which only
occurs when a machine is busy). On Linux, a socket is created as a
"passive socket" which seems to wait for the accept before allowing
incoming connections.

Updates #29685

Change-Id: I41b053b7d5f5b4420b72d6a217be72e41220d769
Reviewed-on: https://go-review.googlesource.com/c/go/+/185717
Run-TryBot: Clément Chigot <clement.chigot@atos.net>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.