Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestNotTemporaryRead timeouts on aix/ppc64 #29685

Open
Helflym opened this Issue Jan 11, 2019 · 5 comments

Comments

Projects
None yet
4 participants
@Helflym
Copy link
Contributor

commented Jan 11, 2019

Hi,
I'm trying to resolve timeouts occurring on aix/ppc64 with net.TestNotTemporaryRead.

https://build.golang.org/log/45540cc03c1d37057e8f725d7f2dd431652ddf4c
https://build.golang.org/log/37d60c3b3cd46cf39d118f84e695049d390da40e
...

This timeout occurs because Accept() seems to be stuck in a infinite loop if the server is already closed. It's only a guess because I can't trigger the bug manually on my local machine. However, a similar behaviorr can be easily made with: https://play.golang.org/p/0IXrHf87i-2.
It does work on linux/amd64 but it times out on aix/ppc64. This might not be the root of this bug but a possible workaround can be to increase the delay on the server.

However, I've several questions:

  • Why is the client doing Accept() and the server doing Dial() ? Is it supposed to be the opposite or it doesn't matter ? This is the case for some others tests of net_test.go.
  • What behavior Go is expected when an accept() is made when the server is already closed (but the port is still listened) ?
    Should it succeed or not ? On aix/ppc64, accept syscall returns EAGAIN (because of O_NONBLOCK flag) and on linux/amd64 it does succeed.

I've also discovered that the behavior of connect is slightly different on AIX than on Linux (I don't know about others OSes). I've tried with the following C code (taken from #6828): accept_after_connect.c.txt. The first connect doesn't return EINPROGRESS as on Linux. It doesn't seem a bug as a connection can result from the listen syscall.
Does Go want EINPROGRESS to be returned ? (*netFD).connect will wait with netpoll if it is.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2019

Why is the client doing Accept() and the server doing Dial() ? Is it supposed to be the opposite or it doesn't matter ? This is the case for some others tests of net_test.go.

The names used are confusing but I can't see that it matters in this case.

What behavior Go is expected when an accept() is made when the server is already closed (but the port is still listened) ?

Go expects that calling the Accept method on a closed Listener will return an error. But I'm not sure that is what you are asking; I'm not sure quite what you are asking. The TestNotTemporaryRead test is intended to see what happens when Read is called on a connected socket, and then the other side of the socket connection is closed.

Does Go want EINPROGRESS to be returned ?

As opposed to what? The Go code will work fine if connect blocks until the connection is complete. It's more efficient if the kernel returns EINPROGRESS, because then we don't have to tie up a thread. But if we tie up a thread waiting for connect to complete, everything should still work.

The stack traces from the trybots suggest that the Accept never completes. But it appears that the Dial does complete. I don't understand how a Dial could complete without the corresponding Accept also completing. Does AIX have anything like strace that could show the exact system calls? What does your test program display on AIX?

@julieqiu julieqiu changed the title net: net.TestNotTemporaryRead timeouts on aix/ppc64 net: TestNotTemporaryRead timeouts on aix/ppc64 Jan 11, 2019

@Helflym

This comment has been minimized.

Copy link
Contributor Author

commented Jan 14, 2019

Does Go want EINPROGRESS to be returned ?

As opposed to what? The Go code will work fine if connect blocks until the connection is complete. It's more efficient if the kernel returns EINPROGRESS, because then we don't have to tie up a thread. But if we tie up a thread waiting for connect to complete, everything should still work.

That's the problem. On AIX, connect doesn't block after a listen. On Linux, man says:

   listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2).

There is nothing similar on AIX. Therefore, once a socket is opened with listen, every connection is accepted. There is no "passive socket".

Using the C program above, I've the following trace on AIX:

socket(2, 1, 0)                                 = 3
bind(3, 0x0FFFFFFFFFFFEF1C, 16)                 = 0
listen(3, 5)                                    = 0
ngetsockname(3, 0x0FFFFFFFFFFFEF1C, 0x0FFFFFFFFFFFEF30) = 0
socket(2, 1, 0)                                 = 4
kfcntl(4, F_GETFL, 0x0000000110008158)          = 2
kioctl(4, -2147195266, 0x0FFFFFFFFFFFEDF0, 0x0000000000000000) = 0
kioctl(4, -2147195267, 0x0FFFFFFFFFFFEDF0, 0x0000000000000000) = 0
kfcntl(4, F_SETFL, 0x0000000000000006)          = 0
getsockopt(4, 65535, 4104, 0x0FFFFFFFFFFFEE64, 0x0FFFFFFFFFFFEE60) = 0
connext(4, 0x0FFFFFFFFFFFEF1C, 16)              = 0

The last connect (connext here) doesn't return any error, while on Linux it does.

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
listen(3, 5)                            = 0
getsockname(3, {sa_family=AF_INET, sin_port=htons(36821), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(4, {sa_family=AF_INET, sin_port=htons(36821), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

If you look at TestDialContextCancelRace, the connect function explicitly returns EINPROGRESS if not error is returned. It's the case on AIX.
However, I didn't managed to do something similar on TestNotTemporaryRead...

I don't know if it's possible to change how net.Listen works to match Linux behavior on AIX. Moreover, it seems a little risky to modify it as some AIX users might want this feature as they're used to.

Anyway, this difference only impacts TestNotTemporaryRead at the moment. Moreover, this bug only happens when a machine is very busy. Increasing the sleep duration and adding another sleep in the client should resolve the problem because Accept will be called before the server is closed and thanks to the second sleep, the read will still be issued after it (it might not be needed but it's safer I think).

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 14, 2019

On GNU/Linux EINPROGRESS just means that the descriptor was marked as non-blocking but the connect would block. The connect does still succeed, and when it does, the poller will mark the socket as readable, and a second call to connect will complete without error.

@gopherbot

This comment has been minimized.

Copy link

commented Jan 16, 2019

Change https://golang.org/cl/158038 mentions this issue: net: increase TestNotTemporaryRead server sleep

@Helflym

This comment has been minimized.

Copy link
Contributor Author

commented Jan 16, 2019

Well, I've submitted a CL to fix this test. I'll see later if I can provide a more general fix on that issue.

gopherbot pushed a commit that referenced this issue Jan 16, 2019

net: increase TestNotTemporaryRead server sleep
On aix/ppc64, if the server closes before the client calls Accept,
this test will fail.

Increasing the time before the server closes should resolve this
timeout.

Updates #29685

Change-Id: Iebb849d694fc9c37cf216ce1f0b8741249b98016
Reviewed-on: https://go-review.googlesource.com/c/158038
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.