New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: timeout tests hang on windows #5971
Labels
Milestone
Comments
I can reliably reproduce it on my 64-bit Windows7 machine if I set process affinity to 1 core in runtime.osinit. But can not reproduce on another 64-bit Windows7. The bad news is that looks like kernel bug. Why I think so (aside from I do not see a bug in our code): - Whenever it happens, when the process exits it hangs in corrupted state. I can not kill it by any means. It remains semi-alive with 1 thread blocked in kernel. - When I tried to look at the semi-alive process with ProcExplorer it caused BSOD several times. Then a process exists Windows cancels all its outstanding IO requests. So the hypothesis is that the IO request becomes corrupted and causes some kind of kernel deadlock. When the kernel tries to cancel it during process exit, that hangs as well. I see that the timeout does fire -- the goroutine blocks in runtime_pollWaitCanceled. And it never unblocks, even if I increase the timeout. The CPU is idle, so it's not a busy loop (and the busy loop would not cause process exit failure). If I add any logging into interesting paths the problem goes away. Alex, you've said that the request eventually unblocks on your machine if you increase the timeout. The hypothesis that your version of OS may have some kind of internal timeout for request cancellation or something like this. I.e. the OS eventually resolves the situation. But this does not happen on my version of OS. Another bad news is that I do not see any reasonable workaround. If my hypothesis is correct, then the lockup is caused by concurrent IO rediness notification and cancellation request. There is no way to avoid that. |
Now I strongly believe that this is a kernel bug. Any attempts to log detailed info eliminate the bug. I've implemented fast in-memory binary logging system, and the results confirm that the IO completion notification is not sent for a cancelled IO operation. I've also collected winsock kernel traces and they show the same. Finally I've reproduced it in a small C windows program. I've filed the bug with Microsoft Connect: https://connect.microsoft.com/VisualStudio/feedback/details/795203/cancelioex-hangs-program#tabs and posted to MSDN: http://social.msdn.microsoft.com/Forums/en-US/bbc1e9eb-1f70-4064-ae6a-6fa470c80c27/cancelioex-hangs-program |
Dmitriy, I tried this on our windows-amd64 builder: c:\go\src\pkg\net>hg id 87fd884c846b+ tip c:\go\src\pkg\net>hg diff diff -r 87fd884c846b src/pkg/net/timeout_test.go --- a/src/pkg/net/timeout_test.go Mon Jul 29 01:25:51 2013 +0200 +++ b/src/pkg/net/timeout_test.go Sun Jul 28 02:03:47 2013 +0000 @@ -423,8 +423,6 @@ switch runtime.GOOS { case "plan9": t.Skipf("skipping test on %q", runtime.GOOS) - case "windows": - t.Skipf("skipping test on %q, see issue #5971", runtime.GOOS) } defer runtime.GOMAXPROCS(runtime.GOMAXPROCS(maxProcs)) @@ -498,7 +496,7 @@ clientc <- copyRes{n, err, d} }() - const tooLong = 2000 * time.Millisecond + const tooLong = 10000 * time.Millisecond select { case res := <-clientc: if isTimeout(res.err) { and test works fine (I tried it many times) c:\go\src\pkg\net>go test -run=1 PASS ok net 13.250s I can see CPU goes 100% during the test, so my explanation as "busy wait" and "timer does not get fired" is still possible. I can observe same behaviour on my pc that used to fail this test with wait=0 in runtime.netpoll - and it still does if I set wait to 0. It also PASS if I extend tooLong deadline in our test. I could not reproduce the failure anywhere else, cpu affinity or not. Could you send me your diff of "set cpu affinity" to see if I can use it. I didn't see BSOD anywhere. I also noticed our windows-amd64 builder OS has not been updated for a while (probably never). Should we upgrade it to the latest version available? I doubt you will get much help from Microsoft, but fingers crossed. Perhaps you know someone who knows someone who might push your issue along. Thank you for all you work. Alex |
Something I found http://msdn.microsoft.com/en-us/windows/hardware/gg487388.aspx. So it looks to me that driver could be at fault here. Also http://msdn.microsoft.com/en-us/library/aa480216.aspx talks about "Driver Hang Verifier" tool that you can run to test your driver. Alex |
Hummm... I've seen these articles. But loopback must be short circuited on TCP/IP level, driver must not be involved. I need to try to turn off my antivirus... What you describe looks different from what I see... Can you try to figure out where exactly it busy loops? A profiler can help, or a stack dump, or logging. I don't know what will be simpler to do on windows. Here is how I set affinity: --- a/src/pkg/runtime/os_windows.c Mon Jul 29 19:43:08 2013 +0400 +++ b/src/pkg/runtime/os_windows.c Mon Jul 29 19:50:53 2013 +0400 @@ -36,6 +36,9 @@ #pragma dynimport runtime·WriteFile WriteFile "kernel32.dll" #pragma dynimport runtime·NtWaitForSingleObject NtWaitForSingleObject "ntdll.dll" +#pragma dynimport runtime·SetProcessAffinityMask SetProcessAffinityMask "kernel32.dll" +extern void *runtime·SetProcessAffinityMask; + extern void *runtime·NtWaitForSingleObject; extern void *runtime·CloseHandle; @@ -78,6 +81,7 @@ void runtime·osinit(void) { + runtime·stdcall(runtime·SetProcessAffinityMask, 2, (uintptr)-1, (uintptr)1); // -1 = current process, -2 = current thread runtime·stdcall(runtime·DuplicateHandle, 7, (uintptr)-1, (uintptr)-2, (uintptr)-1, &m->thread, |
Dmitriy, I have done some debugging here (not on windows-amd64 builder), and I see "timers are late on firing" as I have seen before. If you apply attached diff to 620fea2e0220, you will see some debug output as well as required source changes. Perhaps you can give me pointers at what I could do next. Alternatively you could try and debug the problem on our windows-amd64 builder, if you like. Just let me know. Thank you. Alex Attachments:
|
Why do you say that timers do not fire, when the last log says that the timeout had fired and the thread called CancelIoEx: + 0: SetDeadline(1375158649929599400) + 2105549600: WSARecv(fd=300): Wait: 5 err=i/o timeout + 2105549600: WSARecv(fd=300): CancelIO: 1 + 2105549600: WSARecv(fd=300): CancelIO: 2 + 2105549600: WSARecv(fd=300): CancelIO: 3 err=Element not found. + 2105549600: WSARecv(fd=300): CancelIO: 4 + 2105549600: WSARecv(fd=300): WaitCanceled: 1 Also in the trace the thread waits in runtime_pollWaitCanceled. It means that the timeout had fired, but IO completion notification is not sent. |
I've got to the bottom of this. Windows is so carried away watching how 2 goroutines enthusiastically exchange with network packets, so it rejects to schedule timer thread for 2+ seconds. It's not that the thread can not exitsyscall, it's not scheduled on CPU at all. There are 4 preconditions: 1. Goroutines do heavy IO. 2. Dynamic priority boosting is turned on. 3. Server version of windows. 4. Windows is configured to do server-type scheduling. When I turn off dynamic priority boosting, delay drops from 2+ seconds (thread starvation prevention threshold) to 400ms (4 100ms quantums). If reconfigure scheduler to do client-type scheduling, the delay further drops to 60ms (4 15ms quantums). We can not change scheduler type in Go process, because it's system-wide setting. However, we can change dynamic priority boosting, 400ms delay should be enough for this test to pass. If not, we can make the timeout on windows somewhat longer. This should not be a problem in real applications, because they usually do not consume 100% CPU (it's mostly possible only on localhost); the problem is also tolerated by multicore systems; and the timeout does not fire iff the application is doing useful work, in which case delayed timeout is most likely not a problem (you want timeout iff there is no data on the connection). The fix is coming. |
This issue was closed by revision a574822. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: