-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows: cgo is broken #4955
Labels
Milestone
Comments
The problem is that runtime.usleep (called from "textflag 7" lockextra) is not marked as "textflag 7". So, it tries to grow stack and it needs m for that, but m is 0 in TestCthread. I can mark runtime.usleep to be "testflag 7", but that is not enough, because runtime.usleep calls OS and therefore switches stacks. Should runtime.usleep be both "testflag 7" and not switch stacks? I tried to do that, and tests PASS. But I am not certain. Also runtime.usleep needs to be rewritten in asm, because otherwise the stack of "textflag 7" functions becomes too long. I have done it for amd64, I will do it for 386 as well, if my assumptions are valid. Alex Labels changed: added os-windows. |
Issue #4872 has been merged into this issue. |
usleep/osyield must switch stack, because they can be called from a normal goroutine (e.g. runtime.lock), and then stack switch is required. For other OSes textflag7 magically resolve all the issues, but for Windows it seems to be more complicated. We can introduce another pair of functions (or perhaps just one) that can be called w/o m. And use them in lockextra(). The simples impl for Windows will be an empty func. It will affect performance during contention, but should not lead to crashes/deadlocks. |
> usleep/osyield must switch stack, because they can be called from a normal goroutine (e.g. runtime.lock), and then stack switch is required. Fair enough. > For other OSes textflag7 magically resolve all the issues, but for Windows it seems to be more complicated. I do not know about other OSes, but windows syscalls require large stack, because they start with a call to a system dll, which is just a C code that runs in user space and need stack just like any other program. I am not sure how "textflag7 magically resolve all the issues" on other OSes. > We can introduce another pair of functions (or perhaps just one) that can be called w/o m. And use them in lockextra(). The simples impl for Windows will be an empty func. It will affect performance during contention, but should not lead to crashes/deadlocks. Sounds like a plan. I can make these real syscalls. Do you have a good name for them? Is lockextra the only place these should be called from? Should we wait for Russ? Alex |
I believe that it's okay to call ntdll system call functions from the goroutine stack. They do not use additional stack space (perhaps one more word). For example, on 386, if we add this to thread_windows.c: #pragma dynimport runtime�NtYield NtYieldExecution "ntdll.dll" #pragma dynimport runtime�NtWaitForSingleObject NtWaitForSingleObject "ntdll.dll" extern void *runtime�NtWaitForSingleObject; extern void *runtime�NtYield; then I believe the following work as implementations of usleep and osyield: TEXT runtime�osyield(SB),7,$0 MOVL runtime�NtYield(SB), AX MOVL SP, BP CALL AX MOVL BP, SP RET TEXT runtime�usleep(SB),7,$20 MOVL runtime�NtWaitForSingleObject(SB), AX MOVL usec+0(FP), BX MOVL $0, hi-4(SP) MOVL BX, lo-8(SP) MOVL $lo-8(SP), BX MOVL BX, ptime-12(SP) MOVL $0, alertable-16(SP) MOVL $0, handle-20(SP) MOVL SP, BP CALL AX MOVL BP, SP RET On amd64 it should be similar (and probably easier since no stdcall nonsense). Can someone who observed the flakiness on 32-bit try this? I had to comment out the TestCthread test in misc/cgo/test/cgo_test.go, but I that failure is a separate problem. Russ |
because NtWaitForSingleObject and NtYieldExecution are both in nt native syscall list, so their functions in ntdll.dll are really thin shim that forward parameter to the real syscall. x86 windows syscall might require syscall number be passed on stack, thus the requirement of one more word on stack. (however, iirc, the syscall number is passed in ax?) one caveat though: interface of ntdll.dll is not publicly documented and is subject to arbitrary change (though it's fairly unlikely to change). |
Applied this: diff --git a/src/pkg/runtime/sys_windows_386.s b/src/pkg/runtime/sys_windows_386.s --- a/src/pkg/runtime/sys_windows_386.s +++ b/src/pkg/runtime/sys_windows_386.s @@ -314,3 +314,30 @@ MOVL AX, 0(FS) RET + +// void runtime·osyield(void) +// Called from the goroutine stack. +// Assume it does not use additional stack space. +TEXT runtime·osyield(SB),7,$0 + MOVL runtime·NtYieldExecution(SB), AX + MOVL SP, BP + CALL AX + MOVL BP, SP + RET + +// void runtime·usleep(uint32 us) +// Called from the goroutine stack. +// Assume it does not use additional stack space. +TEXT runtime·usleep(SB),7,$20 + MOVL runtime·NtWaitForSingleObject(SB), AX + MOVL usec+0(FP), BX + MOVL $0, hi-4(SP) + MOVL BX, lo-8(SP) + LEAL lo-8(SP), BX + MOVL BX, ptime-12(SP) + MOVL $0, alertable-16(SP) + MOVL $0, handle-20(SP) + MOVL SP, BP + CALL AX + MOVL BP, SP + RET diff --git a/src/pkg/runtime/thread_windows.c b/src/pkg/runtime/thread_windows.c --- a/src/pkg/runtime/thread_windows.c +++ b/src/pkg/runtime/thread_windows.c @@ -31,6 +31,8 @@ #pragma dynimport runtime·timeBeginPeriod timeBeginPeriod "winmm.dll" #pragma dynimport runtime·WaitForSingleObject WaitForSingleObject "kernel32.dll" #pragma dynimport runtime·WriteFile WriteFile "kernel32.dll" +#pragma dynimport runtime·NtYieldExecution NtYieldExecution "ntdll.dll" +#pragma dynimport runtime·NtWaitForSingleObject NtWaitForSingleObject "ntdll.dll" extern void *runtime·CloseHandle; extern void *runtime·CreateEvent; @@ -56,6 +58,8 @@ extern void *runtime·timeBeginPeriod; extern void *runtime·WaitForSingleObject; extern void *runtime·WriteFile; +extern void *runtime·NtYieldExecution; +extern void *runtime·NtWaitForSingleObject; static int32 getproccount(void) @@ -135,22 +139,6 @@ return written; } -#pragma textflag 7 -void -runtime·osyield(void) -{ - runtime·stdcall(runtime·Sleep, 1, (uintptr)0); -} - -void -runtime·usleep(uint32 us) -{ - us /= 1000; - if(us == 0) - us = 1; - runtime·stdcall(runtime·Sleep, 1, (uintptr)us); -} - #define INFINITE ((uintptr)0xFFFFFFFF) int32 diff --git a/src/run.bat b/src/run.bat --- a/src/run.bat +++ b/src/run.bat @@ -72,8 +72,6 @@ echo. :: cgo tests -:: issue #4955 - cgo is broken -goto nocgo if x%CGO_ENABLED% == x0 goto nocgo echo # ..\misc\cgo\life go run %GOROOT%\test\run.go - ..\misc\cgo\life Now go test runtime -short -v -timeout=10m -cpu=1,2,4 command takes 3 times as long (from under 1m to aroung 200s) on my pc. I don't think it is acceptable. Alex |
It is okay to call the NtXxx functions because they are very thin wrappers around the actual kernel system calls. There is a list of the system calls at http://j00ru.vexillium.org/ntapi/. The exact numbers vary from version to version, which is why we need to call via ntdll.dll instead of making the calls ourselves. This will be more efficient as long as we can get the effects we need, and it will use a small enough amount of stack not to need a stack switch. Which tests get slower during the 'go test runtime -short -v'? The most likely reason they'd get slower is that osyield is not actually yielding the CPU, so that locks take much longer when contended. The old implementation did a Sleep(0), and the new implementation is using the NtYieldExecution system call, but perhaps NtYieldExecution is not strong enough. A simple test to do would be to put #define runtime�osyield() runtime�usleep(0) at the bottom of runtime.h, so that all the osyields become Sleep(0) again, and see if that makes the test execution fast again. Russ |
> ... it will use a small enough amount of stack not to need > a stack switch. I am still concerned about our stack usage. Even if new usleep needs just 20-30 bytes of stack, who will check that the room is there? I saw usleep sometimes called from a chain of "textflag 7" functions, so our stack could be on the last leg. And, if you think that 20-30 is on a low side, then think again, because (on amd64) we must reserve room for first 4 parameters + stack must be aligned on 16 (I do not know if these are required for NtXxx functions). Also, what happens, if it works on my pc, and then breaks somewhere else, where NtYieldExecution stack is slightly longer. How are we going to deal with these? The problem we have is with TestCthread test, that test quite uncommon (for the moment) functionality. I would say, lets fix that and move on for now. > ... Which tests get slower during the 'go test runtime -short -v'? I applied a.diff to 9742f722b558. No more NtYieldExecution, just NtWaitForSingleObject. I couldn't use your #define trick, so I did it differently. New version is slower. I ran "go test -short -v" and "go test -cpu=1,2 -v" (see files attached). TestCrashHandler looks suspect, but I am not sure why. > The most likely reason they'd get slower is that osyield is not actually yielding the CPU, so that locks take much longer when contended. The old implementation did a Sleep(0), and the new implementation is using the NtYieldExecution system call, but perhaps NtYieldExecution is not strong enough. A simple test to do would be to put #define runtime?osyield() runtime?usleep(0) at the bottom of runtime.h, so that all the osyields become Sleep(0) again, and see if that makes the test execution fast again. hg id 9742f722b558 tip Attachments:
|
I am not worried about system call dispatch being different on different machines. The set of Windows kernel system calls has been very stable over a period of decades, and there's no reason to believe it will change any time soon. And if it does change, it will be with a new Windows version, which might have other things to update anyway. We sometimes go through similar updates when new OS X versions come out. Here is the definition of the NtWaitForSingleObject function, disassembled using gdb. Note that I don't trust the symbols, but I do trust the memory: 0x774b5464 <ntdll!ZeWaitForSingleObject>: mov $0x160, %eax 0x774b5469 <ntdll!ZeWaitForSingleObject+5>: mov $0x7ffe0300,%edx 0x774b546e <ntdll!ZeWaitForSingleObject+10>: call *(%edx) 0x774b5470 <ntdll!ZeWaitForSingleObject+12>: ret $0xc 0x7ffe0300 0x774b5ca0 0x774b5ca0 <ntdll!KiUserExceptionDispatcher>: mov %esp,%edx 0x774b5ca2 <ntdll!KiUserExceptionDispatcher+2>: sysenter The entire function uses only one extra word of space, for the return PC pushed by the call instruction. I am willing to assume that will not change any time soon. You are correct that the linker does not expect even that one word, and sometimes one word can make a difference. So I propose that our wrapper first call a NOSPLIT no-op function that uses an extra word of space. If the linker doesn't reject the program, then the no-op call is safe, and by implication so is the call to the direct system call wrapper. It is important to me that we get cgo fully running again soon, because I am making other changes to cgo and don't want to be debugging those breakages with this breakage mixed in. For 386, I propose https://golang.org/cl/7563043. It takes the approach I have described. I can run it with -test.cpu=1,2,4 on the win32 builder without any notable slowdown. I will not mail the CL for review until I have added and tested amd64 too, but I'll gladly accept comments any time. Russ |
This issue was closed by revision 8aafb44. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: