Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows: cgo is broken #4955

Closed
rsc opened this issue Mar 1, 2013 · 17 comments
Closed

windows: cgo is broken #4955

rsc opened this issue Mar 1, 2013 · 17 comments
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Mar 1, 2013

tests crashing during cgo. disabling for now so builder can test the rest of the tree.
@alexbrainman
Copy link
Member

Comment 1:

The problem is that runtime.usleep (called from "textflag 7" lockextra) is not marked as
"textflag 7". So, it tries to grow stack and it needs m for that, but m is 0 in
TestCthread.
I can mark runtime.usleep to be "testflag 7", but that is not enough, because
runtime.usleep calls OS and therefore switches stacks. Should runtime.usleep be both
"testflag 7" and not switch stacks?
I tried to do that, and tests PASS. But I am not certain. Also runtime.usleep needs to
be rewritten in asm, because otherwise the stack of "textflag 7" functions becomes too
long. I have done it for amd64, I will do it for 386 as well, if my assumptions are
valid.
Alex

Labels changed: added os-windows.

@alexbrainman
Copy link
Member

Comment 2:

Issue #4872 has been merged into this issue.

@dvyukov
Copy link
Member

dvyukov commented Mar 4, 2013

Comment 3:

usleep/osyield must switch stack, because they can be called from a normal goroutine
(e.g. runtime.lock), and then stack switch is required.
For other OSes textflag7 magically resolve all the issues, but for Windows it seems to
be more complicated.
We can introduce another pair of functions (or perhaps just one) that can be called w/o
m. And use them in lockextra(). The simples impl for Windows will be an empty func. It
will affect performance during contention, but should not lead to crashes/deadlocks.

@alexbrainman
Copy link
Member

Comment 4:

> usleep/osyield must switch stack, because they can be called from a normal goroutine
(e.g. runtime.lock), and then stack switch is required.
Fair enough.
> For other OSes textflag7 magically resolve all the issues, but for Windows it seems to
be more complicated.
I do not know about other OSes, but windows syscalls require large stack, because they
start with a call to a system dll, which is just a C code that runs in user space and
need stack just like any other program. I am not sure how "textflag7 magically resolve
all the issues" on other OSes.
> We can introduce another pair of functions (or perhaps just one) that can be called
w/o m. And use them in lockextra(). The simples impl for Windows will be an empty func.
It will affect performance during contention, but should not lead to crashes/deadlocks.
Sounds like a plan. I can make these real syscalls. Do you have a good name for them? Is
lockextra the only place these should be called from? Should we wait for Russ?
Alex

@dvyukov
Copy link
Member

dvyukov commented Mar 4, 2013

Comment 5:

In other OSes usleep/osyield just invoke syscalls from assembly, just a handful of
instructions.
Yes, for now it's the only place where such functions are required. It's always a good
idea to wait for Russ :)
Name? I dont' know. The first thing that comes to mind is - osyieldnom().

@rsc
Copy link
Contributor Author

rsc commented Mar 4, 2013

Comment 6:

Yes, please make usleep and osyield be textflag 7 functions with no stack
switches.
Russ

@rsc
Copy link
Contributor Author

rsc commented Mar 4, 2013

Comment 7:

I believe that it's okay to call ntdll system call functions from the
goroutine stack. They do not use additional stack space (perhaps one more
word).
For example, on 386, if we add this to thread_windows.c:
#pragma dynimport runtime�NtYield NtYieldExecution "ntdll.dll"
#pragma dynimport runtime�NtWaitForSingleObject NtWaitForSingleObject
"ntdll.dll"
extern void *runtime�NtWaitForSingleObject;
extern void *runtime�NtYield;
then I believe the following work as implementations of usleep and osyield:
TEXT runtime�osyield(SB),7,$0
MOVL runtime�NtYield(SB), AX
MOVL SP, BP
CALL AX
MOVL BP, SP
RET
TEXT runtime�usleep(SB),7,$20
MOVL runtime�NtWaitForSingleObject(SB), AX
MOVL usec+0(FP), BX
MOVL $0, hi-4(SP)
MOVL BX, lo-8(SP)
MOVL $lo-8(SP), BX
MOVL BX, ptime-12(SP)
MOVL $0, alertable-16(SP)
MOVL $0, handle-20(SP)
MOVL SP, BP
CALL AX
MOVL BP, SP
RET
On amd64 it should be similar (and probably easier since no stdcall
nonsense).
Can someone who observed the flakiness on 32-bit try this? I had to comment
out the TestCthread test in misc/cgo/test/cgo_test.go, but I that failure
is a separate problem.
Russ

@alexbrainman
Copy link
Member

Comment 8:

Russ,
> I believe that it's okay to call ntdll system call functions from the
> goroutine stack. They do not use additional stack space (perhaps one more
> word).
How did you come up to with this conclusion? Can you provide some references?
Alex

@minux
Copy link
Member

minux commented Mar 4, 2013

Comment 9:

because NtWaitForSingleObject and NtYieldExecution are both in nt native syscall list,
so their functions in ntdll.dll are really thin shim that forward parameter to the real
syscall.
x86 windows syscall might require syscall number be passed on stack, thus the
requirement of one more word on stack. (however, iirc, the syscall number is passed
in ax?)
one caveat though: interface of ntdll.dll is not publicly documented and is subject to
arbitrary change (though it's fairly unlikely to change).

@alexbrainman
Copy link
Member

Comment 10:

Applied this:
diff --git a/src/pkg/runtime/sys_windows_386.s b/src/pkg/runtime/sys_windows_386.s
--- a/src/pkg/runtime/sys_windows_386.s
+++ b/src/pkg/runtime/sys_windows_386.s
@@ -314,3 +314,30 @@
    MOVL    AX, 0(FS)
 
    RET
+
+// void runtime·osyield(void)
+// Called from the goroutine stack.
+// Assume it does not use additional stack space.
+TEXT runtime·osyield(SB),7,$0
+   MOVL    runtime·NtYieldExecution(SB), AX
+   MOVL    SP, BP
+   CALL    AX
+   MOVL    BP, SP
+   RET
+
+// void runtime·usleep(uint32 us)
+// Called from the goroutine stack.
+// Assume it does not use additional stack space.
+TEXT runtime·usleep(SB),7,$20
+   MOVL    runtime·NtWaitForSingleObject(SB), AX
+   MOVL    usec+0(FP), BX
+   MOVL    $0, hi-4(SP)
+   MOVL    BX, lo-8(SP)
+   LEAL    lo-8(SP), BX
+   MOVL    BX, ptime-12(SP)
+   MOVL    $0, alertable-16(SP)
+   MOVL    $0, handle-20(SP)
+   MOVL    SP, BP
+   CALL    AX
+   MOVL    BP, SP
+   RET
diff --git a/src/pkg/runtime/thread_windows.c b/src/pkg/runtime/thread_windows.c
--- a/src/pkg/runtime/thread_windows.c
+++ b/src/pkg/runtime/thread_windows.c
@@ -31,6 +31,8 @@
 #pragma dynimport runtime·timeBeginPeriod timeBeginPeriod "winmm.dll"
 #pragma dynimport runtime·WaitForSingleObject WaitForSingleObject "kernel32.dll"
 #pragma dynimport runtime·WriteFile WriteFile "kernel32.dll"
+#pragma dynimport runtime·NtYieldExecution NtYieldExecution "ntdll.dll"
+#pragma dynimport runtime·NtWaitForSingleObject NtWaitForSingleObject "ntdll.dll"
 
 extern void *runtime·CloseHandle;
 extern void *runtime·CreateEvent;
@@ -56,6 +58,8 @@
 extern void *runtime·timeBeginPeriod;
 extern void *runtime·WaitForSingleObject;
 extern void *runtime·WriteFile;
+extern void *runtime·NtYieldExecution;
+extern void *runtime·NtWaitForSingleObject;
 
 static int32
 getproccount(void)
@@ -135,22 +139,6 @@
    return written;
 }
 
-#pragma textflag 7
-void
-runtime·osyield(void)
-{
-   runtime·stdcall(runtime·Sleep, 1, (uintptr)0);
-}
-
-void
-runtime·usleep(uint32 us)
-{
-   us /= 1000;
-   if(us == 0)
-       us = 1;
-   runtime·stdcall(runtime·Sleep, 1, (uintptr)us);
-}
-
 #define INFINITE ((uintptr)0xFFFFFFFF)
 
 int32
diff --git a/src/run.bat b/src/run.bat
--- a/src/run.bat
+++ b/src/run.bat
@@ -72,8 +72,6 @@
 echo.
 
 :: cgo tests
-:: issue #4955 - cgo is broken
-goto nocgo
 if x%CGO_ENABLED% == x0 goto nocgo
 echo # ..\misc\cgo\life
 go run %GOROOT%\test\run.go - ..\misc\cgo\life
Now
go test runtime -short -v -timeout=10m -cpu=1,2,4
command takes 3 times as long (from under 1m to aroung 200s) on my pc. I don't think it
is acceptable.
Alex

@rsc
Copy link
Contributor Author

rsc commented Mar 5, 2013

Comment 11:

It is okay to call the NtXxx functions because they are very thin wrappers
around the actual kernel system calls. There is a list of the system calls
at http://j00ru.vexillium.org/ntapi/. The exact numbers vary from version
to version, which is why we need to call via ntdll.dll instead of making
the calls ourselves. This will be more efficient as long as we can get the
effects we need, and it will use a small enough amount of stack not to need
a stack switch.
Which tests get slower during the 'go test runtime -short -v'? The most
likely reason they'd get slower is that osyield is not actually yielding
the CPU, so that locks take much longer when contended. The old
implementation did a Sleep(0), and the new implementation is using the
NtYieldExecution system call, but perhaps NtYieldExecution is not strong
enough. A simple test to do would be to put
#define runtime�osyield() runtime�usleep(0)
at the bottom of runtime.h, so that all the osyields become Sleep(0) again,
and see if that makes the test execution fast again.
Russ

@rsc
Copy link
Contributor Author

rsc commented Mar 5, 2013

Comment 12:

I should add that I verified that the NtXxx functions were thin wrappers
around the system calls, in contrast to what we used to call, by
disassembling both using gdb on a stopped Go program.
Russ

@dvyukov
Copy link
Member

dvyukov commented Mar 5, 2013

Comment 13:

AFAIR,
YieldExecution yields to threads in the run queue of the current CPU only.
Sleep(0) yields to threads of the same or higher priority only.
Sleep(1) yields to all threads.

@alexbrainman
Copy link
Member

Comment 14:

> ... it will use a small enough amount of stack not to need
> a stack switch.
I am still concerned about our stack usage. Even if new usleep needs just 20-30 bytes of
stack, who will check that the room is there? I saw usleep sometimes called from a chain
of "textflag 7" functions, so our stack could be on the last leg. And, if you think that
20-30 is on a low side, then think again, because (on amd64) we must reserve room for
first 4 parameters + stack must be aligned on 16 (I do not know if these are required
for NtXxx functions).
Also, what happens, if it works on my pc, and then breaks somewhere else, where
NtYieldExecution stack is slightly longer. How are we going to deal with these?
The problem we have is with TestCthread test, that test quite uncommon (for the moment)
functionality. I would say, lets fix that and move on for now.
> ... Which tests get slower during the 'go test runtime -short -v'?
I applied a.diff to 9742f722b558. No more NtYieldExecution, just NtWaitForSingleObject.
I couldn't use your #define trick, so I did it differently. New version is slower. I ran
"go test -short -v" and "go test -cpu=1,2 -v" (see files attached). TestCrashHandler
looks suspect, but I am not sure why.
> The most
likely reason they'd get slower is that osyield is not actually yielding
the CPU, so that locks take much longer when contended. The old
implementation did a Sleep(0), and the new implementation is using the
NtYieldExecution system call, but perhaps NtYieldExecution is not strong
enough. A simple test to do would be to put
#define runtime?osyield() runtime?usleep(0)
at the bottom of runtime.h, so that all the osyields become Sleep(0) again,
and see if that makes the test execution fast again.
hg id
9742f722b558 tip

Attachments:

  1. a.diff (2344 bytes)
  2. clean_cpu_1_2.txt (6253 bytes)
  3. clean_short.txt (3051 bytes)
  4. patched_cpu_1_2.txt (6254 bytes)
  5. patched_short.txt (3052 bytes)

@minux
Copy link
Member

minux commented Mar 6, 2013

Comment 15:

i think for ntdll.dll, we don't need to reserve stack space for 4
arguments.
the linker will verify that any non-splitstack call will not overflow
the stack so as long as we add correct stack frame size to TEXT,
we should be fine.

@rsc
Copy link
Contributor Author

rsc commented Mar 7, 2013

Comment 16:

I am not worried about system call dispatch being different on different
machines. The set of Windows kernel system calls has been very stable over
a period of decades, and there's no reason to believe it will change any
time soon. And if it does change, it will be with a new Windows version,
which might have other things to update anyway. We sometimes go through
similar updates when new OS X versions come out.
Here is the definition of the NtWaitForSingleObject function, disassembled
using gdb. Note that I don't trust the symbols, but I do trust the memory:
0x774b5464 <ntdll!ZeWaitForSingleObject>:    mov   $0x160, %eax
0x774b5469 <ntdll!ZeWaitForSingleObject+5>:  mov   $0x7ffe0300,%edx
0x774b546e <ntdll!ZeWaitForSingleObject+10>: call  *(%edx)
0x774b5470 <ntdll!ZeWaitForSingleObject+12>: ret   $0xc
0x7ffe0300 0x774b5ca0
0x774b5ca0 <ntdll!KiUserExceptionDispatcher>:    mov  %esp,%edx
0x774b5ca2 <ntdll!KiUserExceptionDispatcher+2>:  sysenter
The entire function uses only one extra word of space, for the return PC
pushed by the call instruction. I am willing to assume that will not change
any time soon. You are correct that the linker does not expect even that
one word, and sometimes one word can make a difference. So I propose that
our wrapper first call a NOSPLIT no-op function that uses an extra word of
space. If the linker doesn't reject the program, then the no-op call is
safe, and by implication so is the call to the direct system call wrapper.
It is important to me that we get cgo fully running again soon, because I
am making other changes to cgo and don't want to be debugging those
breakages with this breakage mixed in.
For 386, I propose https://golang.org/cl/7563043. It takes the
approach I have described. I can run it with -test.cpu=1,2,4 on the win32
builder without any notable slowdown. I will not mail the CL for review
until I have added and tested amd64 too, but I'll gladly accept comments
any time.
Russ

@rsc
Copy link
Contributor Author

rsc commented Mar 7, 2013

Comment 17:

This issue was closed by revision 8aafb44.

Status changed to Fixed.

@rsc rsc added fixed labels Mar 7, 2013
@rsc rsc added this to the Go1.1 milestone Apr 14, 2015
@rsc rsc removed the go1.1 label Apr 14, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants