windows: cgo is broken #4955

rsc · 2013-03-01T14:54:42Z

tests crashing during cgo. disabling for now so builder can test the rest of the tree.

alexbrainman · 2013-03-04T05:19:07Z

Comment 1:

The problem is that runtime.usleep (called from "textflag 7" lockextra) is not marked as
"textflag 7". So, it tries to grow stack and it needs m for that, but m is 0 in
TestCthread.
I can mark runtime.usleep to be "testflag 7", but that is not enough, because
runtime.usleep calls OS and therefore switches stacks. Should runtime.usleep be both
"testflag 7" and not switch stacks?
I tried to do that, and tests PASS. But I am not certain. Also runtime.usleep needs to
be rewritten in asm, because otherwise the stack of "textflag 7" functions becomes too
long. I have done it for amd64, I will do it for 386 as well, if my assumptions are
valid.
Alex

Labels changed: added os-windows.

alexbrainman · 2013-03-04T05:19:42Z

Comment 2:

Issue #4872 has been merged into this issue.

dvyukov · 2013-03-04T06:01:43Z

Comment 3:

usleep/osyield must switch stack, because they can be called from a normal goroutine
(e.g. runtime.lock), and then stack switch is required.
For other OSes textflag7 magically resolve all the issues, but for Windows it seems to
be more complicated.
We can introduce another pair of functions (or perhaps just one) that can be called w/o
m. And use them in lockextra(). The simples impl for Windows will be an empty func. It
will affect performance during contention, but should not lead to crashes/deadlocks.

alexbrainman · 2013-03-04T06:22:25Z

Comment 4:

> usleep/osyield must switch stack, because they can be called from a normal goroutine
(e.g. runtime.lock), and then stack switch is required.
Fair enough.
> For other OSes textflag7 magically resolve all the issues, but for Windows it seems to
be more complicated.
I do not know about other OSes, but windows syscalls require large stack, because they
start with a call to a system dll, which is just a C code that runs in user space and
need stack just like any other program. I am not sure how "textflag7 magically resolve
all the issues" on other OSes.
> We can introduce another pair of functions (or perhaps just one) that can be called
w/o m. And use them in lockextra(). The simples impl for Windows will be an empty func.
It will affect performance during contention, but should not lead to crashes/deadlocks.
Sounds like a plan. I can make these real syscalls. Do you have a good name for them? Is
lockextra the only place these should be called from? Should we wait for Russ?
Alex

dvyukov · 2013-03-04T06:29:12Z

Comment 5:

In other OSes usleep/osyield just invoke syscalls from assembly, just a handful of
instructions.
Yes, for now it's the only place where such functions are required. It's always a good
idea to wait for Russ :)
Name? I dont' know. The first thing that comes to mind is - osyieldnom().

rsc · 2013-03-04T15:30:03Z

Comment 6:

Yes, please make usleep and osyield be textflag 7 functions with no stack
switches.
Russ

rsc · 2013-03-04T21:10:15Z

Comment 7:

I believe that it's okay to call ntdll system call functions from the
goroutine stack. They do not use additional stack space (perhaps one more
word).
For example, on 386, if we add this to thread_windows.c:
#pragma dynimport runtime�NtYield NtYieldExecution "ntdll.dll"
#pragma dynimport runtime�NtWaitForSingleObject NtWaitForSingleObject
"ntdll.dll"
extern void *runtime�NtWaitForSingleObject;
extern void *runtime�NtYield;
then I believe the following work as implementations of usleep and osyield:
TEXT runtime�osyield(SB),7,$0
MOVL runtime�NtYield(SB), AX
MOVL SP, BP
CALL AX
MOVL BP, SP
RET
TEXT runtime�usleep(SB),7,$20
MOVL runtime�NtWaitForSingleObject(SB), AX
MOVL usec+0(FP), BX
MOVL $0, hi-4(SP)
MOVL BX, lo-8(SP)
MOVL $lo-8(SP), BX
MOVL BX, ptime-12(SP)
MOVL $0, alertable-16(SP)
MOVL $0, handle-20(SP)
MOVL SP, BP
CALL AX
MOVL BP, SP
RET
On amd64 it should be similar (and probably easier since no stdcall
nonsense).
Can someone who observed the flakiness on 32-bit try this? I had to comment
out the TestCthread test in misc/cgo/test/cgo_test.go, but I that failure
is a separate problem.
Russ

alexbrainman · 2013-03-04T22:29:02Z

Comment 8:

Russ,
> I believe that it's okay to call ntdll system call functions from the
> goroutine stack. They do not use additional stack space (perhaps one more
> word).
How did you come up to with this conclusion? Can you provide some references?
Alex

minux · 2013-03-04T23:02:41Z

Comment 9:

because NtWaitForSingleObject and NtYieldExecution are both in nt native syscall list,
so their functions in ntdll.dll are really thin shim that forward parameter to the real
syscall.
x86 windows syscall might require syscall number be passed on stack, thus the
requirement of one more word on stack. (however, iirc, the syscall number is passed
in ax?)
one caveat though: interface of ntdll.dll is not publicly documented and is subject to
arbitrary change (though it's fairly unlikely to change).

alexbrainman · 2013-03-05T05:55:53Z

Comment 10:

Applied this:
diff --git a/src/pkg/runtime/sys_windows_386.s b/src/pkg/runtime/sys_windows_386.s
--- a/src/pkg/runtime/sys_windows_386.s
+++ b/src/pkg/runtime/sys_windows_386.s
@@ -314,3 +314,30 @@
    MOVL    AX, 0(FS)
 
    RET
+
+// void runtime·osyield(void)
+// Called from the goroutine stack.
+// Assume it does not use additional stack space.
+TEXT runtime·osyield(SB),7,$0
+   MOVL    runtime·NtYieldExecution(SB), AX
+   MOVL    SP, BP
+   CALL    AX
+   MOVL    BP, SP
+   RET
+
+// void runtime·usleep(uint32 us)
+// Called from the goroutine stack.
+// Assume it does not use additional stack space.
+TEXT runtime·usleep(SB),7,$20
+   MOVL    runtime·NtWaitForSingleObject(SB), AX
+   MOVL    usec+0(FP), BX
+   MOVL    $0, hi-4(SP)
+   MOVL    BX, lo-8(SP)
+   LEAL    lo-8(SP), BX
+   MOVL    BX, ptime-12(SP)
+   MOVL    $0, alertable-16(SP)
+   MOVL    $0, handle-20(SP)
+   MOVL    SP, BP
+   CALL    AX
+   MOVL    BP, SP
+   RET
diff --git a/src/pkg/runtime/thread_windows.c b/src/pkg/runtime/thread_windows.c
--- a/src/pkg/runtime/thread_windows.c
+++ b/src/pkg/runtime/thread_windows.c
@@ -31,6 +31,8 @@
 #pragma dynimport runtime·timeBeginPeriod timeBeginPeriod "winmm.dll"
 #pragma dynimport runtime·WaitForSingleObject WaitForSingleObject "kernel32.dll"
 #pragma dynimport runtime·WriteFile WriteFile "kernel32.dll"
+#pragma dynimport runtime·NtYieldExecution NtYieldExecution "ntdll.dll"
+#pragma dynimport runtime·NtWaitForSingleObject NtWaitForSingleObject "ntdll.dll"
 
 extern void *runtime·CloseHandle;
 extern void *runtime·CreateEvent;
@@ -56,6 +58,8 @@
 extern void *runtime·timeBeginPeriod;
 extern void *runtime·WaitForSingleObject;
 extern void *runtime·WriteFile;
+extern void *runtime·NtYieldExecution;
+extern void *runtime·NtWaitForSingleObject;
 
 static int32
 getproccount(void)
@@ -135,22 +139,6 @@
    return written;
 }
 
-#pragma textflag 7
-void
-runtime·osyield(void)
-{
-   runtime·stdcall(runtime·Sleep, 1, (uintptr)0);
-}
-
-void
-runtime·usleep(uint32 us)
-{
-   us /= 1000;
-   if(us == 0)
-       us = 1;
-   runtime·stdcall(runtime·Sleep, 1, (uintptr)us);
-}
-
 #define INFINITE ((uintptr)0xFFFFFFFF)
 
 int32
diff --git a/src/run.bat b/src/run.bat
--- a/src/run.bat
+++ b/src/run.bat
@@ -72,8 +72,6 @@
 echo.
 
 :: cgo tests
-:: issue #4955 - cgo is broken
-goto nocgo
 if x%CGO_ENABLED% == x0 goto nocgo
 echo # ..\misc\cgo\life
 go run %GOROOT%\test\run.go - ..\misc\cgo\life
Now
go test runtime -short -v -timeout=10m -cpu=1,2,4
command takes 3 times as long (from under 1m to aroung 200s) on my pc. I don't think it
is acceptable.
Alex

rsc · 2013-03-05T19:31:59Z

Comment 11:

It is okay to call the NtXxx functions because they are very thin wrappers
around the actual kernel system calls. There is a list of the system calls
at http://j00ru.vexillium.org/ntapi/. The exact numbers vary from version
to version, which is why we need to call via ntdll.dll instead of making
the calls ourselves. This will be more efficient as long as we can get the
effects we need, and it will use a small enough amount of stack not to need
a stack switch.
Which tests get slower during the 'go test runtime -short -v'? The most
likely reason they'd get slower is that osyield is not actually yielding
the CPU, so that locks take much longer when contended. The old
implementation did a Sleep(0), and the new implementation is using the
NtYieldExecution system call, but perhaps NtYieldExecution is not strong
enough. A simple test to do would be to put
#define runtime�osyield() runtime�usleep(0)
at the bottom of runtime.h, so that all the osyields become Sleep(0) again,
and see if that makes the test execution fast again.
Russ

rsc · 2013-03-05T19:33:30Z

Comment 12:

I should add that I verified that the NtXxx functions were thin wrappers
around the system calls, in contrast to what we used to call, by
disassembling both using gdb on a stopped Go program.
Russ

dvyukov · 2013-03-05T20:43:38Z

Comment 13:

AFAIR,
YieldExecution yields to threads in the run queue of the current CPU only.
Sleep(0) yields to threads of the same or higher priority only.
Sleep(1) yields to all threads.

alexbrainman · 2013-03-05T23:53:36Z

Comment 14:

> ... it will use a small enough amount of stack not to need
> a stack switch.
I am still concerned about our stack usage. Even if new usleep needs just 20-30 bytes of
stack, who will check that the room is there? I saw usleep sometimes called from a chain
of "textflag 7" functions, so our stack could be on the last leg. And, if you think that
20-30 is on a low side, then think again, because (on amd64) we must reserve room for
first 4 parameters + stack must be aligned on 16 (I do not know if these are required
for NtXxx functions).
Also, what happens, if it works on my pc, and then breaks somewhere else, where
NtYieldExecution stack is slightly longer. How are we going to deal with these?
The problem we have is with TestCthread test, that test quite uncommon (for the moment)
functionality. I would say, lets fix that and move on for now.
> ... Which tests get slower during the 'go test runtime -short -v'?
I applied a.diff to 9742f722b558. No more NtYieldExecution, just NtWaitForSingleObject.
I couldn't use your #define trick, so I did it differently. New version is slower. I ran
"go test -short -v" and "go test -cpu=1,2 -v" (see files attached). TestCrashHandler
looks suspect, but I am not sure why.
> The most
likely reason they'd get slower is that osyield is not actually yielding
the CPU, so that locks take much longer when contended. The old
implementation did a Sleep(0), and the new implementation is using the
NtYieldExecution system call, but perhaps NtYieldExecution is not strong
enough. A simple test to do would be to put
#define runtime?osyield() runtime?usleep(0)
at the bottom of runtime.h, so that all the osyields become Sleep(0) again,
and see if that makes the test execution fast again.
hg id
9742f722b558 tip

Attachments:

a.diff (2344 bytes)
clean_cpu_1_2.txt (6253 bytes)
clean_short.txt (3051 bytes)
patched_cpu_1_2.txt (6254 bytes)
patched_short.txt (3052 bytes)

minux · 2013-03-06T09:59:51Z

Comment 15:

i think for ntdll.dll, we don't need to reserve stack space for 4
arguments.
the linker will verify that any non-splitstack call will not overflow
the stack so as long as we add correct stack frame size to TEXT,
we should be fine.

rsc · 2013-03-07T04:13:48Z

Comment 16:

I am not worried about system call dispatch being different on different
machines. The set of Windows kernel system calls has been very stable over
a period of decades, and there's no reason to believe it will change any
time soon. And if it does change, it will be with a new Windows version,
which might have other things to update anyway. We sometimes go through
similar updates when new OS X versions come out.
Here is the definition of the NtWaitForSingleObject function, disassembled
using gdb. Note that I don't trust the symbols, but I do trust the memory:
0x774b5464 <ntdll!ZeWaitForSingleObject>:    mov   $0x160, %eax
0x774b5469 <ntdll!ZeWaitForSingleObject+5>:  mov   $0x7ffe0300,%edx
0x774b546e <ntdll!ZeWaitForSingleObject+10>: call  *(%edx)
0x774b5470 <ntdll!ZeWaitForSingleObject+12>: ret   $0xc
0x7ffe0300 0x774b5ca0
0x774b5ca0 <ntdll!KiUserExceptionDispatcher>:    mov  %esp,%edx
0x774b5ca2 <ntdll!KiUserExceptionDispatcher+2>:  sysenter
The entire function uses only one extra word of space, for the return PC
pushed by the call instruction. I am willing to assume that will not change
any time soon. You are correct that the linker does not expect even that
one word, and sometimes one word can make a difference. So I propose that
our wrapper first call a NOSPLIT no-op function that uses an extra word of
space. If the linker doesn't reject the program, then the no-op call is
safe, and by implication so is the call to the direct system call wrapper.
It is important to me that we get cgo fully running again soon, because I
am making other changes to cgo and don't want to be debugging those
breakages with this breakage mixed in.
For 386, I propose https://golang.org/cl/7563043. It takes the
approach I have described. I can run it with -test.cpu=1,2,4 on the win32
builder without any notable slowdown. I will not mail the CL for review
until I have added and tested amd64 too, but I'll gladly accept comments
any time.
Russ

rsc · 2013-03-07T14:18:53Z

Comment 17:

This issue was closed by revision 8aafb44.

Status changed to Fixed.

rsc added fixed labels Mar 7, 2013

rsc added this to the Go1.1 milestone Apr 14, 2015

rsc removed the go1.1 label Apr 14, 2015

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windows: cgo is broken #4955

windows: cgo is broken #4955

rsc commented Mar 1, 2013

alexbrainman commented Mar 4, 2013

alexbrainman commented Mar 4, 2013

dvyukov commented Mar 4, 2013

alexbrainman commented Mar 4, 2013

dvyukov commented Mar 4, 2013

rsc commented Mar 4, 2013

rsc commented Mar 4, 2013

alexbrainman commented Mar 4, 2013

minux commented Mar 4, 2013

alexbrainman commented Mar 5, 2013

rsc commented Mar 5, 2013

rsc commented Mar 5, 2013

dvyukov commented Mar 5, 2013

alexbrainman commented Mar 5, 2013

minux commented Mar 6, 2013

rsc commented Mar 7, 2013

rsc commented Mar 7, 2013

windows: cgo is broken #4955

windows: cgo is broken #4955

Comments

rsc commented Mar 1, 2013

alexbrainman commented Mar 4, 2013

alexbrainman commented Mar 4, 2013

dvyukov commented Mar 4, 2013

alexbrainman commented Mar 4, 2013

dvyukov commented Mar 4, 2013

rsc commented Mar 4, 2013

rsc commented Mar 4, 2013

alexbrainman commented Mar 4, 2013

minux commented Mar 4, 2013

alexbrainman commented Mar 5, 2013

rsc commented Mar 5, 2013

rsc commented Mar 5, 2013

dvyukov commented Mar 5, 2013

alexbrainman commented Mar 5, 2013

minux commented Mar 6, 2013

rsc commented Mar 7, 2013

rsc commented Mar 7, 2013