-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libgo: SEGV in runtime test TestChan on ppc64le #36697
Comments
@ianlancetaylor This continues to happen. I would appreciate some direction on how to fix this. We just found out this error has been hiding at least one other error in runtime that shows up intermittently. The SEGV happens consistently now when testing the Chan tests in runtime. I build the a.out file and run it with -test.run=Chan. Here is the relevant stack:
I did some debugging with gdb and found that the SEGV happens because a bad value is being passed for the length spsize in r4 from doscanstack1 when calling scanstackblock. The call is done from this code:
The bad r4 looks like an address, not a length, and causes the loop to iterate too many times resulting finally in a pointer that is invalid.
I've tried to get more detail using gdb but unfortunately setting breakpoints around some of the calls in the while loop in doscanstack1 changes the behavior.. The SEGV always happens at the same location with a bad r4 but the number of times doscanstack1 and scanstackblock have been called before that can vary. |
My reading of what you are saying is that in the loop in |
That said, it's hard to see why this would happen so specifically for the runtime Chan tests and not in other cases. I can't think of why a stack overrun would lead to this kind of error. What are the bad values for One simple thing to try would be increase the value of |
It is not only happening in the Chan tests but when running all the runtime tests, it stops running after it panics. So the Chan tests are the first and now it consistently fails if you do -test.run=Chan whereas early on the failure was intermittent. gdb output:
I can set breaks on doscanstack1 and runtime.scanstackblock and those breaks are hit 4 times each. On the 4th call to runtime.scanstackblock the value in r4 is shown above and matches what it is r1+56 which is where it should have been loaded from before the call and I believe is spsize. I'm continuing to try to figure out where the bad value is coming from, and according to the source code it could have been set on multiple paths before the call to scanstackblock. It is curious to me that there are two copies of runtime.scanstackblock according to gdb? One in the binary and one in libgo.so. My a.out file was built using 'make runtime/check'. Not sure if that could be part of the problem. I've been trying to set breaks in doscanstack1 to where the bad spsize happens but setting breaks in this code affects the behavior and doesn't seem to stop where I expect it to. I'll keep trying. |
I added some prints to the code that display information when the spsize that is returned is a very large number, and also the value of n at the beginning of scanstackblock. In both cases it is __splitstack_find_context that returns the large value.
./a.out -test.run=Concurrent
|
It's true that in the runtime tests there will be two copies of Can you get the contents of |
Here is the output from some prints at the start of doscanstack1:
|
The problem happens because the first stack_segment structure assigned from __morestack_current_segment gets corrupted. I used gdb to stop at __splitstack_find and displayed __morestack_current_segment and its contents. Then set a watch on the field containing the size.
So the stack is overflowing into the space used to hold the stack_segment structure for the first __morestack_current_segment. |
Thanks. This is Go code that is calling into C code. The C code is (presumably) not compiled with -fsplit-stack. The intent for this kind of code is that the linker will detect the call from split-stack code to non-split-stack code and convert the split-stack function prologue sequence to call |
I see the specs file has this rule for the link command:
I don't see any morestack function? The go-callers.o file has the same objdump. I can see in the build log that -fsplit-stack was used. Not sure if this is the way it works with a shared libgo. Another thing that seems odd is that -fsplit-stack is on the link command when building libgo.so which should force -fuse-ld=gold but if I run the link command from the log by hand with the -v option, it looks like -fsplit-stack was removed before invoking the gold linker. |
This is the branch to where
Since calls to |
OK, I just looked at the code in gold and I'm not sure I fully understand it. Unlike the x86_64 version which I wrote, the PowerPC version never changes the function to call |
I don't understand the x86_64 code fully. If do_calls_non_split finds a "cmpq %fs:112,%rsp", it changes that comparison to always fail. Then there are comparisons against "lea offset(%rsp),%r10" or %r11, and in that case gold makes the offset more negative by split_stack_adjust_size. In all cases the call is changed from __morestack to __morestack_non_split. I think that would mean the lea case would allow a call to a non-split-stack function with stack space of split_stack_adjust_size (plus the tcb guard amount of 256 bytes), if the available stack is more than split_stack_adjust_size, ie. there would not necessarily be a call to __morestack_non_split. However, the cmp case, which gcc emits for functions that require fewer than 256 bytes for their own stack frame, would always call __morestack_non_split. __morestack_non_split ensures 0x100000 bytes of stack! So we have a rather weird effect that small stack frame functions calling non-split-stack functions get much more space, at least with the default --split-stack-adjust-size. In contrast, the ppc64 code always adds split_stack_adjust_size to the current function's requested stack frame size, compares that against the tcb limit, and calls __morestack if not enough stack is available. ppc64 doesn't need a special __morestack_non_split that repeats the stack limit comparison because unlike x86 we don't nop out the comparison. (With a "cmp %fs:112,%rsp" style prologue x86 loses that compare so possible recursion might lead to excessive stacks without the __morestack_non_split comparison.) |
I tried this and didn't help, assuming it just needs to be added to the link of the binary:
|
I added an option to get verbose output on the build of the test:
Not sure if this is a problem but I don't see -fsplit-stack being passed to gold? The options -fsplit-stack appears in COLLECT_GCC_OPTIONS and must be there to cause -fuse-ld=gold to be added. |
@amodra You're right, the approaches used on x86 _64and ppc64 are consistent. The always-fail case is used on x86 because there isn't room to add in adjust_split_stack_size. Originally I thought that the default value for @laboger It's normal that Can you try increasing the value even more? You are using |
I used --split-stack-adjust-size=0x80000 and that worked. Now I just get the MemmoveAtomicity errors reported #41428. I'm running the full go testsuite now with this split-stack-adjust-size value. |
Where can I set --split-stack-adjust-size=0x80000? At compile time?? |
-split-stack-adjust-size is a gold linker option. At link time. |
@amodra submitted a change to gold to increase the default split stack size. I'm trying to verify it now and it seems to fix TestChan but I'm seeing other errors in the runtime test. Mostly something like this:
I'm not sure if -d=checkptr works in gccgo and if it did how I would set it. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
This started happening in Go 1.13 and was reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92564. Continues to happen in Go 1.14beta1
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Run the libgo tests
What did you expect to see?
No failures
What did you see instead?
=== RUN TestChan
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0x7214a61c0000 pc=0x1008eb24]
More details, stacks, and gdb information can be found in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92564
The text was updated successfully, but these errors were encountered: