-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: A non-deterministic behavior of a piece of deterministic code #9906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Have you run your program under the race detector? |
@ianlancetaylor, thank you for your response. I have recompiled the program with |
How long does it take to reproduce the bug?
Are you using ECC memory?
|
@minux, with the race detector enabled, I have seen it only once so far (EDIT: twice). Without the race detector, it really varies. Sometimes it takes a couple of seconds, sometimes minutes, and sometimes forever, in which case I just restart the program. I cannot say that there is a pattern. About ECC memory, I am not sure. It is a server that I happen to have access to. Is there a way I can check this? |
@cespare, thank you! So, I have the exact situation described on Stack Exchange:
So, ECC should be active. |
I've tried the program on two computers for a long time, but couldn't
reproduce the problem...
Have you tried to compare the incorrect result with the expected one?
What's the difference? Is the difference stable? Does it depend on the fill
value?
|
There is an interesting pattern in the way the elements in the output matrix get jumped over. If it happens, it is always a continuous segment of a row (somewhere in the middle) or a couple of such segments on different rows. It is never scattered. So, taking into consideration the fact that data are stored columnwise, one can speculate that the middle loop over |
@minux, how many cores do you have? Restarting usually helps; I do not wait for too long. And yes, I tried to compare with the expected matrix element by element, and it is always the |
@IvanUkhov can you reproduce this on a different computer? I also cannot reproduce on my machine after several runs for several minutes. |
Here is the output of |
@cespare, apart from that machine, I have my old MacBook with four cores. I will try to run there. |
Does not fail for me on tip on Intel E5-2690. |
I tested the program (-w=16) on a 16-processor machine for 13 hours,
without a single failure.
|
Just in case, does setting GODEBUG=scavenge=1 env var increase failure rate? |
@dvyukov, I wouldn’t say that GODEBUG affects the failure rate. |
I have tried running the program on two other machines with many cores, and I have not encountered the bug so far. It seems the problem is specific to my first machine, and the bug keeps showing up there again and again. |
have you tried to write the same code in C with pthreads and try on that
machine?
perhaps it's not Go problem.
|
@minux, yeah, that’s definitely something that would be interesting to have a look at. It might take time though: not sure when I wrote in C last time, let alone using pthreads) |
@minux, I’ll keep experimenting, but no issues so far with the C version. |
I’ve been playing the C version for quite some time now, and it’s been working just fine. So, either the problem is specific to Go’s runtime running in this particular environment, or the bug hasn’t got a chance to manifest itself yet, presumably due to the lightness or some implementation details of the C version. |
I’ve just compiled Go from the tip. The problem remains. |
This problem might be related to #9875. Both issues have been discovered on the same machine, and I haven’t managed to reproduce them anywhere else yet. What breaks my program here might be breaking Go’s runtime there. |
What if you move the allocation of C up out of the for { } loop, so that there is only one C reused on each iteration? Does it still die? If so, the next step is to run with 'export GOGC=off' in your environment and see if it still dies. I believe that if C is moved out, there will not be any allocations in the for { } loop, so you should be able to run for a while even with garbage collection off. |
@rsc, thanks for your response. First of all, I wanted to ensure that the problem was still present on master (before applying the suggested changes), so I compiled Go from the tip and ran my program. I tortured it for some time by occasionally restarting, and it reported a calculation error just like in February. With C outside of the infinite loop and regardless of GOGC, the program still reports errors. It takes subjectively more time. To make sure that I disable the GC properly, I ran the program with C inside the infinite loop and GOGC=off, and, as expected, it sucked all 23G of RAM quite quickly. Probably it is something other than the GC, and it might be not Go’s fault at all. |
Because this happens with GOGC=off, I'm pretty much certain your hardware is buggy here. |
Hello,
I have a program that spins off a number of goroutines, and each goroutine multiplies two hard-coded matrices in an infinite loop. The result of each multiplication is checked against a tabulated answer.
I am running the program on a machine with 16 cores, and sometimes the check fails: the result of the matrix multiplication deviates from the expected answer. Namely, I observe that some elements of the output matrix do not get computed. It feels like the execution flow of the program skips some inner loops of the matrix-multiplication algorithm.
I am puzzled with what I see, and I am wondering if it is a Go-related problem or a problem with my system. One can try to reproduce the problem by finding a machine similar to mine and letting it run the program for some time, occasionally restarting it.
I would appreciate any feedback. Thank you!
Regards,
Ivan
The text was updated successfully, but these errors were encountered: