Skip to content

runtime: A non-deterministic behavior of a piece of deterministic code #9906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
IvanUkhov opened this issue Feb 17, 2015 · 27 comments
Closed

Comments

@IvanUkhov
Copy link
Contributor

Hello,

I have a program that spins off a number of goroutines, and each goroutine multiplies two hard-coded matrices in an infinite loop. The result of each multiplication is checked against a tabulated answer.

I am running the program on a machine with 16 cores, and sometimes the check fails: the result of the matrix multiplication deviates from the expected answer. Namely, I observe that some elements of the output matrix do not get computed. It feels like the execution flow of the program skips some inner loops of the matrix-multiplication algorithm.

I am puzzled with what I see, and I am wondering if it is a Go-related problem or a problem with my system. One can try to reproduce the problem by finding a machine similar to mine and letting it run the program for some time, occasionally restarting it.

I would appreciate any feedback. Thank you!

Regards,
Ivan

$ go version
go version go1.4.1 linux/amd64

$ uname -a
Linux fermi 3.10.10-1-ARCH #1 SMP PREEMPT Fri Aug 30 11:30:06 CEST 2013 x86_64 GNU/Linux

$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
stepping    : 5
microcode   : 0xb
cpu MHz     : 1600.000
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips    : 4534.90
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
…
@ianlancetaylor
Copy link
Contributor

Have you run your program under the race detector?

@IvanUkhov
Copy link
Contributor Author

@ianlancetaylor, thank you for your response. I have recompiled the program with -race. The bug is still there, and the race detector says nothing. However, it feels like it has become harder to reproduce it. Probably it is because of the fact that the race detector made the program slower.

@minux
Copy link
Member

minux commented Feb 17, 2015 via email

@IvanUkhov
Copy link
Contributor Author

@minux, with the race detector enabled, I have seen it only once so far (EDIT: twice). Without the race detector, it really varies. Sometimes it takes a couple of seconds, sometimes minutes, and sometimes forever, in which case I just restart the program. I cannot say that there is a pattern.

About ECC memory, I am not sure. It is a server that I happen to have access to. Is there a way I can check this?

@cespare
Copy link
Contributor

cespare commented Feb 17, 2015

@IvanUkhov
Copy link
Contributor Author

@cespare, thank you! So, I have the exact situation described on Stack Exchange:

Total Width: 72 bits
Data Width: 64 bits

So, ECC should be active.

@minux
Copy link
Member

minux commented Feb 17, 2015 via email

@IvanUkhov
Copy link
Contributor Author

There is an interesting pattern in the way the elements in the output matrix get jumped over. If it happens, it is always a continuous segment of a row (somewhere in the middle) or a couple of such segments on different rows. It is never scattered.

So, taking into consideration the fact that data are stored columnwise, one can speculate that the middle loop over j somehow skips several successive js, or the final assignment is not executed for several successive js, potentially with the innermost loop over k.

@IvanUkhov
Copy link
Contributor Author

@minux, how many cores do you have? Restarting usually helps; I do not wait for too long. And yes, I tried to compare with the expected matrix element by element, and it is always the fill value.

@cespare
Copy link
Contributor

cespare commented Feb 17, 2015

@IvanUkhov can you reproduce this on a different computer?

I also cannot reproduce on my machine after several runs for several minutes.

@IvanUkhov
Copy link
Contributor Author

Here is the output of go tool 6g -S main.go data.go, if it can help. I only cut out the input data for brevity.

@IvanUkhov
Copy link
Contributor Author

@cespare, apart from that machine, I have my old MacBook with four cores. I will try to run there.

@dvyukov
Copy link
Member

dvyukov commented Feb 18, 2015

Does not fail for me on tip on Intel E5-2690.

@minux
Copy link
Member

minux commented Feb 18, 2015 via email

@dvyukov
Copy link
Member

dvyukov commented Feb 18, 2015

Just in case, does setting GODEBUG=scavenge=1 env var increase failure rate?

@IvanUkhov
Copy link
Contributor Author

@dvyukov, I wouldn’t say that GODEBUG affects the failure rate.

@IvanUkhov
Copy link
Contributor Author

I have tried running the program on two other machines with many cores, and I have not encountered the bug so far. It seems the problem is specific to my first machine, and the bug keeps showing up there again and again.

@minux
Copy link
Member

minux commented Feb 18, 2015 via email

@IvanUkhov
Copy link
Contributor Author

@minux, yeah, that’s definitely something that would be interesting to have a look at. It might take time though: not sure when I wrote in C last time, let alone using pthreads)

@IvanUkhov
Copy link
Contributor Author

@minux, OK, may be this one will do.

@IvanUkhov
Copy link
Contributor Author

@minux, I’ll keep experimenting, but no issues so far with the C version.

@IvanUkhov
Copy link
Contributor Author

I’ve been playing the C version for quite some time now, and it’s been working just fine. So, either the problem is specific to Go’s runtime running in this particular environment, or the bug hasn’t got a chance to manifest itself yet, presumably due to the lightness or some implementation details of the C version.

@IvanUkhov
Copy link
Contributor Author

I’ve just compiled Go from the tip. The problem remains.

@IvanUkhov
Copy link
Contributor Author

This problem might be related to #9875. Both issues have been discovered on the same machine, and I haven’t managed to reproduce them anywhere else yet. What breaks my program here might be breaking Go’s runtime there.

@mikioh mikioh changed the title A non-deterministic behavior of a piece of deterministic code runtime: A non-deterministic behavior of a piece of deterministic code Feb 21, 2015
@rsc
Copy link
Contributor

rsc commented Apr 10, 2015

What if you move the allocation of C up out of the for { } loop, so that there is only one C reused on each iteration? Does it still die?

If so, the next step is to run with 'export GOGC=off' in your environment and see if it still dies. I believe that if C is moved out, there will not be any allocations in the for { } loop, so you should be able to run for a while even with garbage collection off.

@IvanUkhov
Copy link
Contributor Author

@rsc, thanks for your response. First of all, I wanted to ensure that the problem was still present on master (before applying the suggested changes), so I compiled Go from the tip and ran my program. I tortured it for some time by occasionally restarting, and it reported a calculation error just like in February.

With C outside of the infinite loop and regardless of GOGC, the program still reports errors. It takes subjectively more time. To make sure that I disable the GC properly, I ran the program with C inside the infinite loop and GOGC=off, and, as expected, it sucked all 23G of RAM quite quickly.

Probably it is something other than the GC, and it might be not Go’s fault at all.

@rsc
Copy link
Contributor

rsc commented Apr 10, 2015

Because this happens with GOGC=off, I'm pretty much certain your hardware is buggy here.

@rsc rsc closed this as completed Apr 10, 2015
@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants