Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled #7452

Closed
vicuna opened this issue Jan 6, 2017 · 73 comments

Comments

@vicuna
Copy link

commented Jan 6, 2017

Original bug ID: 7452
Reporter: enguerrand
Assigned to: @mshinwell
Status: closed (set by @mshinwell on 2017-06-09T17:02:32Z)
Resolution: not a bug
Priority: normal
Severity: crash
Platform: Linux
OS: Debian
Version: 4.03.0
Target version: later
Category: back end (clambda to assembly)
Monitored by: @gasche @ygrek @yallop @alainfrisch

Bug description

While switching a 4.02.3 codebase to 4.03 recently, we stumbled upon some random crashes from the compiler, and more rarely, occurrences of bad assembly code being generated (which as failed to compile), or instruction being trapped at runtime while the compiler is running.

Those problems occurs on an OCaml source file generated using the Extprot library.

The problem doesn't seems to happen all the time.
Most of the time, the file will compile successfully, and if enough retries are given, the compiler will then crash, example of returns from dmesg after a few crashes:

[22241.838551] ocamlopt.opt[48175]: segfault at ffffffffffde7768 ip 000055f75e412e3c sp 00007ffc3ee31de0 error 7 in ocamlopt.opt[55f75e0b6000+613000]
[22985.879907] ocamlopt.opt[48221]: segfault at af8 ip 00005564455169bd sp 00007ffc9f36b130 error 4 in ocamlopt.opt[556445006000+613000]
[23936.341126] ocamlopt.opt[48306]: segfault at 5837 ip 00005641554a16c8 sp 00007ffe1278f8e0 error 4 in ocamlopt.opt[56415514a000+613000]
[25395.780978] ocamlopt.opt[48445]: segfault at ffffffffffde7608 ip 0000557e25ea5cf4 sp 00007ffc2eac79d0 error 5 in ocamlopt.opt[557e25b49000+613000]

Backtraces obtained for those crashes give us informations which doesn't seems to show always the same thing. Example backtraces can be found in the attached archive.

The compiler will more rarely generated an assembly file that as won't be able to compile:

/tmp/camlasmc92578.s: Assembler messages:
/tmp/camlasmc92578.s:1005308: Error: operand type mismatch for `add'

Where the line 1005308 is: add $2300, $5199

Or:

/tmp/camlasm601e1c.s: Assembler messages:
/tmp/camlasm601e1c.s:820172: Error: operand type mismatch for `or'

Where the line 820172 is: orq $139950828249720, %rax

We haven't noticed as of now any misbehaviour in a successfully compiled and running instance of this file, but the issue is still very new for us so we will be watching it closely.

Steps to reproduce

The problem doesn't seems to happen all the time, at least it doesn't crash at every build. We sometimes don't witness the crash before 30 minutes of retries.

Steps to reproduce:

OCaml 4.03 and 4.04 has been witnessed as triggering the problem.
Sample file is attached as the test case used to reproduce the problem: Extprot library must be installed in order to compile the file, since it was generated using Extprot. (we use the latest version from Opam)

Test case can be found in the attachment (test.ml)

To reproduce:
Just compile this file, preferably in a loop, with this command:

while ocamlfind opt -c -g -bin-annot -ccopt -g -ccopt -O2 -ccopt -Wextra -ccopt '-Wstrict-overflow=5' -thread -w +a-4-40..42-44-45-48-58 -w -27-32 -package extprot test.ml -o test.cmx; do echo "ok"; done

Additional information

  • If the crash doesn't occur for some time, after it occured again at least once, the probability of the compiler crashing seems to be increasing

  • Crash was witnessed running ocamlopt and ocamlopt.opt

File attachments

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: joris

Traces are almost useless, but the memory corruption happens as frequently in byte code, with similar traces, making the runtime crash in GC or in runtime:

corrupted stack:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055555557d336 in caml_interprete (prog=,
prog_size=) at interp.c:298
298 accu = sp[0]; Next;
(gdb) bt
#0 0x000055555557d336 in caml_interprete (prog=,
prog_size=) at interp.c:298
Backtrace stopped: Cannot access memory at address 0x7fffffffda68

Corrupted heap:

#0 0x000055687aa72534 in mark_slice_darken (slice_pointers=,
in_ephemeron=0, i=11803660, v=140605172023200, gray_vals_ptr=0x55687c7fead8)
at major_gc.c:237
#1 mark_slice (work=853950) at major_gc.c:407
#2 0x000055687aa73384 in caml_major_collection_slice (howmuch=howmuch@entry=-1)
at major_gc.c:767
#3 0x000055687aa743ff in caml_gc_dispatch () at minor_gc.c:458
#4 0x000055687aa8b807 in caml_interprete (prog=,
prog_size=) at interp.c:673
#5 0x000055687aa8cdf9 in caml_main (argv=0x7ffe02927368) at startup.c:374
#6 0x000055687aa71b1c in main (argc=, argv=)
at main.c:35

this looks like a buffer overrun while walking a large block, but the block is strange as far as i can tell:

(gdb) p (unsigned char)*(((uint64_t *)v) - 1)
$40 = 200 '\310'
(gdb) p *(((uint64_t *)v) - 1) >> 10
$45 = 137309738338

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

I will see if I can reproduce this.

In the meantime, have you reproduced this failure on more than one machine?

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: enguerrand

Reproduced it on a few differents machines (all running Debian 64bits), either VM or physical.
Still haven't tried under something else than Linux, I will check if I can reproduce under OSX

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

Some subset (although not all) of the symptoms exhibited in this report kind of look like stack overflow. To rule out such problems, can you try to reproduce this having run "ulimit -s unlimited" first? (I'm also trying to reproduce it but don't know yet whether I will be successful in doing so.)

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: enguerrand

We suspected stack overflow too and we tried to reproduce with a very large stack size and unlimited, and crashes still happened. I tried again just now just to be sure and the result is still the same

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

Right, that's what I was expecting.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

I wonder if it's because the parameter called "i" of the function mark_slice_darken is of type "int". I think it should be of type "mlsize_t" since it's a field index. I wouldn't be surprised if this gargantuan source file produces a block that has sufficiently many fields for "i" to overflow at the moment.

Can you try changing that in your compiler tree (in 4.04 it's in byterun/major.gc line 232) and seeing if the problem goes away? I haven't managed to reproduce it yet.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

(I will produce a GPR once you confirm)

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

Although actually, if that were to be the bug, I think ocamlopt.opt would use more memory than it does when compiling the file. So that might not be it---but I think it's wrong in any case.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: joris

Indeed. I've launch some test to try but since this is scanning a in memory block and since ocamlopt never use more than 1.5G RES in this case i doubt it can overflow i

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: joris

Reproduced with i changed to mlsize_t, crash in ml_mark_slice, but i don't have many more info i forgot to build the runtime with -g this time

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

Given you seem to be able to reproduce this easily, can you try to get a valgrind trace?

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

(Please build the runtime with -g before doing that, in case it gives any more info)

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

In fact, another thing to try: please adjust the compiler Makefile so that it uses the debug runtime (you need a configure flag, and then I think it's "-runtime-variant d" for compile/link). This might pick something up as it enables a lot of assertions.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: enguerrand

I will give Valgrind a bit later.
It's not that easy to reproduce even on our end, we sometimes wait 30 minutes with 5 concurrent loops to see it happen, its kind of erratic

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

OK, well maybe try the debug runtime first then, since that's probably going to be a bit easier to set up.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @mshinwell

It's possible this is something related to the change between 4.02 and 4.03 that allowed the major GC to stop scanning in the middle of blocks and defer the remaining fields until later.

I think it would be worth trying to disable this, since you do seem to have some large blocks (one of the backtraces shows a block with > 5 million fields). This should be achievable by changing byterun/major_gc.c, line 403 (in 4.04): this currently reads "end = size < end ? size : end;" and I think you should change it to "end = size".

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @alainfrisch

Could this be related to #7228 / #546 ?

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: enguerrand

We tried to reproduce the issue on some server to ease the compilation time and we noticed that we couldn't reproduce the issue (as of now after running multiple loops for one hour or so).
We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.
Does someone have such a machine available to test this hypothesis ?

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @Armael

I can reproduce the bug on my laptop, which has a Skylake CPU (i7-6600U).

A single instance of the loop ran without crashing for around 20 minutes. However, shortly after adding 3 more instances in parallel, two of them crashed with "ocamlopt.opt got signal and exited". The one which was running from the start also crashed quickly after that, with "Fatal error: exception File "utils/timings.ml", line 86, characters 27-33: Assertion failed".

Last relevant lines in dmesg are:

[81744.710293] ocamlopt.opt[348]: segfault at 7fe040f1b000 ip 00000000006b1b04 sp 00007fff9f242bb0 error 4 in ocamlopt.opt[400000+36b000]
[81762.064912] ocamlopt.opt[359]: segfault at fffffffffffffff9 ip 00000000004ba582 sp 00007ffd32705ea0 error 5 in ocamlopt.opt[400000+36b000]

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @Armael

I just realized I was running all the parallel instances of the compiler in the same directory, on the same source file. So that's probably the cause of the assertion failure.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: @Armael

Just to be sure, I tried with 4 instances of the loop in 4 separate directories. 3 segfaulted almost instantly (after around 15 seconds):

[84373.677406] ocamlopt.opt[1293]: segfault at ffffffffffde530b ip 00000000004b2b34 sp 00007ffd998b1280 error 7 in ocamlopt.opt[400000+36b000]
[84404.445402] ocamlopt.opt[1315]: segfault at b18 ip 00000000004c45a7 sp 00007ffcf29d0fc0 error 4 in ocamlopt.opt[400000+36b000]
[84412.083689] ocamlopt.opt[1325]: segfault at 7f6cb22c3000 ip 00000000006b1b04 sp 00007ffed67919e0 error 4 in ocamlopt.opt[400000+36b000]

The last one finally crashed 2 minutes after:
[84471.527980] ocamlopt.opt[1298]: segfault at 469b61 ip 00007f3485db380b sp 00007ffce2ad6018 error 7 in libc-2.24.so[7f3485c8b000+195000]

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 6, 2017

Comment author: enguerrand

joris mentioned the possibility that it might be related to the runtime being compiled with -O2 instead of -O1 since 4.03, testing 4.03 with -O1 might be interesting too

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 7, 2017

Comment author: joris

I indeed cannot reproduce after 1h with a runtime built with -O1, while i can reproduce in less than 10 minutes with an -O2 runtime (built with gcc 6.2)
This smells like an UB which is wrongly optimized by gcc and causing issues on newer intel CPUs.
I also tried to build the runtime with -fsanitize=address and -fsanitize=undefined. Interestingly i fail to reproduce with this binary, but it is also not reporting interesting things besides the caml_stat_alloc "leak" and some unaligned reads at startup in roots.c (which should be harmless on x86). This is testing on a i7 6700 (no K) Skylake

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 7, 2017

Comment author: enguerrand

I confirm that after compiling the runtime with -O1 I cannot reproduce after a few hours of retries. (gcc 6.2 too)

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 7, 2017

Comment author: @xavierleroy

We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.

Last Spring, another OCaml (industrial) user reported mysterious semireproducible crashes of a big ocamlopt-compiled program. The crashes would occur only on Skylake processors, and only in the presence of hyperthreading.

If it is possible for you, it would be interesting to turn hyperthreading off (in the BIOS) and try to reproduce again.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 8, 2017

Comment author: joris

Meh. I cannot reproduce with HT disable indeed after one hour of 4 loops running concurrently...

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 8, 2017

Comment author: joris

For the record I upgraded the uefi firmware and the Intel microcode to latest version it makes no difference.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 8, 2017

Comment author: @xavierleroy

Thanks for the quick re-test without hyperthreading. This story is consistent with what was observed at the other industrial user in May. Based on their observations and those in this PR, the problem lies in the combination of:

  • Skylake
  • Hyperthreading
  • OCaml 4.03
  • compiled with gcc -O2

Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy. On the other hand, there was already one documented hardware issue with hyperthreading and Skylake: http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 8, 2017

Comment author: cullmann

To solve our issue in May, we went over from using clang instead of GCC as the base compiler for OCaml and the other parts of our toolchain.

Since that switch, no such random crash came up (and we have one Skylake machine that runs longer regression tests in an endless loop the whole year, nothing to be seen after clang usage, daily crashs before)

The question is if it is feasible for you to:

a) try clang, too
b) if that works, try to search for the difference in e.g. the produced assembly

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 10, 2017

Comment author: joris

I believe i have found something interesting. At some point i did a careful review of sweep_slice function and i noticed this line:

  work -= Whsize_hd (hd); (major_gc.c:545)

This macro returns an unsized int because hd is header_t. If i understand C standard correctly it means that work -= size is similar to work = work - size, and the substraction operands will be promoted to unsigned long.

I checked gcc tree SSA dump and it indeed looks like this is what GCC is doing. I tried to replace this line with

  work = work - (intnat) (Whisze_hd (hd));

It does indeed make some difference in SSA tree (just add a cast and properly execute the arithmetic substraction with signed temporaries), but i must admit i fail to understand why it would cause a segfault in this case since generated assembly used signed instruction (notq then addq). It does make some difference in assembly though, the loop condition is reversed:
it uses testq + jle instead of testq + jg, and it uses an additional register. I do not really understand what changes it makes.

That being said i'm trying to reproduce this bug for 2hours and it has not crashed in O2 with this change yet

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 10, 2017

Comment author: @xavierleroy

I admire your ingenuity in searching for GCC miscompilation issues or undefined behaviors in OCaml's sources. Yet, those issues would produce reproducible crashes, which is not the case here. Also, they would not account for the fact that crashes are observed only with Skylake and hyperthreading.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 11, 2017

Comment author: @mshinwell

I am pursuing independently the possibility of this being a CPU bug.
I agree with what @xLeroy said, although having a smaller diff of the assembly code for major_gc.c than you had before would be useful; it looks like you may now be able to obtain that.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 11, 2017

Comment author: @alainfrisch

reproducible crashes

Couldn't there be undefined behaviors at he CPU level, which would lead to non-reproducible situations depending e.g. on physical memory addresses?

Also, it is hard to get a fully reproducible behavior at the OCaml level itself. Simply getting the current pid and printing it to a string (e.g. for logging purpose) can lead to different allocation scheme of the program (depending on the length of the printed pid).

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 11, 2017

Comment author: joris

Honestly, i have found this and i spent some time looking at the assembly produced but it makes no sense to me why it would behave differently. Still, please find attached :

-major_gc_with_intnat_cast_o2.s built with gcc -O2 but with the previously described (intnat) cast. It has not crashed in 17 hours.

  • major_gc_without_cast_o2.patch the diff between this and the assembly produced by O2 without the patch, which crash.

As far as i can tell it just aligned the .text of the hot loop and use an additional temp register and the operands of work -= Whsize_hd(hd) are reversed. Besides that...

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 11, 2017

Comment author: joris

reproducible crashes

btw we tried to disable ASLR with setarch x86_64 -R ocamlfind opt... but it didn't help.
Also it is interesting to note that all the crashes happen during marking, apparently visiting corrupted memory as block header. It could be that the problem is always here but we don't see it every time, because sometimes it does not crash but gas complains about invalid input, sometime gas warns and compilation is successful.
It could be that the others time there is not visible effects. I can try diffing the emitted assembly file

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 11, 2017

Comment author: schommer

During our investigation of the crashes I ran ocaml a few times with the undefined behavior sanitizer of clang, as well as clang-check and most warnings/errors reported were because of unaligned memory access.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 12, 2017

Comment author: joris

So you might have to disregard basically everything i said. I kept the patched binary running in a loop and after respectively 28h and 32h two processes crashed... So it might just be some side effect affecting how often the bug is triggered.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 12, 2017

Comment author: joris

Just to clarify the cast patch has crashed, -fno-free-vrp has not

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 25, 2017

Comment author: @mshinwell

I'd like to find out whether this problem manifests itself if the execution of the OCaml compiler is pinned to a particular processor core.

You can probably do this on Linux by using "taskset" as a wrapper around ocamlopt.opt or else by altering the runtime (e.g. in asmrun/startup.c) to call sched_setaffinity. Could you try this on a Skylake system to see if it makes the problem go away?

(As far as I understand it the problem has not been reproduced unless hyperthreading is enabled. If that is correct, hyperthreading should also be enabled for this experiment.)

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 25, 2017

Comment author: @xavierleroy

Lucky me, my new workstation at Inria is a Skylake Xeon, 4 cores, 8 threads, so I could play with the original repro case.

Without setting processor affinities: it is easy to reproduce the crash (in a few minutes at most) by running at least 5 copies of the compilation task in parallel. With 4 copies the crash happens but takes much longer. With 3 copies or less I didn't observe it in a couple of hours.

By setting processor affinities, I see the crash in a few minutes with only two compilations run in parallel, provided they are mapped on the same physical core (e.g. logical cores 1 and 5 on my machine). Two parallel runs mapped to different physical cores (e.g. logical cores 1 and 2 on my machine) have been running for 1 hour already without a glitch, but I'll let them run overnight. Finally, I'm also trying two parallel runs mapped to the same logical processor, for reference. More results tomorrow.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jan 26, 2017

Comment author: @xavierleroy

More results from my overnight runs:

  • Two runs in parallel, mapped to different physical cores: this crashed eventually but only after 6 hours or so. Note that the machine was not completely idle, so the other logical processors of those physical cores were probably used (lightly).
  • Two runs in parallel mapped to the same logical processor: no crash after 12 hours.
@vicuna

This comment has been minimized.

Copy link
Author

commented Mar 10, 2017

Comment author: @mshinwell

As an update, this is still being investigated at Intel.

@vicuna

This comment has been minimized.

Copy link
Author

commented May 26, 2017

Comment author: @ygrek

Any chance it is what intel microcode update talks about?

http://metadata.ftp-master.debian.org/changelogs/non-free/i/intel-microcode/intel-microcode_3.20170511.1_changelog

  Likely fix nightmare-level Skylake erratum SKL150.  Fortunately,
  either this erratum is very-low-hitting, or gcc/clang/icc/msvc
  won't usually issue the affected opcode pattern and it ends up
  being rare.
  SKL150 - Short loops using both the AH/BH/CH/DH registers and
  the corresponding wide register *may* result in unpredictable
  system behavior.  Requires both logical processors of the same
  core (i.e. sibling hyperthreads) to be active to trigger, as
  well as a "complex set of micro-architectural conditions"
@vicuna

This comment has been minimized.

Copy link
Author

commented May 27, 2017

Comment author: joris

Everything in this description matches this issue. I will have to wait monday to test this though.
Thank you everyone for the time spent on this problem

@vicuna

This comment has been minimized.

Copy link
Author

commented May 29, 2017

Comment author: joris

microcode update appears to fix the crash (microcode version 0xba). I believe this issue can be closed

@vicuna

This comment has been minimized.

Copy link
Author

commented May 29, 2017

Comment author: @mshinwell

Interesting. I will see if I can get Intel to confirm this. Let's leave this issue open for the moment.

@vicuna

This comment has been minimized.

Copy link
Author

commented May 29, 2017

Comment author: @mshinwell

I looked at the code of sweep_slice, which was conjectured to be one of the functions affected (see above, and the attachment opt.s); indeed it appears that perhaps it might trigger the problem. There is a loop with fewer than 64 instructions using both the %ah register and %rax. The use of %ah is probably quite unusual, but GCC is generating it to deal with the GC tag bits inside a header word.

@vicuna

This comment has been minimized.

Copy link
Author

commented May 29, 2017

Comment author: @mshinwell

By the way the original Intel description is here, on the page numbered 65:

http://www.intel.co.uk/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf

@vicuna

This comment has been minimized.

Copy link
Author

commented May 29, 2017

Comment author: @xavierleroy

After updating the microcode on my Xeon E3 Skylake, the test that used to crash in a few minutes has been running for 6 hours without a hiccup. I'll leave it running for a few days, but it looks like the problem is nailed down and fixed.

Update: the test ran for 50 hours and produced no failures.

@vicuna

This comment has been minimized.

Copy link
Author

commented May 31, 2017

Comment author: @mshinwell

This problem also affects Kaby Lake systems (erratum KBL095); however, I'm unsure if a microcode fix has been released publicly for such systems. The best solution is to disable hyperthreading for the moment.

Yesterday Fred and I experimented on a Kaby Lake machine by changing the generated assembly from GCC for major_gc.c so that it didn't reference registers such as %ah. The problem was not reproducible after that change, whereas it was almost immediately reproducible before.

If there are no further developments by the start of next week, I think we can close this issue.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jun 2, 2017

Comment author: joris

I see you changed the description which should help people searching for this issue in the future.

It should be noted that since the issue is triggered by the major gc, it's not only compiler. Any long running ocaml program has a high chance of triggering this, and it will not always crash. You can get corrupted data in memory and never crashing.

As an example we tried to deploy some tool on a large xeon skylake cluster, several hundred processes. They didn't crash in hours, but very quickly we saw corrupted data being sent over the network/written into the database.

So anyone reading this in the future, don't assume this is only compiler, and don't run critical code on skylake/kaby lake without updating the firmware if you don't want to end up in a nightmarish situation.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jun 2, 2017

Comment author: @mshinwell

Agreed, I've updated the title of this issue.

@vicuna

This comment has been minimized.

Copy link
Author

commented Jun 9, 2017

Comment author: @mshinwell

Closing this issue as per the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.