Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TSan] Random failure of tests/parallel/domain_dls.ml #13059

Closed
dustanddreams opened this issue Mar 29, 2024 · 7 comments
Closed

[TSan] Random failure of tests/parallel/domain_dls.ml #13059

dustanddreams opened this issue Mar 29, 2024 · 7 comments

Comments

@dustanddreams
Copy link
Contributor

On s390x, sometimes (roughly one time out of three), tests/parallel/domain_dls.ml triggers a TSan alarm:

WARNING: ThreadSanitizer: data race (pid=3616815)
  Atomic write of size 8 at 0x03ffa7835fa0 by main thread (mutexes: write M0):
    #0 do_some_marking /home/miod/git/ocaml/runtime/major_gc.c:1048:9 (domain_dls.opt+0x205595)
    #1 mark /home/miod/git/ocaml/runtime/major_gc.c:1150:14 (domain_dls.opt+0x20248f)
    #2 major_collection_slice /home/miod/git/ocaml/runtime/major_gc.c:1742:21 (domain_dls.opt+0x2011a9)
    #3 caml_opportunistic_major_collection_slice /home/miod/git/ocaml/runtime/major_gc.c:1913:3 (domain_dls.opt+0x200111)
    #4 caml_do_opportunistic_major_slice /home/miod/git/ocaml/runtime/minor_gc.c:727:5 (domain_dls.opt+0x21918b)
    #5 stw_handler /home/miod/git/ocaml/runtime/domain.c:1369:9 (domain_dls.opt+0x1d928b)
    #6 handle_incoming /home/miod/git/ocaml/runtime/domain.c:330:5 (domain_dls.opt+0x1d83c5)
    #7 caml_handle_incoming_interrupts /home/miod/git/ocaml/runtime/domain.c:343:3 (domain_dls.opt+0x1d83c5)
    #8 caml_handle_gc_interrupt /home/miod/git/ocaml/runtime/domain.c:1766:5 (domain_dls.opt+0x1d83c5)
    #9 caml_do_pending_actions_exn /home/miod/git/ocaml/runtime/signals.c:342:3 (domain_dls.opt+0x226e89)
    #10 caml_alloc_small_dispatch /home/miod/git/ocaml/runtime/minor_gc.c:863:31 (domain_dls.opt+0x21976f)
    #11 caml_garbage_collection /home/miod/git/ocaml/runtime/signals_nat.c:84:5 (domain_dls.opt+0x2469d9)
    #12 caml_call_gc <null> (domain_dls.opt+0x237ae9)
    #13 camlStdlib__Domain.spawn_734 <null> (domain_dls.opt+0x1990b3)
    #14 camlStdlib__Array.init_291 <null> (domain_dls.opt+0x1864cf)
    #15 camlDomain_dls.check_dls_domain_reuse_336 <null> (domain_dls.opt+0x14ecef)
    #16 camlDomain_dls.entry <null> (domain_dls.opt+0x14f4cd)
    #17 caml_program <null> (domain_dls.opt+0x14a95d)
    #18 caml_start_program <null> (domain_dls.opt+0x237f3f)
    #19 caml_startup_common /home/miod/git/ocaml/runtime/startup_nat.c:128:9 (domain_dls.opt+0x237681)
    #20 caml_startup_exn /home/miod/git/ocaml/runtime/startup_nat.c:135:10 (domain_dls.opt+0x237755)
    #21 caml_startup /home/miod/git/ocaml/runtime/startup_nat.c:140:15 (domain_dls.opt+0x237755)
    #22 caml_main /home/miod/git/ocaml/runtime/startup_nat.c:147:3 (domain_dls.opt+0x237755)
    #23 main /home/miod/git/ocaml/runtime/main.c:37:3 (domain_dls.opt+0x1fec11)

  Previous read of size 1 at 0x03ffa7835fa7 by thread T21 (mutexes: write M1):
    #0 caml_make_vect /home/miod/git/ocaml/runtime/array.c:212:17 (domain_dls.opt+0x1cc4a3)
    #1 caml_c_call <null> (domain_dls.opt+0x237d25)
    #2 camlStdlib__Domain.create_dls_413 <null> (domain_dls.opt+0x1976cf)
    #3 camlStdlib__Domain.body_739 <null> (domain_dls.opt+0x199153)
    #4 caml_start_program <null> (domain_dls.opt+0x237f3f)
    #5 caml_callback_exn /home/miod/git/ocaml/runtime/callback.c:201:12 (domain_dls.opt+0x1d02bd)
    #6 domain_thread_func /home/miod/git/ocaml/runtime/domain.c:1207:24 (domain_dls.opt+0x1d6477)

  Mutex M0 (0x02aa3754b238) created at:
    #0 pthread_mutex_init /home/miod/git/llvm-project/compiler-rt/lib/tsan/rtltsan_interceptors_posix.cpp:1315:3 (domain_dls.opt+0xb2ba3)
    #1 caml_plat_mutex_init /home/miod/git/ocaml/runtime/platform.c:57:8 (domain_dls.opt+0x21c329)
    #2 caml_init_domains /home/miod/git/ocaml/runtime/domain.c:935:5 (domain_dls.opt+0x1d564f)
    #3 caml_init_gc /home/miod/git/ocaml/runtime/gc_ctrl.c:350:3 (domain_dls.opt+0x1ec1d7)
    #4 caml_startup_common /home/miod/git/ocaml/runtime/startup_nat.c:107:3 (domain_dls.opt+0x237453)
    #5 caml_startup_exn /home/miod/git/ocaml/runtime/startup_nat.c:135:10 (domain_dls.opt+0x237755)
    #6 caml_startup /home/miod/git/ocaml/runtime/startup_nat.c:140:15 (domain_dls.opt+0x237755)
    #7 caml_main /home/miod/git/ocaml/runtime/startup_nat.c:147:3 (domain_dls.opt+0x237755)
    #8 main /home/miod/git/ocaml/runtime/main.c:37:3 (domain_dls.opt+0x1fec11)

  Mutex M1 (0x02aa3754b580) created at:
    #0 pthread_mutex_init /home/miod/git/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1315:3 (domain_dls.opt+0xb2ba3)
    #1 caml_plat_mutex_init /home/miod/git/ocaml/runtime/platform.c:57:8 (domain_dls.opt+0x21c329)
    #2 caml_init_domains /home/miod/git/ocaml/runtime/domain.c:935:5 (domain_dls.opt+0x1d564f)
    #3 caml_init_gc /home/miod/git/ocaml/runtime/gc_ctrl.c:350:3 (domain_dls.opt+0x1ec1d7)
    #4 caml_startup_common /home/miod/git/ocaml/runtime/startup_nat.c:107:3 (domain_dls.opt+0x237453)
    #5 caml_startup_exn /home/miod/git/ocaml/runtime/startup_nat.c:135:10 (domain_dls.opt+0x237755)
    #6 caml_startup /home/miod/git/ocaml/runtime/startup_nat.c:140:15 (domain_dls.opt+0x237755)
    #7 caml_main /home/miod/git/ocaml/runtime/startup_nat.c:147:3 (domain_dls.opt+0x237755)
    #8 main /home/miod/git/ocaml/runtime/main.c:37:3 (domain_dls.opt+0x1fec11)

  Thread T21 (tid=3616838, running) created by main thread at:
    #0 pthread_create /home/miod/git/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1022:3 (domain_dls.opt+0x818db)
    #1 caml_domain_spawn /home/miod/git/ocaml/runtime/domain.c:1257:9 (domain_dls.opt+0x1d5e7b)
    #2 caml_c_call <null> (domain_dls.opt+0x237d25)
    #3 camlStdlib__Domain.spawn_734 <null> (domain_dls.opt+0x199057)
    #4 camlStdlib__Array.init_291 <null> (domain_dls.opt+0x1864cf)
    #5 camlDomain_dls.check_dls_domain_reuse_336 <null> (domain_dls.opt+0x14ecef)
    #6 camlDomain_dls.entry <null> (domain_dls.opt+0x14f4cd)
    #7 caml_program <null> (domain_dls.opt+0x14a95d)
    #8 caml_start_program <null> (domain_dls.opt+0x237f3f)
    #9 caml_startup_common /home/miod/git/ocaml/runtime/startup_nat.c:128:9 (domain_dls.opt+0x237681)
    #10 caml_startup_exn /home/miod/git/ocaml/runtime/startup_nat.c:135:10 (domain_dls.opt+0x237755)
    #11 caml_startup /home/miod/git/ocaml/runtime/startup_nat.c:140:15 (domain_dls.opt+0x237755)
    #12 caml_main /home/miod/git/ocaml/runtime/startup_nat.c:147:3 (domain_dls.opt+0x237755)
    #13 main /home/miod/git/ocaml/runtime/main.c:37:3 (domain_dls.opt+0x1fec11)

SUMMARY: ThreadSanitizer: data race /home/miod/git/ocaml/runtime/major_gc.c:1048:9 in do_some_marking

I have not been able to observe a similar failure on other platforms.

@OlivierNicole
Copy link
Contributor

So, in principle, these accesses being concurrent should be ok. The question is why our machinery based volatile * accesses failed to make TSan see it that way. The incriminated read is a use of Tag_val, which loads the least significant byte of a header word. My hypothesis is that because s390x is big-endian, the load is not aligned, and we currently consider unaligned volatile accesses to behave like plain accesses. (Tangent: this code I linked to seems to have an superfluous alignment check now that I’m looking at it.)

My feeling is that the unaligned nature of the load does not change anything to the fact that these accesses are safe In Practice™: the GC marking does not modify the tag. So the question is whether there is way to remove this false report.

@dustanddreams
Copy link
Contributor Author

Well I'd argue that, on platforms requiring strict alignment of data such as s390x (i.e. when allow_unaligned_access = false in arch.ml), loads smaller than the size of a register are atomic (the memory bus transaction is a machine word read, with the proper byte enables set or unset to only return the subset of the word the ALU is interested in).

@gasche
Copy link
Member

gasche commented Mar 29, 2024

If I understand @dustanddreams' point, the recommend fix would be to change the definitions of the tsan_volatile_{read,write} functions in tsan.c on certain architectures.

@jmid
Copy link
Contributor

jmid commented Mar 30, 2024

Unsure if this is related, but 4 days ago I observed a crash in our Domain.DLS trunk tests (on 32-bit though):
ocaml-multicore/multicoretests#446
The striking thing to me was that this was in a sequential test even.
I've not had time to look more into reproducing and reporting it.

@OlivierNicole
Copy link
Contributor

Well I'd argue that, on platforms requiring strict alignment of data such as s390x (i.e. when allow_unaligned_access = false in arch.ml), loads smaller than the size of a register are atomic (the memory bus transaction is a machine word read, with the proper byte enables set or unset to only return the subset of the word the ALU is interested in).

Interesting, I don’t know much about unaligned accesses. How are they different in, for instance, amd64?

@dustanddreams
Copy link
Contributor Author

Interesting, I don’t know much about unaligned accesses. How are they different in, for instance, amd64?

I think we have a misunderstanding of what "unaligned access" means here.

An unaligned access is a memory access of width w at address a, where a is not a multiple of w. Depending on the hardware platform, such an operation may trap, behave differently (e.g. silently accessing a rounded-down address and thus not returning the expected data), or require multiple memory bus transactions to complete (and as such, hardly atomic).

Now if you consider atomicity, aligned read accesses not larger than the width of the memory bus (usually at least as wide as the machine register, but sometimes wider, e.g. the 88110 was a 32-bit processor with a 64-bit memory bus and able to perform atomic loads of 64-bit values into register pairs) are always atomic (well, except on alpha without the BWX extension, but these are not supported by the native compiler since 4.x).

On the other hands, aligned write accesses are atomic only if they have the same width as the memory bus, or if the hardware provides some atomicity guarantees, since smaller-than-width writes will be turned into a 3-step read, modify the subset bytes, write operation. (which is why the x86 architecture has the "LOCK" prefix to require the memory bus to not be released between these steps, in order to appear to be an atomic operation to other devices).

I will work on a relaxed version of __tsan_volatile_read* to demonstrate this.

@dustanddreams
Copy link
Contributor Author

Fixed by #13067.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants