Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on MacOSX with trunk #11226

Closed
jmid opened this issue Apr 29, 2022 · 3 comments
Closed

Segfault on MacOSX with trunk #11226

jmid opened this issue Apr 29, 2022 · 3 comments

Comments

@jmid
Copy link
Contributor

jmid commented Apr 29, 2022

I've been chasing a segfault that is triggered on MacOSX. To setup and reproduce:

I can pretty consistently reproduce by running the following (9/10 times or so):

$ dune exec src/lazy/lazy_lin_test.exe -- -v -s 249901845
random seed: 249901845
generated error fail pass / total     time test name
[ ]   19    0    0   19 /  100    22.4s Linearizable lazy test with DomainSegmentation fault: 11

An attempt at reducing the problem is also available. This does not crash as consistently - but the code is a bit simpler and has fewer dependencies:

$ dune exec src/lazy/lazy_lin_reduced.exe
0 t                                 
1 t
2 t
3 t
4 t
5 t
6 CamlinternalLazy.Undefined
7 CamlinternalLazy.Undefined
8 Segmentation fault: 11

What (I think) I know so far:

Here's first the output of an lldb run without the debug runtime which stops with a EXC_BAD_ACCESS:

$ lldb _build/default/src/lazy/lazy_lin_reduced.exe
(lldb) target create "_build/default/src/lazy/lazy_lin_reduced.exe"
Current executable set to '/Users/jmi/software/ocaml-04-28-2022-11213/multicoretests/_build/default/src/lazy/lazy_lin_reduced.exe' (x86_64).
(lldb) run
Process 1503 launched: '/Users/jmi/software/ocaml-04-28-2022-11213/multicoretests/_build/default/src/lazy/lazy_lin_reduced.exe' (x86_64)
0 t
1 t
2 t
3 t
4 t
5 t
6 CamlinternalLazy.Undefined
7 CamlinternalLazy.Undefined
8 Process 1503 stopped
* thread #3, name = 'Domain3', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001000b7cf4 lazy_lin_reduced.exe`caml_c_call + 4
lazy_lin_reduced.exe`caml_c_call:
->  0x1000b7cf4 <+4>:  movq   %rsp, (%r10)
    0x1000b7cf7 <+7>:  movq   0x30(%r14), %r11
    0x1000b7cfb <+11>: movq   %rsp, 0x8(%r11)
    0x1000b7cff <+15>: movq   %r10, (%r11)
Target 0: (lazy_lin_reduced.exe) stopped.
(lldb) bt all
lazy_lin_reduced.exe was compiled with optimization - stepping may behave oddly; variables may not be available.
  thread #1, name = 'Domain0', queue = 'com.apple.main-thread'
    frame #0: 0x00007fff20608cce libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff2063be49 libsystem_pthread.dylib`_pthread_cond_wait + 1298
    frame #2: 0x00000001000b2fd8 lazy_lin_reduced.exe`caml_ml_condition_wait [inlined] sync_condvar_wait(c=0x0000000100515290, m=0x0000000100515250) at sync_posix.h:122:10 [opt]
    frame #3: 0x00000001000b2fcd lazy_lin_reduced.exe`caml_ml_condition_wait(wcond=<unavailable>, wmut=<unavailable>) at sync.c:172:13 [opt]
    frame #4: 0x00000001000b7d0b lazy_lin_reduced.exe`caml_c_call + 27
    frame #5: 0x00000001000556bc lazy_lin_reduced.exe`camlStdlib__Domain__loop_718 + 44
    frame #6: 0x000000010005565d lazy_lin_reduced.exe`camlStdlib__Domain__join_713 + 141
    frame #7: 0x000000010000837f lazy_lin_reduced.exe`camlDune__exe__Lazy_lin_reduced__lin_prop_domain_754 + 287
    frame #8: 0x000000010000907f lazy_lin_reduced.exe`camlUtil__repeat_268 + 95
    frame #9: 0x0000000100008522 lazy_lin_reduced.exe`camlDune__exe__Lazy_lin_reduced__exec_test_802 + 146
    frame #10: 0x000000010003b228 lazy_lin_reduced.exe`camlStdlib__List__map_483 + 56
    frame #11: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #12: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #13: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #14: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #15: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #16: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #17: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #18: 0x000000010003b23f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #19: 0x0000000100008fec lazy_lin_reduced.exe`camlDune__exe__Lazy_lin_reduced__entry + 2012
    frame #20: 0x0000000100002b8b lazy_lin_reduced.exe`caml_program + 747
    frame #21: 0x00000001000b7dc4 lazy_lin_reduced.exe`caml_start_program + 112
    frame #22: 0x00000001000b760b lazy_lin_reduced.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:136:7 [opt]
    frame #23: 0x00000001000b7604 lazy_lin_reduced.exe`caml_main(argv=<unavailable>) at startup_nat.c:142:3 [opt]
    frame #24: 0x00000001000a787c lazy_lin_reduced.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #25: 0x00007fff20656f3d libdyld.dylib`start + 1
  thread #2, name = 'Backup0'
    frame #0: 0x00000001000af98e lazy_lin_reduced.exe`pool_sweep(local=<unavailable>, plist=<unavailable>, sz=1, release_to_global_pool=1) at shared_heap.c:457:31 [opt]
    frame #1: 0x00000001000af524 lazy_lin_reduced.exe`caml_sweep(local=0x0000000111008200, work=512) at shared_heap.c:545:7 [opt]
    frame #2: 0x00000001000a89f0 lazy_lin_reduced.exe`major_collection_slice(howmuch=<unavailable>, participant_count=0, barrier_participants=0x0000000000000000, mode=Slice_opportunistic) at major_gc.c:1208:14 [opt]
    frame #3: 0x0000000100094e98 lazy_lin_reduced.exe`handle_incoming at domain.c:1248:9 [opt]
    frame #4: 0x0000000100094e59 lazy_lin_reduced.exe`handle_incoming(s=<unavailable>) at domain.c:305:5 [opt]
    frame #5: 0x00000001000970e2 lazy_lin_reduced.exe`backup_thread_func [inlined] caml_handle_incoming_interrupts at domain.c:318:3 [opt]
    frame #6: 0x00000001000970cd lazy_lin_reduced.exe`backup_thread_func(v=0x000000010014a810) at domain.c:956:13 [opt]
    frame #7: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #8: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
* thread #3, name = 'Domain3', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001000b7cf4 lazy_lin_reduced.exe`caml_c_call + 4
    frame #1: 0x000000010007beef lazy_lin_reduced.exe`camlStdlib__Format__buffered_out_flush_1279 + 111
    frame #2: 0x000000010007f83e lazy_lin_reduced.exe`camlStdlib__Format__flush_standard_formatters_2002 + 62
    frame #3: 0x0000000100055139 lazy_lin_reduced.exe`camlStdlib__Domain__new_exit_673 + 41
    frame #4: 0x00000001000554a7 lazy_lin_reduced.exe`camlStdlib__Domain__body_706 + 135
    frame #5: 0x00000001000b7dc4 lazy_lin_reduced.exe`caml_start_program + 112
    frame #6: 0x000000010009364e lazy_lin_reduced.exe`caml_callback_exn(closure=<unavailable>, arg=1) at callback.c:169:1 [opt]
    frame #7: 0x0000000100093af9 lazy_lin_reduced.exe`caml_callback(closure=<unavailable>, arg=1) at callback.c:253:34 [opt]
    frame #8: 0x0000000100096151 lazy_lin_reduced.exe`domain_thread_func(v=<unavailable>) at domain.c:1085:5 [opt]
    frame #9: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #10: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
  thread #4, name = 'Backup3'
    frame #0: 0x00007fff206084ba libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff206392ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20637192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x0000000100097078 lazy_lin_reduced.exe`backup_thread_func [inlined] caml_plat_lock(m=0x000000010014aca8) at platform.h:144:21 [opt]
    frame #4: 0x0000000100097070 lazy_lin_reduced.exe`backup_thread_func(v=0x000000010014abe8) at domain.c:975:9 [opt]
    frame #5: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #6: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
  thread #5, name = 'Domain2'
    frame #0: 0x00000001000969ca lazy_lin_reduced.exe`caml_try_run_on_all_domains_with_spin_work [inlined] caml_wait_interrupt_serviced at domain.c:342:14 [opt]
    frame #1: 0x00000001000969b7 lazy_lin_reduced.exe`caml_try_run_on_all_domains_with_spin_work(handler=(lazy_lin_reduced.exe`caml_stw_empty_minor_heap at minor_gc.c:721), data=<unavailable>, leader_setup=<unavailable>, enter_spin_callback=<unavailable>, enter_spin_data=0x0000000000000000) at domain.c:1429:5 [opt]
    frame #2: 0x00000001000acc1d lazy_lin_reduced.exe`caml_empty_minor_heaps_once [inlined] caml_try_stw_empty_minor_heap_on_all_domains at minor_gc.c:758:10 [opt]
    frame #3: 0x00000001000acbf1 lazy_lin_reduced.exe`caml_empty_minor_heaps_once at minor_gc.c:778:5 [opt]
    frame #4: 0x00000001000961d8 lazy_lin_reduced.exe`domain_thread_func [inlined] domain_terminate at domain.c:1654:5 [opt]
    frame #5: 0x0000000100096151 lazy_lin_reduced.exe`domain_thread_func(v=<unavailable>) at domain.c:1086:5 [opt]
    frame #6: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #7: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
  thread #6, name = 'Backup2'
    frame #0: 0x00007fff206084ba libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff206392ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20637192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x0000000100097078 lazy_lin_reduced.exe`backup_thread_func [inlined] caml_plat_lock(m=0x000000010014ab60) at platform.h:144:21 [opt]
    frame #4: 0x0000000100097070 lazy_lin_reduced.exe`backup_thread_func(v=0x000000010014aaa0) at domain.c:975:9 [opt]
    frame #5: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #6: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
(lldb) 

and here's another one with the debug runtime which stops with EXC_BREAKPOINT:

$ lldb _build/default/src/lazy/lazy_lin_reduced.exe
(lldb) target create "_build/default/src/lazy/lazy_lin_reduced.exe"
Current executable set to '/Users/jmi/software/ocaml-04-28-2022-11213/multicoretests/_build/default/src/lazy/lazy_lin_reduced.exe' (x86_64).
(lldb) run
Process 1548 launched: '/Users/jmi/software/ocaml-04-28-2022-11213/multicoretests/_build/default/src/lazy/lazy_lin_reduced.exe' (x86_64)
### OCaml runtime: debug mode ###
0 t
1 t
2 t
3 t
4 t
5 t
6 CamlinternalLazy.Undefined
7 CamlinternalLazy.Undefined
8 Process 1548 stopped
* thread #5, name = 'Domain3', stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x0)
    frame #0: 0x00000001000b9cf8 lazy_lin_reduced.exe`caml_c_call + 48
lazy_lin_reduced.exe`caml_c_call:
->  0x1000b9cf8 <+48>: movq   0x20(%r14), %r11
    0x1000b9cfc <+52>: movq   (%r11), %r11
    0x1000b9cff <+55>: cmpq   %r11, 0x8(%rsp)
    0x1000b9d04 <+60>: je     0x1000b9d07               ; <+63>
Target 0: (lazy_lin_reduced.exe) stopped.
(lldb) bt all
lazy_lin_reduced.exe was compiled with optimization - stepping may behave oddly; variables may not be available.
  thread #1, name = 'Domain0', queue = 'com.apple.main-thread'
    frame #0: 0x00000001000b1380 lazy_lin_reduced.exe`pool_sweep(local=<unavailable>, plist=<unavailable>, sz=2, release_to_global_pool=1) at shared_heap.c:0:5 [opt]
    frame #1: 0x00000001000b0d94 lazy_lin_reduced.exe`caml_sweep(local=0x0000000111808200, work=512) at shared_heap.c:545:7 [opt]
    frame #2: 0x00000001000a8e00 lazy_lin_reduced.exe`major_collection_slice(howmuch=<unavailable>, participant_count=0, barrier_participants=0x0000000000000000, mode=Slice_opportunistic) at major_gc.c:1208:14 [opt]
    frame #3: 0x0000000100094e18 lazy_lin_reduced.exe`handle_incoming at domain.c:1248:9 [opt]
    frame #4: 0x0000000100094dd6 lazy_lin_reduced.exe`handle_incoming(s=<unavailable>) at domain.c:305:5 [opt]
    frame #5: 0x0000000100097145 lazy_lin_reduced.exe`caml_handle_gc_interrupt [inlined] caml_handle_incoming_interrupts at domain.c:318:3 [opt]
    frame #6: 0x0000000100097130 lazy_lin_reduced.exe`caml_handle_gc_interrupt at domain.c:1531:5 [opt]
    frame #7: 0x00000001000b2649 lazy_lin_reduced.exe`caml_process_pending_actions at signals.c:236:3 [opt]
    frame #8: 0x00000001000b9785 lazy_lin_reduced.exe`caml_garbage_collection at signals_nat.c:104:7 [opt]
    frame #9: 0x00000001000b9ba1 lazy_lin_reduced.exe`caml_call_gc + 241
    frame #10: 0x00000001000552cd lazy_lin_reduced.exe`camlStdlib__Domain__join_713 + 173
    frame #11: 0x0000000100007fdd lazy_lin_reduced.exe`camlDune__exe__Lazy_lin_reduced__lin_prop_domain_754 + 301
    frame #12: 0x0000000100008ccf lazy_lin_reduced.exe`camlUtil__repeat_268 + 95
    frame #13: 0x0000000100008172 lazy_lin_reduced.exe`camlDune__exe__Lazy_lin_reduced__exec_test_802 + 146
    frame #14: 0x000000010003ae78 lazy_lin_reduced.exe`camlStdlib__List__map_483 + 56
    frame #15: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #16: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #17: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #18: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #19: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #20: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #21: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #22: 0x000000010003ae8f lazy_lin_reduced.exe`camlStdlib__List__map_483 + 79
    frame #23: 0x0000000100008c3c lazy_lin_reduced.exe`camlDune__exe__Lazy_lin_reduced__entry + 2012
    frame #24: 0x0000000100002610 lazy_lin_reduced.exe`caml_program + 752
    frame #25: 0x00000001000b9e02 lazy_lin_reduced.exe`caml_start_program + 150
    frame #26: 0x00000001000b955b lazy_lin_reduced.exe`caml_main [inlined] caml_startup(argv=<unavailable>) at startup_nat.c:136:7 [opt]
    frame #27: 0x00000001000b9554 lazy_lin_reduced.exe`caml_main(argv=<unavailable>) at startup_nat.c:142:3 [opt]
    frame #28: 0x00000001000a794c lazy_lin_reduced.exe`main(argc=<unavailable>, argv=<unavailable>) at main.c:37:3 [opt]
    frame #29: 0x00007fff20656f3d libdyld.dylib`start + 1
  thread #2, name = 'Backup0'
    frame #0: 0x00007fff206084ba libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff206392ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20637192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x0000000100097848 lazy_lin_reduced.exe`backup_thread_func [inlined] caml_plat_lock(m=0x000000010014e9f0) at platform.h:144:21 [opt]
    frame #4: 0x0000000100097839 lazy_lin_reduced.exe`backup_thread_func(v=0x000000010014e930) at domain.c:975:9 [opt]
    frame #5: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #6: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
  thread #3, name = 'Domain2'
    frame #0: 0x0000000100096eea lazy_lin_reduced.exe`caml_try_run_on_all_domains_with_spin_work [inlined] caml_wait_interrupt_serviced at domain.c:342:14 [opt]
    frame #1: 0x0000000100096ed7 lazy_lin_reduced.exe`caml_try_run_on_all_domains_with_spin_work(handler=(lazy_lin_reduced.exe`caml_stw_empty_minor_heap at minor_gc.c:721), data=<unavailable>, leader_setup=<unavailable>, enter_spin_callback=<unavailable>, enter_spin_data=0x00000001000adce0) at domain.c:1429:5 [opt]
    frame #2: 0x00000001000add86 lazy_lin_reduced.exe`caml_empty_minor_heaps_once [inlined] caml_try_stw_empty_minor_heap_on_all_domains at minor_gc.c:758:10 [opt]
    frame #3: 0x00000001000add5a lazy_lin_reduced.exe`caml_empty_minor_heaps_once at minor_gc.c:778:5 [opt]
    frame #4: 0x0000000100096434 lazy_lin_reduced.exe`domain_thread_func [inlined] domain_terminate at domain.c:1654:5 [opt]
    frame #5: 0x00000001000963ac lazy_lin_reduced.exe`domain_thread_func(v=<unavailable>) at domain.c:1086:5 [opt]
    frame #6: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #7: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
  thread #4, name = 'Backup2'
    frame #0: 0x00007fff206084ba libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff206392ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20637192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x0000000100097848 lazy_lin_reduced.exe`backup_thread_func [inlined] caml_plat_lock(m=0x000000010014ec80) at platform.h:144:21 [opt]
    frame #4: 0x0000000100097839 lazy_lin_reduced.exe`backup_thread_func(v=0x000000010014ebc0) at domain.c:975:9 [opt]
    frame #5: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #6: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
* thread #5, name = 'Domain3', stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x0)
  * frame #0: 0x00000001000b9cf8 lazy_lin_reduced.exe`caml_c_call + 48
    frame #1: 0x000000010002f69f lazy_lin_reduced.exe`camlStdlib__output_substring_258 + 79
    frame #2: 0x000000010007bb2e lazy_lin_reduced.exe`camlStdlib__Format__buffered_out_flush_1279 + 94
    frame #3: 0x000000010007f48e lazy_lin_reduced.exe`camlStdlib__Format__flush_standard_formatters_2002 + 62
    frame #4: 0x0000000100054d89 lazy_lin_reduced.exe`camlStdlib__Domain__new_exit_673 + 41
    frame #5: 0x00000001000550f7 lazy_lin_reduced.exe`camlStdlib__Domain__body_706 + 135
    frame #6: 0x00000001000b9e02 lazy_lin_reduced.exe`caml_start_program + 150
    frame #7: 0x000000010009334f lazy_lin_reduced.exe`caml_callback_exn(closure=<unavailable>, arg=1) at callback.c:169:1 [opt]
    frame #8: 0x00000001000938f9 lazy_lin_reduced.exe`caml_callback(closure=<unavailable>, arg=1) at callback.c:253:34 [opt]
    frame #9: 0x00000001000963ac lazy_lin_reduced.exe`domain_thread_func(v=<unavailable>) at domain.c:1085:5 [opt]
    frame #10: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #11: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
  thread #6, name = 'Backup3'
    frame #0: 0x00007fff206084ba libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff206392ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff20637192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x0000000100097848 lazy_lin_reduced.exe`backup_thread_func [inlined] caml_plat_lock(m=0x000000010014edc8) at platform.h:144:21 [opt]
    frame #4: 0x0000000100097839 lazy_lin_reduced.exe`backup_thread_func(v=0x000000010014ed08) at domain.c:975:9 [opt]
    frame #5: 0x00007fff2063b8fc libsystem_pthread.dylib`_pthread_start + 224
    frame #6: 0x00007fff20637443 libsystem_pthread.dylib`thread_start + 15
(lldb) 
@kayceesrk
Copy link
Contributor

I've had a preliminary look at this. At caml_c_call where the segfault occurs, I noticed that Caml_state->current_thread is NULL.

movq %rsp, Stack_sp(%r10); \

The other Caml_state fields look reasonable. I am looking at where current_thread is being set to NULL.

kayceesrk added a commit to kayceesrk/ocaml that referenced this issue May 10, 2022
Use a global key to access the per-thread `caml_thread_t` rather than
have a key per domain. This fixes the issue reported in ocaml#11226.
kayceesrk added a commit to kayceesrk/ocaml that referenced this issue May 10, 2022
Use a global key to access the per-thread `caml_thread_t` rather than
have a key per domain. This fixes the issue reported in ocaml#11226.
@kayceesrk
Copy link
Contributor

kayceesrk commented May 10, 2022

Here is what I suspect is the root case of the bug.

The test itself links against the threads library. This brings in the systhreads hooks for domain creation, termination, enter/leave blocking section, etc, even if the OCaml code itself does not use the Thread module explicitly. The issue is with the systhread initialisation performed at the start of each domain in the function caml_thread_initialize_domain:

st_tls_newkey(&Thread_key);
st_tls_set(Thread_key, (void *) new_thread);

Thread_key is thread_table[Caml_state->id].thread_key. Caml_state->id refers to the slot in the thread_table. The slots are reused when domains are created and destroyed. Hence, the code on line 368 creates the key several times using the same key address. This is not supported. See https://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html, "Non-idempotent data key creation".

One would expect the following assertion to trivially hold when followed by the lines above:

CAMLassert (new_thread == (caml_thread_t)st_tls_get(Thread_key))

However, this assertion fails intermittently on this testcase.

Later, when the systhread state is restored using caml_thread_restore_runtime_state when leaving the blocking section (caml_thread_leave_blocking_section), garbage values are written to the current stack, C stack fields of the domain local state:

static void caml_thread_leave_blocking_section(void)
{
/* Wait until the runtime is free */
st_masterlock_acquire(&Thread_main_lock);
/* Update Current_thread to point to the thread descriptor corresponding to
the thread currently executing */
Current_thread = st_tls_get(Thread_key);
/* Restore the runtime state from the curr_thread descriptor */
caml_thread_restore_runtime_state();
}

This promptly makes the next external call to fail.

Fix

The PR #11250 fixes the issue by ensuring that the key is created exactly once per program rather than once every time a domain is created.

@jmid
Copy link
Contributor Author

jmid commented May 10, 2022

I confirm that this is fixed by PR #11250.

I just completed 20 reruns of src/lazy/lazy_lin_reduced.exe on PR #11250 without a segfault after having gotten segfaults in 5/5 reruns without it.

Running the full multicoretest suite on MacOS has also been a pretty consistent way to trigger this issue (in something like 19/20 runs): https://github.com/jmid/multicoretests/actions/workflows/macosx-500-workflow.yml I've now run it fully 9 times without a segfault - each run takes 10-20min on my local machine.

Thanks @kayceesrk!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants