Skip to content

Do not write to event ring after going out of stw participant set.#13529

Merged
sadiqj merged 1 commit intoocaml:trunkfrom
kayceesrk:fix_data_race_ring_write
Oct 5, 2024
Merged

Do not write to event ring after going out of stw participant set.#13529
sadiqj merged 1 commit intoocaml:trunkfrom
kayceesrk:fix_data_race_ring_write

Conversation

@kayceesrk
Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk commented Oct 5, 2024

When the runtime events ring is destroyed, only the domains that are part of the stop-the-world (stw) participant list are stopped. So it is unsafe to write to ring when the domain is not part of the stw participant set.

This fixes the data race in lib-runtime-events/test_dropped_events. I was able to reliably trigger data race in test_dropped_events on M2 macOS on trunk.

> WARNING: ThreadSanitizer: data race (pid=15812)                                                                    
>   Write of size 8 at 0x00010469c920 by main thread (mutexes: write M0):           
>     #0 stw_teardown_runtime_events runtime_events.c:435 (test_dropped_events.opt:arm64+0x1000df800)
>     #1 caml_try_run_on_all_domains_with_spin_work domain.c:1691 (test_dropped_events.opt:arm64+0x1000aeec0)                                                                                                                             
>     #2 caml_try_run_on_all_domains domain.c:1713 (test_dropped_events.opt:arm64+0x1000acf44)                                                                                                                                            >     #3 caml_runtime_events_destroy runtime_events.c:232 (test_dropped_events.opt:arm64+0x1000df4b4)                                                                                                                                     
>     #4 caml_do_exit sys.c:188 (test_dropped_events.opt:arm64+0x1000e99d4)                                                                                                                                                               
>     #5 main main.c:38 (test_dropped_events.opt:arm64+0x1000cc59c)                                                                                                                                                                       
>                                                                                                                    
>   Previous read of size 8 at 0x00010469c920 by thread T1 (mutexes: write M1, write M2):
>     #0 write_to_ring runtime_events.c:529 (test_dropped_events.opt:arm64+0x1000df568)
>     #1 caml_ev_counter runtime_events.c:635 (test_dropped_events.opt:arm64+0x1000e03d8)                                                                                                                                                 >     #2 update_major_slice_work major_gc.c:733 (test_dropped_events.opt:arm64+0x1000cf8b8)
>     #3 caml_teardown_major_gc major_gc.c:2128 (test_dropped_events.opt:arm64+0x1000cf3c8)
>     #4 domain_thread_func domain.c:1248 (test_dropped_events.opt:arm64+0x1000ae534)
>                                                                                                                                                                                                                                         
>   Location is global 'current_metadata' at 0x00010469c920 (test_dropped_events.opt+0x10017c920)    

With this PR, the data race is gone.

@kayceesrk kayceesrk added the run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled label Oct 5, 2024
@kayceesrk kayceesrk added the thread-sanitizer Related to data races, thread sanitizer label Oct 5, 2024
@kayceesrk kayceesrk changed the title Do not write to ring after going out of stw participant set. Do not write to event ring after going out of stw participant set. Oct 5, 2024
@kayceesrk kayceesrk force-pushed the fix_data_race_ring_write branch from f7b1162 to e3ce324 Compare October 5, 2024 06:53
Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding and fixing this @kayceesrk

One thing - could you add a comment to update_major_slice_work noting that it might be called outside of the stw set and that's also why we have a logging flag (in addition to the opportunistic spam)?

At the moment the only information that is the case is on the caller side.

@kayceesrk kayceesrk force-pushed the fix_data_race_ring_write branch from e3ce324 to 570cf11 Compare October 5, 2024 09:19
When the runtime events ring is destroyed, only the domains that are
part of the stop-the-world (stw) participant list are stopped. So it is
unsafe to write to ring when the domain is not part of the stw
participant set.

This fixes the data race in `lib-runtime-events/test_dropped_events`.
@kayceesrk kayceesrk force-pushed the fix_data_race_ring_write branch from 570cf11 to dfabc40 Compare October 5, 2024 09:20
@kayceesrk
Copy link
Copy Markdown
Contributor Author

Done in dfabc40.

@sadiqj sadiqj merged commit 27c82c4 into ocaml:trunk Oct 5, 2024
@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Oct 5, 2024

Thanks @kayceesrk !

dra27 pushed a commit that referenced this pull request Oct 14, 2024
When the runtime events ring is destroyed, only the domains that are
part of the stop-the-world (stw) participant list are stopped. So it is
unsafe to write to ring when the domain is not part of the stw
participant set.

This fixes the data race in `lib-runtime-events/test_dropped_events`.

(cherry picked from commit 27c82c4)
@dra27
Copy link
Copy Markdown
Member

dra27 commented Oct 14, 2024

Cherry-picked to 5.3 in b190c7c, but I've not cherry-picked to 5.2 on the basis that the failures we saw on trunk were related to work added after 5.2.0 was released (unless you'd really like it in 5.2.1?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-me run-thread-sanitizer Makes the CI run the testsuite with TSAN enabled thread-sanitizer Related to data races, thread sanitizer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants