Fix data race on global pools arrays of `pool_freelist` #12755

fabbing · 2023-11-17T18:09:50Z

This PR, a joint work with @OlivierNicole, addresses data races on global_avail_pools and global_full_pools members of the struct pool_freelist in shared_heap.c.

Domains can access the global_(avail|full)_pools arrays in parallel while trying to adopt a pool with pool_global_adopt, so these arrays must be made of atomic pointers.
Since the pool_freelist struct is not exposed outside of the runtime, the choice was made to use a proper _Atomic qualifier rather than a volatile one. The downside is that read and write accesses have to be done through the atomic_(load|store)_relaxed functions to not cause more synchronisation than intended, which makes the code a bit more verbose.

Co-authored-by: Olivier Nicole <olivier@chnik.fr>

OlivierNicole · 2023-11-17T19:17:57Z

To address some possible confusion: manipulations of this global free list is generally protected by a mutex, but there is exactly one exception, pool_global_adopt as @fabbing said, where a first read is made outside of the lock:

ocaml/runtime/shared_heap.c

Lines 301 to 309 in 8ec2b3d

    
           /* probably no available pools out there to be had */ 
        
           if( !pool_freelist.global_avail_pools[sz] && 
        
               !pool_freelist.global_full_pools[sz] ) 
        
             return NULL; 
        
           /* Haven't managed to find a pool locally, try the global ones */ 
        
           caml_plat_lock(&pool_freelist.lock); 
        
           if( pool_freelist.global_avail_pools[sz] ) {

This read consequently can race with all writes into the same arrays, which is technically UB.

gasche

I am not convinced by your choice of relaxed. Except for the one short-cut check that was outside the lock, these accesses only occur (in debug mode or) in code that runs on the first pool allocation that follows the termination of another domain, that is, it is very rare. I think that the performance gains of relaxed compared to a pleasant atomic access are neglectible.

gasche · 2023-11-17T21:35:36Z

runtime/shared_heap.c

-  if( !pool_freelist.global_avail_pools[sz] &&
-      !pool_freelist.global_full_pools[sz] )
+  if( !atomic_load_acquire(&pool_freelist.global_avail_pools[sz]) &&
+      !atomic_load_acquire(&pool_freelist.global_full_pools[sz]) )


I would expect a relaxed here if the intention is to have a fast, approximate check that corresponds to the previous version. acquire is probably fine though.

My intention was to avoid taking the lock if not necessary, but now that I have to argue about it, I'm no longer sure it was achieving it. Relaxed operation will be fine.

OlivierNicole · 2023-11-20T11:34:54Z

I am not convinced by your choice of relaxed. Except for the one short-cut check that was outside the lock, these accesses only occur (in debug mode or) in code that runs on the first pool allocation that follows the termination of another domain, that is, it is very rare.

In principle, I’m not at all opposed to use SC atomics instead of relaxed if the performance cost is negligible: it makes the code much more readable. I see one other place where such SC atomic operations on global pools will occur, however: through move_all_pools, which is called by each domain from caml_cycle_heap when completing a major heap cycle. (Although I don’t really understand this code, e.g. I don’t understand why move_all_pools is called repeatedly on a global pool list that I expect to be empty after the first call.)

gasche · 2023-11-20T11:45:30Z

caml_cycle_heap is a rare operation, adding a couple SC atomics is neglectible.

fabbing · 2023-11-20T14:48:25Z

I am not convinced by your choice of relaxed. Except for the one short-cut check that was outside the lock, these accesses only occur (in debug mode or) in code that runs on the first pool allocation that follows the termination of another domain, that is, it is very rare. I think that the performance gains of relaxed compared to a pleasant atomic access are neglectible.

These accesses are protected by the pool_freelist.lock mutex which propagates changes to other domains when the lock is taken/released, relaxed operations are fine.
The code is more verbose because we have to use atomic_load_relaxed, but it also makes it clear that these operations don't cause synchronisation.

gasche · 2023-11-20T16:06:39Z

I would have preferred the less verbose code with normal C reads/writes, but if you care about using relaxed operations explicitly, those who do the work decide.

gasche · 2023-11-20T16:08:01Z

There is a failure on the CI:

runtime/shared_heap.c:82:5: error: incompatible integer to pointer conversion initializing '_Atomic(pool *)' with an expression of type 'int' [-Werror,-Wint-conversion]
  { 0, },
    ^
runtime/shared_heap.c:83:5: error: incompatible integer to pointer conversion initializing '_Atomic(pool *)' with an expression of type 'int' [-Werror,-Wint-conversion]
  { 0, },

fabbing · 2023-11-20T16:36:59Z

It's nice to have Clang to spot these issues. Sorry I didn't notice it before!

sadiqj · 2023-11-25T11:04:11Z

Is this only waiting on a Changes entry?

gasche · 2023-11-25T15:18:44Z

I guess it is.

fabbing · 2023-11-27T09:53:21Z

I guess it is.

Changes updated accordingly, and thanks for the review!

Fix data race on global pools arrays of pool_freelist

8d5b8d0

Co-authored-by: Olivier Nicole <olivier@chnik.fr>

gasche reviewed Nov 17, 2023

View reviewed changes

Use relaxed operations outside of critical section

0883b41

gasche approved these changes Nov 20, 2023

View reviewed changes

Fix incompatible int-ptr conversion reported by clang

f2d0130

Update Changes for ocaml#12755

11aa3ad

gasche merged commit 8cfd9c5 into ocaml:trunk Nov 27, 2023
9 checks passed

fabbing mentioned this pull request Dec 5, 2023

ThreadSanitizer issues #11040

Closed

20 tasks

OlivierNicole mentioned this pull request Jan 19, 2024

Spurious race reports from the runtime when repeatedly spawning and joining domains ocaml-multicore/ocaml-tsan#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data race on global pools arrays of `pool_freelist` #12755

Fix data race on global pools arrays of `pool_freelist` #12755

fabbing commented Nov 17, 2023

OlivierNicole commented Nov 17, 2023

gasche left a comment

gasche Nov 17, 2023

fabbing Nov 20, 2023

OlivierNicole commented Nov 20, 2023

gasche commented Nov 20, 2023

fabbing commented Nov 20, 2023 •

edited

gasche commented Nov 20, 2023

gasche commented Nov 20, 2023

fabbing commented Nov 20, 2023

sadiqj commented Nov 25, 2023

gasche commented Nov 25, 2023

fabbing commented Nov 27, 2023

Fix data race on global pools arrays of pool_freelist #12755

Fix data race on global pools arrays of pool_freelist #12755

Conversation

fabbing commented Nov 17, 2023

OlivierNicole commented Nov 17, 2023

gasche left a comment

Choose a reason for hiding this comment

gasche Nov 17, 2023

Choose a reason for hiding this comment

fabbing Nov 20, 2023

Choose a reason for hiding this comment

OlivierNicole commented Nov 20, 2023

gasche commented Nov 20, 2023

fabbing commented Nov 20, 2023 • edited

gasche commented Nov 20, 2023

gasche commented Nov 20, 2023

fabbing commented Nov 20, 2023

sadiqj commented Nov 25, 2023

gasche commented Nov 25, 2023

fabbing commented Nov 27, 2023

Fix data race on global pools arrays of `pool_freelist` #12755

Fix data race on global pools arrays of `pool_freelist` #12755

fabbing commented Nov 20, 2023 •

edited