Avoid false-sharing #11986

bartoszmodelski · 2023-02-01T16:51:30Z

False-sharing is pretty bad for performance of lock-free algorithms. There's many known results. This classic paper is a little old but for example:

80% reduction in wall-clock time on linear regression claimed by Intel's coobkook.
In the case of multicore OCaml, @anmolsahoo25 found a 23% reduction in wall-clock time on a benchmark with just two threads.

We've had some discussion on lockfree. There's probably at least 2 acceptable solutions there. The rest of this issue describes both. Let me know what you think, I'm happy to open a PR with either.

Expose memory-aligned allocation

A somewhat principled C-style solution is to just call posix_memalign / aligned_alloc (C11) to get a block of memory, which occupies entire cache line(s). It makes it impossible for the cache line to be falsely-shared. It's also useful for doing math quickly (SSE, AVX).

I claim we can do this relatively easily by sending these allocations into large_allocate with a special path that uses aligned_alloc (rather than malloc). ~~Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.~~

See this approach here.

Add padding

A nice workaround is to just add padding. If block spans multiple cache lines and the middle is accessed then alignment doesn't matter. @polytypic implemented it in multicore-magic. A disadvantage here is that we cannot pad area before the atomic (or it wouldn't be a valid Atomic.t), so we can still have accidental false-sharing if some other atomics in the program have not been padded. The positive part is that it's super simple.

If that's the blessed way to go forward, I would propose to add that to Obj or/and create Atomic.make_padded.

Other ideas

@sadiqj mentioned making Atomic.t unboxed to let us pad records like in Java or Golang. It sounds reasonable but I do not have the expertise to create a POC.

cc @kayceesrk

The text was updated successfully, but these errors were encountered:

stedolan · 2023-02-01T17:57:09Z

Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.

This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems.

In fact, I think the slab allocator should already be cache-aligned for allocations of the right size. Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems?

stedolan · 2023-02-02T11:59:17Z

The multicore allocator tries to allocate objects from domain-local pages. So you'll get false sharing if you allocate three objects sequentially and send the middle one to another domain while using the other two locally, but in general you should not get false sharing between objects allocated on one domain and objects allocated on another.

In other words, to avoid false sharing it is good to take a copy of small objects on the receiving domain, but I don't think you necessarily need to pad.

bartoszmodelski · 2023-02-02T15:47:09Z

This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems.

Rip me. Thanks for pointing out.

In fact, I think the slab allocator should already be cache-aligned for allocations of the right size.

I assume that slabs refers pools in the heap code. As far as I can tell the memory for pools is cache-aligned, but the allocations themselves are not. I tried the following:

value var = caml_alloc_shr(7, 0);
assert((var - 8) % 64 == 32); // succeeds

Could be a red herring, but it coincides with POOL_HEADER_WSIZE = 4.

Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems?

I assume that 8-word pool will only keep 8-word chunks even under sweeping or compaction. With pool-allocs being aligned, this proposition sounds okay. All the assumptions here seem foundational. Ideally, maybe we can wrap it into a separate call? It will make it easier for the users, and easier for the compiler folks to evolve if needed.

However, I'm not sure if making all Atomic.t 7-word is optimal. It's a better world than now, but I would not want to lose current atomic. To add to the list, I can see myself using both (padded and thin atomic) even within the same data structure. For example, a queue with contention in FAD-manipulated indices but also requiring cells to be atomic for correctness.

Octachron · 2023-03-24T12:49:15Z

cc @sadiqj would you have the time to have a look at the issue ? I am temporarily assigning you (with the semantics that assignee people should try to shepherd discussions about issues but not necessarily fix them), but please free to de-assign yourself (or even better to assign someone else).

bartoszmodelski · 2023-03-24T13:39:24Z

I’m happy to shoot a PR following @stedolan’s suggestion in a week or two (on holidays now).

…

On Fri, 24 Mar 2023 at 19:49, Florian Angeletti ***@***.***> wrote: cc @sadiqj <https://github.com/sadiqj> would you have the time to have a look at the issue ? I am temporarily assigning you (with the semantics that assignee people should try to shepherd discussions about issues but not necessarily fix them), but please free to de-assign yourself (or even better to assign someone else). — Reply to this email directly, view it on GitHub <#11986 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC4F7JH73O5PTOXVJSBYXHDW5WJ5PANCNFSM6AAAAAAUN63G2U> . You are receiving this because you authored the thread.Message ID: ***@***.***>

bartoszmodelski added the feature-wish label Feb 1, 2023

polytypic mentioned this issue Mar 10, 2023

Fix space leaks of Michael-Scott queue and avoid the impossible ocaml-multicore/saturn#64

Merged

Octachron added the multicore label Mar 24, 2023

Octachron assigned sadiqj Mar 24, 2023

bartoszmodelski mentioned this issue Apr 26, 2023

Add cache-aligned atomic #12212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid false-sharing #11986

Avoid false-sharing #11986

bartoszmodelski commented Feb 1, 2023 •

edited

stedolan commented Feb 1, 2023

stedolan commented Feb 2, 2023

bartoszmodelski commented Feb 2, 2023 •

edited

Octachron commented Mar 24, 2023

bartoszmodelski commented Mar 24, 2023 via email

Avoid false-sharing #11986

Avoid false-sharing #11986

Comments

bartoszmodelski commented Feb 1, 2023 • edited

Expose memory-aligned allocation

Add padding

Other ideas

stedolan commented Feb 1, 2023

stedolan commented Feb 2, 2023

bartoszmodelski commented Feb 2, 2023 • edited

Octachron commented Mar 24, 2023

bartoszmodelski commented Mar 24, 2023 via email

bartoszmodelski commented Feb 1, 2023 •

edited

bartoszmodelski commented Feb 2, 2023 •

edited