New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid false-sharing #11986
Comments
This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems. In fact, I think the slab allocator should already be cache-aligned for allocations of the right size. Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems? |
The multicore allocator tries to allocate objects from domain-local pages. So you'll get false sharing if you allocate three objects sequentially and send the middle one to another domain while using the other two locally, but in general you should not get false sharing between objects allocated on one domain and objects allocated on another. In other words, to avoid false sharing it is good to take a copy of small objects on the receiving domain, but I don't think you necessarily need to pad. |
Rip me. Thanks for pointing out.
I assume that slabs refers pools in the heap code. As far as I can tell the memory for pools is cache-aligned, but the allocations themselves are not. I tried the following:
Could be a red herring, but it coincides with POOL_HEADER_WSIZE = 4.
I assume that 8-word pool will only keep 8-word chunks even under sweeping or compaction. With pool-allocs being aligned, this proposition sounds okay. All the assumptions here seem foundational. Ideally, maybe we can wrap it into a separate call? It will make it easier for the users, and easier for the compiler folks to evolve if needed. However, I'm not sure if making all Atomic.t 7-word is optimal. It's a better world than now, but I would not want to lose current atomic. To add to the list, I can see myself using both (padded and thin atomic) even within the same data structure. For example, a queue with contention in FAD-manipulated indices but also requiring cells to be atomic for correctness. |
cc @sadiqj would you have the time to have a look at the issue ? I am temporarily assigning you (with the semantics that assignee people should try to shepherd discussions about issues but not necessarily fix them), but please free to de-assign yourself (or even better to assign someone else). |
I’m happy to shoot a PR following @stedolan’s suggestion in a week or two
(on holidays now).
…On Fri, 24 Mar 2023 at 19:49, Florian Angeletti ***@***.***> wrote:
cc @sadiqj <https://github.com/sadiqj> would you have the time to have a
look at the issue ? I am temporarily assigning you (with the semantics that
assignee people should try to shepherd discussions about issues but not
necessarily fix them), but please free to de-assign yourself (or even
better to assign someone else).
—
Reply to this email directly, view it on GitHub
<#11986 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4F7JH73O5PTOXVJSBYXHDW5WJ5PANCNFSM6AAAAAAUN63G2U>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
False-sharing is pretty bad for performance of lock-free algorithms. There's many known results. This classic paper is a little old but for example:
We've had some discussion on lockfree. There's probably at least 2 acceptable solutions there. The rest of this issue describes both. Let me know what you think, I'm happy to open a PR with either.
Expose memory-aligned allocation
A somewhat principled C-style solution is to just call
posix_memalign
/aligned_alloc
(C11) to get a block of memory, which occupies entire cache line(s). It makes it impossible for the cache line to be falsely-shared. It's also useful for doing math quickly (SSE, AVX).I claim we can do this relatively easily by sending these allocations into
large_allocate
with a special path that usesaligned_alloc
(rather thanmalloc
).Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.See this approach here.
Add padding
A nice workaround is to just add padding. If block spans multiple cache lines and the middle is accessed then alignment doesn't matter. @polytypic implemented it in multicore-magic. A disadvantage here is that we cannot pad area before the atomic (or it wouldn't be a valid
Atomic.t
), so we can still have accidental false-sharing if some other atomics in the program have not been padded. The positive part is that it's super simple.If that's the blessed way to go forward, I would propose to add that to
Obj
or/and createAtomic.make_padded
.Other ideas
@sadiqj mentioned making
Atomic.t
unboxed to let us pad records like in Java or Golang. It sounds reasonable but I do not have the expertise to create a POC.cc @kayceesrk
The text was updated successfully, but these errors were encountered: