Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid false-sharing #11986

Open
bartoszmodelski opened this issue Feb 1, 2023 · 5 comments
Open

Avoid false-sharing #11986

bartoszmodelski opened this issue Feb 1, 2023 · 5 comments

Comments

@bartoszmodelski
Copy link
Contributor

bartoszmodelski commented Feb 1, 2023

False-sharing is pretty bad for performance of lock-free algorithms. There's many known results. This classic paper is a little old but for example:

  • 80% reduction in wall-clock time on linear regression claimed by Intel's coobkook.
  • In the case of multicore OCaml, @anmolsahoo25 found a 23% reduction in wall-clock time on a benchmark with just two threads.

We've had some discussion on lockfree. There's probably at least 2 acceptable solutions there. The rest of this issue describes both. Let me know what you think, I'm happy to open a PR with either.

Expose memory-aligned allocation

A somewhat principled C-style solution is to just call posix_memalign / aligned_alloc (C11) to get a block of memory, which occupies entire cache line(s). It makes it impossible for the cache line to be falsely-shared. It's also useful for doing math quickly (SSE, AVX).

I claim we can do this relatively easily by sending these allocations into large_allocate with a special path that uses aligned_alloc (rather than malloc). Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.

See this approach here.

Add padding

A nice workaround is to just add padding. If block spans multiple cache lines and the middle is accessed then alignment doesn't matter. @polytypic implemented it in multicore-magic. A disadvantage here is that we cannot pad area before the atomic (or it wouldn't be a valid Atomic.t), so we can still have accidental false-sharing if some other atomics in the program have not been padded. The positive part is that it's super simple.

If that's the blessed way to go forward, I would propose to add that to Obj or/and create Atomic.make_padded.

Other ideas

@sadiqj mentioned making Atomic.t unboxed to let us pad records like in Java or Golang. It sounds reasonable but I do not have the expertise to create a POC.

cc @kayceesrk

@stedolan
Copy link
Contributor

stedolan commented Feb 1, 2023

Current threshold for large alloc is 129, while cache line is 64 on most systems, so we're pretty close to using this infra anyway.

This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems.

In fact, I think the slab allocator should already be cache-aligned for allocations of the right size. Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems?

@stedolan
Copy link
Contributor

stedolan commented Feb 2, 2023

The multicore allocator tries to allocate objects from domain-local pages. So you'll get false sharing if you allocate three objects sequentially and send the middle one to another domain while using the other two locally, but in general you should not get false sharing between objects allocated on one domain and objects allocated on another.

In other words, to avoid false sharing it is good to take a copy of small objects on the receiving domain, but I don't think you necessarily need to pad.

@bartoszmodelski
Copy link
Contributor Author

bartoszmodelski commented Feb 2, 2023

This isn't quite right: the threshold for large allocations is 129 words, which is 8256 bytes on 64-bit systems.

Rip me. Thanks for pointing out.

In fact, I think the slab allocator should already be cache-aligned for allocations of the right size.

I assume that slabs refers pools in the heap code. As far as I can tell the memory for pools is cache-aligned, but the allocations themselves are not. I tried the following:

value var = caml_alloc_shr(7, 0);
assert((var - 8) % 64 == 32); // succeeds 

Could be a red herring, but it coincides with POOL_HEADER_WSIZE = 4.

Perhaps we should make Atomic.t be a 7-word record, so that with the header it consumes 64 bytes on 64-bit systems?

I assume that 8-word pool will only keep 8-word chunks even under sweeping or compaction. With pool-allocs being aligned, this proposition sounds okay. All the assumptions here seem foundational. Ideally, maybe we can wrap it into a separate call? It will make it easier for the users, and easier for the compiler folks to evolve if needed.

However, I'm not sure if making all Atomic.t 7-word is optimal. It's a better world than now, but I would not want to lose current atomic. To add to the list, I can see myself using both (padded and thin atomic) even within the same data structure. For example, a queue with contention in FAD-manipulated indices but also requiring cells to be atomic for correctness.

@Octachron
Copy link
Member

cc @sadiqj would you have the time to have a look at the issue ? I am temporarily assigning you (with the semantics that assignee people should try to shepherd discussions about issues but not necessarily fix them), but please free to de-assign yourself (or even better to assign someone else).

@bartoszmodelski
Copy link
Contributor Author

bartoszmodelski commented Mar 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants