-
Notifications
You must be signed in to change notification settings - Fork 22
Scalable memory management and solving the cactus stack problem #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Rename to WV_CacheLinePadding Unfortunately nim-lang/Nim#12722 broke Nim devel in the past 3 days, commit nim-lang/Nim@1b2c1bc is good. Also C proc signatures changed to csize_t nim-lang/Nim#12497
…s with the new one
…orm the other one into a state machine to prepare for batching
… livelock/cache contention on the other channel
- terminate the batch by a nil - iterator doesn't fails when the cursor is deleted underneath it - reorder append function to prevent GCC codegen bug
…psc channels for intrusive lists
- 20% faster in eager futures and same speed on lazy (much stabler actually) - VERY sensitive to Arena size, 32kB and 64kB arenas are VERY slow (50% slower than baseline on eager futures) probably due to the 64k aliasing problem - can release memory on long running processes
… channel + implement batched send
…r) on fib(40). Stille 3x-5x better than TBB but we will need to regain those sub 200 figures somewhere else.
…nel documentation
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides an innovative solution to memory management in a multithreading runtime.
It's also an alternate approach to the cactus stack problem which has been plaguing runtimes for about 20 years (since Cilk).
History
Notoriously, both Go and Rust had troubles to handle them and used OS primitives instead:
Cilk handling of cactus stacks makes Cilk function have a different calling convention and
unusable from regular C.
And in research:
The changes
So this PR introduces 2 keys memory management data structure:
While before this PR, the runtime could handle this load:
How does that work
There is actually one memory pool per thread.
The memory pools service aligned fixed size memory block of 256 bytes for tasks (192 bytes) and flowvars (128 or 256 bytes due to cache line padding of contention fields).
It is much faster than the system allocator for single-threaded workload and somewhat faster when memory is allocated in a thread and released from another.
The memory pools manage arenas of size 16K bytes, this magic number may need more benchmarks.
Similar to Mimalloc extreme freelist sharding, freelists are not managed at the thread-local allocator level but at the arena level, each arenas keep track of its own freelists: one for usable free blocks, one for deferred free block, one for blocks freed by remote threads.
The memory pools act in general like a LIFO allocator as child tasks as resolved before their parent. It's basically a thread-local cactus stack. Unlike TBB, there is no depth-limitation so it maintains the busy-leaves property of work-stealing, it is theoretically unbounded memory growth (unlike work-first / parent-stealing runtimes like Cilk) but unlike previous research we are heap-allocated.
The memory pool is sufficient to deal with futures/flowvars as in general they are awaited/synced in the context that spawned the child, the scheme + specialized SPSC channel for flowvars accounted to 10% overhead reduction compared to the legacy channels with hardcoded cache on eager flowvar Fibonacci(40) (i.e. all channels allocated on the heap as opposed to using alloca when possible).
If we allow future escaping their context (i.e. returning a future), the memory pool should also provide good default performance.
Unfortunately, this didn't work out very well for tasks, both fibonacci(40) spiked from less than 200ms (lazy futures)/400ms (eager futures) to 2s/2.3s compared with just storing tasks in an always growing never-releasing stack. Why? Because it was stressing the memory pool with the "slow" multithreaded path due to the tasks being so small that they were regularly stolen and released in other thread.
So a new solution was devised: the lookaside buffers / lookaside lists.
They extend the original intrusive stack by supporting task eviction, this helps 2 use-case:
Now the main question is when to do task eviction? We want to amortize cost and we want an adaptative solution to the recent workload. Also the less metadata/counters we need to increment/decrement on the fast path the better as recursive workload like tree search may spawn billions of tasks at once which may means billions of increment.
For task eviction, the lookaside buffer hooks into the memory pool heartbeat, when it is time to do amortized expensive memory maintenance, the memory pool has a callback field that triggers also task eviction in the lookaside buffer depending on the current buffer size and the recent requests.
Not that it is important for the heartbeat to be triggered on memory allocations as task evictions deallocate and would otherwise lead to an avalanche effect.
How does it perform
Very well on the speed side, actually the only change that had a noticeable impact (7%) on performance was properly zero-ing the task data structure.
Further comparisons against the original implementation for long-running producer-consumer workload is needed on both CPU and memory consumption front.
What's next?
recyclethat allows freeing a memory block to the memory pool requires a threadID argument.This could use pthread_self or assembly or windows API to get a thread ID instead as this requirements leaks into the rest of the codebase.
Given the current already very good performance and low-overhead of the memory subsytem, it's of lower priority.