Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task stealing with CL deques #29

Merged
merged 17 commits into from
Jun 8, 2021
Merged

Conversation

ctk21
Copy link
Contributor

@ctk21 ctk21 commented May 19, 2021

This PR is an implementation of Chase Lev deques organised for task stealing; as is traditional in Cilk, Concurrent bags, etc.

The basic idea is:

  • each domain in the task pool owns a CL deque; it can call push without synchronous operations and pop will not need to synchronize with other domains unless there is only 1 element in the deque.
  • if a domain has no tasks, it will steal tasks from randomly chosen deques belonging to other domains.
  • there is a mechanism for domains to block when they discover there is no work in any queue by posting a message to a 'waiter channel'. Importantly when all domains are awake, the only extra overhead is to check that the 'waiter channel' is empty.
  • the implementation can also be tweaked to operate in a 'non-blocking' mode where you spin rather than block in the operating system by setting recv_block_spins.

Early benchmarking results (detuned Zen2 with sandmark fb2a38) are encouraging:
20210519_big_deque_time
20210519_big_deque_speedup

I wouldn't say the implementation is fully finished:

  • I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.
  • It might be worth making the implementation full garbage free. We could remove the Some/None return from the deque to use exceptions.
  • While the benchmark results are good for many cases (LU_decomposition and game_of_life stand out), it would be good to understand the matrix_multiplication regression and take a look at the binarytrees5 where the deques might be a bit slower.

The CL deque implementation was built on the queue in lockfree but I would recommend looking at the C code in Correct and efficient work-stealing for weak memory models

@ctk21 ctk21 force-pushed the ctk21/work_stealing_deque_experiment branch from ed28306 to 9f1912f Compare May 24, 2021 08:04
@ctk21
Copy link
Contributor Author

ctk21 commented May 24, 2021

(rebased to pick up #30)

@ctk21
Copy link
Contributor Author

ctk21 commented May 26, 2021

I had a think about formalizing how multi_channel is turning a non blocking structure into a blocking one. If Thread A is enqueuing data and Thread B is dequeuing then

Thread A:                                    Thread B:
A1 - enqueue data (non-blocking)             B1 - poll for data (non-blocking)
A2 - check waiter queue and signal           B2 - enqueue ourselves on the waiter queue
     any blocking waiter                     B3 - poll for data again (non-blocking) 
                                                  and block with waiter structure

Each of A1, A2, B1, B2, B3 are ordered in the sense that completion is either Ax < By or Bx < Ay.
I think we can then construct the following argument to show that Thread B will always pick up Thread A's enqueue:

if A1 < B1 then B will pick up the data in B1
else B1 < A1 then 
   if B2 < A1 then A will signal B in A2 and the data will be collected by B
   else A1 < B2 then B3 is will execute after A1 and so B will pick up the data

@ctk21
Copy link
Contributor Author

ctk21 commented May 26, 2021

I've implemented two improvements:

  • we now don't do Atomic.set on the elements of the backing array in the deque
  • I've made the ws_deque garbage free

The big benchmarks now look like this; Zen2 (not isolated) sandmark ed00c5, baseline is bc7c44, this pr is e69db8:
20210526_sherwood_big_domainslib_pr29_time
20210526_sherwood_big_domainslib_pr29_speedup

Broad brush, if you want to scale the task stealing is the way to go.

The benchmarks for the standard sizes show:
20210526_sherwood_pr29_time
20210526_sherwood_pr29_speedup

There is a single regression for our evolutionary_algorithm case. I spent some time with this one, there are two things to note:

  • this benchmark does not scale as it has lots of serial sections; the task queue flips from being run to all blocked to running to all blocked.
  • this benchmark has safepoint issues; it makes use of iterations and folds using stdlib functions which in the current multicore can put big gaps between safepoints.

I think we could do things about the first: we could throttle how we wake up blocked waiters and also limit the number of stealing domains. I'm in two minds about if we should pursue that in this PR or as follow ups.

@ctk21 ctk21 marked this pull request as ready for review May 26, 2021 15:28
@ctk21
Copy link
Contributor Author

ctk21 commented May 27, 2021

So I did a bit more investigation on the evolutionary_algorithm results and found a memory bug. The tasks in the deque arrays will be overwritten, however all the memory held in the task closures are kept there until the memory in the deque array is overwritten.

I have implemented a fix where the deque array elements hold a reference to the item; this reference is clobbered when taking tasks. You can't touch the array itself in steal because the producer may reuse that slot.

A rerun of the big benchmarks got me:
20210527_ws_deque_gc_fix_big_time
20210527_ws_deque_gc_fix_big_speedup

A rerun of the standard size benchmarks got me:
20210527_ws_deque_gc_fix_time
20210527_ws_deque_gc_fix_speedup

There remains something a bit odd about evolutionary algorithm that I don't fully understand - it looks like we do differing amounts of GC work with the old vs new code. The deques are causing about 20% more minor GCs, which is odd as I can't see significant extra allocations in the domainslib structures themselves:
20210527_ws_deque_ea_mgc

@kayceesrk kayceesrk self-requested a review June 4, 2021 15:37
@ctk21 ctk21 requested a review from Sudha247 June 4, 2021 15:38
@kayceesrk
Copy link
Contributor

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very intricate :-). I've only managed to go through ws_dequeue.ml, and that too not entirely. I'll continue with the rest of the PR next week.

}

let create () = {
top = Atomic.make 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these 1 and not 0?

lib/ws_deque.ml Show resolved Hide resolved
lib/ws_deque.ml Show resolved Hide resolved
lib/ws_deque.ml Show resolved Hide resolved
lib/ws_deque.ml Outdated Show resolved Hide resolved
@kayceesrk kayceesrk self-requested a review June 5, 2021 07:37
lib/multi_channel.ml Show resolved Hide resolved
lib/multi_channel.ml Show resolved Hide resolved
@kayceesrk
Copy link
Contributor

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

lib/ws_deque.ml Show resolved Hide resolved
@ocaml-multicore ocaml-multicore deleted a comment from kayceesrk Jun 7, 2021
@ctk21 ctk21 force-pushed the ctk21/work_stealing_deque_experiment branch from f5d95d2 to 2d61ed8 Compare June 7, 2021 08:39
@ctk21
Copy link
Contributor Author

ctk21 commented Jun 7, 2021

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

Rebased to master and pushed!

@ctk21
Copy link
Contributor Author

ctk21 commented Jun 7, 2021

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

To expand a bit. We can implement the CL deque using the Atomic module and the existing OCaml multicore memory-model. However if you look at Figure 1 in Correct and efficient work-stealing for weak memory models, then we aren't able to express the acquire-release components of that implementation in OCaml multicore today.

In other parallel data structures, for example a SPSC ring buffer, using acquire-release can be important to get a really low overhead implementation. However there is a problem to solve which is if the OCaml multicore memory-model can be extended to accommodate acquire-release atomics.

Copy link
Contributor

@Sudha247 Sudha247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work and the results are very exciting! The overall structure is convincing to me.

Just had a couple minor questions.

lib/ws_deque.ml Outdated Show resolved Hide resolved
lib/multi_channel.ml Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants