Task stealing with CL deques #29

ctk21 · 2021-05-19T15:48:00Z

This PR is an implementation of Chase Lev deques organised for task stealing; as is traditional in Cilk, Concurrent bags, etc.

The basic idea is:

each domain in the task pool owns a CL deque; it can call push without synchronous operations and pop will not need to synchronize with other domains unless there is only 1 element in the deque.
if a domain has no tasks, it will steal tasks from randomly chosen deques belonging to other domains.
there is a mechanism for domains to block when they discover there is no work in any queue by posting a message to a 'waiter channel'. Importantly when all domains are awake, the only extra overhead is to check that the 'waiter channel' is empty.
the implementation can also be tweaked to operate in a 'non-blocking' mode where you spin rather than block in the operating system by setting recv_block_spins.

Early benchmarking results (detuned Zen2 with sandmark fb2a38) are encouraging:

I wouldn't say the implementation is fully finished:

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.
It might be worth making the implementation full garbage free. We could remove the Some/None return from the deque to use exceptions.
While the benchmark results are good for many cases (LU_decomposition and game_of_life stand out), it would be good to understand the matrix_multiplication regression and take a look at the binarytrees5 where the deques might be a bit slower.

The CL deque implementation was built on the queue in lockfree but I would recommend looking at the C code in Correct and efficient work-stealing for weak memory models

ctk21 · 2021-05-24T08:04:59Z

(rebased to pick up #30)

ctk21 · 2021-05-26T09:11:28Z

I had a think about formalizing how multi_channel is turning a non blocking structure into a blocking one. If Thread A is enqueuing data and Thread B is dequeuing then

Thread A:                                    Thread B:
A1 - enqueue data (non-blocking)             B1 - poll for data (non-blocking)
A2 - check waiter queue and signal           B2 - enqueue ourselves on the waiter queue
     any blocking waiter                     B3 - poll for data again (non-blocking) 
                                                  and block with waiter structure

Each of A1, A2, B1, B2, B3 are ordered in the sense that completion is either Ax < By or Bx < Ay.
I think we can then construct the following argument to show that Thread B will always pick up Thread A's enqueue:

if A1 < B1 then B will pick up the data in B1
else B1 < A1 then 
   if B2 < A1 then A will signal B in A2 and the data will be collected by B
   else A1 < B2 then B3 is will execute after A1 and so B will pick up the data

ctk21 · 2021-05-26T15:28:04Z

I've implemented two improvements:

we now don't do Atomic.set on the elements of the backing array in the deque
I've made the ws_deque garbage free

The big benchmarks now look like this; Zen2 (not isolated) sandmark ed00c5, baseline is bc7c44, this pr is e69db8:

Broad brush, if you want to scale the task stealing is the way to go.

The benchmarks for the standard sizes show:

There is a single regression for our evolutionary_algorithm case. I spent some time with this one, there are two things to note:

this benchmark does not scale as it has lots of serial sections; the task queue flips from being run to all blocked to running to all blocked.
this benchmark has safepoint issues; it makes use of iterations and folds using stdlib functions which in the current multicore can put big gaps between safepoints.

I think we could do things about the first: we could throttle how we wake up blocked waiters and also limit the number of stealing domains. I'm in two minds about if we should pursue that in this PR or as follow ups.

ctk21 · 2021-05-27T15:07:37Z

So I did a bit more investigation on the evolutionary_algorithm results and found a memory bug. The tasks in the deque arrays will be overwritten, however all the memory held in the task closures are kept there until the memory in the deque array is overwritten.

I have implemented a fix where the deque array elements hold a reference to the item; this reference is clobbered when taking tasks. You can't touch the array itself in steal because the producer may reuse that slot.

A rerun of the big benchmarks got me:

A rerun of the standard size benchmarks got me:

There remains something a bit odd about evolutionary algorithm that I don't fully understand - it looks like we do differing amounts of GC work with the old vs new code. The deques are causing about 20% more minor GCs, which is odd as I can't see significant extra allocations in the domainslib structures themselves:

kayceesrk · 2021-06-05T04:32:16Z

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

kayceesrk

This PR is very intricate :-). I've only managed to go through ws_dequeue.ml, and that too not entirely. I'll continue with the rest of the PR next week.

kayceesrk · 2021-06-05T05:57:03Z

lib/ws_deque.ml

+  }
+
+  let create () = {
+    top = Atomic.make 1;


Why are these 1 and not 0?

lib/ws_deque.ml

lib/multi_channel.ml

kayceesrk · 2021-06-05T10:13:11Z

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

lib/ws_deque.ml

structure with blocking

…ceived in check_waiters

…ght bits of the dls lookups; fix multi channel array size config

…he array in steal for ws_deque

…s the barrier

ctk21 · 2021-06-07T08:41:15Z

Can you merge with the master again? The diff includes changes already included in master, and it makes it hard to read. See https://github.com/ocaml-multicore/domainslib/pull/29/files#diff-1a51aea47d0ef65c5073e293d19298451614febc83360af128ded914dbb6599bL39

Rebased to master and pushed!

ctk21 · 2021-06-07T09:26:35Z

I feel we can lean on the OCaml multicore memory model more in the deque rather than wrapping every operation in an Atomic indirection.

Just to connect various folks who might be interested in this, @fpottier and his students were looking at instances where Multicore OCaml explicitly takes advantage of data racy behaviours for improving performance. This looks like one. But @ctk21 mentioned that we need acquire release atomics which the OCaml memory model does not support (yet). @stedolan mentioned that while acquire load fits nicely into the model, store-release does not.

To expand a bit. We can implement the CL deque using the Atomic module and the existing OCaml multicore memory-model. However if you look at Figure 1 in Correct and efficient work-stealing for weak memory models, then we aren't able to express the acquire-release components of that implementation in OCaml multicore today.

In other parallel data structures, for example a SPSC ring buffer, using acquire-release can be important to get a really low overhead implementation. However there is a problem to solve which is if the OCaml multicore memory-model can be extended to accommodate acquire-release atomics.

Sudha247

This is great work and the results are very exciting! The overall structure is convincing to me.

Just had a couple minor questions.

lib/ws_deque.ml

lib/multi_channel.ml

ctk21 force-pushed the ctk21/work_stealing_deque_experiment branch from ed28306 to 9f1912f Compare May 24, 2021 08:04

ctk21 marked this pull request as ready for review May 26, 2021 15:28

kayceesrk self-requested a review June 4, 2021 15:37

ctk21 requested a review from Sudha247 June 4, 2021 15:38

kayceesrk reviewed Jun 5, 2021

View reviewed changes

kayceesrk self-requested a review June 5, 2021 07:37

kayceesrk reviewed Jun 5, 2021

View reviewed changes

lib/multi_channel.ml Show resolved Hide resolved

lib/multi_channel.ml Show resolved Hide resolved

ctk21 commented Jun 7, 2021

View reviewed changes

lib/ws_deque.ml Show resolved Hide resolved

ocaml-multicore deleted a comment from kayceesrk Jun 7, 2021

ctk21 added 13 commits June 7, 2021 09:38

a multichannel task implementation that wraps a non-blocking polling

252b82f

structure with blocking

Use work stealing deques with the multi_channel blocking code

0c9fc04

spin before blocking

377baa3

random shuffle steal

272cc1f

shortcut pop when there is nothing in the queue

800e153

move to unsafe_get; tweak min_size ws_deque to 32; poll status for re…

c799a62

…ceived in check_waiters

remove closure causing allocation in ws_deque steal

354bb5b

cleanup some names around dls; make 2048 default spins; inline the ri…

94c6fc7

…ght bits of the dls lookups; fix multi channel array size config

more closely match CL 'correct and efficient' paper on when to load t…

b2b53a9

…he array in steal for ws_deque

manually insert safepoints

1e57554

no need for Atomic.set on every element the Atomic.set on bottom form…

f048578

…s the barrier

no options and garbage in the multi_channel queue

6e4ce97

ensure the deque array elements don't hold onto popped objects

2d61ed8

ctk21 force-pushed the ctk21/work_stealing_deque_experiment branch from f5d95d2 to 2d61ed8 Compare June 7, 2021 08:39

improve check for growing backing buffer in push

ecb5ac5

Sudha247 reviewed Jun 7, 2021

View reviewed changes

lib/ws_deque.ml Outdated Show resolved Hide resolved

lib/multi_channel.ml Outdated Show resolved Hide resolved

ctk21 added 2 commits June 7, 2021 12:49

add a license to match lib/ws_deque for the new file lib/multi_channel

0107600

use Array.make for initialization of ws_deque

1591216

kayceesrk approved these changes Jun 7, 2021

View reviewed changes

Address review comment and reduce calls to Domain.DLS.get

8167cb9

Sudha247 merged commit 93a5422 into master Jun 8, 2021

kayceesrk mentioned this pull request Jun 12, 2021

Run sandmark_nightly on a larger machine ocaml-bench/sandmark#237

Closed

kayceesrk mentioned this pull request Jul 2, 2021

Mandelbrot numbers on navajo look odd ocaml-bench/sandmark-nightly#6

Closed

Sudha247 mentioned this pull request Sep 22, 2021

Lwt_domain: an interface to multicore parallelism ocsigen/lwt#860

Merged

edwintorok mentioned this pull request Oct 6, 2021

Task.await deadlock (task finishes but await never returns) #47

Closed

Sudha247 mentioned this pull request Dec 24, 2021

Task library slowdown if the number of domains is greater than 8 #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task stealing with CL deques #29

Task stealing with CL deques #29

ctk21 commented May 19, 2021 •

edited

Loading

ctk21 commented May 24, 2021

ctk21 commented May 26, 2021 •

edited

Loading

ctk21 commented May 26, 2021

ctk21 commented May 27, 2021 •

edited

Loading

kayceesrk commented Jun 5, 2021

kayceesrk left a comment •

edited

Loading

kayceesrk Jun 5, 2021

kayceesrk commented Jun 5, 2021

ctk21 commented Jun 7, 2021 •

edited

Loading

ctk21 commented Jun 7, 2021

Sudha247 left a comment

Task stealing with CL deques #29

Task stealing with CL deques #29

Conversation

ctk21 commented May 19, 2021 • edited Loading

ctk21 commented May 24, 2021

ctk21 commented May 26, 2021 • edited Loading

ctk21 commented May 26, 2021

ctk21 commented May 27, 2021 • edited Loading

kayceesrk commented Jun 5, 2021

kayceesrk left a comment • edited Loading

Choose a reason for hiding this comment

kayceesrk Jun 5, 2021

Choose a reason for hiding this comment

kayceesrk commented Jun 5, 2021

ctk21 commented Jun 7, 2021 • edited Loading

ctk21 commented Jun 7, 2021

Sudha247 left a comment

Choose a reason for hiding this comment

ctk21 commented May 19, 2021 •

edited

Loading

ctk21 commented May 26, 2021 •

edited

Loading

ctk21 commented May 27, 2021 •

edited

Loading

kayceesrk left a comment •

edited

Loading

ctk21 commented Jun 7, 2021 •

edited

Loading