Improvement of parallel_for implementation #16

ctk21 · 2020-08-16T13:32:52Z

This PR:

introduces a recursive divide-and-conquer scheme for distributing work in parallel_for
makes the chunk_size a parameter that takes a sensible default (N/(8*num_domains))
retains the old parallel_for as parallel_for_sequential

The motivation is to two-fold:

improve scaling when we use more than 8-16 cores.
try to provide a roughly sensible default for the chunk_size rather than forcing the user to pick a chunk_size which can often be poor even for seemingly sensible values.

My design notes for this change are:

Considerations for `parallel_for`

There are a couple of subtle things that can happen with parallel_for concerning: how work is distributed and how a chunk_size is chosen.

Work distribution

Consider the following work distribution methods:

Sequential: the caller creates a single work item for each chunk to be done and distributes each one to the work pool.
Divide-and-conquer (DC): if the work is less than chunk_size, do it now without scheduling. If the work is more than chunk_size, split the current work into two, post the first half to the work queue, recursively work on the second half of the work queue.

In the sequential scheme, all the overhead of work queuing, signalling waiting workers and synchronizing on results is taken by the single caller thread. In the DC scheme, while it may generate some extra work items (O(log(n))) the overhead of work queuing, signalling waiting workers and synchronizing on results is now distributed over the workers in the pool.

Benchmarking in Sandmark suggests the DC is noticeably better (often by a significant margin at 16 or more domains).

How to choose chunk_size

At first glance, it looks like for balanced workloads you should choose:
n_tasks / n_workers

Firstly experience suggests it is rare for a workload to be truely balanced. Even a basic piece of code with no garbage collection can experience different run times (e.g. different paths through the code, caching or NUMA effects), while in the presence of garbage collection (either minor collection or major slice) different domains can see different execution times for seemingly identical code.

There is a second issue relating to rounding. Consider sequential distribution in the presence of rounding with a chunk_size set to int(n_tasks / n_workers):
n_tasks = k * chunk_size + r
where
k = n_tasks / n_workers
r = n_tasks % n_workers
There will then be the following chunks to execute:
k chunks of chunk_size
r / chunk_size chunks of chunk_size
1 chunk of r % chunk_size
This means that you can end up with an under-utilized pool once the first k chunks are finished. Indeed the critical path can often be "time to do chunk_size on one worker" greater than the user expected.

It is difficult for users to select a default chunk_size that will work well with all permutations of n_task, n_workers, task variabilities, machine specifics, etc.

With this in mind, I would like to advocate for a default chunk_size of:
n_tasks / (8 * n_workers)
This attempts to exploit any imbalance and reduce the impact of rounding effects. We can't go arbitrarily larger than 8 as there is an overhead in managing the tasks.

(This choice of 8 is also in line with other parallel_for implementations out there, in as far as I can tell)

Benchmarks

In these sandmark benchmarks, we have the following variants:

dc_parallel_for when domainslib is using this PR
dflt_chunksz where I have changed sandmark to use a chunk_size of n_tasks / (8 * n_workers) when previously it had n_tasks / n_workers.

The end-result of using the default chunk_size and this PR's implementation of parallel_for gives you the blue line. For some benchmarks, the impact can be substantial. Potentially more importantly, with this change, it is less likely for a user to oversize their pool and see a reduction in performance.

default the chunksize to N/(8*num_domains); Make old parallel_for -> parallel_for_sequential

… unit tests added to sum_par for parallel_for to handle chunk_size defaulting; placing non-optional argument to parallel_for at end of argument list

ctk21 · 2020-08-19T13:46:25Z

Following discussion, I've removed parallel_for_sequential.
I've also special cased num_domains=1 to result in a simple for loop.

ctk21 · 2020-08-19T14:14:06Z

More results, this time from a larger Zen2 machine:

The improvement is not as dramatic on Intel machine I used earlier, but there are still gains to be had.

(The curious super-linear scaling for nbody, is that the sequential code in sandmark isn't as good as the parallel code in the loop body which avoids some memory writes).

kayceesrk · 2020-08-20T04:45:44Z

The results look very good. I'm surprised that the work distribution impacts the scalability to this extent. I'll merge this one.

UnixJunkie · 2020-09-14T08:38:42Z

There is a funny thing happening starting with ncores >= 16.
Maybe it would be interesting to know what is the bottleneck in those "many cores" scenarios.
Maybe some things are still doable to improve scaling with ncores >= 16.
Also, maybe especially for those cases, the problem size should be "big enough".

UnixJunkie · 2020-09-14T08:41:16Z

If you have a machine with 64 cores to run those tests, that would be nice.

ctk21 added 3 commits August 16, 2020 14:16

try an implementation of divide-and-conquer for parallel_for;

71eb577

default the chunksize to N/(8*num_domains); Make old parallel_for -> parallel_for_sequential

Remove parallel_for_sequential

0e3a9af

Special case num_domain=1 in parallel_for to do all work in for loop;…

6a51e85

… unit tests added to sum_par for parallel_for to handle chunk_size defaulting; placing non-optional argument to parallel_for at end of argument list

kayceesrk merged commit d110088 into master Aug 20, 2020

ctk21 mentioned this pull request Aug 21, 2020

Use same algorithm for parallel_for_reduce as parallel_for #18

Merged

Sudha247 mentioned this pull request Aug 27, 2020

parallel_for examples use default chunk_size ocaml-multicore/parallel-programming-in-multicore-ocaml#4

Merged

edwintorok mentioned this pull request Oct 6, 2021

Task.await deadlock (task finishes but await never returns) #47

Closed

kayceesrk mentioned this pull request Sep 27, 2022

Manual chapters on parallelism and memory model ocaml/ocaml#11280

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of parallel_for implementation #16

Improvement of parallel_for implementation #16

ctk21 commented Aug 16, 2020 •

edited

Loading

ctk21 commented Aug 19, 2020

ctk21 commented Aug 19, 2020

kayceesrk commented Aug 20, 2020

UnixJunkie commented Sep 14, 2020

UnixJunkie commented Sep 14, 2020

Improvement of parallel_for implementation #16

Improvement of parallel_for implementation #16

Conversation

ctk21 commented Aug 16, 2020 • edited Loading

Considerations for parallel_for

Work distribution

How to choose chunk_size

Benchmarks

ctk21 commented Aug 19, 2020

ctk21 commented Aug 19, 2020

kayceesrk commented Aug 20, 2020

UnixJunkie commented Sep 14, 2020

UnixJunkie commented Sep 14, 2020

ctk21 commented Aug 16, 2020 •

edited

Loading

Considerations for `parallel_for`