Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of parallel_for implementation #16

Merged
merged 3 commits into from
Aug 20, 2020

Conversation

ctk21
Copy link
Contributor

@ctk21 ctk21 commented Aug 16, 2020

This PR:

  • introduces a recursive divide-and-conquer scheme for distributing work in parallel_for
  • makes the chunk_size a parameter that takes a sensible default (N/(8*num_domains))
  • retains the old parallel_for as parallel_for_sequential

The motivation is to two-fold:

  • improve scaling when we use more than 8-16 cores.
  • try to provide a roughly sensible default for the chunk_size rather than forcing the user to pick a chunk_size which can often be poor even for seemingly sensible values.

My design notes for this change are:

Considerations for parallel_for

There are a couple of subtle things that can happen with parallel_for concerning: how work is distributed and how a chunk_size is chosen.

Work distribution

Consider the following work distribution methods:

  • Sequential: the caller creates a single work item for each chunk to be done and distributes each one to the work pool.
  • Divide-and-conquer (DC): if the work is less than chunk_size, do it now without scheduling. If the work is more than chunk_size, split the current work into two, post the first half to the work queue, recursively work on the second half of the work queue.

In the sequential scheme, all the overhead of work queuing, signalling waiting workers and synchronizing on results is taken by the single caller thread. In the DC scheme, while it may generate some extra work items (O(log(n))) the overhead of work queuing, signalling waiting workers and synchronizing on results is now distributed over the workers in the pool.

Benchmarking in Sandmark suggests the DC is noticeably better (often by a significant margin at 16 or more domains).

How to choose chunk_size

At first glance, it looks like for balanced workloads you should choose:
n_tasks / n_workers

Firstly experience suggests it is rare for a workload to be truely balanced. Even a basic piece of code with no garbage collection can experience different run times (e.g. different paths through the code, caching or NUMA effects), while in the presence of garbage collection (either minor collection or major slice) different domains can see different execution times for seemingly identical code.

There is a second issue relating to rounding. Consider sequential distribution in the presence of rounding with a chunk_size set to int(n_tasks / n_workers):
n_tasks = k * chunk_size + r
where
k = n_tasks / n_workers
r = n_tasks % n_workers
There will then be the following chunks to execute:
k chunks of chunk_size
r / chunk_size chunks of chunk_size
1 chunk of r % chunk_size
This means that you can end up with an under-utilized pool once the first k chunks are finished. Indeed the critical path can often be "time to do chunk_size on one worker" greater than the user expected.

It is difficult for users to select a default chunk_size that will work well with all permutations of n_task, n_workers, task variabilities, machine specifics, etc.

With this in mind, I would like to advocate for a default chunk_size of:
n_tasks / (8 * n_workers)
This attempts to exploit any imbalance and reduce the impact of rounding effects. We can't go arbitrarily larger than 8 as there is an overhead in managing the tasks.

(This choice of 8 is also in line with other parallel_for implementations out there, in as far as I can tell)

Benchmarks

In these sandmark benchmarks, we have the following variants:

  • dc_parallel_for when domainslib is using this PR
  • dflt_chunksz where I have changed sandmark to use a chunk_size of n_tasks / (8 * n_workers) when previously it had n_tasks / n_workers.
    dflt_and_dist
    The end-result of using the default chunk_size and this PR's implementation of parallel_for gives you the blue line. For some benchmarks, the impact can be substantial. Potentially more importantly, with this change, it is less likely for a user to oversize their pool and see a reduction in performance.

default the chunksize to N/(8*num_domains);
Make old parallel_for -> parallel_for_sequential
… unit tests added to sum_par for parallel_for to handle chunk_size defaulting; placing non-optional argument to parallel_for at end of argument list
@ctk21
Copy link
Contributor Author

ctk21 commented Aug 19, 2020

Following discussion, I've removed parallel_for_sequential.
I've also special cased num_domains=1 to result in a simple for loop.

@ctk21
Copy link
Contributor Author

ctk21 commented Aug 19, 2020

More results, this time from a larger Zen2 machine:
zen2_dc_for
The improvement is not as dramatic on Intel machine I used earlier, but there are still gains to be had.

(The curious super-linear scaling for nbody, is that the sequential code in sandmark isn't as good as the parallel code in the loop body which avoids some memory writes).

@kayceesrk kayceesrk merged commit d110088 into master Aug 20, 2020
@kayceesrk
Copy link
Contributor

The results look very good. I'm surprised that the work distribution impacts the scalability to this extent. I'll merge this one.

@UnixJunkie
Copy link

There is a funny thing happening starting with ncores >= 16.
Maybe it would be interesting to know what is the bottleneck in those "many cores" scenarios.
Maybe some things are still doable to improve scaling with ncores >= 16.
Also, maybe especially for those cases, the problem size should be "big enough".

@UnixJunkie
Copy link

If you have a machine with 64 cores to run those tests, that would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants