Parallelism within / between groups #270

bdice · 2020-03-11T15:08:43Z

Feature description

tl;dr: We need a way to control parallelism within and between groups. Parallel operation within a group would be "intra-group" and parallel operation between groups would be "inter-group." This behavior would be controlled by the --parallel flag.

Copied from Slack, adapted for brevity:

@bdice: I have two operations, equilibrate and sample, in a group called simulate. Currently only the equilibrate jobs are eligible to run. I want to run equilibrate and then sample, parallelized across jobs (not parallel over both operations at the same time). That is, all jobs run equilibrate and then run simulate when that's done. On submission, it is requesting enough resources to run equilibrate and sample simultaneously, instead of sequentially (while still being parallel over jobs). How do I parallelize across groups but operate sequentially within groups?

@b-butler: The current implementation by design parallelizes across both jobs and operations since given only one flag there was not a way to specify only parallelizing part of the submission. We could implement more mutually exclusive flags or make --parallel have a default value.

Proposed solution

Proposal for API from @csadorf:
--parallel=none: No parallel execution, default when no option is provided.
--parallel=inter: Parallel execution across, but not within groups; default when only --parallel is provided.
--parallel=intra: Parallel execution within groups, but not across.
--parallel=all: Parallel execution within and across groups.

The text was updated successfully, but these errors were encountered:

csadorf · 2020-03-11T17:08:09Z

I would like to update my proposal:

--parallel=none: No parallel execution, default when no option is provided.
--parallel=inter-groups: Parallel execution between, but not within groups; default when only --parallel is provided.
--parallel=intra-groups: Parallel execution within groups, but not between.
--parallel=full: Parallel execution between and within groups.

ac-optimus · 2020-03-11T17:59:42Z

hi, I would like to work on this issue?

bdice · 2020-03-11T18:11:45Z

I have talked with @ac-optimus on Slack about this issue and there are some good first steps that could be taken. Some suggestions for implementation:

The parallelism has to be handled by both the "submit" logic (which handles parallelism between groups) and the "run" logic (which handles parallelism within a group).
Resources requests should use max or sum appropriately. If you're running a group's operations in parallel, that group needs to request the sum of its resources. Likewise, a group run in serial needs the max of its resources. If you're running multiple groups in parallel, the total request is the sum of all groups' resources. Likewise, multiple groups run in serial need to request the max over all groups' resources. That is, --parallel=none should request something like max(max(op for op in group) for group in groups) for each resource (GPUs, number of processors, etc.). In the same way, --parallel=inter-groups would be sum(max(...)), --parallel=intra-groups would be max(sum(...)), --parallel=full would be sum(sum(...)).
Copy the behavior of Issue#38 add ignore_condition for submit and run #209 in how it implements the options as an IntEnum.
Test this out by overriding the default environment with a SLURM environment and generating a script. It may be easier to check the output after we resolve Execute run with pretend in template testing #252.

bdice · 2020-03-11T18:13:04Z

@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.

csadorf · 2020-03-11T18:14:02Z

Calculating resources for this is non-trivial. How does this relate to #265 ?

ac-optimus · 2020-03-11T18:20:28Z

@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.

okay sure, will update you on this very soon!

b-butler · 2020-03-11T18:28:50Z

I think that this logic for directives aggregation should take two steps. We need to aggregate directives within a group according to serial or parallel, and after that we need to aggregate again with respect to the inter-group parallelization (or lack thereof).

With respect to #265 @csadorf, maybe this issue is another reason to centralize this directives logic to reduce code surface area and the fact that the two aggregations are identical though different in scope.

I could see the implementation of this more complete aggregation logic being the first step in addressing #265 by separating the logic from the FlowGroup and templates (the templates would still need the total directives, but there is no point in duplicating the aggregation logic).

csadorf · 2020-03-11T18:31:46Z

Agreed, refactoring the directives logic must be the first step.

csadorf · 2020-03-11T18:32:17Z

And honestly, it should not be that hard. We just need to allow for customization by environments.

b-butler · 2020-03-11T18:38:25Z

So would the ability to aggregate directives be found in the ComputeEnvironment class?

csadorf · 2020-03-11T18:46:12Z

Or at least it should be associated with a default entity.

bdice · 2020-03-11T18:50:15Z

Yes, refactoring directives seems like a good place to start.

(To clarify, the use of the word "aggregation" here in the context of determining resource requests over a set of flow groups is not related to the "aggregation" feature discussed in #52.)

bdice added enhancement New feature or request groups cluster submission Enhancements to the submission process labels Mar 11, 2020

bdice added this to the v0.10.0 milestone Mar 11, 2020

bdice assigned ac-optimus Mar 11, 2020

ac-optimus mentioned this issue Mar 25, 2020

Make directives module or class #265

Closed

vyasr mentioned this issue May 26, 2020

Show expected exec commands in submission script. #279

Merged

12 tasks

b-butler mentioned this issue Jun 15, 2020

Remove parallel option for FlowGroup submission #297

Merged

12 tasks

bdice modified the milestones: v0.10.0, v0.11.0 Jun 27, 2020

b-butler modified the milestones: v0.11.0, v0.12.0 Oct 7, 2020

bdice modified the milestones: v0.12.0, v0.13.0 Jan 15, 2021

vyasr mentioned this issue Jan 25, 2021

Add type hints #404

Open

vyasr mentioned this issue Feb 26, 2021

Stampede2 bundled submissions #437

Open

bdice removed this from the v0.13.0 milestone Mar 17, 2021

vyasr mentioned this issue Jan 8, 2022

Clarify limits of submit --bundle --parallel glotzerlab/signac-docs#157

Merged

3 tasks

joaander mentioned this issue Oct 31, 2023

Incorrect SLURM scripts produced with omp_num_threads>1. #777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism within / between groups #270

Parallelism within / between groups #270

bdice commented Mar 11, 2020

csadorf commented Mar 11, 2020

ac-optimus commented Mar 11, 2020

bdice commented Mar 11, 2020

bdice commented Mar 11, 2020

csadorf commented Mar 11, 2020

ac-optimus commented Mar 11, 2020

b-butler commented Mar 11, 2020

csadorf commented Mar 11, 2020

csadorf commented Mar 11, 2020

b-butler commented Mar 11, 2020

csadorf commented Mar 11, 2020

bdice commented Mar 11, 2020 •

edited

Loading

Parallelism within / between groups #270

Parallelism within / between groups #270

Comments

bdice commented Mar 11, 2020

Feature description

Proposed solution

csadorf commented Mar 11, 2020

ac-optimus commented Mar 11, 2020

bdice commented Mar 11, 2020

bdice commented Mar 11, 2020

csadorf commented Mar 11, 2020

ac-optimus commented Mar 11, 2020

b-butler commented Mar 11, 2020

csadorf commented Mar 11, 2020

csadorf commented Mar 11, 2020

b-butler commented Mar 11, 2020

csadorf commented Mar 11, 2020

bdice commented Mar 11, 2020 • edited Loading

bdice commented Mar 11, 2020 •

edited

Loading