Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelism within / between groups #270

Open
bdice opened this issue Mar 11, 2020 · 12 comments
Open

Parallelism within / between groups #270

bdice opened this issue Mar 11, 2020 · 12 comments
Assignees
Labels
cluster submission Enhancements to the submission process enhancement New feature or request groups

Comments

@bdice
Copy link
Member

bdice commented Mar 11, 2020

Feature description

tl;dr: We need a way to control parallelism within and between groups. Parallel operation within a group would be "intra-group" and parallel operation between groups would be "inter-group." This behavior would be controlled by the --parallel flag.

Copied from Slack, adapted for brevity:

@bdice: I have two operations, equilibrate and sample, in a group called simulate. Currently only the equilibrate jobs are eligible to run. I want to run equilibrate and then sample, parallelized across jobs (not parallel over both operations at the same time). That is, all jobs run equilibrate and then run simulate when that's done. On submission, it is requesting enough resources to run equilibrate and sample simultaneously, instead of sequentially (while still being parallel over jobs). How do I parallelize across groups but operate sequentially within groups?

@b-butler: The current implementation by design parallelizes across both jobs and operations since given only one flag there was not a way to specify only parallelizing part of the submission. We could implement more mutually exclusive flags or make --parallel have a default value.

Proposed solution

Proposal for API from @csadorf:
--parallel=none: No parallel execution, default when no option is provided.
--parallel=inter: Parallel execution across, but not within groups; default when only --parallel is provided.
--parallel=intra: Parallel execution within groups, but not across.
--parallel=all: Parallel execution within and across groups.

@bdice bdice added enhancement New feature or request groups cluster submission Enhancements to the submission process labels Mar 11, 2020
@bdice bdice added this to the v0.10.0 milestone Mar 11, 2020
@csadorf
Copy link
Contributor

csadorf commented Mar 11, 2020

I would like to update my proposal:

  • --parallel=none: No parallel execution, default when no option is provided.
  • --parallel=inter-groups: Parallel execution between, but not within groups; default when only --parallel is provided.
  • --parallel=intra-groups: Parallel execution within groups, but not between.
  • --parallel=full: Parallel execution between and within groups.

@ac-optimus
Copy link
Contributor

hi, I would like to work on this issue?

@bdice
Copy link
Member Author

bdice commented Mar 11, 2020

I have talked with @ac-optimus on Slack about this issue and there are some good first steps that could be taken. Some suggestions for implementation:

  • The parallelism has to be handled by both the "submit" logic (which handles parallelism between groups) and the "run" logic (which handles parallelism within a group).
  • Resources requests should use max or sum appropriately. If you're running a group's operations in parallel, that group needs to request the sum of its resources. Likewise, a group run in serial needs the max of its resources. If you're running multiple groups in parallel, the total request is the sum of all groups' resources. Likewise, multiple groups run in serial need to request the max over all groups' resources. That is, --parallel=none should request something like max(max(op for op in group) for group in groups) for each resource (GPUs, number of processors, etc.). In the same way, --parallel=inter-groups would be sum(max(...)), --parallel=intra-groups would be max(sum(...)), --parallel=full would be sum(sum(...)).
  • Copy the behavior of Issue#38 add ignore_condition for submit and run #209 in how it implements the options as an IntEnum.
  • Test this out by overriding the default environment with a SLURM environment and generating a script. It may be easier to check the output after we resolve Execute run with pretend in template testing #252.

@bdice
Copy link
Member Author

bdice commented Mar 11, 2020

@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.

@csadorf
Copy link
Contributor

csadorf commented Mar 11, 2020

Calculating resources for this is non-trivial. How does this relate to #265 ?

@ac-optimus
Copy link
Contributor

@glotzerlab/signac-committers Other suggestions for implementation are welcome. @ac-optimus Since this is fairly complicated, I would like to see a small proposal for the work to be done (which parts of the code would be edited) before beginning a pull request.

okay sure, will update you on this very soon!

@b-butler
Copy link
Member

I think that this logic for directives aggregation should take two steps. We need to aggregate directives within a group according to serial or parallel, and after that we need to aggregate again with respect to the inter-group parallelization (or lack thereof).

With respect to #265 @csadorf, maybe this issue is another reason to centralize this directives logic to reduce code surface area and the fact that the two aggregations are identical though different in scope.

I could see the implementation of this more complete aggregation logic being the first step in addressing #265 by separating the logic from the FlowGroup and templates (the templates would still need the total directives, but there is no point in duplicating the aggregation logic).

@csadorf
Copy link
Contributor

csadorf commented Mar 11, 2020

Agreed, refactoring the directives logic must be the first step.

@csadorf
Copy link
Contributor

csadorf commented Mar 11, 2020

And honestly, it should not be that hard. We just need to allow for customization by environments.

@b-butler
Copy link
Member

So would the ability to aggregate directives be found in the ComputeEnvironment class?

@csadorf
Copy link
Contributor

csadorf commented Mar 11, 2020

Or at least it should be associated with a default entity.

@bdice
Copy link
Member Author

bdice commented Mar 11, 2020

Yes, refactoring directives seems like a good place to start.

(To clarify, the use of the word "aggregation" here in the context of determining resource requests over a set of flow groups is not related to the "aggregation" feature discussed in #52.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster submission Enhancements to the submission process enhancement New feature or request groups
Projects
None yet
Development

No branches or pull requests

4 participants