Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node bundling of jobs won't subscribe to nodes correctly #239

Closed
vyasr opened this issue Feb 3, 2020 · 2 comments
Closed

Multi-node bundling of jobs won't subscribe to nodes correctly #239

vyasr opened this issue Feb 3, 2020 · 2 comments
Labels
cluster submission Enhancements to the submission process

Comments

@vyasr
Copy link
Contributor

vyasr commented Feb 3, 2020

Feature description

Any submission involving multiple nodes that executes more than one operation will not parallelize correctly using backgrounding. For example, if a node has 24 cores and we submit 72 operations, our script generation will correctly request 3 nodes with 24 tasks per node. However, the operations will be executed by running the normal command and backgrounding it, and I don't believe there is any way for these operations to be transmitted across nodes. As a result, the 25th job will simply oversubscribe one of the processors on the current node.

Proposed solution

I believe that using the appropriate submission prefix like ibrun or srun should resolve this issue. Rather than only using these commands when running operations that individually require MPI, we may need to use them for any multi-node job. @joaander may be able to provide additional commentary on potential solutions.

@joaander
Copy link
Member

joaander commented Feb 4, 2020

Any solution will be environment dependent. In principle, srun is supposed to solve this, but in most of the environments we have tested this it fails for one reason or another.

@vyasr vyasr added the cluster submission Enhancements to the submission process label Feb 26, 2020
@joaander
Copy link
Member

Discussion on this topic has been moved to #777.

@joaander joaander closed this as not planned Won't fix, can't repro, duplicate, stale Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster submission Enhancements to the submission process
Projects
None yet
Development

No branches or pull requests

2 participants