Fix srun invocation on Great Lakes. #779

joaander · 2023-10-31T15:38:35Z

Description

Switch back to mpirun on Great Lakes.

Motivation and Context

This change was made in #722. My workflows (hoomd-validation) work correctly with the most current release of flow, but not the main branch. When flow forks to srun, somehow my environment variables are all lost. For example, one of errors I got testing this was singularity not found even though I checked in the shell and singularity was in the PATH and the filesystem where singularity was stored was accessible.

Strangely, everything works if I directly called the srun command from the shell:

srun -n 4  /home/joaander/miniforge3/bin/python /gpfs/accounts/sglotzer_root/sglotzer0/joaander/hoomd-validation/hoomd_validation/project.py exec hard_disk_nvt_gpu agg-88ab2b6305a8b249743764fc16e5c22e

I don't know what causes this behavior, but the expedient solution is to switch back to mpirun which we have used for a long time on Great Lakes with fewer issues.

Checklist:

I am familiar with the Contributing Guidelines.
I agree with the terms of the Contributor Agreement.
My name is on the list of contributors.
The changes introduced by this pull request are covered by existing or newly introduced tests.
The package documentation and framework documentation in signac-docs are up to date with these changes.
I have updated the changelog and added any related issue and pull request numbers for future reference.

codecov · 2023-10-31T16:00:16Z

Codecov Report

Merging #779 (b65d02a) into main (9ba71de) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #779   +/-   ##
=======================================
  Coverage   69.11%   69.11%           
=======================================
  Files          45       45           
  Lines        4348     4348           
  Branches     1055      979   -76     
=======================================
  Hits         3005     3005           
  Misses       1137     1137           
  Partials      206      206

Files	Coverage Δ
flow/environments/umich.py	`85.71% <100.00%> (ø)`

joaander · 2023-11-01T14:57:09Z

Wait, this needs more testing on GPU nodes where srun configures the appropriate cgroups and mpirun does not. I may need to find a fix for the environment variable issue.

Edit: Never mind, I was misinterpreting the output. srun and mpirun on great lakes both allocate all assigned GPUs to all ranks on a given node. cgroups limit users to the assigned nodes, but do not assign individual gpus to ranks.

joaander · 2023-11-01T16:38:31Z

Upon further testing with a 4 V100 GPU job (which slurm assigned 3 GPUs to 1 node and 1 GPU to another node), mpirun assigns multiple tasks to the same GPU while srun correctly distributes the tasks. Note: we still need --gpus-per-task which I will add separately when working on #777.

I found that srun --export=ALL solved the environment variable forwarding issue. I also found that srun buffered stdout/stderr until the job completed, so I also added -u to disable buffering.

srun -u --export=ALL correctly runs hoomd-validation jobs both when hoomd is compiled from source and from the glotzerlab-software container. I am waiting for one more test on the 4 GPU A100 node.

joaander · 2023-11-01T16:53:59Z

This works on both V100 and A100 nodes on Great Lakes.

for more information, see https://pre-commit.ci

joaander · 2023-11-02T20:56:12Z

@b-butler This PR now includes both fixes to the Great Lakes environment necessary to make the template functional for the next release.

Use cpus-per-task to ensure all processes requested via np are placed on one node.
Use gpus-per-task to allow multi-GPU jobs to correctly distribute to the assigned GPUs.

This formulation works only with homogenous jobs. How do I call homogeneous_openmp_mpi_config to ensure this?

I did not fix --cpus-per-task for all environments, as we are planning a more general fix in #777.

Due to planned refactoring of directives this makes more sense.

b-butler

I made some changes to the implementation (moving to the template) since we are refactoring and nranks will be removed. The code is equivalent.

Use mpirun on Great Lakes.

5995cb4

joaander requested review from a team as code owners October 31, 2023 15:38

joaander requested review from vyasr and SchoeniPhlippsn and removed request for a team October 31, 2023 15:38

joaander added 2 commits October 31, 2023 11:45

Update test reference data.

8a66b79

Update reference data again.

3d71c28

joaander marked this pull request as draft November 1, 2023 14:56

joaander marked this pull request as ready for review November 1, 2023 15:05

joaander marked this pull request as draft November 1, 2023 15:57

Use srun, but fix environment export and buffered output.

91fd3a3

Update template reference data.

0692cdc

joaander force-pushed the fix/great-lakes-mpi-launcher branch from fcd7444 to 0692cdc Compare November 1, 2023 16:50

joaander marked this pull request as ready for review November 1, 2023 16:53

joaander changed the title ~~Use mpirun on Great Lakes.~~ Fix srun invocation on Great Lakes. Nov 1, 2023

joaander and others added 3 commits November 2, 2023 16:46

Temporary fix to Great Lakes scheduling.

3022147

[pre-commit.ci] auto fixes from pre-commit.com hooks

b702535

for more information, see https://pre-commit.ci

Update test reference data.

ebdd1cd

b-butler added 2 commits November 8, 2023 10:36

fix: Raise error on non-homogeneous MPI requests

5099399

refactor: Move necessary changes to template.

9334277

Due to planned refactoring of directives this makes more sense.

b-butler approved these changes Nov 8, 2023

View reviewed changes

Merge branch 'main' into fix/great-lakes-mpi-launcher

b65d02a

b-butler enabled auto-merge (squash) November 8, 2023 18:51

b-butler merged commit 8477337 into main Nov 8, 2023
14 checks passed

b-butler deleted the fix/great-lakes-mpi-launcher branch November 8, 2023 19:33

b-butler added this to the v0.27.0 milestone Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix srun invocation on Great Lakes. #779

Fix srun invocation on Great Lakes. #779

joaander commented Oct 31, 2023

codecov bot commented Oct 31, 2023 •

edited

Loading

joaander commented Nov 1, 2023 •

edited

Loading

joaander commented Nov 1, 2023

joaander commented Nov 1, 2023

joaander commented Nov 2, 2023

b-butler left a comment

Fix srun invocation on Great Lakes. #779

Fix srun invocation on Great Lakes. #779

Conversation

joaander commented Oct 31, 2023

Description

Motivation and Context

Checklist:

codecov bot commented Oct 31, 2023 • edited Loading

Codecov Report

joaander commented Nov 1, 2023 • edited Loading

joaander commented Nov 1, 2023

joaander commented Nov 1, 2023

joaander commented Nov 2, 2023

b-butler left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 31, 2023 •

edited

Loading

joaander commented Nov 1, 2023 •

edited

Loading