Incorrect SLURM scripts produced with `omp_num_threads>1`. #777

joaander · 2023-10-30T15:32:52Z

Description

flow incorrectly computes np = nranks * omp_num_threads and then sets --ntasks=np in the SLURM submission script.

To reproduce

$ flow test-workflow -g 4 -c 128 -n 1

Then add:

@Project.post.true("hybrid_mpi_omp")                                                                                                                                                                                                                                    
@Project.operation(directives={"nranks": 4, "omp_num_threads": 127})                                                                                                                                                                                                    
    def hybrid_mpi_omp(job):  
        pass

to project.py

$ python project.py submit --pretend -o hybrid_mpi_omp --partition wholenode

Error output

The resulting job script is:

#SBATCH --job-name="Project/b49788b9076312db79994d96758fdcf8/hybrid_mpi_o/6bb6c79496f9cc4858a0c6dd166318b6"
#SBATCH --partition=wholenode
#SBATCH --ntasks=508
...
export OMP_NUM_THREADS=127; mpirun -n 4 ...

A job definition like this does not provide the desired behavior. Tested interactively:

salloc -Admr140129 --ntasks=508 --partition wholenode
...
a937:~/devel/signac-flow ▶ export OMP_NUM_THREADS=127; mpirun -n 4 hostname
a937.anvil.rcac.purdue.edu
a937.anvil.rcac.purdue.edu
a937.anvil.rcac.purdue.edu
a937.anvil.rcac.purdue.edu

Notice how all ranks are run on the same node. Each rank should be on a separate node.

Correct job definition:

salloc -Admr140129 --ntasks=4  --cpus-per-task=127 --partition wholenode -t 1:00:00
...
a937:~/devel/signac-flow ▶ mpirun -n 4 hostname
a937.anvil.rcac.purdue.edu
a965.anvil.rcac.purdue.edu
a966.anvil.rcac.purdue.edu
a971.anvil.rcac.purdue.edu

When ntasks is the number of MPI tasks and --cpus-per-task is set appropriately, the 4 MPI ranks are assigned to different nodes.

System configuration

Please complete the following information:

Operating System [e.g. macOS]: Linux
Version of Python [e.g. 3.7]: 3.11
Version of signac [e.g. 1.0]: 2.1.0
Version of signac-flow: 4d4f144

Solution

Fixing this requires passing information on ranks and cpus per rank separately into the submit templates. It also requires that jobs using MPI are homogeneous (there is unused code in flow to verify this).

I would prefer not to hack in a quick fix for this bug. The current flow np, ntasks, ngpus, memory, etc... map to the actual job request only in a very indirect way. I would strongly prefer a solution that implements new directives that provide the user a more direct route to setting ntasks, cpus-per-task, gpus-per-task, mem-per-cpu. I am considering possible schemas for directives to achieve this without breaking existing jobs too much. Users will need a way to indicate whether to launch jobs with the parallel runner (e.g. mpirun) or not. For complete generality, we should also allow jobs to request more than one node and yet still launch without mpirun (e.g. for when users are running a tool that uses some other library for internode communication).

The text was updated successfully, but these errors were encountered:

b-butler · 2023-10-30T17:55:00Z

There are definitely bugs and landmines in the submission logic. We would need to brainstorm and come up with the minimum number of settings that support all the workflows we claim to support. If we introduces changes to the directives' schema, we also need to decide what options to support (e.g. is memory and memory-per-cpu supported both are options on schedulers).

Part of the reason (not a defense just an explanation of design intent) for the directives schema was to provide a simple set of options that we would appropriately handle. The other major reason is that it was developed piecemeal as needed.

This particular bug is introduced in #722, my bad. We moved to --ntasks to avoid submission errors with --ncpus-per-task for hetetogenous submissions.

joaander · 2023-10-30T19:48:29Z

I understand, setting these per rank is incompatible with heterogeneous jobs and they would need to be disallowed. I don't know if is even possible to craft a single SLURM job submission that requests different numbers of cores for different ranks.

My problem with the "simple" set of options is that is difficult to predict what simple inputs will give the correct complex output. For example, if we attempted to discover the intended mem_per_rank as memory // nrank (after checking that jobs are homogeneous) - the user has no feedback on how that calculation was done and no intuition as to why some outputs will issue --memory and others will issue --mem-per-rank.

We could potentially support both aggregate (np, nranks, ngpus, memory) and per node (nranks, cpus_per_rank, mem_per_rank, gpus_per_rank) operations (they would be mutually exclusive). This unfortunately makes the template scripts more complex. There may be better options.

joaander · 2023-10-31T13:37:39Z

signac has long promised effortless bundling and parallel job executions (see e.g. #270, glotzerlab/signac-docs#157). In practice, this has only worked on a small subset of systems (none of which are in active use by the glotzer group). Many users assume that it works and don't bother checking if their jobs are using all the resources they requested. I think in this refactor, the submit environments should disallow --parallel by default and enable it only after testing proves that it is both possible and robust. In some cases, --bundle --parallel fails to work even on single node jobs. On almost all, it fails for multi-node. srun promises a working solution, but I have yet to find a cluster where it functions as documented.

I would summarize the possible use-cases as:

shell launched (limited to a single node)
- single operation
- bundled operations
  - homogeneous
    - serial
    - parallel
  - heterogeneous
    - serial
    - parallel
MPI driven (or other launcher) (single node or multi-node)
- single operation
- bundled operations
  - homogeneous
    - serial
    - parallel
  - heterogeneous
    - serial
    - parallel

For single-node jobs, np, ngpu, and memory work well enough. Single node parallel bundles can be enabled on most systems after verifying that the system doesn't implement implicit automatic binding to 1 CPU core (or we find a way to disable it). I propose that the new schema can keep np, ngpu, and memory - but error whenever more than one node is requested. This mode would be activated when nrank is None.

example:

# directives = dict(np=8, ngpu=2, memory='4g')

#SBATCH --ntasks 1
#SBATCH --cpus-per-task 8
#SBATCH --gpus=8
#SBATCH --memory=4g

MPI (or another launcher) is required to make multi-node jobs a possibility (MPI is also possible on a single node). Parallel bundles will almost always be disabled in this case - unless extensive testing work has been done to prove that srun can effectively launch bundled jobs. I don't think heterogenous bundles should be a target use-case for flow. SLURM can efficiently handle many cluster jobs with different requirements in parallel, let it do that work! For MPI (or other launcher) driven jobs. When nrank is not none, I propose ignoring np, ngpu, and memory (can error if they are not None). Instead, it will use cpus_per_rank, gpus_per_rank, and memory_per_cpu:

example:

# directives = dict(nrank=32, cpus_per_rank=2, gpus_per_rank=4, memory_per_cpu='2g')

#SBATCH --ntasks 32
#SBATCH --cpus-per-task 2
#SBATCH --gpus-per-task 4
#SBATCH --mem-per-cpu 2g

This provides simple and scalable syntax well-adopted to the two different classes of use-cases. It enables:

shell launched (limited to a single node)
- single operation
- bundled operations
  - homogeneous
    - serial
    - parallel (on tested and supported systems)
  - heterogeneous
    - serial
    - parallel (on tested and supported systems)
MPI driven (or other launcher) (single node or multi-node)
- single operation
- bundled operations
  - homogeneous
    - serial
    - parallel (very rarely, only on very well tested systems - users should use aggregation instead)

The only case removed (which is essentially non-fuctional now, except possibly on stampede2) is heterogenous MPI driven jobs.

Any thoughts or suggestions?

One of the pain points I have with the current system is that setting nrank implicitly enables the mpirun command. It would be nice if this were more explicit, but is this too redundant?

directives_mpi = dict(launcher='mpi', nrank=32, cpus_per_rank=2, gpus_per_rank=4, memory_per_cpu='2g')

b-butler · 2023-10-31T15:07:08Z

So the resource logic would look something like

           |----error_on_bad_request----- MPI
directives-|
           |----error_on_bad_request-----non-MPI

Also, when would it make sense to have nranks not launch MPI? or are you suggesting multiple launchers could be targeted with launcher= "mpiexec?

Also for clarity the proposed changes are,

For bundling

Disable --parallel by default
Add some flag (probably per environment) which specify whether --parallel bundling works
Always (except extensive testing) disable --parallel with multi-node submissions

For directives

Create two new paths (with some overlapping directives but many exclusive) for resource requests (MPI v. non-MPI)
Possibly add support for launchers other than the environment default

Am I understanding correctly @joaander?

joaander · 2023-10-31T15:31:12Z

Yes, that is my current proposal. Happy to discuss in person and refine these ideas.

I wasn't thinking that the launcher would override mpirun / srun usage - we set that per environment after testing. mpirun and mpiexec are synonyms. srun is a little more general, though we currently use it only to launch MPI applications.

When I think about more general support for other launchers, I mean something like charmrun (https://charm.readthedocs.io/en/latest/quickstart.html) or the equivalent for NCCL in non-MPI mode. We aren't currently supporting this directly, but I don't want to design something completely MPI specific to allow for future expansion.

The term "rank" already assumes MPI. As long as the word is rank, then it does not make sense to have nrank not None and launch something other than MPI.

I'd be happy to use a different nomenclature. SLURM uses the work task which is roughly equivalent to a process. As an alternate proposal (one that breaks more user scripts) we could have a more unified approach to resource selection and put the gates on the user only based on the type of launcher they select (i.e. because the shell launcher doesn't know how to distribute tasks beyond one node).

directives_shell = dict(launcher='shell', ntasks=1, cpus_per_task=4, memory_per_cpu='4g')
directives_trivial = dict(ntasks=1) # implies launcher='shell'
directives_mpi = dict(launcher='mpi', ntasks=64, cpus_per_task=4, gpus_per_task=1, memory_per_cpu='4g')

In this case, I mean that launcher='shell' directly invokes python from the controlling shell. launcher='mpi' invokes the environment configured mpi launcher (mpirun, srun, jsrun, ibrun, ...).

joaander · 2024-05-28T16:17:56Z

Duplicate of #785.

joaander self-assigned this Oct 30, 2023

This was referenced Oct 30, 2023

Support Purdue Anvil. #775

Merged

Multi-node bundling of jobs won't subscribe to nodes correctly #239

Closed

joaander mentioned this issue Oct 31, 2023

Template Multinode GPU Error #566

Open

joaander mentioned this issue Nov 1, 2023

Fix srun invocation on Great Lakes. #779

Merged

6 tasks

joaander removed their assignment Nov 2, 2023

joaander mentioned this issue Nov 3, 2023

Refactor directives. #785

Open

joaander closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect SLURM scripts produced with `omp_num_threads>1`. #777

Incorrect SLURM scripts produced with `omp_num_threads>1`. #777

joaander commented Oct 30, 2023

b-butler commented Oct 30, 2023

joaander commented Oct 30, 2023

joaander commented Oct 31, 2023

b-butler commented Oct 31, 2023 •

edited by cbkerr

Loading

joaander commented Oct 31, 2023

joaander commented May 28, 2024

Incorrect SLURM scripts produced with omp_num_threads>1. #777

Incorrect SLURM scripts produced with omp_num_threads>1. #777

Comments

joaander commented Oct 30, 2023

Description

To reproduce

Error output

Correct job definition:

System configuration

Solution

b-butler commented Oct 30, 2023

joaander commented Oct 30, 2023

joaander commented Oct 31, 2023

b-butler commented Oct 31, 2023 • edited by cbkerr Loading

joaander commented Oct 31, 2023

joaander commented May 28, 2024

Incorrect SLURM scripts produced with `omp_num_threads>1`. #777

Incorrect SLURM scripts produced with `omp_num_threads>1`. #777

b-butler commented Oct 31, 2023 •

edited by cbkerr

Loading