Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect SLURM scripts produced with omp_num_threads>1. #777

Closed
joaander opened this issue Oct 30, 2023 · 6 comments
Closed

Incorrect SLURM scripts produced with omp_num_threads>1. #777

joaander opened this issue Oct 30, 2023 · 6 comments

Comments

@joaander
Copy link
Member

Description

flow incorrectly computes np = nranks * omp_num_threads and then sets --ntasks=np in the SLURM submission script.

To reproduce

$ flow test-workflow -g 4 -c 128 -n 1

Then add:

@Project.post.true("hybrid_mpi_omp")                                                                                                                                                                                                                                    
@Project.operation(directives={"nranks": 4, "omp_num_threads": 127})                                                                                                                                                                                                    
    def hybrid_mpi_omp(job):  
        pass

to project.py

$ python project.py submit --pretend -o hybrid_mpi_omp --partition wholenode

Error output

The resulting job script is:

#SBATCH --job-name="Project/b49788b9076312db79994d96758fdcf8/hybrid_mpi_o/6bb6c79496f9cc4858a0c6dd166318b6"
#SBATCH --partition=wholenode
#SBATCH --ntasks=508
...
export OMP_NUM_THREADS=127; mpirun -n 4 ...

A job definition like this does not provide the desired behavior. Tested interactively:

salloc -Admr140129 --ntasks=508 --partition wholenode
...
a937:~/devel/signac-flow ▶ export OMP_NUM_THREADS=127; mpirun -n 4 hostname
a937.anvil.rcac.purdue.edu
a937.anvil.rcac.purdue.edu
a937.anvil.rcac.purdue.edu
a937.anvil.rcac.purdue.edu

Notice how all ranks are run on the same node. Each rank should be on a separate node.

Correct job definition:

salloc -Admr140129 --ntasks=4  --cpus-per-task=127 --partition wholenode -t 1:00:00
...
a937:~/devel/signac-flow ▶ mpirun -n 4 hostname
a937.anvil.rcac.purdue.edu
a965.anvil.rcac.purdue.edu
a966.anvil.rcac.purdue.edu
a971.anvil.rcac.purdue.edu

When ntasks is the number of MPI tasks and --cpus-per-task is set appropriately, the 4 MPI ranks are assigned to different nodes.

System configuration

Please complete the following information:

  • Operating System [e.g. macOS]: Linux
  • Version of Python [e.g. 3.7]: 3.11
  • Version of signac [e.g. 1.0]: 2.1.0
  • Version of signac-flow: 4d4f144

Solution

Fixing this requires passing information on ranks and cpus per rank separately into the submit templates. It also requires that jobs using MPI are homogeneous (there is unused code in flow to verify this).

I would prefer not to hack in a quick fix for this bug. The current flow np, ntasks, ngpus, memory, etc... map to the actual job request only in a very indirect way. I would strongly prefer a solution that implements new directives that provide the user a more direct route to setting ntasks, cpus-per-task, gpus-per-task, mem-per-cpu. I am considering possible schemas for directives to achieve this without breaking existing jobs too much. Users will need a way to indicate whether to launch jobs with the parallel runner (e.g. mpirun) or not. For complete generality, we should also allow jobs to request more than one node and yet still launch without mpirun (e.g. for when users are running a tool that uses some other library for internode communication).

@joaander joaander self-assigned this Oct 30, 2023
@b-butler
Copy link
Member

There are definitely bugs and landmines in the submission logic. We would need to brainstorm and come up with the minimum number of settings that support all the workflows we claim to support. If we introduces changes to the directives' schema, we also need to decide what options to support (e.g. is memory and memory-per-cpu supported both are options on schedulers).

Part of the reason (not a defense just an explanation of design intent) for the directives schema was to provide a simple set of options that we would appropriately handle. The other major reason is that it was developed piecemeal as needed.

This particular bug is introduced in #722, my bad. We moved to --ntasks to avoid submission errors with --ncpus-per-task for hetetogenous submissions.

@joaander
Copy link
Member Author

I understand, setting these per rank is incompatible with heterogeneous jobs and they would need to be disallowed. I don't know if is even possible to craft a single SLURM job submission that requests different numbers of cores for different ranks.

My problem with the "simple" set of options is that is difficult to predict what simple inputs will give the correct complex output. For example, if we attempted to discover the intended mem_per_rank as memory // nrank (after checking that jobs are homogeneous) - the user has no feedback on how that calculation was done and no intuition as to why some outputs will issue --memory and others will issue --mem-per-rank.

We could potentially support both aggregate (np, nranks, ngpus, memory) and per node (nranks, cpus_per_rank, mem_per_rank, gpus_per_rank) operations (they would be mutually exclusive). This unfortunately makes the template scripts more complex. There may be better options.

@joaander
Copy link
Member Author

signac has long promised effortless bundling and parallel job executions (see e.g. #270, glotzerlab/signac-docs#157). In practice, this has only worked on a small subset of systems (none of which are in active use by the glotzer group). Many users assume that it works and don't bother checking if their jobs are using all the resources they requested. I think in this refactor, the submit environments should disallow --parallel by default and enable it only after testing proves that it is both possible and robust. In some cases, --bundle --parallel fails to work even on single node jobs. On almost all, it fails for multi-node. srun promises a working solution, but I have yet to find a cluster where it functions as documented.

I would summarize the possible use-cases as:

  • shell launched (limited to a single node)
    • single operation
    • bundled operations
      • homogeneous
        • serial
        • parallel
      • heterogeneous
        • serial
        • parallel
  • MPI driven (or other launcher) (single node or multi-node)
    • single operation
    • bundled operations
      • homogeneous
        • serial
        • parallel
      • heterogeneous
        • serial
        • parallel

For single-node jobs, np, ngpu, and memory work well enough. Single node parallel bundles can be enabled on most systems after verifying that the system doesn't implement implicit automatic binding to 1 CPU core (or we find a way to disable it). I propose that the new schema can keep np, ngpu, and memory - but error whenever more than one node is requested. This mode would be activated when nrank is None.

example:

# directives = dict(np=8, ngpu=2, memory='4g')

#SBATCH --ntasks 1
#SBATCH --cpus-per-task 8
#SBATCH --gpus=8
#SBATCH --memory=4g

MPI (or another launcher) is required to make multi-node jobs a possibility (MPI is also possible on a single node). Parallel bundles will almost always be disabled in this case - unless extensive testing work has been done to prove that srun can effectively launch bundled jobs. I don't think heterogenous bundles should be a target use-case for flow. SLURM can efficiently handle many cluster jobs with different requirements in parallel, let it do that work! For MPI (or other launcher) driven jobs. When nrank is not none, I propose ignoring np, ngpu, and memory (can error if they are not None). Instead, it will use cpus_per_rank, gpus_per_rank, and memory_per_cpu:

example:

# directives = dict(nrank=32, cpus_per_rank=2, gpus_per_rank=4, memory_per_cpu='2g')

#SBATCH --ntasks 32
#SBATCH --cpus-per-task 2
#SBATCH --gpus-per-task 4
#SBATCH --mem-per-cpu 2g

This provides simple and scalable syntax well-adopted to the two different classes of use-cases. It enables:

  • shell launched (limited to a single node)
    • single operation
    • bundled operations
      • homogeneous
        • serial
        • parallel (on tested and supported systems)
      • heterogeneous
        • serial
        • parallel (on tested and supported systems)
  • MPI driven (or other launcher) (single node or multi-node)
    • single operation
    • bundled operations
      • homogeneous
        • serial
        • parallel (very rarely, only on very well tested systems - users should use aggregation instead)

The only case removed (which is essentially non-fuctional now, except possibly on stampede2) is heterogenous MPI driven jobs.

Any thoughts or suggestions?

One of the pain points I have with the current system is that setting nrank implicitly enables the mpirun command. It would be nice if this were more explicit, but is this too redundant?

directives_mpi = dict(launcher='mpi', nrank=32, cpus_per_rank=2, gpus_per_rank=4, memory_per_cpu='2g')

@b-butler
Copy link
Member

b-butler commented Oct 31, 2023

So the resource logic would look something like

           |----error_on_bad_request----- MPI
directives-|
           |----error_on_bad_request-----non-MPI

Also, when would it make sense to have nranks not launch MPI? or are you suggesting multiple launchers could be targeted with launcher= "mpiexec?

Also for clarity the proposed changes are,

  1. For bundling
  • Disable --parallel by default
  • Add some flag (probably per environment) which specify whether --parallel bundling works
  • Always (except extensive testing) disable --parallel with multi-node submissions
  1. For directives
  • Create two new paths (with some overlapping directives but many exclusive) for resource requests (MPI v. non-MPI)
  • Possibly add support for launchers other than the environment default

Am I understanding correctly @joaander?

@joaander
Copy link
Member Author

Yes, that is my current proposal. Happy to discuss in person and refine these ideas.

I wasn't thinking that the launcher would override mpirun / srun usage - we set that per environment after testing. mpirun and mpiexec are synonyms. srun is a little more general, though we currently use it only to launch MPI applications.

When I think about more general support for other launchers, I mean something like charmrun (https://charm.readthedocs.io/en/latest/quickstart.html) or the equivalent for NCCL in non-MPI mode. We aren't currently supporting this directly, but I don't want to design something completely MPI specific to allow for future expansion.

The term "rank" already assumes MPI. As long as the word is rank, then it does not make sense to have nrank not None and launch something other than MPI.

I'd be happy to use a different nomenclature. SLURM uses the work task which is roughly equivalent to a process. As an alternate proposal (one that breaks more user scripts) we could have a more unified approach to resource selection and put the gates on the user only based on the type of launcher they select (i.e. because the shell launcher doesn't know how to distribute tasks beyond one node).

directives_shell = dict(launcher='shell', ntasks=1, cpus_per_task=4, memory_per_cpu='4g')
directives_trivial = dict(ntasks=1) # implies launcher='shell'
directives_mpi = dict(launcher='mpi', ntasks=64, cpus_per_task=4, gpus_per_task=1, memory_per_cpu='4g')

In this case, I mean that launcher='shell' directly invokes python from the controlling shell. launcher='mpi' invokes the environment configured mpi launcher (mpirun, srun, jsrun, ibrun, ...).

@joaander joaander removed their assignment Nov 2, 2023
@joaander
Copy link
Member Author

Duplicate of #785.

@joaander joaander closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants