Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

dawson6 · 2023-05-08T16:38:45Z

On Toss 3, one could run like so:

salloc -N 3 -p pdebug --exclusive srun -n 1

And that will run the atswrapper on 1 of the allocated nodes, which would then run 'srun -n 1' commands on that node to submit all the jobs.

The benefit of this is that, while 'atswrapper' is not an MPI application, it prevents the followup srun jobs, submitted by atswrapper, from running on the login node.

This works on toss3.

But on toss4 (rzwhippet) the followup srun jobs all fail with:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet40: Unable to satisfy cpu bind request
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet41: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request

Now, one can run like so

salloc -N 3 -p pdebug --exclusive , and while that runs, it does the 'srun's on the login node, which looks bad.

OR 1 can run by splitting that iinto two steps

salloc the node somehow
run atswrapper

But combinging the salloc ... srun into 1 line h as issues now, it did not with toss3.

ryanday36 · 2023-05-08T22:04:45Z

This seems to be a pretty clear bug in how Slurm is setting up binding when you specify cpus-per-task (rzwhippet is running a newer version of Slurm than rzgenie). I'll report it to the Slurm developers so that we can get it fixed.

It looks like you can work around this by setting any valid '--cpu-bind' argument. E.g. here I set SLURM_CPU_BIND=quiet (which is the default cpu-bind behavior), and it rescues my simple reproducer.

[day36@rzwhippet17:salloc_test]$ cat runstuff.sh 
#!/bin/sh
echo "#works"
SLURM_CPU_BIND=quiet srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname
echo ""
echo "#fails"
srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname

[day36@rzwhippet17:salloc_test]$ salloc -N2 --exclusive srun -N1 -n1 ./runstuff.sh 
salloc: Granted job allocation 2365
salloc: Waiting for resource configuration
salloc: Nodes rzwhippet[16,22] are ready for job
#works
rzwhippet22

#fails
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000000000000000000010000000000000000000000000001.
srun: error: Task launch for StepId=2365.2 failed on node rzwhippet22: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted
srun: error: rzwhippet16: task 0: Exited with exit code 192
salloc: Relinquishing job allocation 2365
[day36@rzwhippet17:salloc_test]$

dawson6 · 2023-08-16T15:00:41Z

Resolved, just have to teach folks to use --interactive after srun

salloc -N --exclusive srun --interactive -n 1

dawson6 closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

dawson6 commented May 8, 2023

ryanday36 commented May 8, 2023

dawson6 commented Aug 16, 2023

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

Comments

dawson6 commented May 8, 2023

ryanday36 commented May 8, 2023

dawson6 commented Aug 16, 2023