Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

Closed
dawson6 opened this issue May 8, 2023 · 2 comments
Closed

Salloc + Slurm on Toss 4 (rzwhippet) Fails #133

dawson6 opened this issue May 8, 2023 · 2 comments

Comments

@dawson6
Copy link
Member

dawson6 commented May 8, 2023

On Toss 3, one could run like so:

salloc -N 3 -p pdebug --exclusive srun -n 1

And that will run the atswrapper on 1 of the allocated nodes, which would then run 'srun -n 1' commands on that node to submit all the jobs.

The benefit of this is that, while 'atswrapper' is not an MPI application, it prevents the followup srun jobs, submitted by atswrapper, from running on the login node.

This works on toss3.

But on toss4 (rzwhippet) the followup srun jobs all fail with:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet40: Unable to satisfy cpu bind request
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet41: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request

Now, one can run like so

salloc -N 3 -p pdebug --exclusive , and while that runs, it does the 'srun's on the login node, which looks bad.

OR 1 can run by splitting that iinto two steps

  1. salloc the node somehow
  2. run atswrapper

But combinging the salloc ... srun into 1 line h as issues now, it did not with toss3.

@ryanday36
Copy link

This seems to be a pretty clear bug in how Slurm is setting up binding when you specify cpus-per-task (rzwhippet is running a newer version of Slurm than rzgenie). I'll report it to the Slurm developers so that we can get it fixed.

It looks like you can work around this by setting any valid '--cpu-bind' argument. E.g. here I set SLURM_CPU_BIND=quiet (which is the default cpu-bind behavior), and it rescues my simple reproducer.

[day36@rzwhippet17:salloc_test]$ cat runstuff.sh 
#!/bin/sh
echo "#works"
SLURM_CPU_BIND=quiet srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname
echo ""
echo "#fails"
srun --mpibind=off --nodes=1 --ntasks=1 --cpus-per-task=1 hostname

[day36@rzwhippet17:salloc_test]$ salloc -N2 --exclusive srun -N1 -n1 ./runstuff.sh 
salloc: Granted job allocation 2365
salloc: Waiting for resource configuration
salloc: Nodes rzwhippet[16,22] are ready for job
#works
rzwhippet22

#fails
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000000000000000000010000000000000000000000000001.
srun: error: Task launch for StepId=2365.2 failed on node rzwhippet22: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted
srun: error: rzwhippet16: task 0: Exited with exit code 192
salloc: Relinquishing job allocation 2365
[day36@rzwhippet17:salloc_test]$

@dawson6
Copy link
Member Author

dawson6 commented Aug 16, 2023

Resolved, just have to teach folks to use --interactive after srun

salloc -N --exclusive srun --interactive -n 1

@dawson6 dawson6 closed this as completed Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants