-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salloc + Slurm on Toss 4 (rzwhippet) Fails #133
Comments
This seems to be a pretty clear bug in how Slurm is setting up binding when you specify cpus-per-task (rzwhippet is running a newer version of Slurm than rzgenie). I'll report it to the Slurm developers so that we can get it fixed. It looks like you can work around this by setting any valid '--cpu-bind' argument. E.g. here I set SLURM_CPU_BIND=quiet (which is the default cpu-bind behavior), and it rescues my simple reproducer.
|
Resolved, just have to teach folks to use --interactive after srun salloc -N --exclusive srun --interactive -n 1 |
On Toss 3, one could run like so:
salloc -N 3 -p pdebug --exclusive srun -n 1
And that will run the atswrapper on 1 of the allocated nodes, which would then run 'srun -n 1' commands on that node to submit all the jobs.
The benefit of this is that, while 'atswrapper' is not an MPI application, it prevents the followup srun jobs, submitted by atswrapper, from running on the login node.
This works on toss3.
But on toss4 (rzwhippet) the followup srun jobs all fail with:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000000000003000000000000070000000000000300000000000007.
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet40: Unable to satisfy cpu bind request
srun: error: Task launch for StepId=1932.2 failed on node rzwhippet41: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
Now, one can run like so
salloc -N 3 -p pdebug --exclusive , and while that runs, it does the 'srun's on the login node, which looks bad.
OR 1 can run by splitting that iinto two steps
But combinging the salloc ... srun into 1 line h as issues now, it did not with toss3.
The text was updated successfully, but these errors were encountered: