-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slurmstepd not killing auks process #24
Comments
We have the same problem (SLURM 17.11.5). Thanks for sharing your knowledge, it saved me a lot of stress. Your proposed solution seems to work fine (at least as a workaround). |
I agree that this is due to the modification in signal handling in Slurm. slurmstepd now blocks more signals than before, including SIGTERM, and only unblocks them prior to starting user tasks. Every process forked by spank plugins in the privileged per task hook inherits the blocked signal masks of its parent and thus ignores SIGTERM too. I do not exactly understand why this modification was introduced in Slurm. You should file a bug in schedmd bugzilla for that to get more information. Sending SIGKILL as you suggest is a valid work-around and could be a correct replacement. You can send a pull request for that change and I will include it directly if you want. We are still in 17.11.2 (with backports of important patches..) so we not yet have encountered it. Thanks for the warning and the detailed information. |
how about reseting the blocked signal to none just before the call to fork ?
seems to work. |
This seems like the cleaner solution. Unfortunately I don't have a test environment to test it in. |
I have tested the patch by @kenshin33 and it is working fine. |
@kenshin33 can you make a pull request for your patch ? |
sure thing, will do later today.
…On Wed, Oct 24, 2018 at 4:55 PM hautreux ***@***.***> wrote:
@kenshin33 <https://github.com/kenshin33> can you make a pull request for
your patch ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#24 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJlee4O0aAT1NwuelkFgMXUEWndU2vq_ks5uoNPOgaJpZM4Skw3j>
.
--
- Unix is fundamentally a simple system, but you have to be a genius to
understand its simplicity.
- Do not seek death, death will ultimately find you. Seek the road that
makes death a fulfilment.
|
I would like to have a version bump for this bug fix. this helps to have a clean update process through an RPM/dep repository. I added this in the new PR #33 PS: and thanks for the bug fix! |
With recent versions of SLURM 17.11.3, including 18.08.0-0pre1, the auks -R loop process (on the compute node) doesn't seem to terminate properly. This causes slurmctld to believe that a process is still running and leaves the resources allocated. The job goes into a completing state but never actually completes. I believe, but am not 100% certain, this is related to a change in signal blocking/processing in slurmstepd.c introduced in 17.11.3 per this commit:
-- Make sure the slurmstepd blocks signals like SIGTERM correctly.
In slurm-spank-auks.c, changing:
kill(renewer_pid, SIGTERM);
to:
kill(renewer_pid, SIGKILL);
seems to result in the expected behavior (i.e., auks -R loop exits when the job completes naturally or is scanceled and the resources are freed.) I'm not sure if this is really the best way to go about it though.
Mark
The text was updated successfully, but these errors were encountered: