Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurmstepd not killing auks process #24

Closed
astrotazer opened this issue Mar 9, 2018 · 8 comments
Closed

slurmstepd not killing auks process #24

astrotazer opened this issue Mar 9, 2018 · 8 comments

Comments

@astrotazer
Copy link

With recent versions of SLURM 17.11.3, including 18.08.0-0pre1, the auks -R loop process (on the compute node) doesn't seem to terminate properly. This causes slurmctld to believe that a process is still running and leaves the resources allocated. The job goes into a completing state but never actually completes. I believe, but am not 100% certain, this is related to a change in signal blocking/processing in slurmstepd.c introduced in 17.11.3 per this commit:

-- Make sure the slurmstepd blocks signals like SIGTERM correctly.

In slurm-spank-auks.c, changing:

kill(renewer_pid, SIGTERM);

to:

kill(renewer_pid, SIGKILL);

seems to result in the expected behavior (i.e., auks -R loop exits when the job completes naturally or is scanceled and the resources are freed.) I'm not sure if this is really the best way to go about it though.

Mark

@robberteggermont
Copy link
Contributor

We have the same problem (SLURM 17.11.5). Thanks for sharing your knowledge, it saved me a lot of stress. Your proposed solution seems to work fine (at least as a workaround).

@hautreux
Copy link
Owner

hautreux commented Mar 27, 2018

I agree that this is due to the modification in signal handling in Slurm.

slurmstepd now blocks more signals than before, including SIGTERM, and only unblocks them prior to starting user tasks. Every process forked by spank plugins in the privileged per task hook inherits the blocked signal masks of its parent and thus ignores SIGTERM too.

I do not exactly understand why this modification was introduced in Slurm. You should file a bug in schedmd bugzilla for that to get more information.

Sending SIGKILL as you suggest is a valid work-around and could be a correct replacement. You can send a pull request for that change and I will include it directly if you want.

We are still in 17.11.2 (with backports of important patches..) so we not yet have encountered it. Thanks for the warning and the detailed information.

@kenshin33
Copy link
Contributor

how about reseting the blocked signal to none just before the call to fork ?

 --- a/src/plugins/slurm/slurm-spank-auks.c
 +++ b/src/plugins/slurm/slurm-spank-auks.c
 @@ -286,6 +286,9 @@
                xerror("unable to launch renewer process");
        }
        else if ( renewer_pid == 0 ) {
 +               sigset_t mask;
 +               sigemptyset(&mask);
 +               sigprocmask(SIG_SETMASK, &mask, NULL);
                 char *argv[4];
                 argv[0]= BINDIR "/auks" ;
                 argv[1]="-R";argv[2]="loop";

seems to work.

@robberteggermont
Copy link
Contributor

This seems like the cleaner solution. Unfortunately I don't have a test environment to test it in.

@LaHaine
Copy link

LaHaine commented Aug 21, 2018

I have tested the patch by @kenshin33 and it is working fine.

@hautreux
Copy link
Owner

@kenshin33 can you make a pull request for your patch ?

@kenshin33
Copy link
Contributor

kenshin33 commented Oct 25, 2018 via email

c-mertes added a commit to c-mertes/auks that referenced this issue Apr 1, 2019
@c-mertes
Copy link

c-mertes commented Apr 1, 2019

I would like to have a version bump for this bug fix. this helps to have a clean update process through an RPM/dep repository.

I added this in the new PR #33

PS: and thanks for the bug fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants