Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix behavior of interrupt (SIGINT, SIGTERM) signals #769

Merged
merged 1 commit into from Jun 12, 2022

Conversation

andreynering
Copy link
Member

@andreynering andreynering commented Jun 12, 2022

Task will now give time for the processes running to do cleanup work

Ref #458
Ref #479
Fixes #728

@andreynering andreynering added the type: bug Something not working as intended. label Jun 12, 2022
Task will now give time for the processes running to do cleanup work

Ref #458
Ref #479
Fixes #728

Co-authored-by: Marco Molteni <marco.molteni@pix4d.com>
Co-authored-by: aliculPix4D <aleksandar.licul_ext@pix4d.com>
@andreynering
Copy link
Member Author

I took some time to understand the problem, but thanks to the sleepit helper @marco-m-pix4d implemented I could do some manual tests figure it out what was the problem.

We had code to intercept the signals and it triggered a cancel of the context, which means the process was actually receiving two signals: one because of the cancel() and the other because the OS was still sending the signals to the sub-processes anyway.

task/cmd/task/task.go

Lines 252 to 262 in c9a582f

func getSignalContext() context.Context {
ch := make(chan os.Signal, 1)
signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
ctx, cancel := context.WithCancel(context.Background())
go func() {
sig := <-ch
log.Printf("task: signal received: %s", sig)
cancel()
}()
return ctx
}

Changed it so it will now only intercept the signal but don't cancel the context, so it's actually sent only once to the sub-processes as expected.

As an addition, I added a force kill behavior if SIGINT is sent for the third time, so users at least have way to force Task to quit if any sub-process happen to be stuck.

Also changed the timeout to send KILL to processes in case of context cancellation from 2 to 15 seconds, which give them more time to do the necessary cleanup.

I'll merge it to master. Please give it a test if possible and let me know if it's working correctly. I'll wait a couple days to release it to be more certain that it is stable.

/cc @marco-m-pix4d @aliculPix4D @ghostsquad

@andreynering andreynering merged commit c1466f8 into master Jun 12, 2022
@andreynering andreynering deleted the fix-signals branch June 12, 2022 02:02
@marco-m-pix4d
Copy link
Contributor

We had code to intercept the signals and it triggered a cancel of the context, which means the process was actually receiving two signals: one because of the cancel() and the other because the OS was still sending the signals to the sub-processes anyway.

This is what I remember I had discovered in my work, yes.

Changed it so it will now only intercept the signal but don't cancel the context, so it's actually sent only once to the sub-processes as expected.

This looks correct to me. I will try to test it today. Thanks for the effort!

@marco-m-pix4d
Copy link
Contributor

Seems to be running fine.

@andreynering
Copy link
Member Author

@marco-m-pix4d Great! I plan to make a release tonight.

We still have that intermittently failing test as before. Hopefully we'll find a solution for that soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something not working as intended.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ctx/Signal Handling still not totally correct
2 participants