-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows/service: implement graceful shutdown when run as windows service #73292
Conversation
- Fixes kubernetes#72900 The issue here originally is that os.Exit() is called which exits the process too early (before svc.Execute updates the status to stopped). This is picked up as service error and leads to restarting, if restart-on-fail is configured for the windows service. svc.Execute already guarantees that the application is exited after, so that os.Exit call would be unnecessary. This rework also adds graceful shutdown, which also resolves the underlying root cause. The graceful shutdown is not guaranteed to succeed, since the service controller can decide to kill the service any time after exceeding a shutdown timeout.
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi @steffengy. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks a lot for taking care of this @steffengy . I was thinking to raise a SIGINT instead of os.Exit, but your implementation looks much cleaner :). Overall LGTM, just one small question: will this work for kube-proxy as well? Maybe I missed it but I don't see where the signal handlers are being setup. |
@aserdean So to match the linux behavior (when systemd kills the service), we could just |
That could work, but it feels like a workaround. The cleanest implementation, in my opinion, would be to add a graceful shutdown mechanic to kube-proxy, at least for Windows. |
Yeah for sure, but I don't see how that would work for windows-only, since that code is shared. |
I think you can hook it over the proxier implementation i.e.: kubernetes/pkg/proxy/winkernel/proxier.go Line 695 in a5e424d
but that wouldn't be that elegant either.
Just stopping would be fine from my perspective (I sent that the logs are flushed on exit, but to be honest I'm not familiar with the code. |
pkg/windows/service/service.go
Outdated
// If we do not do this, our main threads won't be notified of the upcoming shutdown. | ||
// Since Windows services do not use any console, we cannot simply generate a CTRL_BREAK_EVENT | ||
// but need a dedicated notification mechanism. | ||
server.RequestShutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not a golang expert, but why is this needed? i would think that break loop
will allow the process to gracefully exit after you set the status to stopPending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by this, i mean the extra complexity around server.RequestShutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michmike
As the comments describe:
-
The thread/goroutine running this communicates with the windows service controller.
-
break loop
will only exit that one thread, while the main thread (e.g. kubelet) will continue to run and
eventually be (forcefully) timeout-killed by the service controller.
By doing that we'd also essentially communicate to the service controller: "We're currently shutting down" but we'll never actually shut down, which it doesn't like. -
RequestShutdown() basically executes the same code that would've been executed if a SIGINT signal was received (which you cannot send/receive in "windows service mode". In windows signals don't actually exist but there's CTRL_BREAK/CTRL_C event which are mapped to signals like SIGINT by golang. Those cannot be sent to processes without a console (User I/O) - which a service by definition does not have. So this is the least-intrusive approach to add a alternative shutdown-signaling mechanism.
-
I also wouldn't call it complex - the alternative (assuming a parallel universe where this would work) is sending the signal manually, which is much more complex than adding a single function, that might be of use elsewhere.
Let me know if that clarifies things and what you think
/cc @logicalhan |
@aserdean I implemented the exit-workaround for cases like kube-proxy and |
/lgtm |
/retest |
@steffengy: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@steffengy thanks for adding the comment regarding /LGTM |
@aserdean: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pull-kubernetes-e2e-gce looks spurious. |
/retest |
/approve |
/lgtm on previous LGTM from @PatrickLang |
go func() { | ||
// Ensure the SCM was notified (The operation above (send to s) was received and communicated to the | ||
// service control manager - so it doesn't look like the service crashes) | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to explicitly flush or get an ack from the SCM, rather than just waiting for an arbitrary duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not without effort, you really don't want to be doing there (and you'd have to be polling and doing windows API (=syscalls) manually here).
Also 1 second is really enough time, we basically only need to make sure that the function that spawns this goroutine has exited, since then the relevant syscall is performed, which should only take a few hundred nanoseconds.
Also keep in mind that this is the edge case (for programs that do not support graceful shutdowns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SCM doesn't have a way to ACK, it's the other direction. This is actually the kubelet's ack to the service control manager that the stop request was received, and is processing. If the service control manager doesn't get this pending state, it will assume the process is hung and forcefully kill it. If the process is still around after the wait (30 seconds, or more if we give it a hint when passing the stop pending status), the service control manager will poll this status again.
/lgtm |
/assign @brendandburns |
/test pull-kubernetes-cross |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brendandburns, PatrickLang, steffengy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The issue here originally is that os.Exit() is called which exits
the process too early (before svc.Execute updates the status to stopped).
This is picked up as service error and leads to restarting,
if restart-on-fail is configured for the windows service.
svc.Execute already guarantees that the application is exited after,
so that os.Exit call would be unnecessary.
This rework also adds graceful shutdown, which also resolves the
underlying root cause. The graceful shutdown is not guaranteed
to succeed, since the service controller can decide to kill
the service any time after exceeding a shutdown timeout.
/sig windows
/kind bug
Fixes #72900