Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows/service: implement graceful shutdown when run as windows service #73292

Merged
merged 3 commits into from Feb 20, 2019

Conversation

@steffengy
Copy link
Contributor

steffengy commented Jan 24, 2019

The issue here originally is that os.Exit() is called which exits
the process too early (before svc.Execute updates the status to stopped).
This is picked up as service error and leads to restarting,
if restart-on-fail is configured for the windows service.
svc.Execute already guarantees that the application is exited after,
so that os.Exit call would be unnecessary.

This rework also adds graceful shutdown, which also resolves the
underlying root cause. The graceful shutdown is not guaranteed
to succeed, since the service controller can decide to kill
the service any time after exceeding a shutdown timeout.

/sig windows
/kind bug

windows: Ensure graceful termination when being run as windows service

Fixes #72900

windows/service: implement graceful shutdown when run as windows service
- Fixes #72900
The issue here originally is that os.Exit() is called which exits
the process too early (before svc.Execute updates the status to stopped).
This is picked up as service error and leads to restarting,
if restart-on-fail is configured for the windows service.
svc.Execute already guarantees that the application is exited after,
so that os.Exit call would be unnecessary.

This rework also adds graceful shutdown, which also resolves the
underlying root cause. The graceful shutdown is not guaranteed
to succeed, since the service controller can decide to kill
the service any time after exceeding a shutdown timeout.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 24, 2019

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 24, 2019

Hi @steffengy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aserdean

This comment has been minimized.

Copy link
Contributor

aserdean commented Jan 29, 2019

Thanks a lot for taking care of this @steffengy .

I was thinking to raise a SIGINT instead of os.Exit, but your implementation looks much cleaner :).

Overall LGTM, just one small question: will this work for kube-proxy as well? Maybe I missed it but I don't see where the signal handlers are being setup.

@steffengy

This comment has been minimized.

Copy link
Contributor Author

steffengy commented Jan 29, 2019

@aserdean
Kube-Proxy indeed does not seem to use that signal handling, so is unaffected by this
(There don't seem to be any shutdown mechanisms, it just loops in SyncLoop until the process is killed)

So to match the linux behavior (when systemd kills the service), we could just os.Exit if SetupSignalHandler was never called?
I'll think about this some more.

@aserdean

This comment has been minimized.

Copy link
Contributor

aserdean commented Jan 29, 2019

That could work, but it feels like a workaround.

The cleanest implementation, in my opinion, would be to add a graceful shutdown mechanic to kube-proxy, at least for Windows.

@steffengy

This comment has been minimized.

Copy link
Contributor Author

steffengy commented Jan 29, 2019

Yeah for sure, but I don't see how that would work for windows-only, since that code is shared.
And it opens up some questions like: Do we just stop or do we also do some cleanup in that case?
Just stopping is probably fine, but it'll still be quite an invasive change compared to the rest of the PR.

@aserdean

This comment has been minimized.

Copy link
Contributor

aserdean commented Jan 29, 2019

Yeah for sure, but I don't see how that would work for windows-only, since that code is shared.

I think you can hook it over the proxier implementation i.e.:

func (proxier *Proxier) SyncLoop() {

but that wouldn't be that elegant either.

And it opens up some questions like: Do we just stop or do we also do some cleanup in that case?
Just stopping is probably fine, but it'll still be quite an invasive change compared to the rest of the PR.

Just stopping would be fine from my perspective (I sent that the logs are flushed on exit, but to be honest I'm not familiar with the code.
Indeed that should probably go in a different PR and once both are merged it will fully fix:
#72900

@PatrickLang PatrickLang moved this from Backlog to In Review in SIG-Windows Jan 29, 2019

// If we do not do this, our main threads won't be notified of the upcoming shutdown.
// Since Windows services do not use any console, we cannot simply generate a CTRL_BREAK_EVENT
// but need a dedicated notification mechanism.
server.RequestShutdown()

This comment has been minimized.

@michmike

michmike Jan 31, 2019

i am not a golang expert, but why is this needed? i would think that break loop will allow the process to gracefully exit after you set the status to stopPending

This comment has been minimized.

@michmike

michmike Jan 31, 2019

by this, i mean the extra complexity around server.RequestShutdown

This comment has been minimized.

@steffengy

steffengy Jan 31, 2019

Author Contributor

@michmike
As the comments describe:

  • The thread/goroutine running this communicates with the windows service controller.

  • break loop will only exit that one thread, while the main thread (e.g. kubelet) will continue to run and
    eventually be (forcefully) timeout-killed by the service controller.
    By doing that we'd also essentially communicate to the service controller: "We're currently shutting down" but we'll never actually shut down, which it doesn't like.

  • RequestShutdown() basically executes the same code that would've been executed if a SIGINT signal was received (which you cannot send/receive in "windows service mode". In windows signals don't actually exist but there's CTRL_BREAK/CTRL_C event which are mapped to signals like SIGINT by golang. Those cannot be sent to processes without a console (User I/O) - which a service by definition does not have. So this is the least-intrusive approach to add a alternative shutdown-signaling mechanism.

  • I also wouldn't call it complex - the alternative (assuming a parallel universe where this would work) is sending the signal manually, which is much more complex than adding a single function, that might be of use elsewhere.

Let me know if that clarifies things and what you think

@roycaihw

This comment has been minimized.

Copy link
Member

roycaihw commented Jan 31, 2019

@k8s-ci-robot k8s-ci-robot requested a review from logicalhan Jan 31, 2019

@steffengy

This comment has been minimized.

Copy link
Contributor Author

steffengy commented Feb 1, 2019

@aserdean I implemented the exit-workaround for cases like kube-proxy and
that seems to do okay. I also agree with you, that making kube-proxy "signal-aware" is something, if done, should be done in a separate PR.

@steffengy

This comment has been minimized.

Copy link
Contributor Author

steffengy commented Feb 6, 2019

/retest

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 6, 2019

@steffengy: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@PatrickLang

This comment has been minimized.

Copy link
Contributor

PatrickLang commented Feb 6, 2019

/ok-to-test

@PatrickLang

This comment has been minimized.

Copy link
Contributor

PatrickLang commented Feb 6, 2019

@aserdean or @michmike can you approve?

@aserdean

This comment has been minimized.

Copy link
Contributor

aserdean commented Feb 7, 2019

@steffengy thanks for adding the comment regarding kube-proxy.

/LGTM

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 7, 2019

@aserdean: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

@steffengy thanks for adding the comment regarding kube-proxy.

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Feb 7, 2019

@steffengy

This comment has been minimized.

Copy link
Contributor Author

steffengy commented Feb 7, 2019

pull-kubernetes-e2e-gce looks spurious.
pull-kubernetes-verify should be addressed now.
Can i have another retry @PatrickLang ?

@PatrickLang

This comment has been minimized.

Copy link
Contributor

PatrickLang commented Feb 7, 2019

/retest
@steffengy I think you can use this command now since this has the ok-to-test label

@PatrickLang

This comment has been minimized.

Copy link
Contributor

PatrickLang commented Feb 7, 2019

/approve

@brendandburns

This comment has been minimized.

Copy link
Contributor

brendandburns commented Feb 8, 2019

/lgtm

on previous LGTM from @PatrickLang

go func() {
// Ensure the SCM was notified (The operation above (send to s) was received and communicated to the
// service control manager - so it doesn't look like the service crashes)
time.Sleep(1 * time.Second)

This comment has been minimized.

@mtaufen

mtaufen Feb 12, 2019

Contributor

Is there any way to explicitly flush or get an ack from the SCM, rather than just waiting for an arbitrary duration?

This comment has been minimized.

@steffengy

steffengy Feb 13, 2019

Author Contributor

Not without effort, you really don't want to be doing there (and you'd have to be polling and doing windows API (=syscalls) manually here).

Also 1 second is really enough time, we basically only need to make sure that the function that spawns this goroutine has exited, since then the relevant syscall is performed, which should only take a few hundred nanoseconds.
Also keep in mind that this is the edge case (for programs that do not support graceful shutdowns).

This comment has been minimized.

@PatrickLang

PatrickLang Feb 15, 2019

Contributor

the SCM doesn't have a way to ACK, it's the other direction. This is actually the kubelet's ack to the service control manager that the stop request was received, and is processing. If the service control manager doesn't get this pending state, it will assume the process is hung and forcefully kill it. If the process is still around after the wait (30 seconds, or more if we give it a hint when passing the stop pending status), the service control manager will poll this status again.

@michmike

This comment has been minimized.

Copy link

michmike commented Feb 18, 2019

/lgtm

@michmike

This comment has been minimized.

Copy link

michmike commented Feb 18, 2019

/assign @brendandburns
it wants an approve from you, not just lgtm

@michmike

This comment has been minimized.

Copy link

michmike commented Feb 18, 2019

/test pull-kubernetes-cross

@brendandburns

This comment has been minimized.

Copy link
Contributor

brendandburns commented Feb 20, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 20, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, PatrickLang, steffengy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 296985c into kubernetes:master Feb 20, 2019

14 checks passed

cla/linuxfoundation steffengy authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

SIG-Windows automation moved this from In Review to Done (v.1.14) Feb 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.