windows/service: implement graceful shutdown when run as windows service #73292

steffengy · 2019-01-24T23:39:42Z

The issue here originally is that os.Exit() is called which exits
the process too early (before svc.Execute updates the status to stopped).
This is picked up as service error and leads to restarting,
if restart-on-fail is configured for the windows service.
svc.Execute already guarantees that the application is exited after,
so that os.Exit call would be unnecessary.

This rework also adds graceful shutdown, which also resolves the
underlying root cause. The graceful shutdown is not guaranteed
to succeed, since the service controller can decide to kill
the service any time after exceeding a shutdown timeout.

/sig windows
/kind bug

windows: Ensure graceful termination when being run as windows service

Fixes #72900

- Fixes kubernetes#72900 The issue here originally is that os.Exit() is called which exits the process too early (before svc.Execute updates the status to stopped). This is picked up as service error and leads to restarting, if restart-on-fail is configured for the windows service. svc.Execute already guarantees that the application is exited after, so that os.Exit call would be unnecessary. This rework also adds graceful shutdown, which also resolves the underlying root cause. The graceful shutdown is not guaranteed to succeed, since the service controller can decide to kill the service any time after exceeding a shutdown timeout.

k8s-ci-robot · 2019-01-24T23:39:45Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2019-01-24T23:39:49Z

Hi @steffengy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aserdean · 2019-01-29T15:12:09Z

Thanks a lot for taking care of this @steffengy .

I was thinking to raise a SIGINT instead of os.Exit, but your implementation looks much cleaner :).

Overall LGTM, just one small question: will this work for kube-proxy as well? Maybe I missed it but I don't see where the signal handlers are being setup.

steffengy · 2019-01-29T15:49:58Z

@aserdean
Kube-Proxy indeed does not seem to use that signal handling, so is unaffected by this
(There don't seem to be any shutdown mechanisms, it just loops in SyncLoop until the process is killed)

So to match the linux behavior (when systemd kills the service), we could just os.Exit if SetupSignalHandler was never called?
I'll think about this some more.

aserdean · 2019-01-29T16:56:19Z

That could work, but it feels like a workaround.

The cleanest implementation, in my opinion, would be to add a graceful shutdown mechanic to kube-proxy, at least for Windows.

steffengy · 2019-01-29T17:01:49Z

Yeah for sure, but I don't see how that would work for windows-only, since that code is shared.
And it opens up some questions like: Do we just stop or do we also do some cleanup in that case?
Just stopping is probably fine, but it'll still be quite an invasive change compared to the rest of the PR.

aserdean · 2019-01-29T17:32:48Z

Yeah for sure, but I don't see how that would work for windows-only, since that code is shared.

I think you can hook it over the proxier implementation i.e.:

kubernetes/pkg/proxy/winkernel/proxier.go

Line 695 in a5e424d

func (proxier *Proxier) SyncLoop() {

but that wouldn't be that elegant either.

And it opens up some questions like: Do we just stop or do we also do some cleanup in that case?
Just stopping is probably fine, but it'll still be quite an invasive change compared to the rest of the PR.

Just stopping would be fine from my perspective (I sent that the logs are flushed on exit, but to be honest I'm not familiar with the code.
Indeed that should probably go in a different PR and once both are merged it will fully fix:
#72900

michmike · 2019-01-31T07:35:53Z

pkg/windows/service/service.go

+				// If we do not do this, our main threads won't be notified of the upcoming shutdown.
+				// Since Windows services do not use any console, we cannot simply generate a CTRL_BREAK_EVENT
+				// but need a dedicated notification mechanism.
+				server.RequestShutdown()


i am not a golang expert, but why is this needed? i would think that break loop will allow the process to gracefully exit after you set the status to stopPending

by this, i mean the extra complexity around server.RequestShutdown

@michmike
As the comments describe:

The thread/goroutine running this communicates with the windows service controller.

break loop will only exit that one thread, while the main thread (e.g. kubelet) will continue to run and
eventually be (forcefully) timeout-killed by the service controller.
By doing that we'd also essentially communicate to the service controller: "We're currently shutting down" but we'll never actually shut down, which it doesn't like.

RequestShutdown() basically executes the same code that would've been executed if a SIGINT signal was received (which you cannot send/receive in "windows service mode". In windows signals don't actually exist but there's CTRL_BREAK/CTRL_C event which are mapped to signals like SIGINT by golang. Those cannot be sent to processes without a console (User I/O) - which a service by definition does not have. So this is the least-intrusive approach to add a alternative shutdown-signaling mechanism.

I also wouldn't call it complex - the alternative (assuming a parallel universe where this would work) is sending the signal manually, which is much more complex than adding a single function, that might be of use elsewhere.

Let me know if that clarifies things and what you think

roycaihw · 2019-01-31T21:07:37Z

/cc @logicalhan

…aries

steffengy · 2019-02-01T18:17:48Z

@aserdean I implemented the exit-workaround for cases like kube-proxy and
that seems to do okay. I also agree with you, that making kube-proxy "signal-aware" is something, if done, should be done in a separate PR.

PatrickLang · 2019-02-06T00:13:40Z

/lgtm

steffengy · 2019-02-06T07:25:45Z

/retest

k8s-ci-robot · 2019-02-06T07:25:58Z

@steffengy: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

PatrickLang · 2019-02-06T23:13:27Z

/ok-to-test

PatrickLang · 2019-02-06T23:21:59Z

@aserdean or @michmike can you approve?

aserdean · 2019-02-07T12:12:05Z

@steffengy thanks for adding the comment regarding kube-proxy.

/LGTM

k8s-ci-robot · 2019-02-07T12:12:14Z

@aserdean: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

@steffengy thanks for adding the comment regarding kube-proxy.

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

steffengy · 2019-02-07T15:59:52Z

pull-kubernetes-e2e-gce looks spurious.
pull-kubernetes-verify should be addressed now.
Can i have another retry @PatrickLang ?

PatrickLang · 2019-02-07T18:55:15Z

/retest
@steffengy I think you can use this command now since this has the ok-to-test label

PatrickLang · 2019-02-07T18:57:56Z

/approve

brendandburns · 2019-02-08T17:59:40Z

/lgtm

on previous LGTM from @PatrickLang

mtaufen · 2019-02-12T21:22:00Z

pkg/windows/service/service.go

+					go func() {
+						// Ensure the SCM was notified (The operation above (send to s) was received and communicated to the
+						// service control manager - so it doesn't look like the service crashes)
+						time.Sleep(1 * time.Second)


Is there any way to explicitly flush or get an ack from the SCM, rather than just waiting for an arbitrary duration?

Not without effort, you really don't want to be doing there (and you'd have to be polling and doing windows API (=syscalls) manually here).

Also 1 second is really enough time, we basically only need to make sure that the function that spawns this goroutine has exited, since then the relevant syscall is performed, which should only take a few hundred nanoseconds.
Also keep in mind that this is the edge case (for programs that do not support graceful shutdowns).

the SCM doesn't have a way to ACK, it's the other direction. This is actually the kubelet's ack to the service control manager that the stop request was received, and is processing. If the service control manager doesn't get this pending state, it will assume the process is hung and forcefully kill it. If the process is still around after the wait (30 seconds, or more if we give it a hint when passing the stop pending status), the service control manager will poll this status again.

michmike · 2019-02-18T22:14:10Z

/lgtm

michmike · 2019-02-18T22:15:36Z

/assign @brendandburns
it wants an approve from you, not just lgtm

michmike · 2019-02-18T22:16:00Z

/test pull-kubernetes-cross

brendandburns · 2019-02-20T01:31:53Z

/approve

k8s-ci-robot · 2019-02-20T01:32:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, PatrickLang, steffengy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [brendandburns]
~~staging/src/k8s.io/apiserver/OWNERS~~ [brendandburns]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 24, 2019

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 24, 2019

k8s-ci-robot requested review from thockin and wojtek-t January 24, 2019 23:40

PatrickLang added this to Backlog in SIG-Windows Jan 29, 2019

PatrickLang moved this from Backlog to In Review in SIG-Windows Jan 29, 2019

michmike reviewed Jan 31, 2019

View reviewed changes

k8s-ci-robot requested a review from logicalhan January 31, 2019 21:07

windows/svc: workaround-exit mechanism that works for signal-less bin…

afdfe8d

…aries

k8s-ci-robot assigned PatrickLang Feb 6, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 6, 2019

windows/svc: address failing test by updating bazel BUILD

c2b771d

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2019

k8s-ci-robot assigned brendandburns Feb 8, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2019

mtaufen reviewed Feb 12, 2019

View reviewed changes

k8s-ci-robot assigned michmike Feb 18, 2019

michmike approved these changes Feb 18, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2019

k8s-ci-robot merged commit 296985c into kubernetes:master Feb 20, 2019

SIG-Windows automation moved this from In Review to Done (v.1.14) Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

windows/service: implement graceful shutdown when run as windows service #73292

windows/service: implement graceful shutdown when run as windows service #73292

steffengy commented Jan 24, 2019 •

edited

Loading

k8s-ci-robot commented Jan 24, 2019

k8s-ci-robot commented Jan 24, 2019

aserdean commented Jan 29, 2019

steffengy commented Jan 29, 2019 •

edited

Loading

aserdean commented Jan 29, 2019

steffengy commented Jan 29, 2019

aserdean commented Jan 29, 2019

michmike Jan 31, 2019

michmike Jan 31, 2019

steffengy Jan 31, 2019 •

edited

Loading

roycaihw commented Jan 31, 2019

steffengy commented Feb 1, 2019

PatrickLang commented Feb 6, 2019

steffengy commented Feb 6, 2019

k8s-ci-robot commented Feb 6, 2019

PatrickLang commented Feb 6, 2019

PatrickLang commented Feb 6, 2019

aserdean commented Feb 7, 2019

k8s-ci-robot commented Feb 7, 2019

steffengy commented Feb 7, 2019

PatrickLang commented Feb 7, 2019

PatrickLang commented Feb 7, 2019

brendandburns commented Feb 8, 2019

mtaufen Feb 12, 2019

steffengy Feb 13, 2019

PatrickLang Feb 15, 2019

michmike commented Feb 18, 2019

michmike commented Feb 18, 2019

michmike commented Feb 18, 2019

brendandburns commented Feb 20, 2019

k8s-ci-robot commented Feb 20, 2019

windows/service: implement graceful shutdown when run as windows service #73292

windows/service: implement graceful shutdown when run as windows service #73292

Conversation

steffengy commented Jan 24, 2019 • edited Loading

k8s-ci-robot commented Jan 24, 2019

k8s-ci-robot commented Jan 24, 2019

aserdean commented Jan 29, 2019

steffengy commented Jan 29, 2019 • edited Loading

aserdean commented Jan 29, 2019

steffengy commented Jan 29, 2019

aserdean commented Jan 29, 2019

michmike Jan 31, 2019

Choose a reason for hiding this comment

michmike Jan 31, 2019

Choose a reason for hiding this comment

steffengy Jan 31, 2019 • edited Loading

Choose a reason for hiding this comment

roycaihw commented Jan 31, 2019

steffengy commented Feb 1, 2019

PatrickLang commented Feb 6, 2019

steffengy commented Feb 6, 2019

k8s-ci-robot commented Feb 6, 2019

PatrickLang commented Feb 6, 2019

PatrickLang commented Feb 6, 2019

aserdean commented Feb 7, 2019

k8s-ci-robot commented Feb 7, 2019

steffengy commented Feb 7, 2019

PatrickLang commented Feb 7, 2019

PatrickLang commented Feb 7, 2019

brendandburns commented Feb 8, 2019

mtaufen Feb 12, 2019

Choose a reason for hiding this comment

steffengy Feb 13, 2019

Choose a reason for hiding this comment

PatrickLang Feb 15, 2019

Choose a reason for hiding this comment

michmike commented Feb 18, 2019

michmike commented Feb 18, 2019

michmike commented Feb 18, 2019

brendandburns commented Feb 20, 2019

k8s-ci-robot commented Feb 20, 2019

steffengy commented Jan 24, 2019 •

edited

Loading

steffengy commented Jan 29, 2019 •

edited

Loading

steffengy Jan 31, 2019 •

edited

Loading