Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows/service: implement graceful shutdown when run as windows service #73292

Merged
merged 3 commits into from
Feb 20, 2019

Conversation

steffengy
Copy link
Contributor

@steffengy steffengy commented Jan 24, 2019

The issue here originally is that os.Exit() is called which exits
the process too early (before svc.Execute updates the status to stopped).
This is picked up as service error and leads to restarting,
if restart-on-fail is configured for the windows service.
svc.Execute already guarantees that the application is exited after,
so that os.Exit call would be unnecessary.

This rework also adds graceful shutdown, which also resolves the
underlying root cause. The graceful shutdown is not guaranteed
to succeed, since the service controller can decide to kill
the service any time after exceeding a shutdown timeout.

/sig windows
/kind bug

windows: Ensure graceful termination when being run as windows service

Fixes #72900

- Fixes kubernetes#72900
The issue here originally is that os.Exit() is called which exits
the process too early (before svc.Execute updates the status to stopped).
This is picked up as service error and leads to restarting,
if restart-on-fail is configured for the windows service.
svc.Execute already guarantees that the application is exited after,
so that os.Exit call would be unnecessary.

This rework also adds graceful shutdown, which also resolves the
underlying root cause. The graceful shutdown is not guaranteed
to succeed, since the service controller can decide to kill
the service any time after exceeding a shutdown timeout.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/windows Categorizes an issue or PR as relevant to SIG Windows. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 24, 2019
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 24, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @steffengy. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 24, 2019
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 24, 2019
@PatrickLang PatrickLang added this to Backlog in SIG-Windows Jan 29, 2019
@aserdean
Copy link
Contributor

Thanks a lot for taking care of this @steffengy .

I was thinking to raise a SIGINT instead of os.Exit, but your implementation looks much cleaner :).

Overall LGTM, just one small question: will this work for kube-proxy as well? Maybe I missed it but I don't see where the signal handlers are being setup.

@steffengy
Copy link
Contributor Author

steffengy commented Jan 29, 2019

@aserdean
Kube-Proxy indeed does not seem to use that signal handling, so is unaffected by this
(There don't seem to be any shutdown mechanisms, it just loops in SyncLoop until the process is killed)

So to match the linux behavior (when systemd kills the service), we could just os.Exit if SetupSignalHandler was never called?
I'll think about this some more.

@aserdean
Copy link
Contributor

That could work, but it feels like a workaround.

The cleanest implementation, in my opinion, would be to add a graceful shutdown mechanic to kube-proxy, at least for Windows.

@steffengy
Copy link
Contributor Author

Yeah for sure, but I don't see how that would work for windows-only, since that code is shared.
And it opens up some questions like: Do we just stop or do we also do some cleanup in that case?
Just stopping is probably fine, but it'll still be quite an invasive change compared to the rest of the PR.

@aserdean
Copy link
Contributor

Yeah for sure, but I don't see how that would work for windows-only, since that code is shared.

I think you can hook it over the proxier implementation i.e.:

func (proxier *Proxier) SyncLoop() {

but that wouldn't be that elegant either.

And it opens up some questions like: Do we just stop or do we also do some cleanup in that case?
Just stopping is probably fine, but it'll still be quite an invasive change compared to the rest of the PR.

Just stopping would be fine from my perspective (I sent that the logs are flushed on exit, but to be honest I'm not familiar with the code.
Indeed that should probably go in a different PR and once both are merged it will fully fix:
#72900

@PatrickLang PatrickLang moved this from Backlog to In Review in SIG-Windows Jan 29, 2019
// If we do not do this, our main threads won't be notified of the upcoming shutdown.
// Since Windows services do not use any console, we cannot simply generate a CTRL_BREAK_EVENT
// but need a dedicated notification mechanism.
server.RequestShutdown()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not a golang expert, but why is this needed? i would think that break loop will allow the process to gracefully exit after you set the status to stopPending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by this, i mean the extra complexity around server.RequestShutdown

Copy link
Contributor Author

@steffengy steffengy Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michmike
As the comments describe:

  • The thread/goroutine running this communicates with the windows service controller.

  • break loop will only exit that one thread, while the main thread (e.g. kubelet) will continue to run and
    eventually be (forcefully) timeout-killed by the service controller.
    By doing that we'd also essentially communicate to the service controller: "We're currently shutting down" but we'll never actually shut down, which it doesn't like.

  • RequestShutdown() basically executes the same code that would've been executed if a SIGINT signal was received (which you cannot send/receive in "windows service mode". In windows signals don't actually exist but there's CTRL_BREAK/CTRL_C event which are mapped to signals like SIGINT by golang. Those cannot be sent to processes without a console (User I/O) - which a service by definition does not have. So this is the least-intrusive approach to add a alternative shutdown-signaling mechanism.

  • I also wouldn't call it complex - the alternative (assuming a parallel universe where this would work) is sending the signal manually, which is much more complex than adding a single function, that might be of use elsewhere.

Let me know if that clarifies things and what you think

@roycaihw
Copy link
Member

/cc @logicalhan

@steffengy
Copy link
Contributor Author

@aserdean I implemented the exit-workaround for cases like kube-proxy and
that seems to do okay. I also agree with you, that making kube-proxy "signal-aware" is something, if done, should be done in a separate PR.

@PatrickLang
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2019
@steffengy
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@steffengy: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@PatrickLang
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 6, 2019
@PatrickLang
Copy link
Contributor

@aserdean or @michmike can you approve?

@aserdean
Copy link
Contributor

aserdean commented Feb 7, 2019

@steffengy thanks for adding the comment regarding kube-proxy.

/LGTM

@k8s-ci-robot
Copy link
Contributor

@aserdean: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

@steffengy thanks for adding the comment regarding kube-proxy.

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2019
@steffengy
Copy link
Contributor Author

pull-kubernetes-e2e-gce looks spurious.
pull-kubernetes-verify should be addressed now.
Can i have another retry @PatrickLang ?

@PatrickLang
Copy link
Contributor

/retest
@steffengy I think you can use this command now since this has the ok-to-test label

@PatrickLang
Copy link
Contributor

/approve

@brendandburns
Copy link
Contributor

/lgtm

on previous LGTM from @PatrickLang

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2019
go func() {
// Ensure the SCM was notified (The operation above (send to s) was received and communicated to the
// service control manager - so it doesn't look like the service crashes)
time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to explicitly flush or get an ack from the SCM, rather than just waiting for an arbitrary duration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without effort, you really don't want to be doing there (and you'd have to be polling and doing windows API (=syscalls) manually here).

Also 1 second is really enough time, we basically only need to make sure that the function that spawns this goroutine has exited, since then the relevant syscall is performed, which should only take a few hundred nanoseconds.
Also keep in mind that this is the edge case (for programs that do not support graceful shutdowns).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SCM doesn't have a way to ACK, it's the other direction. This is actually the kubelet's ack to the service control manager that the stop request was received, and is processing. If the service control manager doesn't get this pending state, it will assume the process is hung and forcefully kill it. If the process is still around after the wait (30 seconds, or more if we give it a hint when passing the stop pending status), the service control manager will poll this status again.

@michmike
Copy link
Contributor

/lgtm

@michmike
Copy link
Contributor

/assign @brendandburns
it wants an approve from you, not just lgtm

@michmike
Copy link
Contributor

/test pull-kubernetes-cross

@brendandburns
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brendandburns, PatrickLang, steffengy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2019
@k8s-ci-robot k8s-ci-robot merged commit 296985c into kubernetes:master Feb 20, 2019
SIG-Windows automation moved this from In Review to Done (v.1.14) Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does Windows service control integration properly support stopping a service
8 participants