Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make testing.StartTestServer close cleanly #50690

Conversation

frobware
Copy link
Contributor

@frobware frobware commented Aug 15, 2017

What this PR does / why we need it:

This PR ensures that the test apiserver closes cleanly. Without this
change there are many repeated reconnection attempts to etcd, at
sub-second intervals, accompanied by a lot of log spam indicating that
the connection could not be made. This also results in the
accumulation of many 1000s of goroutines and they in turn prevent
effective use of the test server across multiple test functions within
the same process.

This PR introduces Storage.Destroy() and the test server now closes
all its stores when it is being shutdown.

Prior to this change there are ~1500+ goroutines remaining after the
server stops. With the change there are ~200 remaining; this is a
stepping-stone on the way to reducing that further.

Which issue this PR fixes

Fixes #49489

Special notes for your reviewer:

Release note:

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 15, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @frobware. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 15, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: frobware
We suggest the following additional approvers: deads2k, nikhiljindal

Assign the PR to them by writing /assign @deads2k @nikhiljindal in a comment when ready.

Associated issue: 49489

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Aug 15, 2017
@ironcladlou
Copy link
Contributor

Instead of injecting a stop channel through all these constructors (which then inject the channel into various server structs which have Run methods which accept stop channels), is there a way to pass through the stop channel from the outermost Run invocations?

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 15, 2017
@frobware
Copy link
Contributor Author

frobware commented Aug 15, 2017

@ironcladlou The trouble I had or have with that is the place that can hold the channel is in various Config types. To me there is a difference between what is effectively static configuration that could be reused to create another server and the channel which is most definitely active. Having said that, some configuration values, if applied to create a new server, could cause creation to fail (e.g., bind port).

@k8s-reviewable
Copy link

This change is Reviewable

@ironcladlou
Copy link
Contributor

As far as I can tell, all this wiring to add state to the completedConfig is so that GenericAPIServer.installAPIResources can call Destroy on storage instances... It's not clear to me why NonBlockingRun (which has the stop channel via Run) can't do the teardown? It seems incredibly strange to add the stop channel (which relates only to execution as pertains to calls to Run) to the config state and to make it a creation dependency. I still maintain the stop channel should propagate via (and ONLY via) Run, and if there's some places that breaks down, we should look very closely at those cases because there's probably some other refactoring which needs done.

@frobware frobware force-pushed the fix-49489-make-testing.StartTestServer-close-storage-on-teardown branch from 9ed4f30 to 22b8d5a Compare August 16, 2017 18:04
@frobware
Copy link
Contributor Author

I was trying to avoid adding state to the GenericAPIServer. We can certainly call an additional Shutdown() or DestroyStorage() after Run() has completed. However, wherever we call InstallAPI (and its ilk), we would have to persist the *APIGroupInfo on the genericapiserver because we need a handle to each store so that we can call store.Destroy().

@ironcladlou
Copy link
Contributor

ironcladlou commented Aug 16, 2017

I was trying to avoid adding state to the GenericAPIServer. We can certainly call an additional Shutdown() or DestroyStorage() after Run() has completed. However, wherever we call InstallAPI (and its ilk), we would have to persist the *APIGroupInfo on the genericapiserver because we need a handle to each store so that we can call store.Destroy().

If calling apiGroupVersion.InstallREST has the side effect of starting things which require stopping
for a graceful shutdown, it seems appropriate for GenericAPIServer to track that sort of thing for later cleanup.

Seems to me that any stateful component which allows registration/installation of stuff for which the component controls the lifecycle independently should also support de-registration/uninstallation and also cascading destruction said stuff.

On that note, is GenericAPIServer really intended to control the lifecycle of Stores? Are the Stores we're explicitly shutting down shared with other components?

I'm finding it really difficult to understand the actual lifecycle of most components being wired around the system. I wonder if this change makes the lifecycle more or less opaque. 😟

@sttts
Copy link
Contributor

sttts commented Aug 17, 2017

If calling apiGroupVersion.InstallREST has the side effect of starting things which require stopping
for a graceful shutdown, it seems appropriate for GenericAPIServer to track that sort of thing for later cleanup.

I don't agree with that. A stop channel is essentially a context. Passing a context during creation is a good and established pattern. Moreover, we use it everywhere. Introducing another pattern for this purpose for Stores feels wrong. Moreover, our plumbing is aligned along creation only. I don't want to double the complexity by adding shutdown logic. A context perfectly merges those two goals.

@sttts
Copy link
Contributor

sttts commented Aug 17, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 17, 2017
@frobware
Copy link
Contributor Author

I did experiment with passing a stop channel all the way through to the stores but it was significantly more intrusive: frobware@2e045be

@frobware
Copy link
Contributor Author

/cc @deads2k PTAL. Thanks.

@deads2k
Copy link
Contributor

deads2k commented Aug 17, 2017

I don't agree with that. A stop channel is essentially a context. Passing a context during creation is a good and established pattern. Moreover, we use it everywhere. Introducing another pattern for this purpose for Stores feels wrong. Moreover, our plumbing is aligned along creation only. I don't want to double the complexity by adding shutdown logic. A context perfectly merges those two goals.

Most of the rest of our code takes a stop channel on a Run method. It strikes me as odd that this doesn't and instead wires it up during the construction process. Did something get wired weird in a way that doesn't allow post-creation running? I'm betting the watch cache without a separate Run is driving this?

@ironcladlou
Copy link
Contributor

@sttts

I don't agree with that. A stop channel is essentially a context. Passing a context during creation is a good and established pattern. Moreover, we use it everywhere. Introducing another pattern for this purpose for Stores feels wrong. Moreover, our plumbing is aligned along creation only. I don't want to double the complexity by adding shutdown logic. A context perfectly merges those two goals.

@deads2k

Most of the rest of our code takes a stop channel on a Run method. It strikes me as odd that this doesn't and instead wires it up during the construction process.

I would agree with @sttts if creation-based context were the actual pattern employed throughout the codebase. Instead, I have seen context only applied via Run-type methods post-creation. I have no objection to switching paradigms if everybody else is fine with having a mixture of patterns. (Although not everybody would agree that passing context around is a good idea in general.)

@deads2k

Did something get wired weird in a way that doesn't allow post-creation running? I'm betting the watch cache without a separate Run is driving this?

I think you're right about the watch cache at least.

@frobware frobware force-pushed the fix-49489-make-testing.StartTestServer-close-storage-on-teardown branch from 22b8d5a to 71107c5 Compare August 17, 2017 13:29
@frobware
Copy link
Contributor Author

/retest

@sttts
Copy link
Contributor

sttts commented Aug 17, 2017

I have no objection to switching paradigms if everybody else is fine with having a mixture of patterns.

Our mixtures in the apiserver is more like "using a stopCh" vs. "leak running go-routine".

I agree with @deads2k that stopCh in non-run routines smells, but I fear it's a major refactoring to fix our wiring in a way that we can cleanly run without anything creating go-routines before. I don't see that very soon and we need a solution here. IMO this PR is very pragmatic to get us to a solution with an acceptable amount of ugliness with the new Store.Shutdown() func. I don't like the more consequent stopCh wiring in frobware/kubernetes@2e045be either.

@deads2k brought up the idea to port @smarterclayton's integration test runner which launches one process per test. This would solve our problem as well as we can leak whatever we want without consequences. Is this something we can get soonish (= within a few weeks?). If not, I would prefer the solution here in the PR. It's not perfect, a bit ugly, but good enough and exists now. The additional complexity is very limited, and it's forgiving in the sense if a destruction chain is slightly it wrong won't kill us, only leave some garbage.

(Although not everybody would agree that passing context around is a good idea in general.)

The author has no solution either, at least none on-top of the language. I agree that Go should have better support for process trees and partial shutdown of those. But it hasn't in 1.x. So a context is the best we can get as a pattern now.

@ironcladlou
Copy link
Contributor

@sttts

@deads2k brought up the idea to port @smarterclayton's integration test runner which launches one process per test. This would solve our problem as well as we can leak whatever we want without consequences. Is this something we can get soonish (= within a few weeks?).

Thanks for bringing that up. To be honest, I have more confidence in the isolated test process approach than our ability to get graceful shutdown working (and keep it working) across the board. I don't know that graceful shutdown is something anybody even cares about outside a test context. I'd almost rather see this PR replaced with the per-process test runner if there was widespread acceptance of the idea.

@frobware
Copy link
Contributor Author

@deads2k brought up the idea to port @smarterclayton's integration test runner which launches one process per test. This would solve our problem as well as we can leak whatever we want without consequences.

When I was looking at this originally I measured the startup time of the server to be around 8s (on my hardware).

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2017
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 8, 2017

@frobware: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-verify 636e0a3 link /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

@frobware PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2017
@sttts sttts mentioned this pull request Oct 18, 2017
@k8s-github-robot
Copy link

This PR hasn't been active in 90 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review.

cc @caesarxuchao @frobware

You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days

@MHBauer
Copy link
Contributor

MHBauer commented Jan 24, 2018

@frobware are you still around out there somewhere or should I take this and try to drive it forward?

@frobware
Copy link
Contributor Author

@frobware are you still around out there somewhere or should I take this and try to drive it forward?

I am, but working on other things ATM. Feel free to drive forward. Thanks.

@sttts
Copy link
Contributor

sttts commented Jan 25, 2018

@MHBauer and assign me for review or ping me for discussion. Am too busy to drive this myself, but am happy to support with review, opinion and direction as far as I can.

@paralin
Copy link
Contributor

paralin commented Mar 27, 2018

This PR is still mentioned in a comment in the code, but it's closed - is anyone going to finish this?

@nikhita
Copy link
Member

nikhita commented Jul 3, 2018

This PR is still mentioned in a comment in the code, but it's closed - is anyone going to finish this?

bump

@wojtek-t
Copy link
Member

wojtek-t commented Apr 5, 2022

FYI - I'm resurrecting this PR in #109303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testing.StartTestServer doesn't tear down cleanly