[WIP] Make testing.StartTestServer close cleanly #50690

frobware · 2017-08-15T14:27:40Z

What this PR does / why we need it:

This PR ensures that the test apiserver closes cleanly. Without this
change there are many repeated reconnection attempts to etcd, at
sub-second intervals, accompanied by a lot of log spam indicating that
the connection could not be made. This also results in the
accumulation of many 1000s of goroutines and they in turn prevent
effective use of the test server across multiple test functions within
the same process.

This PR introduces Storage.Destroy() and the test server now closes
all its stores when it is being shutdown.

Prior to this change there are ~1500+ goroutines remaining after the
server stops. With the change there are ~200 remaining; this is a
stepping-stone on the way to reducing that further.

Which issue this PR fixes

Fixes #49489

Special notes for your reviewer:

Release note:

k8s-ci-robot · 2017-08-15T14:27:43Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2017-08-15T14:27:47Z

Hi @frobware. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-08-15T14:28:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: frobware
We suggest the following additional approvers: deads2k, nikhiljindal

Assign the PR to them by writing /assign @deads2k @nikhiljindal in a comment when ready.

Associated issue: 49489

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

cmd/kube-apiserver/OWNERS
federation/OWNERS
pkg/master/OWNERS
pkg/registry/OWNERS
staging/src/k8s.io/apiextensions-apiserver/OWNERS
staging/src/k8s.io/apiserver/OWNERS
staging/src/k8s.io/kube-aggregator/OWNERS
staging/src/k8s.io/sample-apiserver/OWNERS
test/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

ironcladlou · 2017-08-15T15:13:41Z

Instead of injecting a stop channel through all these constructors (which then inject the channel into various server structs which have Run methods which accept stop channels), is there a way to pass through the stop channel from the outermost Run invocations?

frobware · 2017-08-15T20:40:50Z

@ironcladlou The trouble I had or have with that is the place that can hold the channel is in various Config types. To me there is a difference between what is effectively static configuration that could be reused to create another server and the channel which is most definitely active. Having said that, some configuration values, if applied to create a new server, could cause creation to fail (e.g., bind port).

k8s-reviewable · 2017-08-16T10:46:28Z

This change is

ironcladlou · 2017-08-16T17:15:18Z

As far as I can tell, all this wiring to add state to the completedConfig is so that GenericAPIServer.installAPIResources can call Destroy on storage instances... It's not clear to me why NonBlockingRun (which has the stop channel via Run) can't do the teardown? It seems incredibly strange to add the stop channel (which relates only to execution as pertains to calls to Run) to the config state and to make it a creation dependency. I still maintain the stop channel should propagate via (and ONLY via) Run, and if there's some places that breaks down, we should look very closely at those cases because there's probably some other refactoring which needs done.

frobware · 2017-08-16T18:32:38Z

I was trying to avoid adding state to the GenericAPIServer. We can certainly call an additional Shutdown() or DestroyStorage() after Run() has completed. However, wherever we call InstallAPI (and its ilk), we would have to persist the *APIGroupInfo on the genericapiserver because we need a handle to each store so that we can call store.Destroy().

ironcladlou · 2017-08-16T19:03:20Z

I was trying to avoid adding state to the GenericAPIServer. We can certainly call an additional Shutdown() or DestroyStorage() after Run() has completed. However, wherever we call InstallAPI (and its ilk), we would have to persist the *APIGroupInfo on the genericapiserver because we need a handle to each store so that we can call store.Destroy().

If calling apiGroupVersion.InstallREST has the side effect of starting things which require stopping
for a graceful shutdown, it seems appropriate for GenericAPIServer to track that sort of thing for later cleanup.

Seems to me that any stateful component which allows registration/installation of stuff for which the component controls the lifecycle independently should also support de-registration/uninstallation and also cascading destruction said stuff.

On that note, is GenericAPIServer really intended to control the lifecycle of Stores? Are the Stores we're explicitly shutting down shared with other components?

I'm finding it really difficult to understand the actual lifecycle of most components being wired around the system. I wonder if this change makes the lifecycle more or less opaque. 😟

sttts · 2017-08-17T08:04:39Z

If calling apiGroupVersion.InstallREST has the side effect of starting things which require stopping
for a graceful shutdown, it seems appropriate for GenericAPIServer to track that sort of thing for later cleanup.

I don't agree with that. A stop channel is essentially a context. Passing a context during creation is a good and established pattern. Moreover, we use it everywhere. Introducing another pattern for this purpose for Stores feels wrong. Moreover, our plumbing is aligned along creation only. I don't want to double the complexity by adding shutdown logic. A context perfectly merges those two goals.

sttts · 2017-08-17T08:06:10Z

/ok-to-test

frobware · 2017-08-17T09:14:56Z

I did experiment with passing a stop channel all the way through to the stores but it was significantly more intrusive: frobware@2e045be

frobware · 2017-08-17T10:15:45Z

/cc @deads2k PTAL. Thanks.

deads2k · 2017-08-17T11:49:26Z

I don't agree with that. A stop channel is essentially a context. Passing a context during creation is a good and established pattern. Moreover, we use it everywhere. Introducing another pattern for this purpose for Stores feels wrong. Moreover, our plumbing is aligned along creation only. I don't want to double the complexity by adding shutdown logic. A context perfectly merges those two goals.

Most of the rest of our code takes a stop channel on a Run method. It strikes me as odd that this doesn't and instead wires it up during the construction process. Did something get wired weird in a way that doesn't allow post-creation running? I'm betting the watch cache without a separate Run is driving this?

ironcladlou · 2017-08-17T13:19:29Z

@sttts

I don't agree with that. A stop channel is essentially a context. Passing a context during creation is a good and established pattern. Moreover, we use it everywhere. Introducing another pattern for this purpose for Stores feels wrong. Moreover, our plumbing is aligned along creation only. I don't want to double the complexity by adding shutdown logic. A context perfectly merges those two goals.

@deads2k

Most of the rest of our code takes a stop channel on a Run method. It strikes me as odd that this doesn't and instead wires it up during the construction process.

I would agree with @sttts if creation-based context were the actual pattern employed throughout the codebase. Instead, I have seen context only applied via Run-type methods post-creation. I have no objection to switching paradigms if everybody else is fine with having a mixture of patterns. (Although not everybody would agree that passing context around is a good idea in general.)

@deads2k

Did something get wired weird in a way that doesn't allow post-creation running? I'm betting the watch cache without a separate Run is driving this?

I think you're right about the watch cache at least.

frobware · 2017-08-17T13:31:10Z

/retest

sttts · 2017-08-17T14:06:05Z

I have no objection to switching paradigms if everybody else is fine with having a mixture of patterns.

Our mixtures in the apiserver is more like "using a stopCh" vs. "leak running go-routine".

I agree with @deads2k that stopCh in non-run routines smells, but I fear it's a major refactoring to fix our wiring in a way that we can cleanly run without anything creating go-routines before. I don't see that very soon and we need a solution here. IMO this PR is very pragmatic to get us to a solution with an acceptable amount of ugliness with the new Store.Shutdown() func. I don't like the more consequent stopCh wiring in frobware/kubernetes@2e045be either.

@deads2k brought up the idea to port @smarterclayton's integration test runner which launches one process per test. This would solve our problem as well as we can leak whatever we want without consequences. Is this something we can get soonish (= within a few weeks?). If not, I would prefer the solution here in the PR. It's not perfect, a bit ugly, but good enough and exists now. The additional complexity is very limited, and it's forgiving in the sense if a destruction chain is slightly it wrong won't kill us, only leave some garbage.

(Although not everybody would agree that passing context around is a good idea in general.)

The author has no solution either, at least none on-top of the language. I agree that Go should have better support for process trees and partial shutdown of those. But it hasn't in 1.x. So a context is the best we can get as a pattern now.

ironcladlou · 2017-08-17T14:27:04Z

@sttts

@deads2k brought up the idea to port @smarterclayton's integration test runner which launches one process per test. This would solve our problem as well as we can leak whatever we want without consequences. Is this something we can get soonish (= within a few weeks?).

Thanks for bringing that up. To be honest, I have more confidence in the isolated test process approach than our ability to get graceful shutdown working (and keep it working) across the board. I don't know that graceful shutdown is something anybody even cares about outside a test context. I'd almost rather see this PR replaced with the per-process test runner if there was widespread acceptance of the idea.

frobware · 2017-08-17T14:37:23Z

@deads2k brought up the idea to port @smarterclayton's integration test runner which launches one process per test. This would solve our problem as well as we can leak whatever we want without consequences.

When I was looking at this originally I measured the startup time of the server to be around 8s (on my hardware).

k8s-ci-robot · 2017-09-08T15:13:20Z

@frobware: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-verify	`636e0a3`	link	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-09-09T08:13:25Z

@frobware PR needs rebase

k8s-github-robot · 2017-12-08T08:13:55Z

This PR hasn't been active in 90 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review.

cc @caesarxuchao @frobware

You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days

MHBauer · 2018-01-24T00:10:43Z

@frobware are you still around out there somewhere or should I take this and try to drive it forward?

frobware · 2018-01-24T09:49:39Z

@frobware are you still around out there somewhere or should I take this and try to drive it forward?

I am, but working on other things ATM. Feel free to drive forward. Thanks.

sttts · 2018-01-25T14:53:35Z

@MHBauer and assign me for review or ping me for discussion. Am too busy to drive this myself, but am happy to support with review, opinion and direction as far as I can.

paralin · 2018-03-27T04:57:58Z

This PR is still mentioned in a comment in the code, but it's closed - is anyone going to finish this?

nikhita · 2018-07-03T06:43:12Z

This PR is still mentioned in a comment in the code, but it's closed - is anyone going to finish this?

bump

wojtek-t · 2022-04-05T11:06:21Z

FYI - I'm resurrecting this PR in #109303

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Aug 15, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 15, 2017

k8s-github-robot assigned caesarxuchao Aug 15, 2017

k8s-github-robot assigned justinsb Aug 15, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Aug 15, 2017

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 15, 2017

frobware force-pushed the fix-49489-make-testing.StartTestServer-close-storage-on-teardown branch from 9ed4f30 to 22b8d5a Compare August 16, 2017 18:04

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 17, 2017

frobware force-pushed the fix-49489-make-testing.StartTestServer-close-storage-on-teardown branch from 22b8d5a to 71107c5 Compare August 17, 2017 13:29

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2017

master: pass stop channerl, not wait.NeverStop

636e0a3

frobware mentioned this pull request Sep 8, 2017

[WIP] Add unit tests to smoke test OpenAPI delegation #51715

Closed

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2017

sttts mentioned this pull request Oct 5, 2017

Leaking reflector metrics in 1.8 #53485

Closed

sttts mentioned this pull request Oct 18, 2017

TestCRD times out #54095

Closed

enisoc mentioned this pull request Oct 18, 2017

Deployment Integration Test Goroutine Limit Exceeded #53617

Closed

nikhita mentioned this pull request Oct 19, 2017

Memory leak kube-apiserver 1.8.1 #54217

Closed

k8s-github-robot closed this Dec 8, 2017

This was referenced Jan 5, 2018

Integration tests appear to hold tons of file descriptors open kubernetes-retired/service-catalog#1649

Closed

DestroyFunc is not exposed via rest.Storage nor called by the ApiServer kubernetes/apiserver#30

Closed

frobware mentioned this pull request Mar 2, 2020

REQUEST: New membership for frobware kubernetes/org#1678

Closed

6 tasks

aojea mentioned this pull request Feb 7, 2022

apiserver: use endpoint lease reconciler as default #107952

Closed

This was referenced Apr 5, 2022

Fix leaking goroutines coming from storage layer in integration tests #109292

Closed

Cleanup rest storage resources on shutdown #109303

Merged

ncdc mentioned this pull request May 12, 2022

e2e: wait for server to terminate on cleanup kcp-dev/kcp#1035

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Make testing.StartTestServer close cleanly #50690

[WIP] Make testing.StartTestServer close cleanly #50690

frobware commented Aug 15, 2017 •

edited by k8s-github-robot

k8s-ci-robot commented Aug 15, 2017

k8s-ci-robot commented Aug 15, 2017

k8s-github-robot commented Aug 15, 2017

ironcladlou commented Aug 15, 2017

frobware commented Aug 15, 2017 •

edited

k8s-reviewable commented Aug 16, 2017

ironcladlou commented Aug 16, 2017

frobware commented Aug 16, 2017

ironcladlou commented Aug 16, 2017 •

edited

sttts commented Aug 17, 2017

sttts commented Aug 17, 2017

frobware commented Aug 17, 2017

frobware commented Aug 17, 2017

deads2k commented Aug 17, 2017

ironcladlou commented Aug 17, 2017

frobware commented Aug 17, 2017

sttts commented Aug 17, 2017

ironcladlou commented Aug 17, 2017

frobware commented Aug 17, 2017

k8s-ci-robot commented Sep 8, 2017 •

edited

k8s-github-robot commented Sep 9, 2017

k8s-github-robot commented Dec 8, 2017

MHBauer commented Jan 24, 2018

frobware commented Jan 24, 2018

sttts commented Jan 25, 2018

paralin commented Mar 27, 2018

nikhita commented Jul 3, 2018

wojtek-t commented Apr 5, 2022

[WIP] Make testing.StartTestServer close cleanly #50690

[WIP] Make testing.StartTestServer close cleanly #50690

Conversation

frobware commented Aug 15, 2017 • edited by k8s-github-robot

k8s-ci-robot commented Aug 15, 2017

k8s-ci-robot commented Aug 15, 2017

k8s-github-robot commented Aug 15, 2017

ironcladlou commented Aug 15, 2017

frobware commented Aug 15, 2017 • edited

k8s-reviewable commented Aug 16, 2017

ironcladlou commented Aug 16, 2017

frobware commented Aug 16, 2017

ironcladlou commented Aug 16, 2017 • edited

sttts commented Aug 17, 2017

sttts commented Aug 17, 2017

frobware commented Aug 17, 2017

frobware commented Aug 17, 2017

deads2k commented Aug 17, 2017

ironcladlou commented Aug 17, 2017

frobware commented Aug 17, 2017

sttts commented Aug 17, 2017

ironcladlou commented Aug 17, 2017

frobware commented Aug 17, 2017

k8s-ci-robot commented Sep 8, 2017 • edited

k8s-github-robot commented Sep 9, 2017

k8s-github-robot commented Dec 8, 2017

MHBauer commented Jan 24, 2018

frobware commented Jan 24, 2018

sttts commented Jan 25, 2018

paralin commented Mar 27, 2018

nikhita commented Jul 3, 2018

wojtek-t commented Apr 5, 2022

frobware commented Aug 15, 2017 •

edited by k8s-github-robot

frobware commented Aug 15, 2017 •

edited

ironcladlou commented Aug 16, 2017 •

edited

k8s-ci-robot commented Sep 8, 2017 •

edited