WaitFor returns immediately when done is closed #72364

kdada · 2018-12-27T10:49:57Z

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #72357

Special notes for your reviewer:

This PR fix the comments in #70277.

When the done channel is closed, WaitFor just closes stopCh and waits for next signal. This PR fixes this issue. Now the WaitFor func returns immediately when the done channel is closed.

Does this PR introduce a user-facing change?:

[Breaking change, client-go]: The WaitFor function returns, probably an ErrWaitTimeout, when the done channel is closed, even if the `WaitFunc` doesn't handle the done channel.

k8s-ci-robot · 2018-12-27T10:50:04Z

Hi @kdada. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kdada · 2018-12-27T10:54:40Z

@caesarxuchao @cheftako PTAL

liggitt · 2018-12-27T16:35:39Z

/cc @caesarxuchao @cheftako
/uncc

since this is a follow-up of the PR they reviewed

cheftako · 2018-12-27T18:41:05Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

 			}
 		case <-done:
-			closeCh()
+			break FOR


I think its easier to read if we just return here. Also can we add a test for this case?

TestWaitForWithClosedChannel and TestWaitForWithEarlyClosingWaitFunc test the two branches which return ErrWaitTimeout.

Also I document the case when done and c are closed:

// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed, // the behavior of this function depends on the golang 'select'.

For the test, I think cheftako wants to make sure that the close of the done channel can stop the WaitFor, regardless of the WaitFunc. For example, if the WaitFunc doesn't send anything to the its returned channel, nor close the channel, WaitFor should still be stopped by the close of the done channel. [edited] In the TestWaitForWithClosedChannel, instead of passing a poller, can you just pass a WaitFunc that returns an open channel, and doesn't handle done channel?

I would like to see an additional test something like

stopCh := make(chan struct{}) go func() { time.Sleep(time.Second) close(stopCh) } start := time.Now() err := WaitFor(poller(ForeverTestTimeout, ForeverTestTimeout), func() (bool, error) { return false, nil }, stopCh)

This clearly tests only the stopCh channel being closed and does not rely on it starting closed.

logicalhan · 2018-12-27T21:33:04Z

/assign @caesarxuchao

caesarxuchao · 2018-12-27T21:44:17Z

In the release note can you add "[Breaking change, client-go]:" in the front? That helps client-go owners to prepare client-go release notes. Thanks.

kdada · 2018-12-28T03:21:45Z

@caesarxuchao @cheftako PTAL

caesarxuchao · 2018-12-28T06:23:58Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait_test.go

+	duration := time.Now().Sub(start)
+
+	// The WaitFor should return immediately, so the duration is close to 0s.
+	if duration >= ForeverTestTimeout/2 {


Why dividing it by 2 instead of using ForeverTestTimeout directly? Is the purpose to make the test less false negative (i.e., less likely to miss a bug)?

caesarxuchao · 2018-12-28T06:31:12Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

@@ -351,22 +351,20 @@ type WaitFunc func(done <-chan struct{}) <-chan struct{}
 // WaitFor continually checks 'fn' as driven by 'wait'.
 //
 // WaitFor gets a channel from 'wait()'', and then invokes 'fn' once for every value
-// placed on the channel and once more when the channel is closed.
+// placed on the channel and once more when the channel is closed. If the channel is closed
+// and 'fn' returns false without error, WaitFor returns ErrWaitTimeout.
 //
 // If 'fn' returns an error the loop ends and that error is returned, and if


nit: s/, and if/. If/

Just to make the format of the three cases consistent.

caesarxuchao · 2018-12-28T06:41:06Z

staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go

 // returning true.
+//
+// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed,
+// the behavior of this function depends on the golang 'select'.


This is a good point! Though I think the following is more clear, what do you think?

When the done channel is closed, because the golang `select` statement is "uniform pseudo-random", the `fn` might still run one or multiple time, though eventually `WaitFor` will return.

kdada · 2018-12-28T11:51:58Z

@caesarxuchao PTAL

caesarxuchao · 2018-12-28T18:46:10Z

/lgtm

The release note is not accurate, actually in contrary to the last sentence in the comment of WaitFor. How about something like

[Breaking change, client-go]: The WaitFor function returns, probably an ErrWaitTimeout, when the done channel is closed, even if the `WaitFunc` doesn't handle the done channel.

kdada · 2018-12-29T03:59:25Z

/retest

kdada · 2018-12-31T13:02:31Z

@caesarxuchao
the TestGarbageCollectorSync fails because of this PR.

In the test, the sync period is 10ms(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector_test.go#895):

	go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)

In the Sync method(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector.go#232):

if !controller.WaitForCacheSync("garbage collector", waitForStopOrTimeout(stopCh, period), gc.dependencyGraphBuilder.IsSynced) {
				utilruntime.HandleError(fmt.Errorf("timed out waiting for dependency graph builder sync during GC sync (attempt %d)", attempt))
				return false, nil
			}

The waitForStopOrTimeout method will close the stopChWithTimeout after 10ms.

func waitForStopOrTimeout(stopCh <-chan struct{}, timeout time.Duration) <-chan struct{} {
	stopChWithTimeout := make(chan struct{})
	go func() {
		select {
		case <-stopCh:
		case <-time.After(timeout):
		}
		close(stopChWithTimeout)
	}()
	return stopChWithTimeout
}

Then the controller.WaitForCacheSync will always return false because the cache.WaitForCacheSync only tests after the first 100ms.

const (
	syncedPollPeriod = 100 * time.Millisecond
)

// WaitForCacheSync waits for caches to populate.  It returns true if it was successful, false
// if the controller should shutdown
func WaitForCacheSync(stopCh <-chan struct{}, cacheSyncs ...InformerSynced) bool {
	err := wait.PollUntil(syncedPollPeriod,
		func() (bool, error) {
			for _, syncFunc := range cacheSyncs {
				if !syncFunc() {
					return false, nil
				}
			}
			return true, nil
		},
		stopCh)
	if err != nil {
		klog.V(2).Infof("stop requested")
		return false
	}

	klog.V(4).Infof("caches populated")
	return true
}

I think we need to add more comments for the cache.WaitForCacheSync.

caesarxuchao · 2019-01-03T22:31:30Z

Good analysis. TestGarbageCollectorSync failure demonstrated the backwards incompatibility of this PR. The 10 ms timeout in the original test wasn't reasonable.

caesarxuchao · 2019-01-03T22:37:16Z

pkg/controller/garbagecollector/garbagecollector_test.go

@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) {
 	stopCh := make(chan struct{})
 	defer close(stopCh)
 	go gc.Run(1, stopCh)
-	go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
+	go gc.Sync(fakeDiscoveryClient, 100*time.Millisecond, stopCh)


Isn't this 100 ms timeout also going to be flaky, given that the first check also happens after 100ms?

yes. we should not do anything in tests with such tight time tolerances. the 10ms sync loop period was in contrast to the 1s wait to see if progress was made (e.g. 100x difference between loop period and test.

I'm trying to understand what change is being made in this PR that broke this test... is that going to cause problems in real life as well?

The old WaitFor will run the condition check function at least once more after the done channel is closed. The new behavior will most likely return a timeout error, though due to the nature of select, it has a slim chance to run the condition function once more.

I doubt any user would intentionally rely on the old behavior. Also, the old behavior wasn't explicitly documented. Hence, I think the chance to cause real problems is slim.

Ok, thanks for clarifying

Changed to 1 second.

liggitt · 2019-01-04T13:25:10Z

pkg/controller/garbagecollector/garbagecollector_test.go

@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) {
 	stopCh := make(chan struct{})
 	defer close(stopCh)
 	go gc.Run(1, stopCh)
-	go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)


changing this to 1 second means the time.Sleep of 1 second is now too short to reliably catch sync problems. The test should be able to wait significantly longer than the resync period to ensure progress is made when expected. Having to lengthen the resync period this much means having to extend the test wait time significantly, which makes the test take much longer to run than desired

IIUC, the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness.

How about setting it 200ms here, and let the test run 10s (by setting the sleep time to 10s), then the sync got tested 50 times?

Also note that the old test wasn't as reliable as it intended to be. Although the sync period was set to 10ms, because the old WaitFor function didn't handle the closed done channel, the WaitForCacheSync returned 100ms later, after the first poll period. So the old test only run the sync behavior 10 times in the 1s test time.

the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness

it was actually to make sure the test waited waaaaay longer than the resync period to ensure there was time to complete at least 2 iterations

The pseudo-code of GarbageCollector.Sync():

// In this case, `stopCh` won't be closed. Ignore it. GarbageCollector.Sync(): wait.Until() loops with period: wait.PollImmediateUntil() loops with 100ms (hardcode): controller.WaitForCacheSync() loops with a channel which will be closed after the `period` This loop never returns unless the sync is finished.

1 second of period makes controller.WaitForCacheSync() tries ~10 times to check if the cache is synced.

200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop. Finally controller.WaitForCacheSync() also executes about 10 times. (Due to the behavior of time.Ticker, it may drop some ticks for slow receivers. So the 10 is the upper bound)

200ms is better because it tests both of wait.PollImmediateUntil() and controller.WaitForCacheSync().

that's probably worth documenting in a comment in the test

LGTM.

@kdada could you add the comment as liggitt suggested? The suedo-code is great, it makes the reasoning much more clear. I added a little more details to it.

GarbageCollector.Sync(): wait.Until() loops with `period` until the `stopCh` is closed : wait.PollImmediateUntil() loops with 100ms (hardcode) util the `stopCh` is closed: controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.

The final goal is to make sure that the outermost wait.Until loop runs at least 2 times during the 1s sleep, which ensures that the changes made to the fakeDiscoveryClient are picked up by the Sync.

@caesarxuchao Done

kdada · 2019-01-07T07:41:55Z

/retest

kdada · 2019-01-09T11:06:18Z

/retest

caesarxuchao

@kdada I still have a few nits regarding the comment. I'll take another look tonight to make the iteration faster :)

caesarxuchao · 2019-01-09T21:34:55Z

pkg/controller/garbagecollector/garbagecollector_test.go

+	//            controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms),
+	//            until either its stop channel is closed after `period`, or all caches synced.
+	//
+	// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.


This line isn't accurate. Every loop in wait.PollImmediateUntil() always executes controller.WaitForCacheSync() once.

Can you update it to "Setting the period to 200ms allows the WaitForCacheSync() to check for cache sync ~2 times in every wait.PollImmediateUntil() loop."

caesarxuchao · 2019-01-09T21:36:15Z

pkg/controller/garbagecollector/garbagecollector_test.go

@@ -856,7 +856,17 @@ func TestGarbageCollectorSync(t *testing.T) {
 	stopCh := make(chan struct{})
 	defer close(stopCh)
 	go gc.Run(1, stopCh)
-	go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
+	// The pseudo-code of GarbageCollector.Sync():
+	// GarbageCollector.Sync():


nit: add the input parameters, e.g., GarbageCollector.Sync(client, period, stopCh). Otherwise readers won't know what "period" we are referring to.

caesarxuchao · 2019-01-09T21:46:17Z

pkg/controller/garbagecollector/garbagecollector_test.go

+	//            until either its stop channel is closed after `period`, or all caches synced.
+	//
+	// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
+	// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker,


Is it really important to run WaitForCacheSync for 10 times? I think the important thing is letting gc.Sync() loop at least twice to ensure the changes made to the fakeDiscoveryClient are picked up.

We can't escape from wait.PollImmediateUntil() unless controller.WaitForCacheSync() returns true.
So there are only two paths:

gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns false -> wait.PollImmediateUntil() loops.

gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns true -> gc.Sync() loops.

Right. My previous comment on "letting gc.Sync() loop at least twice" was wrong. But still, the important thing isn't running WaitForCacheSync() ten times, but is running at least twice the GetDeletableResources and then resync, so that changes to the fakeDiscoveryClient are picked up.

Also, I would add one line to psuedo-code:

GarbageCollector.Sync(): wait.Until() loops every `period` until the `stopCh` is closed : wait.PollImmediateUntil() loops every 100ms (hardcode) util the `stopCh` is closed: GetDeletableResources and gc.resyncMonitors() controller.WaitForCacheSync() loops every `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.

caesarxuchao · 2019-01-09T21:48:24Z

pkg/controller/garbagecollector/garbagecollector_test.go

+	//
+	// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
+	// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker,
+	// it may drop some ticks for slow receivers. So the 10 is the upper bound).


nit: I would drop this line. It's not important for the reader.

…llectorSync

caesarxuchao · 2019-01-10T19:35:52Z

/lgtm

Thanks, @kdada

kdada · 2019-01-11T01:55:58Z

@caesarxuchao need an /approve :)

caesarxuchao · 2019-01-11T17:57:17Z

/approve

k8s-ci-robot · 2019-01-11T17:59:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, kdada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/garbagecollector/OWNERS~~ [caesarxuchao]
~~staging/src/k8s.io/apimachinery/pkg/OWNERS~~ [caesarxuchao]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

WaitFor returns immediately when done is closed

caesarxuchao · 2019-01-11T21:23:28Z

@kdada it seems you have to rebase.

followup of #72364, slightly improve the comment

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 27, 2018

k8s-ci-robot requested review from lavalamp and liggitt December 27, 2018 10:50

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018

kdada force-pushed the master branch from 9761ad9 to f1a5f11 Compare December 27, 2018 11:02

k8s-ci-robot requested review from caesarxuchao and cheftako and removed request for liggitt December 27, 2018 16:35

cheftako reviewed Dec 27, 2018

View reviewed changes

k8s-ci-robot assigned caesarxuchao Dec 27, 2018

kdada force-pushed the master branch from f1a5f11 to 807b948 Compare December 28, 2018 03:15

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 28, 2018

caesarxuchao reviewed Dec 28, 2018

View reviewed changes

WaitFor returns immediately when done is closed

a766ed2

kdada force-pushed the master branch from 807b948 to a766ed2 Compare December 28, 2018 11:43

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 28, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2018

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 31, 2018

caesarxuchao reviewed Jan 3, 2019

View reviewed changes

kdada force-pushed the master branch from 8dd6d4f to a11801c Compare January 4, 2019 06:26

liggitt reviewed Jan 4, 2019

View reviewed changes

kdada force-pushed the master branch from a11801c to d0a5766 Compare January 7, 2019 05:52

kdada force-pushed the master branch from d0a5766 to 1a0ce8d Compare January 9, 2019 08:00

caesarxuchao reviewed Jan 9, 2019

View reviewed changes

kdada force-pushed the master branch from 1a0ce8d to eb2e6a2 Compare January 10, 2019 02:32

change the period of GarbageCollector.Sync to 200ms for TestGarbageCo…

44b2e84

…llectorSync

kdada force-pushed the master branch from eb2e6a2 to 44b2e84 Compare January 10, 2019 09:52

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2019

caesarxuchao mentioned this pull request Jan 10, 2019

followup of #72364, slightly improve the comment #72786

Merged

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2019

hh pushed a commit to ii/kubernetes that referenced this pull request Jan 11, 2019

Merge pull request kubernetes#72364 from kdada/master

dd99c2b

WaitFor returns immediately when done is closed

k8s-ci-robot merged commit 44b2e84 into kubernetes:master Jan 11, 2019

k8s-ci-robot added a commit that referenced this pull request Jan 25, 2019

Merge pull request #72786 from caesarxuchao/72364-minor-followup

7da6b42

followup of #72364, slightly improve the comment

WaitFor returns immediately when done is closed #72364

WaitFor returns immediately when done is closed #72364

Conversation

kdada commented Dec 27, 2018 • edited

k8s-ci-robot commented Dec 27, 2018

kdada commented Dec 27, 2018

liggitt commented Dec 27, 2018

Choose a reason for hiding this comment

kdada Dec 28, 2018 • edited

Choose a reason for hiding this comment

caesarxuchao Dec 28, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

logicalhan commented Dec 27, 2018

caesarxuchao commented Dec 27, 2018

kdada commented Dec 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada commented Dec 28, 2018

caesarxuchao commented Dec 28, 2018

kdada commented Dec 29, 2018

kdada commented Dec 31, 2018 • edited

caesarxuchao commented Jan 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Jan 4, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada Jan 7, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada Jan 9, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada commented Jan 7, 2019

kdada commented Jan 9, 2019

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdada Jan 10, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

caesarxuchao commented Jan 10, 2019

kdada commented Jan 11, 2019

caesarxuchao commented Jan 11, 2019

k8s-ci-robot commented Jan 11, 2019

caesarxuchao commented Jan 11, 2019

kdada commented Dec 27, 2018 •

edited

kdada Dec 28, 2018 •

edited

caesarxuchao Dec 28, 2018 •

edited

kdada commented Dec 31, 2018 •

edited

liggitt Jan 4, 2019 •

edited

kdada Jan 7, 2019 •

edited

kdada Jan 9, 2019 •

edited

kdada Jan 10, 2019 •

edited