New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitFor returns immediately when done is closed #72364

Merged
merged 2 commits into from Jan 11, 2019

Conversation

@kdada
Copy link
Contributor

kdada commented Dec 27, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #72357

Special notes for your reviewer:

This PR fix the comments in #70277.

When the done channel is closed, WaitFor just closes stopCh and waits for next signal. This PR fixes this issue. Now the WaitFor func returns immediately when the done channel is closed.

Does this PR introduce a user-facing change?:

[Breaking change, client-go]: The WaitFor function returns, probably an ErrWaitTimeout, when the done channel is closed, even if the `WaitFunc` doesn't handle the done channel.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 27, 2018

Hi @kdada. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Dec 27, 2018

@kdada kdada force-pushed the kdada:master branch from 9761ad9 to f1a5f11 Dec 27, 2018

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 27, 2018

/cc @caesarxuchao @cheftako
/uncc

since this is a follow-up of the PR they reviewed

@k8s-ci-robot k8s-ci-robot requested review from caesarxuchao and cheftako and removed request for liggitt Dec 27, 2018

}
case <-done:
closeCh()
break FOR

This comment has been minimized.

@cheftako

cheftako Dec 27, 2018

Member

I think its easier to read if we just return here. Also can we add a test for this case?

This comment has been minimized.

@kdada

kdada Dec 28, 2018

Author Contributor

TestWaitForWithClosedChannel and TestWaitForWithEarlyClosingWaitFunc test the two branches which return ErrWaitTimeout.

Also I document the case when done and c are closed:

// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed,
// the behavior of this function depends on the golang 'select'.

This comment has been minimized.

@caesarxuchao

caesarxuchao Dec 28, 2018

Member

For the test, I think cheftako wants to make sure that the close of the done channel can stop the WaitFor, regardless of the WaitFunc. For example, if the WaitFunc doesn't send anything to the its returned channel, nor close the channel, WaitFor should still be stopped by the close of the done channel. [edited] In the TestWaitForWithClosedChannel, instead of passing a poller, can you just pass a WaitFunc that returns an open channel, and doesn't handle done channel?

This comment has been minimized.

@kdada

kdada Dec 28, 2018

Author Contributor

Done.

This comment has been minimized.

@cheftako

cheftako Dec 28, 2018

Member

I would like to see an additional test something like

    stopCh := make(chan struct{})
    go func() {
         time.Sleep(time.Second)
         close(stopCh)
    }
start := time.Now()
err := WaitFor(poller(ForeverTestTimeout, ForeverTestTimeout), func() (bool, error) {
	return false, nil
}, stopCh)

This clearly tests only the stopCh channel being closed and does not rely on it starting closed.

@logicalhan

This comment has been minimized.

Copy link
Contributor

logicalhan commented Dec 27, 2018

/assign @caesarxuchao

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Dec 27, 2018

In the release note can you add "[Breaking change, client-go]:" in the front? That helps client-go owners to prepare client-go release notes. Thanks.

@kdada kdada force-pushed the kdada:master branch from f1a5f11 to 807b948 Dec 28, 2018

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels Dec 28, 2018

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Dec 28, 2018

duration := time.Now().Sub(start)

// The WaitFor should return immediately, so the duration is close to 0s.
if duration >= ForeverTestTimeout/2 {

This comment has been minimized.

@caesarxuchao

caesarxuchao Dec 28, 2018

Member

Why dividing it by 2 instead of using ForeverTestTimeout directly? Is the purpose to make the test less false negative (i.e., less likely to miss a bug)?

This comment has been minimized.

@kdada

kdada Dec 28, 2018

Author Contributor

Yes.

@@ -351,22 +351,20 @@ type WaitFunc func(done <-chan struct{}) <-chan struct{}
// WaitFor continually checks 'fn' as driven by 'wait'.
//
// WaitFor gets a channel from 'wait()'', and then invokes 'fn' once for every value
// placed on the channel and once more when the channel is closed.
// placed on the channel and once more when the channel is closed. If the channel is closed
// and 'fn' returns false without error, WaitFor returns ErrWaitTimeout.
//
// If 'fn' returns an error the loop ends and that error is returned, and if

This comment has been minimized.

@caesarxuchao

caesarxuchao Dec 28, 2018

Member

nit: s/, and if/. If/

Just to make the format of the three cases consistent.

// returning true.
//
// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed,
// the behavior of this function depends on the golang 'select'.

This comment has been minimized.

@caesarxuchao

caesarxuchao Dec 28, 2018

Member

This is a good point! Though I think the following is more clear, what do you think?

When the done channel is closed, because the golang `select` statement is "uniform pseudo-random", the `fn` might still run one or multiple time, though eventually `WaitFor` will return.

This comment has been minimized.

@kdada

kdada Dec 28, 2018

Author Contributor

OK

@kdada kdada force-pushed the kdada:master branch from 807b948 to a766ed2 Dec 28, 2018

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Dec 28, 2018

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Dec 28, 2018

/lgtm

The release note is not accurate, actually in contrary to the last sentence in the comment of WaitFor. How about something like

[Breaking change, client-go]: The WaitFor function returns, probably an ErrWaitTimeout, when the done channel is closed, even if the `WaitFunc` doesn't handle the done channel.

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 28, 2018

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Dec 29, 2018

/retest

@k8s-ci-robot k8s-ci-robot added sig/apps and removed lgtm labels Dec 31, 2018

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Dec 31, 2018

@caesarxuchao
the TestGarbageCollectorSync fails because of this PR.

In the test, the sync period is 10ms(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector_test.go#895):

	go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)

In the Sync method(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector.go#232):

if !controller.WaitForCacheSync("garbage collector", waitForStopOrTimeout(stopCh, period), gc.dependencyGraphBuilder.IsSynced) {
				utilruntime.HandleError(fmt.Errorf("timed out waiting for dependency graph builder sync during GC sync (attempt %d)", attempt))
				return false, nil
			}

The waitForStopOrTimeout method will close the stopChWithTimeout after 10ms.

func waitForStopOrTimeout(stopCh <-chan struct{}, timeout time.Duration) <-chan struct{} {
	stopChWithTimeout := make(chan struct{})
	go func() {
		select {
		case <-stopCh:
		case <-time.After(timeout):
		}
		close(stopChWithTimeout)
	}()
	return stopChWithTimeout
}

Then the controller.WaitForCacheSync will always return false because the cache.WaitForCacheSync only tests after the first 100ms.

const (
	syncedPollPeriod = 100 * time.Millisecond
)

// WaitForCacheSync waits for caches to populate.  It returns true if it was successful, false
// if the controller should shutdown
func WaitForCacheSync(stopCh <-chan struct{}, cacheSyncs ...InformerSynced) bool {
	err := wait.PollUntil(syncedPollPeriod,
		func() (bool, error) {
			for _, syncFunc := range cacheSyncs {
				if !syncFunc() {
					return false, nil
				}
			}
			return true, nil
		},
		stopCh)
	if err != nil {
		klog.V(2).Infof("stop requested")
		return false
	}

	klog.V(4).Infof("caches populated")
	return true
}

I think we need to add more comments for the cache.WaitForCacheSync.

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Jan 3, 2019

Good analysis. TestGarbageCollectorSync failure demonstrated the backwards incompatibility of this PR. The 10 ms timeout in the original test wasn't reasonable.

@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) {
stopCh := make(chan struct{})
defer close(stopCh)
go gc.Run(1, stopCh)
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
go gc.Sync(fakeDiscoveryClient, 100*time.Millisecond, stopCh)

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 3, 2019

Member

Isn't this 100 ms timeout also going to be flaky, given that the first check also happens after 100ms?

This comment has been minimized.

@liggitt

liggitt Jan 3, 2019

Member

yes. we should not do anything in tests with such tight time tolerances. the 10ms sync loop period was in contrast to the 1s wait to see if progress was made (e.g. 100x difference between loop period and test.

I'm trying to understand what change is being made in this PR that broke this test... is that going to cause problems in real life as well?

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 3, 2019

Member

The old WaitFor will run the condition check function at least once more after the done channel is closed. The new behavior will most likely return a timeout error, though due to the nature of select, it has a slim chance to run the condition function once more.

I doubt any user would intentionally rely on the old behavior. Also, the old behavior wasn't explicitly documented. Hence, I think the chance to cause real problems is slim.

This comment has been minimized.

@liggitt

liggitt Jan 4, 2019

Member

Ok, thanks for clarifying

This comment has been minimized.

@kdada

kdada Jan 4, 2019

Author Contributor

Changed to 1 second.

@kdada kdada force-pushed the kdada:master branch from 8dd6d4f to a11801c Jan 4, 2019

@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) {
stopCh := make(chan struct{})
defer close(stopCh)
go gc.Run(1, stopCh)
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)

This comment has been minimized.

@liggitt

liggitt Jan 4, 2019

Member

changing this to 1 second means the time.Sleep of 1 second is now too short to reliably catch sync problems. The test should be able to wait significantly longer than the resync period to ensure progress is made when expected. Having to lengthen the resync period this much means having to extend the test wait time significantly, which makes the test take much longer to run than desired

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 4, 2019

Member

IIUC, the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness.

How about setting it 200ms here, and let the test run 10s (by setting the sleep time to 10s), then the sync got tested 50 times?

Also note that the old test wasn't as reliable as it intended to be. Although the sync period was set to 10ms, because the old WaitFor function didn't handle the closed done channel, the WaitForCacheSync returned 100ms later, after the first poll period. So the old test only run the sync behavior 10 times in the 1s test time.

This comment has been minimized.

@liggitt

liggitt Jan 4, 2019

Member

the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness

it was actually to make sure the test waited waaaaay longer than the resync period to ensure there was time to complete at least 2 iterations

This comment has been minimized.

@kdada

kdada Jan 7, 2019

Author Contributor

The pseudo-code of GarbageCollector.Sync():

// In this case, `stopCh` won't be closed. Ignore it.
GarbageCollector.Sync():
    wait.Until() loops with period:
        wait.PollImmediateUntil() loops with 100ms (hardcode):
            controller.WaitForCacheSync() loops with a channel which will be closed after the `period`
            This loop never returns unless the sync is finished.

1 second of period makes controller.WaitForCacheSync() tries ~10 times to check if the cache is synced.

200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop. Finally controller.WaitForCacheSync() also executes about 10 times. (Due to the behavior of time.Ticker, it may drop some ticks for slow receivers. So the 10 is the upper bound)

200ms is better because it tests both of wait.PollImmediateUntil() and controller.WaitForCacheSync().

This comment has been minimized.

@liggitt

This comment has been minimized.

@liggitt

liggitt Jan 8, 2019

Member

that's probably worth documenting in a comment in the test

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 8, 2019

Member

LGTM.

@kdada could you add the comment as liggitt suggested? The suedo-code is great, it makes the reasoning much more clear. I added a little more details to it.

GarbageCollector.Sync():
    wait.Until() loops with `period` until the `stopCh` is closed :
        wait.PollImmediateUntil() loops with 100ms (hardcode) util the `stopCh` is closed:
            controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.

The final goal is to make sure that the outermost wait.Until loop runs at least 2 times during the 1s sleep, which ensures that the changes made to the fakeDiscoveryClient are picked up by the Sync.

This comment has been minimized.

@kdada

kdada Jan 9, 2019

Author Contributor

This comment has been minimized.

@foxyriver

foxyriver Jan 11, 2019

Contributor

lgtm

@kdada kdada force-pushed the kdada:master branch from a11801c to d0a5766 Jan 7, 2019

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Jan 7, 2019

/retest

@kdada kdada force-pushed the kdada:master branch from d0a5766 to 1a0ce8d Jan 9, 2019

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Jan 9, 2019

/retest

@caesarxuchao
Copy link
Member

caesarxuchao left a comment

@kdada I still have a few nits regarding the comment. I'll take another look tonight to make the iteration faster :)

// controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms),
// until either its stop channel is closed after `period`, or all caches synced.
//
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 9, 2019

Member

This line isn't accurate. Every loop in wait.PollImmediateUntil() always executes controller.WaitForCacheSync() once.

Can you update it to "Setting the period to 200ms allows the WaitForCacheSync() to check for cache sync ~2 times in every wait.PollImmediateUntil() loop."

@@ -856,7 +856,17 @@ func TestGarbageCollectorSync(t *testing.T) {
stopCh := make(chan struct{})
defer close(stopCh)
go gc.Run(1, stopCh)
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
// The pseudo-code of GarbageCollector.Sync():
// GarbageCollector.Sync():

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 9, 2019

Member

nit: add the input parameters, e.g., GarbageCollector.Sync(client, period, stopCh). Otherwise readers won't know what "period" we are referring to.

// until either its stop channel is closed after `period`, or all caches synced.
//
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker,

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 9, 2019

Member

Is it really important to run WaitForCacheSync for 10 times? I think the important thing is letting gc.Sync() loop at least twice to ensure the changes made to the fakeDiscoveryClient are picked up.

This comment has been minimized.

@kdada

kdada Jan 10, 2019

Author Contributor

We can't escape from wait.PollImmediateUntil() unless controller.WaitForCacheSync() returns true.
So there are only two paths:

  1. gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns false -> wait.PollImmediateUntil() loops.
  2. gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns true -> gc.Sync() loops.

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 10, 2019

Member

Right. My previous comment on "letting gc.Sync() loop at least twice" was wrong. But still, the important thing isn't running WaitForCacheSync() ten times, but is running at least twice the GetDeletableResources and then resync, so that changes to the fakeDiscoveryClient are picked up.

Also, I would add one line to psuedo-code:

GarbageCollector.Sync():
    wait.Until() loops every `period` until the `stopCh` is closed :
        wait.PollImmediateUntil() loops every 100ms (hardcode) util the `stopCh` is closed:
            GetDeletableResources and gc.resyncMonitors()
            controller.WaitForCacheSync() loops every `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.

This comment has been minimized.

@kdada

kdada Jan 10, 2019

Author Contributor

Done

//
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker,
// it may drop some ticks for slow receivers. So the 10 is the upper bound).

This comment has been minimized.

@caesarxuchao

caesarxuchao Jan 9, 2019

Member

nit: I would drop this line. It's not important for the reader.

@kdada kdada force-pushed the kdada:master branch from 1a0ce8d to eb2e6a2 Jan 10, 2019

@kdada kdada force-pushed the kdada:master branch from eb2e6a2 to 44b2e84 Jan 10, 2019

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Jan 10, 2019

/lgtm

Thanks, @kdada

@kdada

This comment has been minimized.

Copy link
Contributor Author

kdada commented Jan 11, 2019

@caesarxuchao need an /approve :)

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Jan 11, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, kdada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hh pushed a commit to ii/kubernetes that referenced this pull request Jan 11, 2019

Merge pull request kubernetes#72364 from kdada/master
WaitFor returns immediately when done is closed
@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Jan 11, 2019

@kdada it seems you have to rebase.

@k8s-ci-robot k8s-ci-robot merged commit 44b2e84 into kubernetes:master Jan 11, 2019

17 of 18 checks passed

tide Not mergeable.
Details
cla/linuxfoundation kdada authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

k8s-ci-robot added a commit that referenced this pull request Jan 25, 2019

Merge pull request #72786 from caesarxuchao/72364-minor-followup
followup of #72364, slightly improve the comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment