Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitFor returns immediately when done is closed #72364

Merged
merged 2 commits into from Jan 11, 2019

Conversation

kdada
Copy link
Contributor

@kdada kdada commented Dec 27, 2018

What type of PR is this?

Uncomment only one, leave it on its own line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #72357

Special notes for your reviewer:

This PR fix the comments in #70277.

When the done channel is closed, WaitFor just closes stopCh and waits for next signal. This PR fixes this issue. Now the WaitFor func returns immediately when the done channel is closed.

Does this PR introduce a user-facing change?:

[Breaking change, client-go]: The WaitFor function returns, probably an ErrWaitTimeout, when the done channel is closed, even if the `WaitFunc` doesn't handle the done channel.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018
@k8s-ci-robot
Copy link
Contributor

Hi @kdada. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 27, 2018
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018
@kdada
Copy link
Contributor Author

kdada commented Dec 27, 2018

@caesarxuchao @cheftako PTAL

@liggitt
Copy link
Member

liggitt commented Dec 27, 2018

/cc @caesarxuchao @cheftako
/uncc

since this is a follow-up of the PR they reviewed

@k8s-ci-robot k8s-ci-robot requested review from caesarxuchao and cheftako and removed request for liggitt December 27, 2018 16:35
}
case <-done:
closeCh()
break FOR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its easier to read if we just return here. Also can we add a test for this case?

Copy link
Contributor Author

@kdada kdada Dec 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestWaitForWithClosedChannel and TestWaitForWithEarlyClosingWaitFunc test the two branches which return ErrWaitTimeout.

Also I document the case when done and c are closed:

// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed,
// the behavior of this function depends on the golang 'select'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test, I think cheftako wants to make sure that the close of the done channel can stop the WaitFor, regardless of the WaitFunc. For example, if the WaitFunc doesn't send anything to the its returned channel, nor close the channel, WaitFor should still be stopped by the close of the done channel. [edited] In the TestWaitForWithClosedChannel, instead of passing a poller, can you just pass a WaitFunc that returns an open channel, and doesn't handle done channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see an additional test something like

    stopCh := make(chan struct{})
    go func() {
         time.Sleep(time.Second)
         close(stopCh)
    }
start := time.Now()
err := WaitFor(poller(ForeverTestTimeout, ForeverTestTimeout), func() (bool, error) {
	return false, nil
}, stopCh)

This clearly tests only the stopCh channel being closed and does not rely on it starting closed.

@logicalhan
Copy link
Member

/assign @caesarxuchao

@caesarxuchao
Copy link
Member

In the release note can you add "[Breaking change, client-go]:" in the front? That helps client-go owners to prepare client-go release notes. Thanks.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 28, 2018
@kdada
Copy link
Contributor Author

kdada commented Dec 28, 2018

@caesarxuchao @cheftako PTAL

duration := time.Now().Sub(start)

// The WaitFor should return immediately, so the duration is close to 0s.
if duration >= ForeverTestTimeout/2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dividing it by 2 instead of using ForeverTestTimeout directly? Is the purpose to make the test less false negative (i.e., less likely to miss a bug)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@@ -351,22 +351,20 @@ type WaitFunc func(done <-chan struct{}) <-chan struct{}
// WaitFor continually checks 'fn' as driven by 'wait'.
//
// WaitFor gets a channel from 'wait()'', and then invokes 'fn' once for every value
// placed on the channel and once more when the channel is closed.
// placed on the channel and once more when the channel is closed. If the channel is closed
// and 'fn' returns false without error, WaitFor returns ErrWaitTimeout.
//
// If 'fn' returns an error the loop ends and that error is returned, and if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/, and if/. If/

Just to make the format of the three cases consistent.

// returning true.
//
// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed,
// the behavior of this function depends on the golang 'select'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point! Though I think the following is more clear, what do you think?

When the done channel is closed, because the golang `select` statement is "uniform pseudo-random", the `fn` might still run one or multiple time, though eventually `WaitFor` will return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@kdada
Copy link
Contributor Author

kdada commented Dec 28, 2018

@caesarxuchao PTAL

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 28, 2018
@caesarxuchao
Copy link
Member

/lgtm

The release note is not accurate, actually in contrary to the last sentence in the comment of WaitFor. How about something like

[Breaking change, client-go]: The WaitFor function returns, probably an ErrWaitTimeout, when the done channel is closed, even if the `WaitFunc` doesn't handle the done channel.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2018
@kdada
Copy link
Contributor Author

kdada commented Dec 29, 2018

/retest

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 31, 2018
@kdada
Copy link
Contributor Author

kdada commented Dec 31, 2018

@caesarxuchao
the TestGarbageCollectorSync fails because of this PR.

In the test, the sync period is 10ms(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector_test.go#895):

	go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)

In the Sync method(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector.go#232):

if !controller.WaitForCacheSync("garbage collector", waitForStopOrTimeout(stopCh, period), gc.dependencyGraphBuilder.IsSynced) {
				utilruntime.HandleError(fmt.Errorf("timed out waiting for dependency graph builder sync during GC sync (attempt %d)", attempt))
				return false, nil
			}

The waitForStopOrTimeout method will close the stopChWithTimeout after 10ms.

func waitForStopOrTimeout(stopCh <-chan struct{}, timeout time.Duration) <-chan struct{} {
	stopChWithTimeout := make(chan struct{})
	go func() {
		select {
		case <-stopCh:
		case <-time.After(timeout):
		}
		close(stopChWithTimeout)
	}()
	return stopChWithTimeout
}

Then the controller.WaitForCacheSync will always return false because the cache.WaitForCacheSync only tests after the first 100ms.

const (
	syncedPollPeriod = 100 * time.Millisecond
)

// WaitForCacheSync waits for caches to populate.  It returns true if it was successful, false
// if the controller should shutdown
func WaitForCacheSync(stopCh <-chan struct{}, cacheSyncs ...InformerSynced) bool {
	err := wait.PollUntil(syncedPollPeriod,
		func() (bool, error) {
			for _, syncFunc := range cacheSyncs {
				if !syncFunc() {
					return false, nil
				}
			}
			return true, nil
		},
		stopCh)
	if err != nil {
		klog.V(2).Infof("stop requested")
		return false
	}

	klog.V(4).Infof("caches populated")
	return true
}

I think we need to add more comments for the cache.WaitForCacheSync.

@caesarxuchao
Copy link
Member

Good analysis. TestGarbageCollectorSync failure demonstrated the backwards incompatibility of this PR. The 10 ms timeout in the original test wasn't reasonable.

@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) {
stopCh := make(chan struct{})
defer close(stopCh)
go gc.Run(1, stopCh)
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
go gc.Sync(fakeDiscoveryClient, 100*time.Millisecond, stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this 100 ms timeout also going to be flaky, given that the first check also happens after 100ms?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we should not do anything in tests with such tight time tolerances. the 10ms sync loop period was in contrast to the 1s wait to see if progress was made (e.g. 100x difference between loop period and test.

I'm trying to understand what change is being made in this PR that broke this test... is that going to cause problems in real life as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old WaitFor will run the condition check function at least once more after the done channel is closed. The new behavior will most likely return a timeout error, though due to the nature of select, it has a slim chance to run the condition function once more.

I doubt any user would intentionally rely on the old behavior. Also, the old behavior wasn't explicitly documented. Hence, I think the chance to cause real problems is slim.

Copy link
Member

@liggitt liggitt Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for clarifying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 1 second.

@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) {
stopCh := make(chan struct{})
defer close(stopCh)
go gc.Run(1, stopCh)
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing this to 1 second means the time.Sleep of 1 second is now too short to reliably catch sync problems. The test should be able to wait significantly longer than the resync period to ensure progress is made when expected. Having to lengthen the resync period this much means having to extend the test wait time significantly, which makes the test take much longer to run than desired

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness.

How about setting it 200ms here, and let the test run 10s (by setting the sleep time to 10s), then the sync got tested 50 times?

Also note that the old test wasn't as reliable as it intended to be. Although the sync period was set to 10ms, because the old WaitFor function didn't handle the closed done channel, the WaitForCacheSync returned 100ms later, after the first poll period. So the old test only run the sync behavior 10 times in the 1s test time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness

it was actually to make sure the test waited waaaaay longer than the resync period to ensure there was time to complete at least 2 iterations

Copy link
Contributor Author

@kdada kdada Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pseudo-code of GarbageCollector.Sync():

// In this case, `stopCh` won't be closed. Ignore it.
GarbageCollector.Sync():
    wait.Until() loops with period:
        wait.PollImmediateUntil() loops with 100ms (hardcode):
            controller.WaitForCacheSync() loops with a channel which will be closed after the `period`
            This loop never returns unless the sync is finished.

1 second of period makes controller.WaitForCacheSync() tries ~10 times to check if the cache is synced.

200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop. Finally controller.WaitForCacheSync() also executes about 10 times. (Due to the behavior of time.Ticker, it may drop some ticks for slow receivers. So the 10 is the upper bound)

200ms is better because it tests both of wait.PollImmediateUntil() and controller.WaitForCacheSync().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's probably worth documenting in a comment in the test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kdada could you add the comment as liggitt suggested? The suedo-code is great, it makes the reasoning much more clear. I added a little more details to it.

GarbageCollector.Sync():
    wait.Until() loops with `period` until the `stopCh` is closed :
        wait.PollImmediateUntil() loops with 100ms (hardcode) util the `stopCh` is closed:
            controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.

The final goal is to make sure that the outermost wait.Until loop runs at least 2 times during the 1s sleep, which ensures that the changes made to the fakeDiscoveryClient are picked up by the Sync.

Copy link
Contributor Author

@kdada kdada Jan 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kdada
Copy link
Contributor Author

kdada commented Jan 7, 2019

/retest

@kdada
Copy link
Contributor Author

kdada commented Jan 9, 2019

/retest

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kdada I still have a few nits regarding the comment. I'll take another look tonight to make the iteration faster :)

// controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms),
// until either its stop channel is closed after `period`, or all caches synced.
//
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line isn't accurate. Every loop in wait.PollImmediateUntil() always executes controller.WaitForCacheSync() once.

Can you update it to "Setting the period to 200ms allows the WaitForCacheSync() to check for cache sync ~2 times in every wait.PollImmediateUntil() loop."

@@ -856,7 +856,17 @@ func TestGarbageCollectorSync(t *testing.T) {
stopCh := make(chan struct{})
defer close(stopCh)
go gc.Run(1, stopCh)
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh)
// The pseudo-code of GarbageCollector.Sync():
// GarbageCollector.Sync():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add the input parameters, e.g., GarbageCollector.Sync(client, period, stopCh). Otherwise readers won't know what "period" we are referring to.

// until either its stop channel is closed after `period`, or all caches synced.
//
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really important to run WaitForCacheSync for 10 times? I think the important thing is letting gc.Sync() loop at least twice to ensure the changes made to the fakeDiscoveryClient are picked up.

Copy link
Contributor Author

@kdada kdada Jan 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't escape from wait.PollImmediateUntil() unless controller.WaitForCacheSync() returns true.
So there are only two paths:

  1. gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns false -> wait.PollImmediateUntil() loops.
  2. gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns true -> gc.Sync() loops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My previous comment on "letting gc.Sync() loop at least twice" was wrong. But still, the important thing isn't running WaitForCacheSync() ten times, but is running at least twice the GetDeletableResources and then resync, so that changes to the fakeDiscoveryClient are picked up.

Also, I would add one line to psuedo-code:

GarbageCollector.Sync():
    wait.Until() loops every `period` until the `stopCh` is closed :
        wait.PollImmediateUntil() loops every 100ms (hardcode) util the `stopCh` is closed:
            GetDeletableResources and gc.resyncMonitors()
            controller.WaitForCacheSync() loops every `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

//
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop.
// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker,
// it may drop some ticks for slow receivers. So the 10 is the upper bound).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would drop this line. It's not important for the reader.

@caesarxuchao
Copy link
Member

/lgtm

Thanks, @kdada

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2019
@kdada
Copy link
Contributor Author

kdada commented Jan 11, 2019

@caesarxuchao need an /approve :)

@caesarxuchao
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, kdada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2019
hh pushed a commit to ii/kubernetes that referenced this pull request Jan 11, 2019
WaitFor returns immediately when done is closed
@caesarxuchao
Copy link
Member

@kdada it seems you have to rebase.

@k8s-ci-robot k8s-ci-robot merged commit 44b2e84 into kubernetes:master Jan 11, 2019
k8s-ci-robot added a commit that referenced this pull request Jan 25, 2019
followup of #72364, slightly improve the comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Follow up #70277, the close of the "done" channel should terminate WaitFor
7 participants