New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WaitFor returns immediately when done is closed #72364
Conversation
Hi @kdada. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@caesarxuchao @cheftako PTAL |
/cc @caesarxuchao @cheftako since this is a follow-up of the PR they reviewed |
} | ||
case <-done: | ||
closeCh() | ||
break FOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its easier to read if we just return here. Also can we add a test for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestWaitForWithClosedChannel and TestWaitForWithEarlyClosingWaitFunc test the two branches which return ErrWaitTimeout.
Also I document the case when done
and c
are closed:
// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed,
// the behavior of this function depends on the golang 'select'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test, I think cheftako wants to make sure that the close of the done
channel can stop the WaitFor
, regardless of the WaitFunc
. For example, if the WaitFunc
doesn't send anything to the its returned channel, nor close the channel, WaitFor
should still be stopped by the close of the done
channel. [edited] In the TestWaitForWithClosedChannel, instead of passing a poller, can you just pass a WaitFunc
that returns an open channel, and doesn't handle done
channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see an additional test something like
stopCh := make(chan struct{})
go func() {
time.Sleep(time.Second)
close(stopCh)
}
start := time.Now()
err := WaitFor(poller(ForeverTestTimeout, ForeverTestTimeout), func() (bool, error) {
return false, nil
}, stopCh)
This clearly tests only the stopCh channel being closed and does not rely on it starting closed.
/assign @caesarxuchao |
In the release note can you add "[Breaking change, client-go]:" in the front? That helps client-go owners to prepare client-go release notes. Thanks. |
@caesarxuchao @cheftako PTAL |
duration := time.Now().Sub(start) | ||
|
||
// The WaitFor should return immediately, so the duration is close to 0s. | ||
if duration >= ForeverTestTimeout/2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dividing it by 2 instead of using ForeverTestTimeout directly? Is the purpose to make the test less false negative (i.e., less likely to miss a bug)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
@@ -351,22 +351,20 @@ type WaitFunc func(done <-chan struct{}) <-chan struct{} | |||
// WaitFor continually checks 'fn' as driven by 'wait'. | |||
// | |||
// WaitFor gets a channel from 'wait()'', and then invokes 'fn' once for every value | |||
// placed on the channel and once more when the channel is closed. | |||
// placed on the channel and once more when the channel is closed. If the channel is closed | |||
// and 'fn' returns false without error, WaitFor returns ErrWaitTimeout. | |||
// | |||
// If 'fn' returns an error the loop ends and that error is returned, and if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/, and if/. If/
Just to make the format of the three cases consistent.
// returning true. | ||
// | ||
// Be careful, when the 'wait' func returns a closed channel and the 'done' is also closed, | ||
// the behavior of this function depends on the golang 'select'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point! Though I think the following is more clear, what do you think?
When the done channel is closed, because the golang `select` statement is "uniform pseudo-random", the `fn` might still run one or multiple time, though eventually `WaitFor` will return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@caesarxuchao PTAL |
/lgtm The release note is not accurate, actually in contrary to the last sentence in the comment of
|
/retest |
@caesarxuchao In the test, the sync period is 10ms(at k8s.io/kubernetes/pkg/controller/garbagecollector/garbagecollector_test.go#895): go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh) In the if !controller.WaitForCacheSync("garbage collector", waitForStopOrTimeout(stopCh, period), gc.dependencyGraphBuilder.IsSynced) {
utilruntime.HandleError(fmt.Errorf("timed out waiting for dependency graph builder sync during GC sync (attempt %d)", attempt))
return false, nil
}
The func waitForStopOrTimeout(stopCh <-chan struct{}, timeout time.Duration) <-chan struct{} {
stopChWithTimeout := make(chan struct{})
go func() {
select {
case <-stopCh:
case <-time.After(timeout):
}
close(stopChWithTimeout)
}()
return stopChWithTimeout
} Then the const (
syncedPollPeriod = 100 * time.Millisecond
)
// WaitForCacheSync waits for caches to populate. It returns true if it was successful, false
// if the controller should shutdown
func WaitForCacheSync(stopCh <-chan struct{}, cacheSyncs ...InformerSynced) bool {
err := wait.PollUntil(syncedPollPeriod,
func() (bool, error) {
for _, syncFunc := range cacheSyncs {
if !syncFunc() {
return false, nil
}
}
return true, nil
},
stopCh)
if err != nil {
klog.V(2).Infof("stop requested")
return false
}
klog.V(4).Infof("caches populated")
return true
} I think we need to add more comments for the |
Good analysis. TestGarbageCollectorSync failure demonstrated the backwards incompatibility of this PR. The 10 ms timeout in the original test wasn't reasonable. |
@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) { | |||
stopCh := make(chan struct{}) | |||
defer close(stopCh) | |||
go gc.Run(1, stopCh) | |||
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh) | |||
go gc.Sync(fakeDiscoveryClient, 100*time.Millisecond, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this 100 ms timeout also going to be flaky, given that the first check also happens after 100ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. we should not do anything in tests with such tight time tolerances. the 10ms sync loop period was in contrast to the 1s wait to see if progress was made (e.g. 100x difference between loop period and test.
I'm trying to understand what change is being made in this PR that broke this test... is that going to cause problems in real life as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old WaitFor
will run the condition check function at least once more after the done
channel is closed. The new behavior will most likely return a timeout error, though due to the nature of select
, it has a slim chance to run the condition function once more.
I doubt any user would intentionally rely on the old behavior. Also, the old behavior wasn't explicitly documented. Hence, I think the chance to cause real problems is slim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for clarifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 1 second.
@@ -856,7 +856,7 @@ func TestGarbageCollectorSync(t *testing.T) { | |||
stopCh := make(chan struct{}) | |||
defer close(stopCh) | |||
go gc.Run(1, stopCh) | |||
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing this to 1 second means the time.Sleep of 1 second is now too short to reliably catch sync problems. The test should be able to wait significantly longer than the resync period to ensure progress is made when expected. Having to lengthen the resync period this much means having to extend the test wait time significantly, which makes the test take much longer to run than desired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness.
How about setting it 200ms here, and let the test run 10s (by setting the sleep time to 10s), then the sync got tested 50 times?
Also note that the old test wasn't as reliable as it intended to be. Although the sync period was set to 10ms, because the old WaitFor
function didn't handle the closed done
channel, the WaitForCacheSync returned 100ms later, after the first poll period. So the old test only run the sync behavior 10 times in the 1s test time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original 100x difference was to ensure the gc sync was run ~100 times to catch any flakiness
it was actually to make sure the test waited waaaaay longer than the resync period to ensure there was time to complete at least 2 iterations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pseudo-code of GarbageCollector.Sync()
:
// In this case, `stopCh` won't be closed. Ignore it.
GarbageCollector.Sync():
wait.Until() loops with period:
wait.PollImmediateUntil() loops with 100ms (hardcode):
controller.WaitForCacheSync() loops with a channel which will be closed after the `period`
This loop never returns unless the sync is finished.
1 second of period makes controller.WaitForCacheSync() tries ~10 times to check if the cache is synced.
200ms of period makes ~2 executions of controller.WaitForCacheSync()
in every wait.PollImmediateUntil() loop. Finally controller.WaitForCacheSync()
also executes about 10 times. (Due to the behavior of time.Ticker, it may drop some ticks for slow receivers. So the 10 is the upper bound)
200ms is better because it tests both of wait.PollImmediateUntil()
and controller.WaitForCacheSync()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's probably worth documenting in a comment in the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@kdada could you add the comment as liggitt suggested? The suedo-code is great, it makes the reasoning much more clear. I added a little more details to it.
GarbageCollector.Sync():
wait.Until() loops with `period` until the `stopCh` is closed :
wait.PollImmediateUntil() loops with 100ms (hardcode) util the `stopCh` is closed:
controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.
The final goal is to make sure that the outermost wait.Until loop runs at least 2 times during the 1s sleep, which ensures that the changes made to the fakeDiscoveryClient are picked up by the Sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@caesarxuchao Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kdada I still have a few nits regarding the comment. I'll take another look tonight to make the iteration faster :)
// controller.WaitForCacheSync() loops with `syncedPollPeriod` (hardcoded to 100ms), | ||
// until either its stop channel is closed after `period`, or all caches synced. | ||
// | ||
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line isn't accurate. Every loop in wait.PollImmediateUntil() always executes controller.WaitForCacheSync() once.
Can you update it to "Setting the period to 200ms allows the WaitForCacheSync() to check for cache sync ~2 times in every wait.PollImmediateUntil() loop."
@@ -856,7 +856,17 @@ func TestGarbageCollectorSync(t *testing.T) { | |||
stopCh := make(chan struct{}) | |||
defer close(stopCh) | |||
go gc.Run(1, stopCh) | |||
go gc.Sync(fakeDiscoveryClient, 10*time.Millisecond, stopCh) | |||
// The pseudo-code of GarbageCollector.Sync(): | |||
// GarbageCollector.Sync(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add the input parameters, e.g., GarbageCollector.Sync(client, period, stopCh). Otherwise readers won't know what "period" we are referring to.
// until either its stop channel is closed after `period`, or all caches synced. | ||
// | ||
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop. | ||
// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really important to run WaitForCacheSync for 10 times? I think the important thing is letting gc.Sync() loop at least twice to ensure the changes made to the fakeDiscoveryClient are picked up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't escape from wait.PollImmediateUntil() unless controller.WaitForCacheSync() returns true.
So there are only two paths:
- gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns false -> wait.PollImmediateUntil() loops.
- gc.Sync() -> wait.PollImmediateUntil() -> controller.WaitForCacheSync() returns true -> gc.Sync() loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. My previous comment on "letting gc.Sync() loop at least twice" was wrong. But still, the important thing isn't running WaitForCacheSync() ten times, but is running at least twice the GetDeletableResources and then resync, so that changes to the fakeDiscoveryClient are picked up.
Also, I would add one line to psuedo-code:
GarbageCollector.Sync():
wait.Until() loops every `period` until the `stopCh` is closed :
wait.PollImmediateUntil() loops every 100ms (hardcode) util the `stopCh` is closed:
GetDeletableResources and gc.resyncMonitors()
controller.WaitForCacheSync() loops every `syncedPollPeriod` (hardcoded to 100ms), until either its stop channel is closed after `period`, or all caches synced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// | ||
// 200ms of period makes ~2 executions of controller.WaitForCacheSync() in every wait.PollImmediateUntil() loop. | ||
// Finally controller.WaitForCacheSync() executes about 10 times (Due to the behavior of time.Ticker, | ||
// it may drop some ticks for slow receivers. So the 10 is the upper bound). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would drop this line. It's not important for the reader.
/lgtm Thanks, @kdada |
@caesarxuchao need an |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: caesarxuchao, kdada The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WaitFor returns immediately when done is closed
@kdada it seems you have to rebase. |
followup of #72364, slightly improve the comment
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #72357
Special notes for your reviewer:
This PR fix the comments in #70277.
When the
done
channel is closed, WaitFor just closesstopCh
and waits for next signal. This PR fixes this issue. Now theWaitFor
func returns immediately when thedone
channel is closed.Does this PR introduce a user-facing change?: