-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid going back in time in watchcache watchers #73845
Avoid going back in time in watchcache watchers #73845
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
8fb140f
to
56b8ee4
Compare
// With some events already sent, update resourceVersion so that | ||
// events that were buffered and not yet processed won't be delivered | ||
// to this watcher second time causing going back in time. | ||
if len(initEvents) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight preference for adjusting watchRV
where we calculate initEvents, and passing a watchRV
that doesn't require modification into newCacheWatcher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - done.
} | ||
currentRV = rv | ||
} | ||
case <-time.After(500 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems likely to cause flakes or unintentional test truncation under load... suggest 1 second or more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah - I'm fine with changing to 1s.
a couple nits, lgtm overall |
56b8ee4
to
1b436f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - PTAL
} | ||
currentRV = rv | ||
} | ||
case <-time.After(500 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.
// With some events already sent, update resourceVersion so that | ||
// events that were buffered and not yet processed won't be delivered | ||
// to this watcher second time causing going back in time. | ||
if len(initEvents) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - done.
} | ||
currentRV = rv | ||
} | ||
case <-time.After(500 * time.Millisecond): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah - I'm fine with changing to 1s.
/lgtm |
Great find @wojtek-t! /cc @cheftako @wenjiaswe who had looked at the "dropped watch event" issue in detail in the past. |
@wojtek-t will you pick this to 1.11, 1.12, 1.13? |
@liggitt - yes, I was waiting for approval on this (and having this merged, which happend after Friday evening for me). Creating cherrypick in few minutes. |
…45-upstream-release-1.13 Automated cherry pick of #73845 upstream release 1.13
Great find. |
Before this change, it was possible that after starting a watcher and processing "initEvents", some vents that were bufferred in the cacher before that happened were delivered to that watcher for the second time causing the watcher to go back in time.
I suspect that this may be the reason of different issues we have seen in the past (e.g. scheduler not scheduling pods even though they fit on a node or vice versa) - we were suspecting the missing events, but in fact these might have been "repeated" events that could have caused that.
@kubernetes/sig-api-machinery-bugs @kubernetes/sig-scheduling-bugs
@liggitt @jpbetz @cheftako @bsalamat