New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cacher: do not simply popExpiredWatchers when the cacher hasn't dispatched any event #117014
cacher: do not simply popExpiredWatchers when the cacher hasn't dispatched any event #117014
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Thu Mar 30 10:31:11 UTC 2023. |
/assign @wojtek-t |
/triage accepted |
/test |
@p0lyn0mial: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -912,8 +912,6 @@ func (c *Cacher) dispatchEvents() { | |||
// Never send a bookmark event if we did not see an event here, this is fine | |||
// because we don't provide any guarantees on sending bookmarks. | |||
if lastProcessedResourceVersion == 0 { | |||
// pop expired watchers in case there has been no update | |||
c.bookmarkWatchers.popExpiredWatchers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as potentially problematic.
Imagine a usecase where no object is actually changing for a long time, but we potentially have a lot of watchers registered.
[FWIW - secrets are a good example]
After this change, until we get the first event, we will be accumulating all watchers in bookmarkWatchers
.
And those currently are not deleted from bookmarkWatcher
on watches being closed - they are just are cleaned up here and on dispatching a regular events.
So in this case, we will be accumulating stuff in bookmarkWatchers
, in a pathological case leading to OOM.
We can't just delete this line - we need a mechanism to prevent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't realise that watchers won't be removed from bookmarkWatcher
when they are closed.
I think I would change the code to just "pop" closed watchers, would you agree ?
I think that memory footprint will not be huge. Since we are storing pointers to the cacheWatcher
struct we should be able to store ~8K entires in just 1MB (assuming the size of the struct is ~100 bytes).
Given the above calculation I think we should be fine.
I could provide a new method c.bookmarkWatchers.popClosedWatchers()
and just call it in line 916.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would change the code to just "pop" closed watchers, would you agree ?
I think we need to ensure that we won't be iterating over too many (still active) watchers every second.
So I think it's not as simple as that.
I think that the "time=0" is somewhat special here...
- For time>0, probably calling
popExpiredWatchers
as it was before is still the right thing to do? - If we do that, we're probably left only with those with
time=0
. For these closing stopped seems good enough, but we need to somehow ensure that we won't be iterating over all of the watchers on every loop...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I've just realised that the cache receives periodic updates from the etcd
every X seconds.
These periodic updates are reflected in lastProcessedResourceVersion
.
Since the graduation of EfficientWatchResumption
every Kube installation supports progress notification from etcd. In addition to that I think that the default notification period is 5 seconds
.
So, that means we will be accumulating disconnected watcher for 5 s
in the worst case.
Expired watchers are not re-added again to watchersBuffer
.
I think we should be fine with what we have atm, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - under the assumption that etcd is configured to really deliver them.
I can imagine a situation of some operator configuring those progress notifiy events to be delivered every 1h or sth like that. in which case it may become problematic...
I think that even we assume that those with time=0 aren't problematic, I would feel more comfortable if we keep calling popExpiredWatchers
for time>0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, updated the PR as we discussed offline.
ab0877c
to
43b0fb7
Compare
43b0fb7
to
813f8d8
Compare
if watcher.stopped { | ||
continue | ||
} | ||
c.bookmarkWatchers.addWatcher(watcher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you feel about introducing a new function that would allow for adding an array of watchers ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the benefit would be acquiring the lock just once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me
t.Fatalf("unexpected error waiting for the cache to be ready") | ||
} | ||
|
||
opts := storage.ListOptions{Predicate: storage.Everything} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
pred := storage.Everything
pred.AllowWatchBookmarks = true
opts := storage.ListOptions{
Predicate: pred,
AllowWatchBookmarks: ...
}
if watcher.stopped { | ||
continue | ||
} | ||
c.bookmarkWatchers.addWatcher(watcher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - just some minor comments.
require.NoError(t, err, "failed to create watch: %v") | ||
defer w.Stop() | ||
|
||
// the default bookmarkTimer is ~1 sec, waiting 2 seconds is enough |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Ensure that popExpiredWatchers is called to ensure that our watch isn't removed from bookmarkWatchers.
// We do that every ~1s, so waiting 2 seconds seems enough.
813f8d8
to
652eaa1
Compare
652eaa1
to
98e0825
Compare
/lgtm Thanks! |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/hold a unit test failure looks real. |
// pop expired watchers in case there has been no update | ||
c.bookmarkWatchers.popExpiredWatchers() | ||
var expiredWatchers []*cacheWatcher | ||
for _, watchers := range c.bookmarkWatchers.popExpiredWatchers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be called under c.Lock
/lgtm cancel |
// so buckedID can hold a negative value | ||
nextTime, ok := w.nextBookmarkTime(t.clock.Now(), t.bookmarkFrequency) | ||
if !ok { | ||
wantsBookmark, bucketID := t.calculateNextBookmarkBucketID(w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - can you please move the refactoring you did here to a separate PR?
I really prefer having bug fixes sit in their own PRs.
…ny event If the cacher hasn't seen any event (when lastProcessedResourceVersion is zero) and the bookmarkTimer has ticked then we shouldn't popExpiredWatchers. This is because the watchers wont' be re-added and will miss future bookmark events when the cacher finally receives an event via the c.incoming chan.
98e0825
to
6db4cbf
Compare
I rebased this PR on top of #117410, PTAL. |
/hold cancel |
/lgtm |
LGTM label has been added. Git tree hash: 2ad06be030cd8251f404a2946aed9539cf003ff2
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: p0lyn0mial, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
If the cacher hasn't seen any event (when
lastProcessedResourceVersion
is zero) and thebookmarkTimer
has ticked then we shouldn't justpopExpiredWatchers
. This is because the watchers wont' be re-added and will miss future bookmark events when the cacher finally receives an event via thec.incoming
chan. This PR fixes the issue by re-adding expired watchers back to the queue. However due to the potential size of the queue we should seek for more performant solution in the future (added a TODO).xref: kubernetes/enhancements#3157
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: