Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid going back in time in watchcache watchers #73845

Merged

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Feb 8, 2019

Before this change, it was possible that after starting a watcher and processing "initEvents", some vents that were bufferred in the cacher before that happened were delivered to that watcher for the second time causing the watcher to go back in time.

I suspect that this may be the reason of different issues we have seen in the past (e.g. scheduler not scheduling pods even though they fit on a node or vice versa) - we were suspecting the missing events, but in fact these might have been "repeated" events that could have caused that.

Fix watch to not send the same set of events multiple times causing watcher to go back in time

@kubernetes/sig-api-machinery-bugs @kubernetes/sig-scheduling-bugs
@liggitt @jpbetz @cheftako @bsalamat

@wojtek-t wojtek-t added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Feb 8, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 8, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver labels Feb 8, 2019
@wojtek-t wojtek-t force-pushed the fix_watcher_going_back_in_time branch from 8fb140f to 56b8ee4 Compare February 8, 2019 13:31
// With some events already sent, update resourceVersion so that
// events that were buffered and not yet processed won't be delivered
// to this watcher second time causing going back in time.
if len(initEvents) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight preference for adjusting watchRV where we calculate initEvents, and passing a watchRV that doesn't require modification into newCacheWatcher

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - done.

}
currentRV = rv
}
case <-time.After(500 * time.Millisecond):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems likely to cause flakes or unintentional test truncation under load... suggest 1 second or more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah - I'm fine with changing to 1s.

@liggitt
Copy link
Member

liggitt commented Feb 8, 2019

a couple nits, lgtm overall

@wojtek-t wojtek-t force-pushed the fix_watcher_going_back_in_time branch from 56b8ee4 to 1b436f1 Compare February 8, 2019 14:56
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - PTAL

}
currentRV = rv
}
case <-time.After(500 * time.Millisecond):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.

// With some events already sent, update resourceVersion so that
// events that were buffered and not yet processed won't be delivered
// to this watcher second time causing going back in time.
if len(initEvents) > 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - done.

}
currentRV = rv
}
case <-time.After(500 * time.Millisecond):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah - I'm fine with changing to 1s.

@liggitt
Copy link
Member

liggitt commented Feb 8, 2019

/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2019
@k8s-ci-robot k8s-ci-robot merged commit fd633d1 into kubernetes:master Feb 8, 2019
@jpbetz
Copy link
Contributor

jpbetz commented Feb 8, 2019

Great find @wojtek-t!

/cc @cheftako @wenjiaswe who had looked at the "dropped watch event" issue in detail in the past.

@liggitt
Copy link
Member

liggitt commented Feb 8, 2019

@wojtek-t will you pick this to 1.11, 1.12, 1.13?

@wojtek-t
Copy link
Member Author

@liggitt - yes, I was waiting for approval on this (and having this merged, which happend after Friday evening for me). Creating cherrypick in few minutes.

k8s-ci-robot added a commit that referenced this pull request Feb 11, 2019
…45-upstream-release-1.13

Automated cherry pick of #73845 upstream release 1.13
k8s-ci-robot added a commit that referenced this pull request Feb 12, 2019
#73845-upstream-release-1.12

Automated cherry pick of #70735 #73845 upstream release 1.12
@lavalamp
Copy link
Member

Great find.

k8s-ci-robot added a commit that referenced this pull request Feb 18, 2019
#73845-upstream-release-1.11

Automated cherry pick of #70735 #73845 upstream release 1.11
@wojtek-t wojtek-t deleted the fix_watcher_going_back_in_time branch July 19, 2019 11:43
@wojtek-t wojtek-t restored the fix_watcher_going_back_in_time branch July 19, 2019 11:43
@nonsense nonsense mentioned this pull request Jun 3, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants