Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid going back in time in watchcache watchers #73845

Merged
merged 1 commit into from Feb 8, 2019

Conversation

@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2019

Before this change, it was possible that after starting a watcher and processing "initEvents", some vents that were bufferred in the cacher before that happened were delivered to that watcher for the second time causing the watcher to go back in time.

I suspect that this may be the reason of different issues we have seen in the past (e.g. scheduler not scheduling pods even though they fit on a node or vice versa) - we were suspecting the missing events, but in fact these might have been "repeated" events that could have caused that.

Fix watch to not send the same set of events multiple times causing watcher to go back in time

@kubernetes/sig-api-machinery-bugs @kubernetes/sig-scheduling-bugs
@liggitt @jpbetz @cheftako @bsalamat

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 8, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wojtek-t wojtek-t force-pushed the wojtek-t:fix_watcher_going_back_in_time branch from 8fb140f to 56b8ee4 Feb 8, 2019

// With some events already sent, update resourceVersion so that
// events that were buffered and not yet processed won't be delivered
// to this watcher second time causing going back in time.
if len(initEvents) > 0 {

This comment has been minimized.

@liggitt

liggitt Feb 8, 2019

Member

slight preference for adjusting watchRV where we calculate initEvents, and passing a watchRV that doesn't require modification into newCacheWatcher

This comment has been minimized.

@wojtek-t

wojtek-t Feb 8, 2019

Author Member

Makes sense - done.

}
currentRV = rv
}
case <-time.After(500 * time.Millisecond):

This comment has been minimized.

@liggitt

liggitt Feb 8, 2019

Member

this seems likely to cause flakes or unintentional test truncation under load... suggest 1 second or more?

This comment has been minimized.

@wojtek-t

wojtek-t Feb 8, 2019

Author Member

The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.

This comment has been minimized.

@wojtek-t

wojtek-t Feb 8, 2019

Author Member

But yeah - I'm fine with changing to 1s.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Feb 8, 2019

a couple nits, lgtm overall

@wojtek-t wojtek-t force-pushed the wojtek-t:fix_watcher_going_back_in_time branch from 56b8ee4 to 1b436f1 Feb 8, 2019

@wojtek-t
Copy link
Member Author

wojtek-t left a comment

Done - PTAL

}
currentRV = rv
}
case <-time.After(500 * time.Millisecond):

This comment has been minimized.

@wojtek-t

wojtek-t Feb 8, 2019

Author Member

The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.

// With some events already sent, update resourceVersion so that
// events that were buffered and not yet processed won't be delivered
// to this watcher second time causing going back in time.
if len(initEvents) > 0 {

This comment has been minimized.

@wojtek-t

wojtek-t Feb 8, 2019

Author Member

Makes sense - done.

}
currentRV = rv
}
case <-time.After(500 * time.Millisecond):

This comment has been minimized.

@wojtek-t

wojtek-t Feb 8, 2019

Author Member

But yeah - I'm fine with changing to 1s.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Feb 8, 2019

/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 8, 2019

@k8s-ci-robot k8s-ci-robot merged commit fd633d1 into kubernetes:master Feb 8, 2019

13 checks passed

cla/linuxfoundation wojtek-t authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@jpbetz

This comment has been minimized.

Copy link
Contributor

jpbetz commented Feb 8, 2019

Great find @wojtek-t!

/cc @cheftako @wenjiaswe who had looked at the "dropped watch event" issue in detail in the past.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Feb 8, 2019

@wojtek-t will you pick this to 1.11, 1.12, 1.13?

@wojtek-t

This comment has been minimized.

Copy link
Member Author

wojtek-t commented Feb 11, 2019

@liggitt - yes, I was waiting for approval on this (and having this merged, which happend after Friday evening for me). Creating cherrypick in few minutes.

k8s-ci-robot added a commit that referenced this pull request Feb 11, 2019

Merge pull request #73909 from wojtek-t/automated-cherry-pick-of-#738…
…45-upstream-release-1.13

Automated cherry pick of #73845 upstream release 1.13

k8s-ci-robot added a commit that referenced this pull request Feb 12, 2019

Merge pull request #73915 from wojtek-t/automated-cherry-pick-of-#70735-
#73845-upstream-release-1.12

Automated cherry pick of #70735 #73845 upstream release 1.12
@lavalamp

This comment has been minimized.

Copy link
Member

lavalamp commented Feb 12, 2019

Great find.

k8s-ci-robot added a commit that referenced this pull request Feb 18, 2019

Merge pull request #73913 from wojtek-t/automated-cherry-pick-of-#70735-
#73845-upstream-release-1.11

Automated cherry pick of #70735 #73845 upstream release 1.11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.