Avoid going back in time in watchcache watchers #73845

wojtek-t · 2019-02-08T11:39:22Z

Before this change, it was possible that after starting a watcher and processing "initEvents", some vents that were bufferred in the cacher before that happened were delivered to that watcher for the second time causing the watcher to go back in time.

I suspect that this may be the reason of different issues we have seen in the past (e.g. scheduler not scheduling pods even though they fit on a node or vice versa) - we were suspecting the missing events, but in fact these might have been "repeated" events that could have caused that.

Fix watch to not send the same set of events multiple times causing watcher to go back in time

@kubernetes/sig-api-machinery-bugs @kubernetes/sig-scheduling-bugs
@liggitt @jpbetz @cheftako @bsalamat

k8s-ci-robot · 2019-02-08T11:39:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/pkg/storage/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2019-02-08T14:36:56Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go

+	// With some events already sent, update resourceVersion so that
+	// events that were buffered and not yet processed won't be delivered
+	// to this watcher second time causing going back in time.
+	if len(initEvents) > 0 {


slight preference for adjusting watchRV where we calculate initEvents, and passing a watchRV that doesn't require modification into newCacheWatcher

Makes sense - done.

liggitt · 2019-02-08T14:38:58Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher_whitebox_test.go

+				}
+				currentRV = rv
+			}
+		case <-time.After(500 * time.Millisecond):


this seems likely to cause flakes or unintentional test truncation under load... suggest 1 second or more?

The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.

But yeah - I'm fine with changing to 1s.

liggitt · 2019-02-08T14:40:52Z

a couple nits, lgtm overall

wojtek-t

Done - PTAL

wojtek-t · 2019-02-08T14:49:42Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher_whitebox_test.go

+				}
+				currentRV = rv
+			}
+		case <-time.After(500 * time.Millisecond):


The good part is that if we hit timeout, we just won't catch the problem.
So it's the other way around that in typical case - timeout doesn't trigger failure in this one.

wojtek-t · 2019-02-08T14:53:48Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go

+	// With some events already sent, update resourceVersion so that
+	// events that were buffered and not yet processed won't be delivered
+	// to this watcher second time causing going back in time.
+	if len(initEvents) > 0 {


Makes sense - done.

wojtek-t · 2019-02-08T14:54:10Z

staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher_whitebox_test.go

+				}
+				currentRV = rv
+			}
+		case <-time.After(500 * time.Millisecond):


But yeah - I'm fine with changing to 1s.

liggitt · 2019-02-08T17:35:06Z

/lgtm
/retest

jpbetz · 2019-02-08T21:05:26Z

Great find @wojtek-t!

/cc @cheftako @wenjiaswe who had looked at the "dropped watch event" issue in detail in the past.

liggitt · 2019-02-08T21:18:57Z

@wojtek-t will you pick this to 1.11, 1.12, 1.13?

wojtek-t · 2019-02-11T07:29:18Z

@liggitt - yes, I was waiting for approval on this (and having this merged, which happend after Friday evening for me). Creating cherrypick in few minutes.

…45-upstream-release-1.13 Automated cherry pick of #73845 upstream release 1.13

#73845-upstream-release-1.12 Automated cherry pick of #70735 #73845 upstream release 1.12

lavalamp · 2019-02-12T23:15:03Z

Great find.

#73845-upstream-release-1.11 Automated cherry pick of #70735 #73845 upstream release 1.11

wojtek-t added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Feb 8, 2019

wojtek-t assigned jpbetz and liggitt Feb 8, 2019

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 8, 2019

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver labels Feb 8, 2019

k8s-ci-robot requested review from caesarxuchao and smarterclayton February 8, 2019 11:40

wojtek-t force-pushed the fix_watcher_going_back_in_time branch from 8fb140f to 56b8ee4 Compare February 8, 2019 13:31

liggitt reviewed Feb 8, 2019

View reviewed changes

Avoid going back in time in watchcache watchers

1b436f1

wojtek-t force-pushed the fix_watcher_going_back_in_time branch from 56b8ee4 to 1b436f1 Compare February 8, 2019 14:56

wojtek-t commented Feb 8, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2019

k8s-ci-robot merged commit fd633d1 into kubernetes:master Feb 8, 2019

k8s-ci-robot added a commit that referenced this pull request Feb 11, 2019

Merge pull request #73909 from wojtek-t/automated-cherry-pick-of-#738…

35cc735

…45-upstream-release-1.13 Automated cherry pick of #73845 upstream release 1.13

k8s-ci-robot added a commit that referenced this pull request Feb 12, 2019

Merge pull request #73915 from wojtek-t/automated-cherry-pick-of-#70735-

fce7bb4

#73845-upstream-release-1.12 Automated cherry pick of #70735 #73845 upstream release 1.12

k8s-ci-robot added a commit that referenced this pull request Feb 18, 2019

Merge pull request #73913 from wojtek-t/automated-cherry-pick-of-#70735-

902c3f5

#73845-upstream-release-1.11 Automated cherry pick of #70735 #73845 upstream release 1.11

liggitt mentioned this pull request Apr 15, 2019

Pod reflector getting delete events with nil object #76624

Closed

wojtek-t deleted the fix_watcher_going_back_in_time branch July 19, 2019 11:43

wojtek-t restored the fix_watcher_going_back_in_time branch July 19, 2019 11:43

wojtek-t mentioned this pull request Dec 30, 2019

pods cannot be scheduled, bug nodes has enough resources #86626

Closed

snowplayfire mentioned this pull request Jan 9, 2020

pod scheduler to node, but kubelet admit failed, pod was outOfmemory #86986

Closed

nonsense mentioned this pull request Jun 3, 2020

Kubernetes Autoscaler testground/infra#43

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid going back in time in watchcache watchers #73845

Avoid going back in time in watchcache watchers #73845

wojtek-t commented Feb 8, 2019

k8s-ci-robot commented Feb 8, 2019

liggitt Feb 8, 2019

wojtek-t Feb 8, 2019

liggitt Feb 8, 2019

wojtek-t Feb 8, 2019

wojtek-t Feb 8, 2019

liggitt commented Feb 8, 2019

wojtek-t left a comment

wojtek-t Feb 8, 2019

wojtek-t Feb 8, 2019

wojtek-t Feb 8, 2019

liggitt commented Feb 8, 2019

jpbetz commented Feb 8, 2019

liggitt commented Feb 8, 2019

wojtek-t commented Feb 11, 2019

lavalamp commented Feb 12, 2019

Avoid going back in time in watchcache watchers #73845

Avoid going back in time in watchcache watchers #73845

Conversation

wojtek-t commented Feb 8, 2019

k8s-ci-robot commented Feb 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented Feb 8, 2019

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented Feb 8, 2019

jpbetz commented Feb 8, 2019

liggitt commented Feb 8, 2019

wojtek-t commented Feb 11, 2019

lavalamp commented Feb 12, 2019