cache: Reflector should have the same injected clock as its informer #115077

smarterclayton · 2023-01-14T19:49:19Z

NOTE: This pull is part of a series of changes that introduce new context-cancellation-aware Poll methods, reduce the surface area of the wait package to a smaller set of functions (if unused, marked private, if consolidating, marked deprecated), unify the underlying loop implementation with better testing, consolidate the backoff manager code into a smaller chunk, and in general address a number of outstanding issues. See #115077, #115116, #115113, #115140, #115064, and #107826.

While refactoring the backoff manager to simplify and unify the code in wait a race condition was encountered in TestSharedInformerWatchDisruption. The new implementation failed because the fake clock was not propagated to the reflector AND backoff managers (right now the backoff managers in tests would be using a real clock). After ensuring the reflector, controller, and informer shared the same clock the test needed to be updated to avoid the race condition by advancing the fake clock and adding real sleeps to wait for asynchronous propagation of the various goroutines in the controller.

Due to the deep structure of informers it is difficult to inject hooks to avoid having to perform sleeps. At a minimum the FakeClock interface should allow a caller to determine the number of waiting timers (to avoid the first sleep).

Included in #115064 but called out separately here for independent tests.

/kind cleanup
/kind flake

NONE

k8s-ci-robot · 2023-01-14T19:49:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/client-go/tools/cache/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

While refactoring the backoff manager to simplify and unify the code in wait a race condition was encountered in TestSharedInformerWatchDisruption. The new implementation failed because the fake clock was not propagated to the backoff managers when the reflector was used in a controller. After ensuring the mangaers, reflector, controller, and informer shared the same clock the test needed was updated to avoid the race condition by advancing the fake clock and adding real sleeps to wait for asynchronous propagation of the various goroutines in the controller. Due to the deep structure of informers it is difficult to inject hooks to avoid having to perform sleeps. At a minimum the FakeClock interface should allow a caller to determine the number of waiting timers (to avoid the first sleep).

aojea · 2023-01-15T23:53:11Z

staging/src/k8s.io/client-go/tools/cache/shared_informer_test.go

@@ -348,6 +348,18 @@ func TestSharedInformerWatchDisruption(t *testing.T) {
 	// Simulate a connection loss (or even just a too-old-watch)
 	source.ResetWatch()

+	// Wait long enough for the reflector to exit and the backoff function to start waiting


just curios, what is "long enough" in this context? the time to execute all these actions?

In this test long enough is "for the goscheduler to kick in, switch to the waiting goroutine, and then run up until the point we try to get the timer channel from timer, which registers us with the fake clock so 'Step' actually does something". I.e. on the order of nano seconds but we don't have a way currently to inject a deterministic "wait until a timer is created on the fake clock and passes this argument".

aojea · 2023-01-15T23:57:03Z

/test pull-kubernetes-node-e2e-containerd

unrelated failure

E2eNode Suite: [It] [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api expand_more

/lgtm

stress ./cache.test -test.run ^TestSharedInformerWatchDisruption$
5s: 0 runs so far, 0 failures
10s: 48 runs so far, 0 failures
15s: 96 runs so far, 0 failures
20s: 144 runs so far, 0 failures
25s: 192 runs so far, 0 failures
30s: 192 runs so far, 0 failures
35s: 240 runs so far, 0 failures
40s: 288 runs so far, 0 failures
45s: 336 runs so far, 0 failures
50s: 384 runs so far, 0 failures
55s: 384 runs so far, 0 failures
1m0s: 432 runs so far, 0 failures
1m5s: 480 runs so far, 0 failures
1m10s: 528 runs so far, 0 failures
1m15s: 576 runs so far, 0 failures
1m20s: 576 runs so far, 0 failures
1m25s: 624 runs so far, 0 failures

test doesn't flake with current sleeps

k8s-ci-robot · 2023-01-15T23:57:10Z

LGTM label has been added.

Git tree hash: 4b275cc1517278d2e8a3ea8e7afc723475dad0dc

smarterclayton · 2023-01-16T14:17:35Z

TFTR!

cici37 · 2023-01-17T21:16:51Z

/triage accepted

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2023

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 14, 2023

k8s-ci-robot requested review from liggitt and ncdc January 14, 2023 19:50

smarterclayton force-pushed the reflector_mock_clock branch from b78d9f2 to 91b3a81 Compare January 14, 2023 19:50

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 14, 2023

aojea reviewed Jan 15, 2023

View reviewed changes

k8s-ci-robot assigned aojea Jan 15, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2023

k8s-ci-robot merged commit 4c4d4ad into kubernetes:master Jan 16, 2023

k8s-ci-robot added this to the v1.27 milestone Jan 16, 2023

smarterclayton mentioned this pull request Jan 17, 2023

wait: Use a context implementation for ContextForChannel #115140

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: Reflector should have the same injected clock as its informer #115077

cache: Reflector should have the same injected clock as its informer #115077

smarterclayton commented Jan 14, 2023 •

edited

k8s-ci-robot commented Jan 14, 2023

aojea Jan 15, 2023

smarterclayton Jan 16, 2023

aojea commented Jan 15, 2023

k8s-ci-robot commented Jan 15, 2023

smarterclayton commented Jan 16, 2023

cici37 commented Jan 17, 2023

cache: Reflector should have the same injected clock as its informer #115077

cache: Reflector should have the same injected clock as its informer #115077

Conversation

smarterclayton commented Jan 14, 2023 • edited

k8s-ci-robot commented Jan 14, 2023

aojea Jan 15, 2023

Choose a reason for hiding this comment

smarterclayton Jan 16, 2023

Choose a reason for hiding this comment

aojea commented Jan 15, 2023

k8s-ci-robot commented Jan 15, 2023

smarterclayton commented Jan 16, 2023

cici37 commented Jan 17, 2023

smarterclayton commented Jan 14, 2023 •

edited