Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: Reflector should have the same injected clock as its informer #115077

Merged
merged 1 commit into from
Jan 16, 2023

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 14, 2023

NOTE: This pull is part of a series of changes that introduce new context-cancellation-aware Poll methods, reduce the surface area of the wait package to a smaller set of functions (if unused, marked private, if consolidating, marked deprecated), unify the underlying loop implementation with better testing, consolidate the backoff manager code into a smaller chunk, and in general address a number of outstanding issues. See #115077, #115116, #115113, #115140, #115064, and #107826.

While refactoring the backoff manager to simplify and unify the code in wait a race condition was encountered in TestSharedInformerWatchDisruption. The new implementation failed because the fake clock was not propagated to the reflector AND backoff managers (right now the backoff managers in tests would be using a real clock). After ensuring the reflector, controller, and informer shared the same clock the test needed to be updated to avoid the race condition by advancing the fake clock and adding real sleeps to wait for asynchronous propagation of the various goroutines in the controller.

Due to the deep structure of informers it is difficult to inject hooks to avoid having to perform sleeps. At a minimum the FakeClock interface should allow a caller to determine the number of waiting timers (to avoid the first sleep).

Included in #115064 but called out separately here for independent tests.

/kind cleanup
/kind flake

NONE

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2023
While refactoring the backoff manager to simplify and unify the code
in wait a race condition was encountered in
TestSharedInformerWatchDisruption. The new implementation failed
because the fake clock was not propagated to the backoff managers
when the reflector was used in a controller. After ensuring the
mangaers, reflector, controller, and informer shared the same
clock the test needed was updated to avoid the race condition by
advancing the fake clock and adding real sleeps to wait for
asynchronous propagation of the various goroutines in the controller.

Due to the deep structure of informers it is difficult to inject
hooks to avoid having to perform sleeps. At a minimum the FakeClock
interface should allow a caller to determine the number of waiting
timers (to avoid the first sleep).
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 14, 2023
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 14, 2023
@@ -348,6 +348,18 @@ func TestSharedInformerWatchDisruption(t *testing.T) {
// Simulate a connection loss (or even just a too-old-watch)
source.ResetWatch()

// Wait long enough for the reflector to exit and the backoff function to start waiting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curios, what is "long enough" in this context? the time to execute all these actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test long enough is "for the goscheduler to kick in, switch to the waiting goroutine, and then run up until the point we try to get the timer channel from timer, which registers us with the fake clock so 'Step' actually does something". I.e. on the order of nano seconds but we don't have a way currently to inject a deterministic "wait until a timer is created on the fake clock and passes this argument".

@aojea
Copy link
Member

aojea commented Jan 15, 2023

/test pull-kubernetes-node-e2e-containerd

unrelated failure

E2eNode Suite: [It] [sig-node] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api expand_more

/lgtm

stress ./cache.test -test.run ^TestSharedInformerWatchDisruption$
5s: 0 runs so far, 0 failures
10s: 48 runs so far, 0 failures
15s: 96 runs so far, 0 failures
20s: 144 runs so far, 0 failures
25s: 192 runs so far, 0 failures
30s: 192 runs so far, 0 failures
35s: 240 runs so far, 0 failures
40s: 288 runs so far, 0 failures
45s: 336 runs so far, 0 failures
50s: 384 runs so far, 0 failures
55s: 384 runs so far, 0 failures
1m0s: 432 runs so far, 0 failures
1m5s: 480 runs so far, 0 failures
1m10s: 528 runs so far, 0 failures
1m15s: 576 runs so far, 0 failures
1m20s: 576 runs so far, 0 failures
1m25s: 624 runs so far, 0 failures

test doesn't flake with current sleeps

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4b275cc1517278d2e8a3ea8e7afc723475dad0dc

@k8s-ci-robot k8s-ci-robot merged commit 4c4d4ad into kubernetes:master Jan 16, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Jan 16, 2023
@smarterclayton
Copy link
Contributor Author

TFTR!

@cici37
Copy link
Contributor

cici37 commented Jan 17, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants