deflake: Add retry with timeout to wait for final conditions #116675

pacoxu · 2023-03-16T06:37:45Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

Add retry with timeout to wait for final conditions

Which issue(s) this PR fixes:

Fixes #107414
Fixes #116774

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

k8s-ci-robot · 2023-03-16T06:37:53Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MadhavJivrajani · 2023-03-16T07:07:53Z

/retest

pacoxu · 2023-03-16T10:34:32Z

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-unit

pacoxu · 2023-03-17T03:02:08Z

/test pull-kubernetes-unit
to see if it still flakes.

pacoxu · 2023-03-17T03:35:58Z

/test pull-kubernetes-unit

pacoxu · 2023-03-17T03:36:50Z

/assign @msau42 @gnufied

msau42 · 2023-03-17T04:01:46Z

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

-	for _, volume := range volumes[nodeName] {
-		if volume.Name == volumeName {
-			result = true
+	var result bool


Can you remove the sleeps in the main test case:

kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

Line 650 in 69b9f9b

// The first detach will be triggered after at leaset 50ms (maxWaitForUnmountDuration in test).

I think this can help but it can still flaky because the test case is trying to catch state transitions in between reconciler retries.

Can you remove the sleeps in the main test case:

If so, we have to add retry in verifyVolumeAttachedToNode and other. Let me try.

I think this can help but it can still flaky because the test case is trying to catch state transitions in between reconciler retries.

I think the wait here cannot fix all flaky things here. The only purpose to use wait here is to make it flake less than before.

pacoxu · 2023-03-17T06:46:43Z

~~the verify failure is for #116705. I will rebase it after 116705 is merged.~~

Rebased.

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

liggitt

did you run this with stress (https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

pacoxu · 2023-03-19T11:41:30Z

did you run this with stress (https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?

Not yet. I need to do some further tests.

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

pacoxu · 2023-03-21T03:26:41Z

did you run this with stress (https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?

Not yet. I need to do some further tests.

28m50s: 2442 runs so far, 0 failures

@liggitt @gnufied the comments are addressed and the stress run seems to be good enough.

liggitt · 2023-03-21T13:50:13Z

thanks for checking flakes with stress, I'll defer to @msau42 @gnufied for review

gnufied · 2023-03-21T14:13:54Z

This still flakes unfortunately. One of the ways I have been able to reproduce flake is to run through a debugger which causes code to pause arbitrarily. So the test is very sensitive to pauses (I would assume even GC pressure can cause a pause).

gnufied · 2023-03-21T14:26:02Z

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

@@ -834,7 +833,7 @@ func Test_Run_OneVolumeAttachAndDetachTimeoutNodesWithReadWriteOnce(t *testing.T
 	// Volume is added to asw. Because attach operation fails, volume should not be reported as attached to the node.
 	waitForVolumeAddedToNode(t, generatedVolumeName, nodeName1, asw)
 	verifyVolumeAttachedToNode(t, generatedVolumeName, nodeName1, cache.AttachStateUncertain, asw)
-	verifyVolumeReportedAsAttachedToNode(t, logger, generatedVolumeName, nodeName1, false, asw)
+	verifyVolumeReportedAsAttachedToNode(t, logger, generatedVolumeName, nodeName1, false, asw, volumeAttachedCheckTimeout)


So now - most calls to verifyVolumeNoStatusUpdateNeeded are prone to flaking, because the state it is checking depends on when goroutine spawned by reconciler for detaching volume finishes and when reconciler loop itself turns.

I propose we remove all of those calls except the last one - verifyVolumeNoStatusUpdateNeeded(t, logger, generatedVolumeName, nodeName2, asw)

I also think we should remove the intermediate checks - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go#L756-L765 . I know they are useful to have but if a test flakes even inside a debugger, it will cause bad dev experience.

I agree if there is no obvious benefit of checking intermediate state, we can remove those checks. Wondering it start to flake much more than before, and what has changed? Any performance implications?

I also think we should remove the intermediate checks - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go#L756-L765 . I know they are useful to have but if a test flakes even inside a debugger, it will cause bad dev experience.

I commented checking intermediate state code here.

Any performance implications?

It flakes for a quite long time in my memory. But yes, it flake more recently than before(I prefer to think this is release related as there are more CI jobs in code freeze week and the CI load is larger.).

IMO all calls to verifyVolumeNoStatusUpdateNeeded are still prone to flaking. I haven't tested with the debugger yet, but that code still verifies state that can flip-flop while tests are running and hence is prone to flaking.

pacoxu · 2023-03-22T03:31:29Z

This still flakes unfortunately. One of the ways I have been able to reproduce flake is to run through a debugger which causes code to pause arbitrarily. So the test is very sensitive to pauses (I would assume even GC pressure can cause a pause).

Does it still flake after removing the code of checking intermediate states?

I have been running a stress test that has been consistently successful for several tens of minutes after removing the code of checking intermediate states.

17m50s: 1513 runs so far, 0 failures

pacoxu · 2023-03-22T03:32:16Z

/priority important-longterm

gnufied · 2023-03-29T15:18:35Z

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go


 	// Add a second pod which tries to attach the volume to the same node.
 	// After adding pod to the same node, detach will not be triggered any more.
 	generatedVolumeName, podAddErr = dsw.AddPod(types.UniquePodName(podName2), controllervolumetesting.NewPod(podName2, podName2), volumeSpec, nodeName1)
 	if podAddErr != nil {
 		t.Fatalf("AddPod failed. Expected: <no error> Actual: <%v>", podAddErr)
 	}
-	// Sleep 1s to verify no detach are triggered after second pod is added in the future.
-	time.Sleep(1000 * time.Millisecond)
+	// verify no detach are triggered after second pod is added in the future.


IMO we should keep this Sleep call, because we are trying to ascertain effect of adding a pod to the reconciler and whether it results in any change. Without the Sleep what we are doing is, we are verifying immediate state and hence they are not the same thing.

Updated.

I'm not sure if relying on sleep in unit testing will still make it fragile.

We are not "relying" on sleep. We are giving reconciler loop a chance to turn, so as we are testing code after reconciler loop has turned (at least in theory).

gnufied · 2023-03-29T15:19:04Z

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go

 	// verifyVolumeReportedAsAttachedToNode will check volume is in the list of volume attached that needs to be updated
 	// in node status. By calling this function (GetVolumesToReportAttached), node status should be updated, and the volume
 	// will not need to be updated until new changes are applied (detach is triggered again)
-	time.Sleep(100 * time.Millisecond)


Same here - we should keep at least one Sleep call here.

gnufied

Some minor nits. Why are we commenting out the code rather then removing them? Do we expect to fix them in a follow up?

… last ones

pacoxu · 2023-03-30T02:49:28Z

Some minor nits. Why are we commenting out the code rather then removing them? Do we expect to fix them in a follow up?

Let's remove it as we have no further action time here.

At first, I kept it because it could help developers understand the testing process and comments.

gnufied · 2023-03-30T14:51:34Z

/lgtm
/approve

k8s-ci-robot · 2023-03-30T14:51:41Z

LGTM label has been added.

Git tree hash: 59eb25b8c9e43208078edccfbe4906968523079c

k8s-ci-robot · 2023-03-30T14:51:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/volume/OWNERS~~ [gnufied]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…75-upstream-release-1.27 Automated cherry pick of #116675 upstream release 1.27

k8s-ci-robot requested review from jingxu97 and mauriciopoppe March 16, 2023 06:37

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 16, 2023

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 16, 2023

k8s-ci-robot assigned gnufied and msau42 Mar 17, 2023

msau42 reviewed Mar 17, 2023

View reviewed changes

pacoxu force-pushed the volume-flake branch from ef0ebe5 to 3b4a17c Compare March 17, 2023 05:00

pacoxu force-pushed the volume-flake branch from 3b4a17c to 46bbf69 Compare March 17, 2023 07:45

liggitt reviewed Mar 17, 2023

View reviewed changes

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go Outdated Show resolved Hide resolved

liggitt reviewed Mar 17, 2023

View reviewed changes

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go Outdated Show resolved Hide resolved

pacoxu changed the title ~~deflake: Add retry with timeout to wait for final conditions~~ [WIP]deflake: Add retry with timeout to wait for final conditions Mar 19, 2023

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2023

liggitt mentioned this pull request Mar 20, 2023

[Flaking test] sig-release-master-blocking ci-kubernetes-unit #116774

Closed

gnufied reviewed Mar 20, 2023

View reviewed changes

pkg/controller/volume/attachdetach/reconciler/reconciler_test.go Outdated Show resolved Hide resolved

pacoxu force-pushed the volume-flake branch from 46bbf69 to 0f3bf66 Compare March 21, 2023 03:26

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2023

gnufied reviewed Mar 21, 2023

View reviewed changes

deflake: Add retry with timeout to wait for final conditions

c14068c

pacoxu force-pushed the volume-flake branch from 0f3bf66 to c14068c Compare March 22, 2023 03:24

k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 22, 2023

This was referenced Mar 28, 2023

[Flaking Test] ci-kubernetes-unit #116955

Closed

Fix flake #116957

Closed

gnufied reviewed Mar 29, 2023

View reviewed changes

verifyVolumeNoStatusUpdateNeeded may cause flake and so only keep the…

8e36e94

… last ones

pacoxu force-pushed the volume-flake branch from ea2151c to 8e36e94 Compare March 30, 2023 02:47

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2023

pacoxu mentioned this pull request Apr 11, 2023

kubeadm: support upgrade coredns and kube-proxy addons after all the control plane instances have been upgraded #116570

Merged

k8s-ci-robot merged commit 8cdc7fa into kubernetes:master Apr 12, 2023

k8s-ci-robot added this to the v1.28 milestone Apr 12, 2023

pacoxu mentioned this pull request Apr 19, 2023

Automated cherry pick of #116675 upstream release 1.27 #117438

Merged

k8s-ci-robot added a commit that referenced this pull request May 5, 2023

Merge pull request #117438 from pacoxu/automated-cherry-pick-of-#1166…

108f028

…75-upstream-release-1.27 Automated cherry pick of #116675 upstream release 1.27

This was referenced May 24, 2023

Windows unit test flake: reconciler.Test_Run_OneVolumeDetachFailNodeWithReadWriteOnce #116693

Closed

unit tests: Skip flaky tests on Windows (part 2) #116659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deflake: Add retry with timeout to wait for final conditions #116675

deflake: Add retry with timeout to wait for final conditions #116675

pacoxu commented Mar 16, 2023 •

edited by liggitt

k8s-ci-robot commented Mar 16, 2023

MadhavJivrajani commented Mar 16, 2023

pacoxu commented Mar 16, 2023

pacoxu commented Mar 17, 2023

pacoxu commented Mar 17, 2023

pacoxu commented Mar 17, 2023

msau42 Mar 17, 2023

pacoxu Mar 17, 2023 •

edited

pacoxu commented Mar 17, 2023 •

edited

liggitt left a comment

pacoxu commented Mar 19, 2023

pacoxu commented Mar 21, 2023

liggitt commented Mar 21, 2023

gnufied commented Mar 21, 2023

gnufied Mar 21, 2023

jingxu97 Mar 21, 2023

pacoxu Mar 22, 2023

gnufied Mar 28, 2023 •

edited

pacoxu commented Mar 22, 2023

pacoxu commented Mar 22, 2023

gnufied Mar 29, 2023

pacoxu Mar 30, 2023 •

edited

gnufied Mar 30, 2023

gnufied Mar 29, 2023

gnufied left a comment

pacoxu commented Mar 30, 2023

gnufied commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

deflake: Add retry with timeout to wait for final conditions #116675

deflake: Add retry with timeout to wait for final conditions #116675

Conversation

pacoxu commented Mar 16, 2023 • edited by liggitt

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Mar 16, 2023

MadhavJivrajani commented Mar 16, 2023

pacoxu commented Mar 16, 2023

pacoxu commented Mar 17, 2023

pacoxu commented Mar 17, 2023

pacoxu commented Mar 17, 2023

msau42 Mar 17, 2023

Choose a reason for hiding this comment

pacoxu Mar 17, 2023 • edited

Choose a reason for hiding this comment

pacoxu commented Mar 17, 2023 • edited

liggitt left a comment

Choose a reason for hiding this comment

pacoxu commented Mar 19, 2023

pacoxu commented Mar 21, 2023

liggitt commented Mar 21, 2023

gnufied commented Mar 21, 2023

gnufied Mar 21, 2023

Choose a reason for hiding this comment

jingxu97 Mar 21, 2023

Choose a reason for hiding this comment

pacoxu Mar 22, 2023

Choose a reason for hiding this comment

gnufied Mar 28, 2023 • edited

Choose a reason for hiding this comment

pacoxu commented Mar 22, 2023

pacoxu commented Mar 22, 2023

gnufied Mar 29, 2023

Choose a reason for hiding this comment

pacoxu Mar 30, 2023 • edited

Choose a reason for hiding this comment

gnufied Mar 30, 2023

Choose a reason for hiding this comment

gnufied Mar 29, 2023

Choose a reason for hiding this comment

gnufied left a comment

Choose a reason for hiding this comment

pacoxu commented Mar 30, 2023

gnufied commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

k8s-ci-robot commented Mar 30, 2023

pacoxu commented Mar 16, 2023 •

edited by liggitt

pacoxu Mar 17, 2023 •

edited

pacoxu commented Mar 17, 2023 •

edited

gnufied Mar 28, 2023 •

edited

pacoxu Mar 30, 2023 •

edited