-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deflake: Add retry with timeout to wait for final conditions #116675
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/test pull-kubernetes-e2e-kind |
/test pull-kubernetes-unit |
/test pull-kubernetes-unit |
for _, volume := range volumes[nodeName] { | ||
if volume.Name == volumeName { | ||
result = true | ||
var result bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the sleeps in the main test case:
// The first detach will be triggered after at leaset 50ms (maxWaitForUnmountDuration in test). |
I think this can help but it can still flaky because the test case is trying to catch state transitions in between reconciler retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the sleeps in the main test case:
If so, we have to add retry in verifyVolumeAttachedToNode
and other. Let me try.
I think this can help but it can still flaky because the test case is trying to catch state transitions in between reconciler retries.
I think the wait here cannot fix all flaky things here. The only purpose to use wait here is to make it flake less than before.
Rebased. |
pkg/controller/volume/attachdetach/reconciler/reconciler_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you run this with stress
(https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?
pkg/controller/volume/attachdetach/reconciler/reconciler_test.go
Outdated
Show resolved
Hide resolved
Not yet. I need to do some further tests. |
pkg/controller/volume/attachdetach/reconciler/reconciler_test.go
Outdated
Show resolved
Hide resolved
@liggitt @gnufied the comments are addressed and the stress run seems to be good enough. |
@@ -834,7 +833,7 @@ func Test_Run_OneVolumeAttachAndDetachTimeoutNodesWithReadWriteOnce(t *testing.T | |||
// Volume is added to asw. Because attach operation fails, volume should not be reported as attached to the node. | |||
waitForVolumeAddedToNode(t, generatedVolumeName, nodeName1, asw) | |||
verifyVolumeAttachedToNode(t, generatedVolumeName, nodeName1, cache.AttachStateUncertain, asw) | |||
verifyVolumeReportedAsAttachedToNode(t, logger, generatedVolumeName, nodeName1, false, asw) | |||
verifyVolumeReportedAsAttachedToNode(t, logger, generatedVolumeName, nodeName1, false, asw, volumeAttachedCheckTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now - most calls to verifyVolumeNoStatusUpdateNeeded
are prone to flaking, because the state it is checking depends on when goroutine spawned by reconciler for detaching volume finishes and when reconciler loop itself turns.
I propose we remove all of those calls except the last one - verifyVolumeNoStatusUpdateNeeded(t, logger, generatedVolumeName, nodeName2, asw)
I also think we should remove the intermediate checks - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go#L756-L765 . I know they are useful to have but if a test flakes even inside a debugger, it will cause bad dev experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree if there is no obvious benefit of checking intermediate state, we can remove those checks. Wondering it start to flake much more than before, and what has changed? Any performance implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we should remove the intermediate checks - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go#L756-L765 . I know they are useful to have but if a test flakes even inside a debugger, it will cause bad dev experience.
I commented checking intermediate state code here.
Any performance implications?
It flakes for a quite long time in my memory. But yes, it flake more recently than before(I prefer to think this is release related as there are more CI jobs in code freeze week and the CI load is larger.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO all calls to verifyVolumeNoStatusUpdateNeeded
are still prone to flaking. I haven't tested with the debugger yet, but that code still verifies state that can flip-flop while tests are running and hence is prone to flaking.
Does it still flake after removing the code of checking intermediate states? I have been running a stress test that has been consistently successful for several tens of minutes after removing the code of checking intermediate states.
|
/priority important-longterm |
|
||
// Add a second pod which tries to attach the volume to the same node. | ||
// After adding pod to the same node, detach will not be triggered any more. | ||
generatedVolumeName, podAddErr = dsw.AddPod(types.UniquePodName(podName2), controllervolumetesting.NewPod(podName2, podName2), volumeSpec, nodeName1) | ||
if podAddErr != nil { | ||
t.Fatalf("AddPod failed. Expected: <no error> Actual: <%v>", podAddErr) | ||
} | ||
// Sleep 1s to verify no detach are triggered after second pod is added in the future. | ||
time.Sleep(1000 * time.Millisecond) | ||
// verify no detach are triggered after second pod is added in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should keep this Sleep
call, because we are trying to ascertain effect of adding a pod to the reconciler and whether it results in any change. Without the Sleep
what we are doing is, we are verifying immediate state and hence they are not the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
- I'm not sure if relying on sleep in unit testing will still make it fragile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not "relying" on sleep. We are giving reconciler loop a chance to turn, so as we are testing code after reconciler loop has turned (at least in theory).
// verifyVolumeReportedAsAttachedToNode will check volume is in the list of volume attached that needs to be updated | ||
// in node status. By calling this function (GetVolumesToReportAttached), node status should be updated, and the volume | ||
// will not need to be updated until new changes are applied (detach is triggered again) | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - we should keep at least one Sleep
call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits. Why are we commenting out the code rather then removing them? Do we expect to fix them in a follow up?
Let's remove it as we have no further action time here.
|
/lgtm |
LGTM label has been added. Git tree hash: 59eb25b8c9e43208078edccfbe4906968523079c
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, pacoxu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…75-upstream-release-1.27 Automated cherry pick of #116675 upstream release 1.27
What type of PR is this?
/kind flake
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #107414
Fixes #116774
Special notes for your reviewer:
Does this PR introduce a user-facing change?