Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deflake: Add retry with timeout to wait for final conditions #116675

Merged
merged 2 commits into from
Apr 12, 2023

Conversation

pacoxu
Copy link
Member

@pacoxu pacoxu commented Mar 16, 2023

What type of PR is this?

/kind flake

What this PR does / why we need it:

  • Add retry with timeout to wait for final conditions

Which issue(s) this PR fixes:

Fixes #107414
Fixes #116774

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 16, 2023
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 16, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Mar 16, 2023
@MadhavJivrajani
Copy link
Contributor

/retest

@pacoxu
Copy link
Member Author

pacoxu commented Mar 16, 2023

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-unit

@pacoxu
Copy link
Member Author

pacoxu commented Mar 17, 2023

/test pull-kubernetes-unit
to see if it still flakes.

@pacoxu
Copy link
Member Author

pacoxu commented Mar 17, 2023

/test pull-kubernetes-unit

@pacoxu
Copy link
Member Author

pacoxu commented Mar 17, 2023

/assign @msau42 @gnufied

for _, volume := range volumes[nodeName] {
if volume.Name == volumeName {
result = true
var result bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the sleeps in the main test case:

// The first detach will be triggered after at leaset 50ms (maxWaitForUnmountDuration in test).

I think this can help but it can still flaky because the test case is trying to catch state transitions in between reconciler retries.

Copy link
Member Author

@pacoxu pacoxu Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the sleeps in the main test case:

If so, we have to add retry in verifyVolumeAttachedToNode and other. Let me try.

I think this can help but it can still flaky because the test case is trying to catch state transitions in between reconciler retries.

I think the wait here cannot fix all flaky things here. The only purpose to use wait here is to make it flake less than before.

@pacoxu
Copy link
Member Author

pacoxu commented Mar 17, 2023

the verify failure is for #116705. I will rebase it after 116705 is merged.

Rebased.

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you run this with stress (https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?

@pacoxu pacoxu changed the title deflake: Add retry with timeout to wait for final conditions [WIP]deflake: Add retry with timeout to wait for final conditions Mar 19, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2023
@pacoxu
Copy link
Member Author

pacoxu commented Mar 19, 2023

did you run this with stress (https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?

Not yet. I need to do some further tests.

@pacoxu
Copy link
Member Author

pacoxu commented Mar 21, 2023

did you run this with stress (https://gist.github.com/liggitt/6a3a2217fa5f846b52519acfc0ffece0#running-unit-tests-to-reproduce-flakes) and see if it resolved visible flakes?

Not yet. I need to do some further tests.

28m50s: 2442 runs so far, 0 failures

@liggitt @gnufied the comments are addressed and the stress run seems to be good enough.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2023
@liggitt
Copy link
Member

liggitt commented Mar 21, 2023

thanks for checking flakes with stress, I'll defer to @msau42 @gnufied for review

@gnufied
Copy link
Member

gnufied commented Mar 21, 2023

This still flakes unfortunately. One of the ways I have been able to reproduce flake is to run through a debugger which causes code to pause arbitrarily. So the test is very sensitive to pauses (I would assume even GC pressure can cause a pause).

image

@@ -834,7 +833,7 @@ func Test_Run_OneVolumeAttachAndDetachTimeoutNodesWithReadWriteOnce(t *testing.T
// Volume is added to asw. Because attach operation fails, volume should not be reported as attached to the node.
waitForVolumeAddedToNode(t, generatedVolumeName, nodeName1, asw)
verifyVolumeAttachedToNode(t, generatedVolumeName, nodeName1, cache.AttachStateUncertain, asw)
verifyVolumeReportedAsAttachedToNode(t, logger, generatedVolumeName, nodeName1, false, asw)
verifyVolumeReportedAsAttachedToNode(t, logger, generatedVolumeName, nodeName1, false, asw, volumeAttachedCheckTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now - most calls to verifyVolumeNoStatusUpdateNeeded are prone to flaking, because the state it is checking depends on when goroutine spawned by reconciler for detaching volume finishes and when reconciler loop itself turns.

I propose we remove all of those calls except the last one - verifyVolumeNoStatusUpdateNeeded(t, logger, generatedVolumeName, nodeName2, asw)

I also think we should remove the intermediate checks - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go#L756-L765 . I know they are useful to have but if a test flakes even inside a debugger, it will cause bad dev experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree if there is no obvious benefit of checking intermediate state, we can remove those checks. Wondering it start to flake much more than before, and what has changed? Any performance implications?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should remove the intermediate checks - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler_test.go#L756-L765 . I know they are useful to have but if a test flakes even inside a debugger, it will cause bad dev experience.

I commented checking intermediate state code here.

Any performance implications?

It flakes for a quite long time in my memory. But yes, it flake more recently than before(I prefer to think this is release related as there are more CI jobs in code freeze week and the CI load is larger.).

Copy link
Member

@gnufied gnufied Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO all calls to verifyVolumeNoStatusUpdateNeeded are still prone to flaking. I haven't tested with the debugger yet, but that code still verifies state that can flip-flop while tests are running and hence is prone to flaking.

@pacoxu
Copy link
Member Author

pacoxu commented Mar 22, 2023

This still flakes unfortunately. One of the ways I have been able to reproduce flake is to run through a debugger which causes code to pause arbitrarily. So the test is very sensitive to pauses (I would assume even GC pressure can cause a pause).

Does it still flake after removing the code of checking intermediate states?

I have been running a stress test that has been consistently successful for several tens of minutes after removing the code of checking intermediate states.

17m50s: 1513 runs so far, 0 failures

@pacoxu
Copy link
Member Author

pacoxu commented Mar 22, 2023

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 22, 2023
This was referenced Mar 28, 2023

// Add a second pod which tries to attach the volume to the same node.
// After adding pod to the same node, detach will not be triggered any more.
generatedVolumeName, podAddErr = dsw.AddPod(types.UniquePodName(podName2), controllervolumetesting.NewPod(podName2, podName2), volumeSpec, nodeName1)
if podAddErr != nil {
t.Fatalf("AddPod failed. Expected: <no error> Actual: <%v>", podAddErr)
}
// Sleep 1s to verify no detach are triggered after second pod is added in the future.
time.Sleep(1000 * time.Millisecond)
// verify no detach are triggered after second pod is added in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should keep this Sleep call, because we are trying to ascertain effect of adding a pod to the reconciler and whether it results in any change. Without the Sleep what we are doing is, we are verifying immediate state and hence they are not the same thing.

Copy link
Member Author

@pacoxu pacoxu Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

  • I'm not sure if relying on sleep in unit testing will still make it fragile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not "relying" on sleep. We are giving reconciler loop a chance to turn, so as we are testing code after reconciler loop has turned (at least in theory).

// verifyVolumeReportedAsAttachedToNode will check volume is in the list of volume attached that needs to be updated
// in node status. By calling this function (GetVolumesToReportAttached), node status should be updated, and the volume
// will not need to be updated until new changes are applied (detach is triggered again)
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - we should keep at least one Sleep call here.

Copy link
Member

@gnufied gnufied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits. Why are we commenting out the code rather then removing them? Do we expect to fix them in a follow up?

@pacoxu
Copy link
Member Author

pacoxu commented Mar 30, 2023

Some minor nits. Why are we commenting out the code rather then removing them? Do we expect to fix them in a follow up?

Let's remove it as we have no further action time here.

  • At first, I kept it because it could help developers understand the testing process and comments.

@gnufied
Copy link
Member

gnufied commented Mar 30, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 30, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 59eb25b8c9e43208078edccfbe4906968523079c

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2023
@k8s-ci-robot k8s-ci-robot merged commit 8cdc7fa into kubernetes:master Apr 12, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Apr 12, 2023
k8s-ci-robot added a commit that referenced this pull request May 5, 2023
…75-upstream-release-1.27

Automated cherry pick of #116675 upstream release 1.27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
7 participants