[Flaky Test] Fix multiple pods eviction flaky #94958

zhouya0 · 2020-09-22T03:10:13Z

What type of PR is this?
/kind flake

What this PR does / why we need it:
As all error log showing:

Sep 20 14:59:38.242: INFO: Noticed Pod "taint-eviction-b1" gets evicted.
Sep 20 15:00:26.106: FAIL: Failed to evict all Pods. 1 pod(s) is not evicted.

I think it's just the time to wait the pod eviction is not long enough since:

kubernetes/test/e2e/node/taints.go

Lines 423 to 426 in ac7d944

    
           // 1. Run two pods both with toleration; one with tolerationSeconds=5, the other with 25 
        
           pod1 := createPodForTaintsTest(true, additionalWaitPerDeleteSeconds, podGroup+"1", podGroup, ns) 
        
           pod2 := createPodForTaintsTest(true, 5*additionalWaitPerDeleteSeconds, podGroup+"2", podGroup, ns)

We would better to wait for 5*additionalWaitPerDeleteSeconds right?

Which issue(s) this PR fixes:

Fixes #94931

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

oomichi · 2020-09-22T19:32:22Z

5*additionalWaitPerDeleteSeconds at

kubernetes/test/e2e/node/taints.go

Line 425 in ac7d944

    
           pod2 := createPodForTaintsTest(true, 5*additionalWaitPerDeleteSeconds, podGroup+"2", podGroup, ns)

is passed to

kubernetes/test/e2e/node/taints.go

Line 113 in ac7d944

    
           Tolerations: []v1.Toleration{{Key: "kubernetes.io/e2e-evict-taint-key", Value: "evictTaintVal", Effect: v1.TaintEffectNoExecute, TolerationSeconds: &ts}},

and the tolerationSeconds field that dictates how long the pod will stay bound to the node after the taint is added according to https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
So this change seems straightforward.

/lgtm

zhouya0 · 2020-09-23T01:01:47Z

/cc @ScrapCodes @hasheddan

k8s-ci-robot · 2020-09-23T01:01:48Z

@zhouya0: GitHub didn't allow me to request PR reviews from the following users: ScrapCodes.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @ScrapCodes @hasheddan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zhouya0 · 2020-09-24T02:19:08Z

/cc @liggitt

zhouya0 · 2020-09-28T10:19:13Z

@oomichi Can you approve this or let me know someone could review this PR?

michaelgugino · 2020-10-07T19:47:11Z

I'm not sure this is the right fix. Looking at the logs:

taint-eviction-b2  bootstrap-e2e-minion-group-bn5b  Pending  30s

Seems the second pod never successfully got scheduled to the node. Need to wait for pods to become ready before applying the taint.

ScrapCodes · 2020-10-08T08:56:25Z

Isn't it enough to check if the pod is in running state before applying the taint.

	// Wait for pods to be running state before eviction happens
		framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(cs, pod1))
		framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(cs, pod2))
		framework.Logf("Pod2 is running on %v. Tainting Node", nodeName)

		// 2. Taint the nodes running those pods with a no-execute taint
		ginkgo.By("Trying to apply a taint on the Node")
		testTaint := getTestTaint()
		e2enode.AddOrUpdateTaintOnNode(cs, nodeName, testTaint)
		framework.ExpectNodeHasTaint(cs, nodeName, &testTaint)
		defer e2enode.RemoveTaintOffNode(cs, nodeName, testTaint)

What is more surprising is, how can a pod be in Pending phase after passing the check for pod is in Running state. I could be missing something here, but happy to be corrected.

michaelgugino · 2020-10-08T15:02:57Z

We can see both pods go running around the same time (I just include b2 logs, b1 right below this line) in kubelet log:

I0920 14:59:10.205944    1233 status_manager.go:572] Status for pod "taint-eviction-b2_taint-multiple-pods-7572(919d65d9-5ea5-49c4-887a-242f9d66c575)" updated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:08 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:10 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:10 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:08 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.138.0.3 PodIP:10.64.2.172 PodIPs:[{IP:10.64.2.172}] StartTime:2020-09-20 14:59:08 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:pause State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-09-20 14:59:09 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:k8s.gcr.io/pause:3.2 ImageID:k8s.gcr.io/pause@sha256:927d98197ec1141a368550822d18fa1c60bdae27b78b0c004f705f548c07814f ContainerID:containerd://c497305368e0e359d8f1bfec1543f899dfe61e994306f35488df4f63f3a4990a Started:0xc001ba9aac}] QOSClass:BestEffort EphemeralContainerStatuses:[]})

So wait for running is working as intended.

Later, after the pods both receive the DELETE, the kubelet patches them back to pending.

Process between delete calls for pod b1 and b2 is about 20 seconds on the kubelet:

I0920 14:59:16.049133 1233 kubelet.go:1848] SyncLoop (DELETE, "api"): "taint-eviction-b1_taint-multiple-pods-7572(9b05492a-5393-42f0-9323-7d264a63b8c1)"

I0920 14:59:36.050043 1233 kubelet.go:1848] SyncLoop (DELETE, "api"): "taint-eviction-b2_taint-multiple-pods-7572(919d65d9-5ea5-49c4-887a-242f9d66c575)"

There's some errors around missing paths and other innocuous things in the kubelet log. It seems that the pods go pending after they are marked for deletion, but the deletion event is delayed quite a bit until all containers are deleted. I'm concerned if we adjust the timing, we might be masking something else.

The pending transition I think is another issue and is documented elsewhere (possibly?). Probably need sig-node to look at the kubelet logs.

RobertKielty · 2020-11-17T06:55:53Z

@hasheddan @zhouya0 can we get together and seek out someone on @kubernetes/sig-node-bugs to help us out

We need a root cause analysis done to understand why the pods remain in a PENDING state as long as they do, see @michaelgugino's commentary in the last comment here.

Posted an assist request here https://kubernetes.slack.com/archives/C0BP8PW9G/p1605598026147600

Many thanks

ScrapCodes · 2020-11-17T08:53:14Z

We can see both pods go running around the same time (I just include b2 logs, b1 right below this line) in kubelet log:
I0920 14:59:10.205944    1233 status_manager.go:572] Status for pod "taint-eviction-b2_taint-multiple-pods-7572(919d65d9-5ea5-49c4-887a-242f9d66c575)" updated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:08 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:10 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:10 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-20 14:59:08 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.138.0.3 PodIP:10.64.2.172 PodIPs:[{IP:10.64.2.172}] StartTime:2020-09-20 14:59:08 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:pause State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-09-20 14:59:09 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:k8s.gcr.io/pause:3.2 ImageID:k8s.gcr.io/pause@sha256:927d98197ec1141a368550822d18fa1c60bdae27b78b0c004f705f548c07814f ContainerID:containerd://c497305368e0e359d8f1bfec1543f899dfe61e994306f35488df4f63f3a4990a Started:0xc001ba9aac}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
So wait for running is working as intended.

Later, after the pods both receive the DELETE, the kubelet patches them back to pending.

Process between delete calls for pod b1 and b2 is about 20 seconds on the kubelet:

I0920 14:59:16.049133 1233 kubelet.go:1848] SyncLoop (DELETE, "api"): "taint-eviction-b1_taint-multiple-pods-7572(9b05492a-5393-42f0-9323-7d264a63b8c1)"

I0920 14:59:36.050043 1233 kubelet.go:1848] SyncLoop (DELETE, "api"): "taint-eviction-b2_taint-multiple-pods-7572(919d65d9-5ea5-49c4-887a-242f9d66c575)"

There's some errors around missing paths and other innocuous things in the kubelet log. It seems that the pods go pending after they are marked for deletion, but the deletion event is delayed quite a bit until all containers are deleted. I'm concerned if we adjust the timing, we might be masking something else.

The pending transition I think is another issue and is documented elsewhere (possibly?). Probably need sig-node to look at the kubelet logs.

@michaelgugino Sorry it is like way to late response. But, can you help us chase that another issue you are pointing out above.

k8s-ci-robot · 2020-12-09T04:09:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pacoxu, zhouya0
To complete the pull request process, please assign oomichi after the PR has been reviewed.
You can assign the PR to them by writing /assign @oomichi in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/e2e/node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SergeyKanzhelev · 2021-01-04T18:41:23Z

should we merge #94958 instead? It has a comment and bigger timeout.

dchen1107 · 2021-01-12T17:29:31Z

@zhouya0 Can you reply what @SergeyKanzhelev requested above?

k8s-ci-robot · 2021-01-12T17:29:38Z

@zhouya0: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zhouya0 · 2021-01-13T01:26:46Z

@zhouya0 Can you reply what @SergeyKanzhelev requested above?

Seems someone has fixed this.
#96866
/close

k8s-ci-robot · 2021-01-13T01:26:58Z

@zhouya0: Closed this PR.

In response to this:

@zhouya0 Can you reply what @SergeyKanzhelev requested above?

Seems someone has fixed this.
#96866
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fix multiple pods eviction flaky

8f80704

k8s-ci-robot requested review from vishh and yujuhong September 22, 2020 03:11

k8s-ci-robot assigned oomichi Sep 22, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2020

k8s-ci-robot requested a review from hasheddan September 23, 2020 01:01

k8s-ci-robot requested a review from liggitt September 24, 2020 02:19

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2020

pacoxu approved these changes Dec 9, 2020

View reviewed changes

SergeyKanzhelev mentioned this pull request Jan 4, 2021

Fix test failures #96866

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2021

k8s-ci-robot closed this Jan 13, 2021

	// 1. Run two pods both with toleration; one with tolerationSeconds=5, the other with 25
	pod1 := createPodForTaintsTest(true, additionalWaitPerDeleteSeconds, podGroup+"1", podGroup, ns)
	pod2 := createPodForTaintsTest(true, 5*additionalWaitPerDeleteSeconds, podGroup+"2", podGroup, ns)

[Flaky Test] Fix multiple pods eviction flaky #94958

[Flaky Test] Fix multiple pods eviction flaky #94958

Uh oh!

Conversation

zhouya0 commented Sep 22, 2020

Uh oh!

oomichi commented Sep 22, 2020

Uh oh!

zhouya0 commented Sep 23, 2020

Uh oh!

k8s-ci-robot commented Sep 23, 2020

Uh oh!

zhouya0 commented Sep 24, 2020

Uh oh!

zhouya0 commented Sep 28, 2020

Uh oh!

michaelgugino commented Oct 7, 2020

Uh oh!

ScrapCodes commented Oct 8, 2020

Uh oh!

michaelgugino commented Oct 8, 2020

Uh oh!

RobertKielty commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ScrapCodes commented Nov 17, 2020

Uh oh!

k8s-ci-robot commented Dec 9, 2020

Uh oh!

SergeyKanzhelev commented Jan 4, 2021

Uh oh!

dchen1107 commented Jan 12, 2021

Uh oh!

k8s-ci-robot commented Jan 12, 2021

Uh oh!

zhouya0 commented Jan 13, 2021

Uh oh!

k8s-ci-robot commented Jan 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

RobertKielty commented Nov 17, 2020 •

edited

Loading