Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] [k8s.io] Pods should support pod readiness gates [NodeFeature:PodReadinessGate] in ci-kubernetes-e2e-gce-cos-k8sbeta-default #75355

Open
alejandrox1 opened this Issue Mar 14, 2019 · 8 comments

Comments

@alejandrox1
Copy link

alejandrox1 commented Mar 14, 2019

Which jobs are failing:
ci-kubernetes-e2e-gce-cos-k8sbeta-default

Which test(s) are failing:
[k8s.io] Pods should support pod readiness gates [NodeFeature:PodReadinessGate] build 9858

Since when has it been failing:
It has been flaking more frequently since 3/13.

Testgrid link:
https://testgrid.k8s.io/sig-release-1.14-blocking#gce-cos-1.14-default

Reason for failure:
The issue seems to be a timeout

I0314 01:53:57.958] Mar 14 01:53:55.768: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
I0314 01:53:57.959] Mar 14 01:53:57.870: INFO: namespace pods-4212 deletion completed in 46.413084965s
I0314 01:53:57.959]
I0314 01:53:57.959]
I0314 01:53:57.959] • Failure [91.269 seconds]
I0314 01:53:57.959] [k8s.io] Pods
I0314 01:53:57.959] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:691
I0314 01:53:57.959] should support pod readiness gates [NodeFeature:PodReadinessGate] [It]
I0314 01:53:57.959] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:746
I0314 01:53:57.959]
I0314 01:53:57.959] Unexpected error:
I0314 01:53:57.959] <*errors.errorString | 0xc000324470>: {
I0314 01:53:57.959] s: "timed out waiting for the condition",
I0314 01:53:57.960] }
I0314 01:53:57.960] timed out waiting for the condition
I0314 01:53:57.960] occurred
I0314 01:53:57.960]
I0314 01:53:57.960] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:779
I0314 01:53:57.960] ------------------------------

@kubernetes/sig-node-test-failures
cc @smourapina @alejandrox1 @kacole2 @mortent

/kind flake
/sig node
/priority important-soon
/milestone v1.14

@alejandrox1

This comment has been minimized.

Copy link
Author

alejandrox1 commented Mar 14, 2019

Latest PR to touch that test #67406
May be due to a readiness probe failing.

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 14, 2019

/remove-kind failing-test

@dashpole sending this your way for triage too - could you have a look please? I wonder if it's a node problem or if timeouts are originating elsewhere 🤔

fwiw, it's failing about 10%-15% of the time in e2e tests: triage

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Mar 14, 2019

It looks like updating the status after the patch to the readiness gate takes longer than expected.

Looking at this failure: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-cos-k8sbeta-default/9858

From the test logs:
[It] should support pod readiness gates [NodeFeature:PodReadinessGate]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:746
�[1mSTEP�[0m: submitting the pod to kubernetes
�[1mSTEP�[0m: patching pod status with condition "k8s.io/test-condition1" to true
�[1mSTEP�[0m: patching pod status with condition "k8s.io/test-condition2" to true
Mar 14 01:52:56.720: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:52:57.731: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:52:58.725: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:52:59.707: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:00.721: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:01.708: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:02.753: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:03.707: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:04.713: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:05.709: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false
Mar 14 01:53:05.767: INFO: Expect the Ready condition of pod "pod-ready" to be true, but got false

From the kubelet logs:

I0314 01:53:12.020475    1266 status_manager.go:499] Patch status for pod "pod-ready_pods-4212(d1d6cfb4-45fb-11e9-af5c-42010a8a0002)" with "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"k8s.io/test-condition2\"},{\"type\":\"k8s.io/test-condition1\"},{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2019-03-14T01:52:55Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"Ready\"}]}}"

The pod goes to Ready at 1:53:12, and we stopped checking at 1:53:05

This looks expected, based on the description of the maxReadyStatusUpdateTolerance parameter in the test.

I don't think this should be release-blocking.

/assign @freehan
who owns this feature to determine how we can make the test less flaky. Feel free to close if this is already tracked somewhere else
/unassign

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 15, 2019

I don't think this should be release-blocking.

Thanks, good to know.

@freehan if it could be worked on, it would really help stabilize CI - which would be super appreciated, especially for release-blocking jobs.

@freehan

This comment has been minimized.

Copy link
Member

freehan commented Mar 20, 2019

When this tests was added (2 or 3 release back), I remember the 10 second waiting timeout maxReadyStatusUpdateTolerance is enough for kubelet to react and update the pod ready condition.

IIRC, kubelet reconciles pod status every 15s. But if pod readiness gate condition is updated, kubelet should reconcile ASAP skipping the 15s period.

There might be a bug here.

@mariantalla mariantalla moved this from Flakes to Under investigation (prioritized) in 1.14 CI Signal Mar 20, 2019

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 20, 2019

Thanks @freehan - do you have time to take a look? 🙏It would help to know (especially if we need to include something in known issues).

@freehan

This comment has been minimized.

Copy link
Member

freehan commented Mar 20, 2019

I just did a few manual tests. Kubelet reacted very quickly. Not able to reproduce the update latency more than 10s.

@freehan

This comment has been minimized.

Copy link
Member

freehan commented Mar 21, 2019

I ran the test in a loop more than 50 times now on my own e2e cluster. Not a single failure.

Is it because there are many tests running in parallel causing high load on kubelet, hence longer update latency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.