New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry for explain commands due to resource not found flakiness #26385
Conversation
Sometimes explain command returns "could not find resource error" and "[sig-cli] oc explain *" tests fail due to this. This PR adds retry to prevent other tests or processes in CI causing this error have an impact on these tests.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ardaguclu The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-metal-ipi |
/retest |
can you explain this in more detail? The version of what is being updated? How does that affect oc explain? And I see these failures in jobs that aren't upgrading clusters, so i wouldn't expect any version changes. e.g.: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1424684437928415232 I don't see any reason this resource would ever "not be found" after installation is complete:
|
@bparees thanks for reviewing. Let me explain the actual problem;
After investigating why haproxy is down and up again, it is high likely some other tests triggering it(still not proved yet). As an example, failed jobs which this flow applies; and to check the opposite, I picked green job and there is no haproxy restarts while these tests are running; |
/retest |
1 similar comment
/retest |
/test e2e-gcp-upgrade |
@ardaguclu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
is this check too aggressive?
i'm still puzzled. if the other 2 apiservers were still up/accessible and even the third apiserver didn't actually restart(only its haproxy on the node did), then I don't see why any data would be missing. and if the apiserver endpoint did go down and came back up, and now oc explain has hit that restarted apiserver, the apiserver should still have all the necessary information. If it does not have all the necessary information, it should be using a readiness probe to ensure it doesn't start serving content until it is fully populated. @soltysh @mfojtik can you weigh in here? something seems off about treating this oc explain error as "expected" just because something on one of our 3 master nodes had a hiccup. |
It seems not aggressive. When I checked from the logs, it is happening avg: 2, max:3 times per job(after grouping by time ranges)
After digging into it further taking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424597077127598080 as an example;
and within these time ranges, kube-apiserver-master-0 logs; |
AFAICS, aggregator service starts in here https://github.com/kubernetes/kubernetes/blob/34d1d2a4b98e496145d49aee322a94d8387bfdf7/cmd/kube-apiserver/app/server.go#L173 Maybe I'm missing something, but can we rely on readiness probe in that case?. Because aggregator worker runs in different goroutine and we don't know when it completes aggregation of specs. |
i'd expect it to happen zero times, unless the apiserver is actually down/broken and needs restarting. If we're just restarting it because it's under heavy load or something, restarting it won't help and just hurts. Is there an important benefit get from keeping the check so tight, vs loosening it a small amount? (presumably longer time to detect real issues, so there is a tradeoff, but it sounds like we might be too far on one side of that tradeoff currently).
presumably we could make it signal when it is done and feed that into a readiness probe. Right now it sounds like the apiserver is reporting readiness prematurely and i'd consider that a bug. |
I filed an issue in upstream kubernetes/kubernetes#104324 and opened a PR about this bug. Thank you @bparees for helping investigating the real problem. I'm closing this PR. |
excellent, thanks for digging into it! |
Sometimes explain command returns "could not find resource error" and
[sig-cli] oc explain *
tests fail due to this. This PR adds retry to preventother tests or processes in CI causing this error have an impact on these tests.
According to the https://search.ci.openshift.org/chart?search=oc+explain+should+contain+proper+spec&maxAge=24h&type=junit. 1% of failures belong to these tests(and this percentage does not include failures marked as flaky). The actual problem is some operator(most probably cluster-version-operator) update version and within this time range, explain command returns "resource not found error".