Add retry for explain commands due to resource not found flakiness #26385

ardaguclu · 2021-08-09T07:48:08Z

Sometimes explain command returns "could not find resource error" and
[sig-cli] oc explain * tests fail due to this. This PR adds retry to prevent
other tests or processes in CI causing this error have an impact on these tests.

According to the https://search.ci.openshift.org/chart?search=oc+explain+should+contain+proper+spec&maxAge=24h&type=junit. 1% of failures belong to these tests(and this percentage does not include failures marked as flaky). The actual problem is some operator(most probably cluster-version-operator) update version and within this time range, explain command returns "resource not found error".

Sometimes explain command returns "could not find resource error" and "[sig-cli] oc explain *" tests fail due to this. This PR adds retry to prevent other tests or processes in CI causing this error have an impact on these tests.

openshift-ci · 2021-08-09T07:48:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ardaguclu
To complete the pull request process, please assign mfojtik after the PR has been reviewed.
You can assign the PR to them by writing /assign @mfojtik in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ardaguclu · 2021-08-09T07:55:52Z

/test e2e-metal-ipi

ardaguclu · 2021-08-09T10:21:50Z

/retest

bparees · 2021-08-09T13:10:31Z

The actual problem is some operator(most probably cluster-version-operator) update version and within this time range, explain command returns "resource not found error".

can you explain this in more detail? The version of what is being updated? How does that affect oc explain? And I see these failures in jobs that aren't upgrading clusters, so i wouldn't expect any version changes. e.g.: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1424684437928415232

I don't see any reason this resource would ever "not be found" after installation is complete:

Aug  9 11:58:57.474: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-oc-explain-bq4j2 --kubeconfig=/tmp/configfile429536836 explain clusterversions --api-version=config.openshift.io/v1:
StdOut>
error: couldn't find resource for "config.openshift.io/v1, Kind=ClusterVersion"

ardaguclu · 2021-08-10T07:15:54Z

@bparees thanks for reviewing. Let me explain the actual problem;

Haproxy's health checks fail sometimes by throwing Health check for server masters/master-1 failed, reason: Layer6 timeout, check duration: 1001ms, status: 0/3 DOWN.. After health check failure, it takes a couple of seconds to heal and serve again.
[sig-cli] oc explain * tests work as expected even if haproxy is failed because apiserver of other masters are still serving.
When haproxy starts serving(Server masters/master-1 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.), oc explain tries to access this(for which resource explain command is executed at that time).
oc explain command fails exactly in here https://github.com/openshift/kubernetes-kubectl/blob/0c5d119ce86414ebbd725f56967987c9cf72e496/pkg/cmd/explain/explain.go#L154
Within this very small time range, since the haproxy is up again, oc could not find time to populate OpenAPI specs and returns not found error.

After investigating why haproxy is down and up again, it is high likely some other tests triggering it(still not proved yet).

As an example, failed jobs which this flow applies;
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1424684437928415232
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424597077127598080
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424100859263848448

and to check the opposite, I picked green job and there is no haproxy restarts while these tests are running;
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424625920471207936

ardaguclu · 2021-08-10T07:17:26Z

/retest

ardaguclu · 2021-08-10T09:36:33Z

/retest

ardaguclu · 2021-08-10T11:58:53Z

/test e2e-gcp-upgrade

openshift-ci · 2021-08-10T14:16:08Z

@ardaguclu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-upgrade	`c84687e`	link	`/test e2e-gcp-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bparees · 2021-08-10T18:36:22Z

Haproxy's health checks fail sometimes by throwing Health check for server masters/master-1 failed, reason: Layer6 timeout, check duration: 1001ms, status: 0/3 DOWN.. After health check failure, it takes a couple of seconds to heal and serve again.

is this check too aggressive?

When haproxy starts serving(Server masters/master-1 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.), oc explain tries to access this(for which resource explain command is executed at that time).
oc explain command fails exactly in here https://github.com/openshift/kubernetes-kubectl/blob/0c5d119ce86414ebbd725f56967987c9cf72e496/pkg/cmd/explain/explain.go#L154
Within this very small time range, since the haproxy is up again, oc could not find time to populate OpenAPI specs and returns not found error.

i'm still puzzled. if the other 2 apiservers were still up/accessible and even the third apiserver didn't actually restart(only its haproxy on the node did), then I don't see why any data would be missing.

and if the apiserver endpoint did go down and came back up, and now oc explain has hit that restarted apiserver, the apiserver should still have all the necessary information. If it does not have all the necessary information, it should be using a readiness probe to ensure it doesn't start serving content until it is fully populated.

@soltysh @mfojtik can you weigh in here? something seems off about treating this oc explain error as "expected" just because something on one of our 3 master nodes had a hiccup.

ardaguclu · 2021-08-11T07:03:25Z

Haproxy's health checks fail sometimes by throwing Health check for server masters/master-1 failed, reason: Layer6 timeout, check duration: 1001ms, status: 0/3 DOWN.. After health check failure, it takes a couple of seconds to heal and serve again.

is this check too aggressive?

It seems not aggressive. When I checked from the logs, it is happening avg: 2, max:3 times per job(after grouping by time ranges)

When haproxy starts serving(Server masters/master-1 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.), oc explain tries to access this(for which resource explain command is executed at that time).
oc explain command fails exactly in here https://github.com/openshift/kubernetes-kubectl/blob/0c5d119ce86414ebbd725f56967987c9cf72e496/pkg/cmd/explain/explain.go#L154
Within this very small time range, since the haproxy is up again, oc could not find time to populate OpenAPI specs and returns not found error.

i'm still puzzled. if the other 2 apiservers were still up/accessible and even the third apiserver didn't actually restart(only its haproxy on the node did), then I don't see why any data would be missing.

and if the apiserver endpoint did go down and came back up, and now oc explain has hit that restarted apiserver, the apiserver should still have all the necessary information. If it does not have all the necessary information, it should be using a readiness probe to ensure it doesn't start serving content until it is fully populated.

@soltysh @mfojtik can you weigh in here? something seems off about treating this oc explain error as "expected" just because something on one of our 3 master nodes had a hiccup.

After digging into it further taking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424597077127598080 as an example;

06:13:18: TerminationGracefulTerminationFinished is fired for kube-apiserver-master-0.
06:13:26: kube-apiserver-check-endpoints stops logging after Starting #1 worker of check-endpoints controller ...
06:13:35: oc explain should contain proper spec+status for CRDs test is started
06:13:37: KubeAPIReadyz is fired for kube-apiserver-master-0.
06:13:39: haproxy-master-0 logs Server masters/master-0 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
06:13:40.044: oc explain is run for operators --api-version=operators.coreos.com/v1'
06:13:40.344: test failed by throwing couldn't find resource for "operators.coreos.com/v1, Kind=Operator"

and within these time ranges, kube-apiserver-master-0 2021-08-09T06:13:37.313196974Z I0809 06:13:37.313090 2021-08-09T06:13:37.320467463Z I0809 06:13:37.320373 2021-08-09T06:13:37.795945394Z I0809 06:13:37.795818 2021-08-09T06:13:37.796029189Z I0809 06:13:37.795982 2021-08-09T06:13:38.609098725Z I0809 06:13:38.608882 2021-08-09T06:13:38.609098725Z I0809 06:13:38.608981 2021-08-09T06:13:39.409034087Z I0809 06:13:39.408876 2021-08-09T06:13:39.409109943Z I0809 06:13:39.409058 -- oc explain starts here
2021-08-09T06:13:40.120721163Z I0809 06:13:40.120588 2021-08-09T06:13:40.120774094Z I0809 06:13:40.120736 -- oc explain fails here
2021-08-09T06:13:40.903857665Z I0809 06:13:40.903757 2021-08-09T06:13:40.903964685Z I0809 06:13:40.903926 2021-08-09T06:13:41.550374559Z I0809 06:13:41.550222 2021-08-09T06:13:41.550415606Z I0809 06:13:41.550367 2021-08-09T06:13:42.228136085Z I0809 06:13:42.228021 2021-08-09T06:13:42.228136085Z I0809 06:13:42.228096 2021-08-09T06:13:42.849925095Z I0809 06:13:42.849133 2021-08-09T06:13:42.849925095Z I0809 06:13:42.849232 2021-08-09T06:13:43.537653077Z I0809 06:13:43.537551 2021-08-09T06:13:43.537682716Z I0809 06:13:43.537647 2021-08-09T06:13:44.114553542Z I0809 06:13:44.114447 2021-08-09T06:13:44.114603682Z I0809 06:13:44.114534 2021-08-09T06:13:44.752935265Z I0809 06:13:44.752702 2021-08-09T06:13:44.752935265Z I0809 06:13:44.752887 2021-08-09T06:13:45.401939388Z I0809 06:13:45.401595 2021-08-09T06:13:45.401939388Z I0809 06:13:45.401676 logs;
18 controller.go:611] quota admission added evaluator for: endpoints
18 controller.go:611] quota admission added evaluator for: endpointslices.discovery.k8s.io
18 aggregator.go:234] Finished OpenAPI spec generation after 954.720961ms
18 aggregator.go:231] Updating OpenAPI spec because v1.image.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 812.841114ms
18 aggregator.go:231] Updating OpenAPI spec because v1.quota.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 799.869444ms
18 aggregator.go:231] Updating OpenAPI spec because v1.packages.operators.coreos.com is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 711.498895ms
18 aggregator.go:231] Updating OpenAPI spec because v1.security.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 782.9872ms
18 aggregator.go:231] Updating OpenAPI spec because v1.user.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 646.243214ms
18 aggregator.go:231] Updating OpenAPI spec because v1.template.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 677.622305ms
18 aggregator.go:231] Updating OpenAPI spec because v1beta1.metrics.k8s.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 621.017972ms
18 aggregator.go:231] Updating OpenAPI spec because v1.build.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 688.293549ms
18 aggregator.go:231] Updating OpenAPI spec because v1.apps.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 576.776439ms
18 aggregator.go:231] Updating OpenAPI spec because v1.authorization.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 638.136251ms
18 aggregator.go:231] Updating OpenAPI spec because v1.route.openshift.io is updated
18 aggregator.go:234] Finished OpenAPI spec generation after 648.679504ms
18 aggregator.go:231] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated

ardaguclu · 2021-08-11T11:35:53Z

AFAICS, aggregator service starts in here https://github.com/kubernetes/kubernetes/blob/34d1d2a4b98e496145d49aee322a94d8387bfdf7/cmd/kube-apiserver/app/server.go#L173
and it runs a new worker in goroutine in here https://github.com/kubernetes/kubernetes/blob/07fe9f0192283ffa1ce81f490b02f5240cf4099d/staging/src/k8s.io/kube-aggregator/pkg/controllers/openapi/controller.go#L86.

Maybe I'm missing something, but can we rely on readiness probe in that case?. Because aggregator worker runs in different goroutine and we don't know when it completes aggregation of specs.

bparees · 2021-08-11T16:33:39Z

It seems not aggressive. When I checked from the logs, it is happening avg: 2, max:3 times per job(after grouping by time ranges)

i'd expect it to happen zero times, unless the apiserver is actually down/broken and needs restarting. If we're just restarting it because it's under heavy load or something, restarting it won't help and just hurts. Is there an important benefit get from keeping the check so tight, vs loosening it a small amount? (presumably longer time to detect real issues, so there is a tradeoff, but it sounds like we might be too far on one side of that tradeoff currently).

Maybe I'm missing something, but can we rely on readiness probe in that case?. Because aggregator worker runs in different goroutine and we don't know when it completes aggregation of specs.

presumably we could make it signal when it is done and feed that into a readiness probe. Right now it sounds like the apiserver is reporting readiness prematurely and i'd consider that a bug.

ardaguclu · 2021-08-16T06:04:39Z

I filed an issue in upstream kubernetes/kubernetes#104324 and opened a PR about this bug.

Thank you @bparees for helping investigating the real problem. I'm closing this PR.

bparees · 2021-08-16T17:44:18Z

I filed an issue in upstream kubernetes/kubernetes#104324 and opened a PR about this bug.

excellent, thanks for digging into it!

openshift-ci bot requested review from bparees and smarterclayton August 9, 2021 07:48

ardaguclu closed this Aug 16, 2021

ardaguclu deleted the sig-cli-explain-flakiness branch January 25, 2022 08:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry for explain commands due to resource not found flakiness #26385

Add retry for explain commands due to resource not found flakiness #26385

ardaguclu commented Aug 9, 2021

openshift-ci bot commented Aug 9, 2021

ardaguclu commented Aug 9, 2021

ardaguclu commented Aug 9, 2021

bparees commented Aug 9, 2021

ardaguclu commented Aug 10, 2021 •

edited

ardaguclu commented Aug 10, 2021

ardaguclu commented Aug 10, 2021

ardaguclu commented Aug 10, 2021

openshift-ci bot commented Aug 10, 2021

bparees commented Aug 10, 2021

ardaguclu commented Aug 11, 2021 •

edited

ardaguclu commented Aug 11, 2021 •

edited

bparees commented Aug 11, 2021

ardaguclu commented Aug 16, 2021

bparees commented Aug 16, 2021

Add retry for explain commands due to resource not found flakiness #26385

Add retry for explain commands due to resource not found flakiness #26385

Conversation

ardaguclu commented Aug 9, 2021

openshift-ci bot commented Aug 9, 2021

ardaguclu commented Aug 9, 2021

ardaguclu commented Aug 9, 2021

bparees commented Aug 9, 2021

ardaguclu commented Aug 10, 2021 • edited

ardaguclu commented Aug 10, 2021

ardaguclu commented Aug 10, 2021

ardaguclu commented Aug 10, 2021

openshift-ci bot commented Aug 10, 2021

bparees commented Aug 10, 2021

ardaguclu commented Aug 11, 2021 • edited

ardaguclu commented Aug 11, 2021 • edited

bparees commented Aug 11, 2021

ardaguclu commented Aug 16, 2021

bparees commented Aug 16, 2021

ardaguclu commented Aug 10, 2021 •

edited

ardaguclu commented Aug 11, 2021 •

edited

ardaguclu commented Aug 11, 2021 •

edited