Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry for explain commands due to resource not found flakiness #26385

Closed

Conversation

ardaguclu
Copy link
Member

Sometimes explain command returns "could not find resource error" and
[sig-cli] oc explain * tests fail due to this. This PR adds retry to prevent
other tests or processes in CI causing this error have an impact on these tests.

According to the https://search.ci.openshift.org/chart?search=oc+explain+should+contain+proper+spec&maxAge=24h&type=junit. 1% of failures belong to these tests(and this percentage does not include failures marked as flaky). The actual problem is some operator(most probably cluster-version-operator) update version and within this time range, explain command returns "resource not found error".

Sometimes explain command returns "could not find resource error" and
"[sig-cli] oc explain *" tests fail due to this. This PR adds retry to prevent
other tests or processes in CI causing this error have an impact on these tests.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ardaguclu
To complete the pull request process, please assign mfojtik after the PR has been reviewed.
You can assign the PR to them by writing /assign @mfojtik in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ardaguclu
Copy link
Member Author

/test e2e-metal-ipi

@ardaguclu
Copy link
Member Author

/retest

@bparees
Copy link
Contributor

bparees commented Aug 9, 2021

The actual problem is some operator(most probably cluster-version-operator) update version and within this time range, explain command returns "resource not found error".

can you explain this in more detail? The version of what is being updated? How does that affect oc explain? And I see these failures in jobs that aren't upgrading clusters, so i wouldn't expect any version changes. e.g.: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1424684437928415232

I don't see any reason this resource would ever "not be found" after installation is complete:

Aug  9 11:58:57.474: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-oc-explain-bq4j2 --kubeconfig=/tmp/configfile429536836 explain clusterversions --api-version=config.openshift.io/v1:
StdOut>
error: couldn't find resource for "config.openshift.io/v1, Kind=ClusterVersion"

@ardaguclu
Copy link
Member Author

ardaguclu commented Aug 10, 2021

@bparees thanks for reviewing. Let me explain the actual problem;

  • Haproxy's health checks fail sometimes by throwing Health check for server masters/master-1 failed, reason: Layer6 timeout, check duration: 1001ms, status: 0/3 DOWN.. After health check failure, it takes a couple of seconds to heal and serve again.
  • [sig-cli] oc explain * tests work as expected even if haproxy is failed because apiserver of other masters are still serving.
  • When haproxy starts serving(Server masters/master-1 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.), oc explain tries to access this(for which resource explain command is executed at that time).
  • oc explain command fails exactly in here https://github.com/openshift/kubernetes-kubectl/blob/0c5d119ce86414ebbd725f56967987c9cf72e496/pkg/cmd/explain/explain.go#L154
  • Within this very small time range, since the haproxy is up again, oc could not find time to populate OpenAPI specs and returns not found error.

After investigating why haproxy is down and up again, it is high likely some other tests triggering it(still not proved yet).

As an example, failed jobs which this flow applies;
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi/1424684437928415232
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424597077127598080
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424100859263848448

and to check the opposite, I picked green job and there is no haproxy restarts while these tests are running;
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424625920471207936

@ardaguclu
Copy link
Member Author

/retest

1 similar comment
@ardaguclu
Copy link
Member Author

/retest

@ardaguclu
Copy link
Member Author

/test e2e-gcp-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

@ardaguclu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-upgrade c84687e link /test e2e-gcp-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bparees
Copy link
Contributor

bparees commented Aug 10, 2021

Haproxy's health checks fail sometimes by throwing Health check for server masters/master-1 failed, reason: Layer6 timeout, check duration: 1001ms, status: 0/3 DOWN.. After health check failure, it takes a couple of seconds to heal and serve again.

is this check too aggressive?

When haproxy starts serving(Server masters/master-1 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.), oc explain tries to access this(for which resource explain command is executed at that time).
oc explain command fails exactly in here https://github.com/openshift/kubernetes-kubectl/blob/0c5d119ce86414ebbd725f56967987c9cf72e496/pkg/cmd/explain/explain.go#L154
Within this very small time range, since the haproxy is up again, oc could not find time to populate OpenAPI specs and returns not found error.

i'm still puzzled. if the other 2 apiservers were still up/accessible and even the third apiserver didn't actually restart(only its haproxy on the node did), then I don't see why any data would be missing.

and if the apiserver endpoint did go down and came back up, and now oc explain has hit that restarted apiserver, the apiserver should still have all the necessary information. If it does not have all the necessary information, it should be using a readiness probe to ensure it doesn't start serving content until it is fully populated.

@soltysh @mfojtik can you weigh in here? something seems off about treating this oc explain error as "expected" just because something on one of our 3 master nodes had a hiccup.

@ardaguclu
Copy link
Member Author

ardaguclu commented Aug 11, 2021

Haproxy's health checks fail sometimes by throwing Health check for server masters/master-1 failed, reason: Layer6 timeout, check duration: 1001ms, status: 0/3 DOWN.. After health check failure, it takes a couple of seconds to heal and serve again.

is this check too aggressive?

It seems not aggressive. When I checked from the logs, it is happening avg: 2, max:3 times per job(after grouping by time ranges)

When haproxy starts serving(Server masters/master-1 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.), oc explain tries to access this(for which resource explain command is executed at that time).
oc explain command fails exactly in here https://github.com/openshift/kubernetes-kubectl/blob/0c5d119ce86414ebbd725f56967987c9cf72e496/pkg/cmd/explain/explain.go#L154
Within this very small time range, since the haproxy is up again, oc could not find time to populate OpenAPI specs and returns not found error.

i'm still puzzled. if the other 2 apiservers were still up/accessible and even the third apiserver didn't actually restart(only its haproxy on the node did), then I don't see why any data would be missing.

and if the apiserver endpoint did go down and came back up, and now oc explain has hit that restarted apiserver, the apiserver should still have all the necessary information. If it does not have all the necessary information, it should be using a readiness probe to ensure it doesn't start serving content until it is fully populated.

@soltysh @mfojtik can you weigh in here? something seems off about treating this oc explain error as "expected" just because something on one of our 3 master nodes had a hiccup.

After digging into it further taking https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi/1424597077127598080 as an example;

  • 06:13:18: TerminationGracefulTerminationFinished is fired for kube-apiserver-master-0.
  • 06:13:26: kube-apiserver-check-endpoints stops logging after Starting #1 worker of check-endpoints controller ...
  • 06:13:35: oc explain should contain proper spec+status for CRDs test is started
  • 06:13:37: KubeAPIReadyz is fired for kube-apiserver-master-0.
  • 06:13:39: haproxy-master-0 logs Server masters/master-0 is UP. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
  • 06:13:40.044: oc explain is run for operators --api-version=operators.coreos.com/v1'
  • 06:13:40.344: test failed by throwing couldn't find resource for "operators.coreos.com/v1, Kind=Operator"

and within these time ranges, kube-apiserver-master-0 logs;
2021-08-09T06:13:37.313196974Z I0809 06:13:37.313090 18 controller.go:611] quota admission added evaluator for: endpoints
2021-08-09T06:13:37.320467463Z I0809 06:13:37.320373 18 controller.go:611] quota admission added evaluator for: endpointslices.discovery.k8s.io
2021-08-09T06:13:37.795945394Z I0809 06:13:37.795818 18 aggregator.go:234] Finished OpenAPI spec generation after 954.720961ms
2021-08-09T06:13:37.796029189Z I0809 06:13:37.795982 18 aggregator.go:231] Updating OpenAPI spec because v1.image.openshift.io is updated
2021-08-09T06:13:38.609098725Z I0809 06:13:38.608882 18 aggregator.go:234] Finished OpenAPI spec generation after 812.841114ms
2021-08-09T06:13:38.609098725Z I0809 06:13:38.608981 18 aggregator.go:231] Updating OpenAPI spec because v1.quota.openshift.io is updated
2021-08-09T06:13:39.409034087Z I0809 06:13:39.408876 18 aggregator.go:234] Finished OpenAPI spec generation after 799.869444ms
2021-08-09T06:13:39.409109943Z I0809 06:13:39.409058 18 aggregator.go:231] Updating OpenAPI spec because v1.packages.operators.coreos.com is updated
-- oc explain starts here
2021-08-09T06:13:40.120721163Z I0809 06:13:40.120588 18 aggregator.go:234] Finished OpenAPI spec generation after 711.498895ms
2021-08-09T06:13:40.120774094Z I0809 06:13:40.120736 18 aggregator.go:231] Updating OpenAPI spec because v1.security.openshift.io is updated
-- oc explain fails here
2021-08-09T06:13:40.903857665Z I0809 06:13:40.903757 18 aggregator.go:234] Finished OpenAPI spec generation after 782.9872ms
2021-08-09T06:13:40.903964685Z I0809 06:13:40.903926 18 aggregator.go:231] Updating OpenAPI spec because v1.user.openshift.io is updated
2021-08-09T06:13:41.550374559Z I0809 06:13:41.550222 18 aggregator.go:234] Finished OpenAPI spec generation after 646.243214ms
2021-08-09T06:13:41.550415606Z I0809 06:13:41.550367 18 aggregator.go:231] Updating OpenAPI spec because v1.template.openshift.io is updated
2021-08-09T06:13:42.228136085Z I0809 06:13:42.228021 18 aggregator.go:234] Finished OpenAPI spec generation after 677.622305ms
2021-08-09T06:13:42.228136085Z I0809 06:13:42.228096 18 aggregator.go:231] Updating OpenAPI spec because v1beta1.metrics.k8s.io is updated
2021-08-09T06:13:42.849925095Z I0809 06:13:42.849133 18 aggregator.go:234] Finished OpenAPI spec generation after 621.017972ms
2021-08-09T06:13:42.849925095Z I0809 06:13:42.849232 18 aggregator.go:231] Updating OpenAPI spec because v1.build.openshift.io is updated
2021-08-09T06:13:43.537653077Z I0809 06:13:43.537551 18 aggregator.go:234] Finished OpenAPI spec generation after 688.293549ms
2021-08-09T06:13:43.537682716Z I0809 06:13:43.537647 18 aggregator.go:231] Updating OpenAPI spec because v1.apps.openshift.io is updated
2021-08-09T06:13:44.114553542Z I0809 06:13:44.114447 18 aggregator.go:234] Finished OpenAPI spec generation after 576.776439ms
2021-08-09T06:13:44.114603682Z I0809 06:13:44.114534 18 aggregator.go:231] Updating OpenAPI spec because v1.authorization.openshift.io is updated
2021-08-09T06:13:44.752935265Z I0809 06:13:44.752702 18 aggregator.go:234] Finished OpenAPI spec generation after 638.136251ms
2021-08-09T06:13:44.752935265Z I0809 06:13:44.752887 18 aggregator.go:231] Updating OpenAPI spec because v1.route.openshift.io is updated
2021-08-09T06:13:45.401939388Z I0809 06:13:45.401595 18 aggregator.go:234] Finished OpenAPI spec generation after 648.679504ms
2021-08-09T06:13:45.401939388Z I0809 06:13:45.401676 18 aggregator.go:231] Updating OpenAPI spec because k8s_internal_local_delegation_chain_0000000002 is updated

@ardaguclu
Copy link
Member Author

ardaguclu commented Aug 11, 2021

AFAICS, aggregator service starts in here https://github.com/kubernetes/kubernetes/blob/34d1d2a4b98e496145d49aee322a94d8387bfdf7/cmd/kube-apiserver/app/server.go#L173
and it runs a new worker in goroutine in here https://github.com/kubernetes/kubernetes/blob/07fe9f0192283ffa1ce81f490b02f5240cf4099d/staging/src/k8s.io/kube-aggregator/pkg/controllers/openapi/controller.go#L86.

Maybe I'm missing something, but can we rely on readiness probe in that case?. Because aggregator worker runs in different goroutine and we don't know when it completes aggregation of specs.

@bparees
Copy link
Contributor

bparees commented Aug 11, 2021

It seems not aggressive. When I checked from the logs, it is happening avg: 2, max:3 times per job(after grouping by time ranges)

i'd expect it to happen zero times, unless the apiserver is actually down/broken and needs restarting. If we're just restarting it because it's under heavy load or something, restarting it won't help and just hurts. Is there an important benefit get from keeping the check so tight, vs loosening it a small amount? (presumably longer time to detect real issues, so there is a tradeoff, but it sounds like we might be too far on one side of that tradeoff currently).

Maybe I'm missing something, but can we rely on readiness probe in that case?. Because aggregator worker runs in different goroutine and we don't know when it completes aggregation of specs.

presumably we could make it signal when it is done and feed that into a readiness probe. Right now it sounds like the apiserver is reporting readiness prematurely and i'd consider that a bug.

@ardaguclu
Copy link
Member Author

I filed an issue in upstream kubernetes/kubernetes#104324 and opened a PR about this bug.

Thank you @bparees for helping investigating the real problem. I'm closing this PR.

@ardaguclu ardaguclu closed this Aug 16, 2021
@bparees
Copy link
Contributor

bparees commented Aug 16, 2021

I filed an issue in upstream kubernetes/kubernetes#104324 and opened a PR about this bug.

excellent, thanks for digging into it!

@ardaguclu ardaguclu deleted the sig-cli-explain-flakiness branch January 25, 2022 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants