Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2097297: Don't consider DNS lookup failure disruption #27250

Merged
merged 1 commit into from
Jun 15, 2022

Conversation

dgoodwin
Copy link
Contributor

@dgoodwin dgoodwin commented Jun 14, 2022

From TRT-277 we know that the CI build clusters where openshift-tests runs are encountering DNS lookup problems. These surface as disruption intervals and sometimes test failures where the message looks something like "dial tcp: lookup [hostname]: i/o timeout". We also know at this time that the same error will usually be reported trying to sample a separate external service, with nothing to do with the cluster under test.

We believe this particular error is safe to ignore for the purposes of disruption sampling, it's local to where the tests are running, and not the cluster we're trying to communicate with.

This PR continues to record these intervals, but at warning level instead of error. I have confirmed that warning level intervals still seem to be charted. The PR attempts to ensure these samples do not make it into the disruption json file we ingest into bigquery, though this seemed to happen by default and I'm not sure why. The disruption tests also already look like they ignore warning levels in their calculations.

An additional test is added to flake if we detect any of these DNS outages. This will help us continue to find job runs where this happened in debugging the build clusters.

@dgoodwin
Copy link
Contributor Author

Initial run, I just want to confirm that disruption at warning level still gets graphed in intervals.

@dgoodwin dgoodwin changed the title WIP WIP: Don't consider DNS lookup failure disruption Jun 14, 2022
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2022
@openshift-ci openshift-ci bot requested review from bparees and csrwng June 14, 2022 15:00
@dgoodwin
Copy link
Contributor Author

dgoodwin commented Jun 14, 2022

Run https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536724679006359552 proves that switching all disruption intervals to warning level still results in them showing in the intervals chart. Proceeding with real code change next.

It also seems to prove that warning intervals are being excluded from the disruption data we'll ingest into bigquery by default: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536724679006359552/artifacts/e2e-gcp-upgrade/openshift-e2e-test/artifacts/junit/backend-disruption_20220614-160251.json

// AnnotationFrom extracts annotations of the format key/value from the message/locator string.
// i.e. "disruption/cache-kube-api connection/new stopped responding to GET requests over new connections"
// has annotations "disruption" and "connection"
func AnnotationFrom(message, annotationName string) string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't end up actually using but I think it's worth keeping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even though the change you made here is equivalent to what was there before (because "reason" is passed on line 77 and annotations["reason"] is returned, I think line 93 should say return annotations[annotationName].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total bug, thanks!

monitorapi.IsDisruptionEvent,
monitorapi.IsErrorEvent, // ignore Warning events, we use these for disruption we don't actually think was from the cluster under test (i.e. DNS)
),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this was needed, I did a trial run first linked in the comments where I just moved the intervals to warning from error. The disruption data file was all 0. I don't know where it's filtering though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is filtered here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks Ken! I'm going to leave it double filtered for now I think, seems safe to do.

@dgoodwin
Copy link
Contributor Author

In https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1536766123473637376 which will always have lots of disruption (single node), the real disruption is getting graphed and showing up in the json: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1536766123473637376/artifacts/e2e-aws-single-node-upgrade/single-node-e2e-test/artifacts/junit/backend-disruption_20220614-182934.json

gcp upgrade went surprisingly well, just a couple seconds of disruption: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536766124006313984 and none of it was DNS lookups.

I cannot find any with dial tcp lookup i/o timeout specifically yet.

I can see the new test running and passing successfully.

Going to run another set of tests and hope to see our dns problem appear and get handled as expected, however we may want to merge given it's not causing problems so far and let it run on wider ci for results.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2022
This is occuring often in the build clusters at least for the time
being, but we can identify that this specific message is a local DNS
failure, and thus should not be treated as disruption in the cluster
under test.

Change to detect this specific message and treat it as a warning level
interval, not an error level. Tests and the disruption json artifacts we
write both automatically filter to only error level intervals, so these
worked as is, although there is an additional filter added in this PR
just for safety.

Warning intervals will still be graphed on the intervals page.

A new test is added to watch for these DNS lookups and flake if found,
which will help us determine if the problem is still on-going.
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2022
@dgoodwin dgoodwin changed the title WIP: Don't consider DNS lookup failure disruption Bug 2097297: Don't consider DNS lookup failure disruption Jun 15, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2022

@dgoodwin: This pull request references Bugzilla bug 2097297, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 2097297: Don't consider DNS lookup failure disruption

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2022

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-single-node 3eb3dfa link false /test e2e-aws-single-node
ci/prow/e2e-gcp-ovn-rt-upgrade 3eb3dfa link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-single-node-upgrade 3eb3dfa link false /test e2e-aws-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot merged commit 2992729 into openshift:master Jun 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2022

@dgoodwin: All pull requests linked via external trackers have merged:

Bugzilla bug 2097297 has been moved to the MODIFIED state.

In response to this:

Bug 2097297: Don't consider DNS lookup failure disruption

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dgoodwin added a commit to dgoodwin/origin that referenced this pull request Sep 30, 2022
TRT-584

Not including info events caused 0s disruptions to disappear in the
reported data and bigquery since Jun 15. Broken in openshift#27250.
dgoodwin added a commit to dgoodwin/origin that referenced this pull request Oct 3, 2022
TRT-584

Not including info events caused 0s disruptions to disappear in the
reported data and bigquery since Jun 15. Broken in openshift#27250.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants