Bug 2097297: Don't consider DNS lookup failure disruption #27250

dgoodwin · 2022-06-14T14:58:04Z

From TRT-277 we know that the CI build clusters where openshift-tests runs are encountering DNS lookup problems. These surface as disruption intervals and sometimes test failures where the message looks something like "dial tcp: lookup [hostname]: i/o timeout". We also know at this time that the same error will usually be reported trying to sample a separate external service, with nothing to do with the cluster under test.

We believe this particular error is safe to ignore for the purposes of disruption sampling, it's local to where the tests are running, and not the cluster we're trying to communicate with.

This PR continues to record these intervals, but at warning level instead of error. I have confirmed that warning level intervals still seem to be charted. The PR attempts to ensure these samples do not make it into the disruption json file we ingest into bigquery, though this seemed to happen by default and I'm not sure why. The disruption tests also already look like they ignore warning levels in their calculations.

An additional test is added to flake if we detect any of these DNS outages. This will help us continue to find job runs where this happened in debugging the build clusters.

dgoodwin · 2022-06-14T14:58:29Z

Initial run, I just want to confirm that disruption at warning level still gets graphed in intervals.

dgoodwin · 2022-06-14T17:39:50Z

Run https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536724679006359552 proves that switching all disruption intervals to warning level still results in them showing in the intervals chart. Proceeding with real code change next.

It also seems to prove that warning intervals are being excluded from the disruption data we'll ingest into bigquery by default: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536724679006359552/artifacts/e2e-gcp-upgrade/openshift-e2e-test/artifacts/junit/backend-disruption_20220614-160251.json

dgoodwin · 2022-06-14T18:20:00Z

pkg/monitor/monitorapi/identification_pod.go

+// AnnotationFrom extracts annotations of the format key/value from the message/locator string.
+// i.e. "disruption/cache-kube-api connection/new stopped responding to GET requests over new connections"
+//   has annotations "disruption" and "connection"
+func AnnotationFrom(message, annotationName string) string {


Didn't end up actually using but I think it's worth keeping.

even though the change you made here is equivalent to what was there before (because "reason" is passed on line 77 and annotations["reason"] is returned, I think line 93 should say return annotations[annotationName].

Total bug, thanks!

dgoodwin · 2022-06-14T18:20:49Z

pkg/monitor/write_job_run_data.go

+			monitorapi.IsDisruptionEvent,
+			monitorapi.IsErrorEvent, // ignore Warning events, we use these for disruption we don't actually think was from the cluster under test (i.e. DNS)
+		),
+	)


I'm not sure this was needed, I did a trial run first linked in the comments where I just moved the intervals to warning from error. The disruption data file was all 0. I don't know where it's filtering though.

Looks like it is filtered here:

origin/pkg/monitor/monitorapi/disruption.go

Line 12 in b827d56

IsErrorEvent,

Awesome thanks Ken! I'm going to leave it double filtered for now I think, seems safe to do.

dgoodwin · 2022-06-15T11:12:58Z

In https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1536766123473637376 which will always have lots of disruption (single node), the real disruption is getting graphed and showing up in the json: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade/1536766123473637376/artifacts/e2e-aws-single-node-upgrade/single-node-e2e-test/artifacts/junit/backend-disruption_20220614-182934.json

gcp upgrade went surprisingly well, just a couple seconds of disruption: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536766124006313984 and none of it was DNS lookups.

I cannot find any with dial tcp lookup i/o timeout specifically yet.

I can see the new test running and passing successfully.

Going to run another set of tests and hope to see our dns problem appear and get handled as expected, however we may want to merge given it's not causing problems so far and let it run on wider ci for results.

This is occuring often in the build clusters at least for the time being, but we can identify that this specific message is a local DNS failure, and thus should not be treated as disruption in the cluster under test. Change to detect this specific message and treat it as a warning level interval, not an error level. Tests and the disruption json artifacts we write both automatically filter to only error level intervals, so these worked as is, although there is an additional filter added in this PR just for safety. Warning intervals will still be graphed on the intervals page. A new test is added to watch for these DNS lookups and flake if found, which will help us determine if the problem is still on-going.

openshift-ci · 2022-06-15T11:45:36Z

@dgoodwin: This pull request references Bugzilla bug 2097297, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 2097297: Don't consider DNS lookup failure disruption

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stbenjam · 2022-06-15T14:36:03Z

Looks like we did hit this in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1537034279022759936 and it worked correctly

/lgtm

openshift-ci · 2022-06-15T14:38:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2022-06-15T14:54:32Z

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-single-node	`3eb3dfa`	link	false	`/test e2e-aws-single-node`
ci/prow/e2e-gcp-ovn-rt-upgrade	`3eb3dfa`	link	false	`/test e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-aws-single-node-upgrade	`3eb3dfa`	link	false	`/test e2e-aws-single-node-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci · 2022-06-15T16:19:11Z

@dgoodwin: All pull requests linked via external trackers have merged:

openshift/origin#27250

Bugzilla bug 2097297 has been moved to the MODIFIED state.

In response to this:

Bug 2097297: Don't consider DNS lookup failure disruption

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

TRT-584 Not including info events caused 0s disruptions to disappear in the reported data and bigquery since Jun 15. Broken in openshift#27250.

dgoodwin changed the title ~~WIP~~ WIP: Don't consider DNS lookup failure disruption Jun 14, 2022

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 14, 2022

openshift-ci bot requested review from bparees and csrwng June 14, 2022 15:00

dgoodwin commented Jun 14, 2022

View reviewed changes

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2022

dgoodwin force-pushed the no-disrupt-on-dns branch from 6a9d361 to 3eb3dfa Compare June 15, 2022 11:27

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 15, 2022

dgoodwin changed the title ~~WIP: Don't consider DNS lookup failure disruption~~ Bug 2097297: Don't consider DNS lookup failure disruption Jun 15, 2022

openshift-ci bot assigned stbenjam Jun 15, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2022

openshift-ci bot merged commit 2992729 into openshift:master Jun 15, 2022

openshift-ci bot mentioned this pull request Jul 2, 2022

Bug 2097297: Show DNS lookup sampler problems on the disruption intervals chart. #27261

Merged

dgoodwin mentioned this pull request Sep 30, 2022

Fix missing reporting of 0s disruptions. #27452

Merged

dgoodwin added a commit to dgoodwin/origin that referenced this pull request Sep 30, 2022

Fix missing reporting of 0s disruptions.

b0008d5

TRT-584 Not including info events caused 0s disruptions to disappear in the reported data and bigquery since Jun 15. Broken in openshift#27250.

dgoodwin added a commit to dgoodwin/origin that referenced this pull request Oct 3, 2022

Fix missing reporting of 0s disruptions.

0051be9

TRT-584 Not including info events caused 0s disruptions to disappear in the reported data and bigquery since Jun 15. Broken in openshift#27250.

dgoodwin mentioned this pull request Oct 18, 2022

disruption checking is disabled for prior releases because we lack a good baseline #27468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 2097297: Don't consider DNS lookup failure disruption #27250

Bug 2097297: Don't consider DNS lookup failure disruption #27250

dgoodwin commented Jun 14, 2022 •

edited by openshift-ci bot

Loading

dgoodwin commented Jun 14, 2022

dgoodwin commented Jun 14, 2022 •

edited

Loading

dgoodwin Jun 14, 2022

DennisPeriquet Jun 15, 2022

dgoodwin Jun 15, 2022

dgoodwin Jun 14, 2022

xueqzhan Jun 14, 2022

dgoodwin Jun 15, 2022

dgoodwin commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

stbenjam commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

Bug 2097297: Don't consider DNS lookup failure disruption #27250

Bug 2097297: Don't consider DNS lookup failure disruption #27250

Conversation

dgoodwin commented Jun 14, 2022 • edited by openshift-ci bot Loading

dgoodwin commented Jun 14, 2022

dgoodwin commented Jun 14, 2022 • edited Loading

dgoodwin Jun 14, 2022

Choose a reason for hiding this comment

DennisPeriquet Jun 15, 2022

Choose a reason for hiding this comment

dgoodwin Jun 15, 2022

Choose a reason for hiding this comment

dgoodwin Jun 14, 2022

Choose a reason for hiding this comment

xueqzhan Jun 14, 2022

Choose a reason for hiding this comment

dgoodwin Jun 15, 2022

Choose a reason for hiding this comment

dgoodwin commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

stbenjam commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

openshift-ci bot commented Jun 15, 2022

dgoodwin commented Jun 14, 2022 •

edited by openshift-ci bot

Loading

dgoodwin commented Jun 14, 2022 •

edited

Loading