-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2097297: Don't consider DNS lookup failure disruption #27250
Conversation
Initial run, I just want to confirm that disruption at warning level still gets graphed in intervals. |
Run https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536724679006359552 proves that switching all disruption intervals to warning level still results in them showing in the intervals chart. Proceeding with real code change next. It also seems to prove that warning intervals are being excluded from the disruption data we'll ingest into bigquery by default: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1536724679006359552/artifacts/e2e-gcp-upgrade/openshift-e2e-test/artifacts/junit/backend-disruption_20220614-160251.json |
// AnnotationFrom extracts annotations of the format key/value from the message/locator string. | ||
// i.e. "disruption/cache-kube-api connection/new stopped responding to GET requests over new connections" | ||
// has annotations "disruption" and "connection" | ||
func AnnotationFrom(message, annotationName string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't end up actually using but I think it's worth keeping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though the change you made here is equivalent to what was there before (because "reason"
is passed on line 77 and annotations["reason"]
is returned, I think line 93 should say return annotations[annotationName]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total bug, thanks!
monitorapi.IsDisruptionEvent, | ||
monitorapi.IsErrorEvent, // ignore Warning events, we use these for disruption we don't actually think was from the cluster under test (i.e. DNS) | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this was needed, I did a trial run first linked in the comments where I just moved the intervals to warning from error. The disruption data file was all 0. I don't know where it's filtering though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is filtered here:
IsErrorEvent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks Ken! I'm going to leave it double filtered for now I think, seems safe to do.
This is occuring often in the build clusters at least for the time being, but we can identify that this specific message is a local DNS failure, and thus should not be treated as disruption in the cluster under test. Change to detect this specific message and treat it as a warning level interval, not an error level. Tests and the disruption json artifacts we write both automatically filter to only error level intervals, so these worked as is, although there is an additional filter added in this PR just for safety. Warning intervals will still be graphed on the intervals page. A new test is added to watch for these DNS lookups and flake if found, which will help us determine if the problem is still on-going.
6a9d361
to
3eb3dfa
Compare
@dgoodwin: This pull request references Bugzilla bug 2097297, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Looks like we did hit this in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27250/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1537034279022759936 and it worked correctly /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@dgoodwin: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@dgoodwin: All pull requests linked via external trackers have merged: Bugzilla bug 2097297 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
TRT-584 Not including info events caused 0s disruptions to disappear in the reported data and bigquery since Jun 15. Broken in openshift#27250.
TRT-584 Not including info events caused 0s disruptions to disappear in the reported data and bigquery since Jun 15. Broken in openshift#27250.
From TRT-277 we know that the CI build clusters where openshift-tests runs are encountering DNS lookup problems. These surface as disruption intervals and sometimes test failures where the message looks something like "dial tcp: lookup [hostname]: i/o timeout". We also know at this time that the same error will usually be reported trying to sample a separate external service, with nothing to do with the cluster under test.
We believe this particular error is safe to ignore for the purposes of disruption sampling, it's local to where the tests are running, and not the cluster we're trying to communicate with.
This PR continues to record these intervals, but at warning level instead of error. I have confirmed that warning level intervals still seem to be charted. The PR attempts to ensure these samples do not make it into the disruption json file we ingest into bigquery, though this seemed to happen by default and I'm not sure why. The disruption tests also already look like they ignore warning levels in their calculations.
An additional test is added to flake if we detect any of these DNS outages. This will help us continue to find job runs where this happened in debugging the build clusters.