New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1907475: Add recording rules for ingress traffic and error rate #1019
Bug 1907475: Add recording rules for ingress traffic and error rate #1019
Conversation
The overall rates of ingress traffic to the cluster are relevant in assessing health and usage - these rules add calculations for the rate of requests that error (5xx and other types of errors) across the frontends, the total requests passing across the frontends, the bandwidth used in both directions, and the number of open connections. Error rate is split for openshift and workloads (like CPU is) in order to better categorize trends before, during, and after upgrades. This data is relevant for both admins and engineering teams to classify and categorize the overall health and traffic of the cluster. This adds seven new series to telemetry via the cluster:usage: prefix.
/retest |
1 similar comment
/retest |
@smarterclayton: This pull request references Bugzilla bug 1907475, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@smarterclayton: This pull request references Bugzilla bug 1907475, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@Miciah can you or someone else on the team take a look? I want to quantify "happy" and "non-disrupted" ingress (this won't catch the nodeport / local traffic problem, but it will catch lower level problems like ovs) |
@sgreene570, could you take a look? @RiRa12621, I figure you might be interested in well. |
Thanks for letting me know @Miciah . |
Can one of the network edge reviewers confirm these make sense to capture given the reasoning in the bug and PR? Just want to make sure before I get final review on the rule structure. |
Yeah, it makes sense to me. /lgtm |
/bugzilla cc-qa |
@quarterpin: This pull request references Bugzilla bug 1907475, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm Verified via pre-merge verification workflow, more reference related to the test can be found in: |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah, quarterpin, s-urbaniak, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
@smarterclayton: All pull requests linked via external trackers have merged: Bugzilla bug 1907475 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The overall rates of ingress traffic to the cluster are relevant in
assessing health and usage - these rules add calculations for the
rate of requests that error (5xx and other types of errors) across
the frontends, the total requests passing across the frontends, the
bandwidth used in both directions, and the number of open connections.
Error rate is split for openshift and workloads (like CPU is) in order
to better categorize trends before, during, and after upgrades.
This data is relevant for both admins and engineering teams to classify
and categorize the overall health and traffic of the cluster. This adds
seven new series to telemetry via the cluster:usage: prefix.