New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alerts for issues with load balancers/ports. #1148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome - simple but helpful. I left some wording propositions in the comments, again I'm not sure if they're significantly better.
@@ -89,3 +89,17 @@ spec: | |||
histogram_quantile(0.95, rate(kuryr_cni_request_duration_seconds_bucket[10m])) > 30 | |||
labels: | |||
severity: warning | |||
- alert: KuryrLoadBalancerNotReady | |||
annotations: | |||
message: One of the Octavia Load Balancers has timed out with either PENDING_UPDATE/PENDING_CREATE/PENDIN_DELETE state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, let's try to make sure we're blaming Octavia properly. ;)
Kuryr noticed an Octavia load balancer unexpectedly stuck in PENDING_* state in the last 20 minutes. Most likely this indicates an issue with OpenStack Octavia.
severity: critical | ||
- alert: KuryrPortNotReady | ||
annotations: | ||
message: One of the Neutron port has timed out with making it active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Neutron:
Kuryr noticed that a Neutron port was in unexpected DOWN state in the last 20 minutes. Most likely this indicates an issue with OpenStack Neutron.
Sometimes, we could experience issues regarding unrecoverable state for either Neutron ports and/or Octavia load balancers being stuck forever. Currently there is nothing we could do from Kuryr-Kubernetes side except alerting.
@@ -121,3 +121,17 @@ spec: | |||
for: 5m | |||
labels: | |||
severity: critical | |||
- alert: KuryrLoadBalancerNotReady | |||
annotations: | |||
message: Kuryr noticed an Octavia load balancer unexpectedly stuck in PENDING_* state in the last 20 minutes. Most likely this indicates an issue with OpenStack Octavia. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it only start alerting after 20 min? It might be too long, maybe 10 mins fits better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works like that: if there is any change on the counter within 20 minutes timeframe, then alarm will be triggered. Given the fact, we do the reconciliation, and retry loop is taken around 10-15 minutes, then we make sure that alarm will remain because we will re-trigger it within 20 min block. Alarm will go if counter stops increase itself. Initially @dulek proposed 30 minutes for the window, but IMHO it's a little bit to long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dulek do you still have a preference for 30 min?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all, @gryf convinced me to do 20 minutes.
annotations: | ||
message: Kuryr noticed an Octavia load balancer unexpectedly stuck in PENDING_* state in the last 20 minutes. Most likely this indicates an issue with OpenStack Octavia. | ||
expr: | | ||
changes(kuryr_load_balancer_readiness_total[20m]) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the metric differs from what is registered kuryr_load_balancer_readiness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know :) what is weird, prometheus client produces two child samples for type counter - one with suffix "_total" and one with suffix "_created"[1].
annotations: | ||
message: Kuryr noticed that a Neutron port was in unexpected DOWN state in the last 20 minutes. Most likely this indicates an issue with OpenStack Neutron. | ||
expr: | | ||
changes(kuryr_port_readiness_total[20m]) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
/lgtm |
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gryf, MaysaMacedo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
9 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/test e2e-openstack-kuryr |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/test e2e-openstack-kuryr |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/override ci/prow/e2e-gcp-ovn |
@abhat: Overrode contexts on behalf of abhat: ci/prow/e2e-gcp-ovn In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@gryf: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Sometimes, we could experience issues regarding unrecoverable state for
either Neutron ports and/or Octavia load balancers being stuck forever.
Currently there is nothing we could do from Kuryr-Kubernetes side except
alerting.