Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alerts for issues with load balancers/ports. #1148

Merged
merged 1 commit into from Jul 14, 2021

Conversation

gryf
Copy link
Member

@gryf gryf commented Jul 6, 2021

Sometimes, we could experience issues regarding unrecoverable state for
either Neutron ports and/or Octavia load balancers being stuck forever.
Currently there is nothing we could do from Kuryr-Kubernetes side except
alerting.

@openshift-ci openshift-ci bot requested review from dulek and luis5tb July 6, 2021 12:43
Copy link
Contributor

@dulek dulek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome - simple but helpful. I left some wording propositions in the comments, again I'm not sure if they're significantly better.

@@ -89,3 +89,17 @@ spec:
histogram_quantile(0.95, rate(kuryr_cni_request_duration_seconds_bucket[10m])) > 30
labels:
severity: warning
- alert: KuryrLoadBalancerNotReady
annotations:
message: One of the Octavia Load Balancers has timed out with either PENDING_UPDATE/PENDING_CREATE/PENDIN_DELETE state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, let's try to make sure we're blaming Octavia properly. ;)

Kuryr noticed an Octavia load balancer unexpectedly stuck in PENDING_* state in the last 20 minutes. Most likely this indicates an issue with OpenStack Octavia.

severity: critical
- alert: KuryrPortNotReady
annotations:
message: One of the Neutron port has timed out with making it active
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Neutron:

Kuryr noticed that a Neutron port was in unexpected DOWN state in the last 20 minutes. Most likely this indicates an issue with OpenStack Neutron.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2021
Sometimes, we could experience issues regarding unrecoverable state for
either Neutron ports and/or Octavia load balancers being stuck forever.
Currently there is nothing we could do from Kuryr-Kubernetes side except
alerting.
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2021
@@ -121,3 +121,17 @@ spec:
for: 5m
labels:
severity: critical
- alert: KuryrLoadBalancerNotReady
annotations:
message: Kuryr noticed an Octavia load balancer unexpectedly stuck in PENDING_* state in the last 20 minutes. Most likely this indicates an issue with OpenStack Octavia.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it only start alerting after 20 min? It might be too long, maybe 10 mins fits better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works like that: if there is any change on the counter within 20 minutes timeframe, then alarm will be triggered. Given the fact, we do the reconciliation, and retry loop is taken around 10-15 minutes, then we make sure that alarm will remain because we will re-trigger it within 20 min block. Alarm will go if counter stops increase itself. Initially @dulek proposed 30 minutes for the window, but IMHO it's a little bit to long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dulek do you still have a preference for 30 min?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, @gryf convinced me to do 20 minutes.

annotations:
message: Kuryr noticed an Octavia load balancer unexpectedly stuck in PENDING_* state in the last 20 minutes. Most likely this indicates an issue with OpenStack Octavia.
expr: |
changes(kuryr_load_balancer_readiness_total[20m]) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the metric differs from what is registered kuryr_load_balancer_readiness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know :) what is weird, prometheus client produces two child samples for type counter - one with suffix "_total" and one with suffix "_created"[1].

[1] https://github.com/prometheus/client_python/blob/9a24236695c9ad47f9dc537a922a6d1333d8d093/prometheus_client/metrics.py#L276-L280

annotations:
message: Kuryr noticed that a Neutron port was in unexpected DOWN state in the last 20 minutes. Most likely this indicates an issue with OpenStack Neutron.
expr: |
changes(kuryr_port_readiness_total[20m]) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@MaysaMacedo
Copy link
Contributor

/lgtm

@MaysaMacedo
Copy link
Contributor

/retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 13, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gryf, MaysaMacedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 13, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

9 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@dulek
Copy link
Contributor

dulek commented Jul 14, 2021

/test e2e-openstack-kuryr

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@MaysaMacedo
Copy link
Contributor

/test e2e-openstack-kuryr

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@abhat
Copy link
Contributor

abhat commented Jul 14, 2021

/override ci/prow/e2e-gcp-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2021

@abhat: Overrode contexts on behalf of abhat: ci/prow/e2e-gcp-ovn

In response to this:

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2021

@gryf: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-openstack-ovn 88bd669 link /test e2e-openstack-ovn
ci/prow/e2e-gcp-ovn-upgrade 88bd669 link /test e2e-gcp-ovn-upgrade
ci/prow/e2e-ovn-hybrid-step-registry 88bd669 link /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-azure-ovn 88bd669 link /test e2e-azure-ovn
ci/prow/e2e-vsphere-windows 88bd669 link /test e2e-vsphere-windows
ci/prow/e2e-vsphere-ovn 88bd669 link /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 18c4ad6 into openshift:master Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants