New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added metrics for failures caused by OpenStack services. #536
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a couple of suggestions:
2fe8239
to
d30ed89
Compare
|
||
self.load_balancer_readiness = prometheus_client.Count( | ||
'kuryr_load_balancer_readiness', | ||
'Increasing this metric indicates issues with creating Octavia LB', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about : "This counter is increased when Kuryr notices that an Octavia load balancer is stuck in an unexpected state"?
registry=self.registry) | ||
|
||
self.port_readiness = prometheus_client.Count( | ||
'kuryr_port_readiness', 'Increasing this metric indicates issues ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "This counter is increased when Kuryr times out waiting for Neutron to move port to ACTIVE"?
Looks good, just minor remarks on wording. I'm not 100% confident mine's better though. |
In Kuryr-kubernetes we mainly rely on two services: Neutron for ports creating and Octavia for load balancers. Sometimes we observe that either ports or load balancer hangs in some states indefinitely. In such case, after we timeout, kuryr controller will crash, leaving ambiguous log message. In this patch we introduce two new Prometheus metrics - one for Port and the other for load balancer, which will be updated during such situation.
@gryf: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Looks good and ocp installation passed, but destroy failed, probably due to the octavia issue. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gryf, MaysaMacedo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In Kuryr-kubernetes we mainly rely on two services: Neutron for ports
creating and Octavia for load balancers. Sometimes we observe that either
ports or load balancer hangs in some states indefinitely. In such case,
after we timeout, kuryr controller will crash, leaving ambiguous log
message.
In this patch we introduce two new Prometheus metrics - one for Port and
the other for load balancer, which will be updated during such
situation.