-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-18340: Update runbooks for ovn-ic #136
base: master
Are you sure you want to change the base?
Conversation
@bpickard22: This pull request references Jira Issue OCPBUGS-18340, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bpickard22 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @martinkennelly |
@bpickard22 After a quick look, it looks like you arent considering for versioning - unfortunately, this repo doesnt have versioning so you must include 4.13 and below and 4.14 and above :( |
Ill take a look again when youve updated the runbooks with versioning and also passing the lint. |
Any questions - DM me or ask here. |
remove runbooks for alerts no longer reported with implementation of ovn-ic and ubdate runbooks related to alerts still reported to be up to date with ovn-ic Signed-off-by: Ben Pickard <bpickard@redhat.com>
66a092b
to
ec2594e
Compare
/jira refresh |
@bpickard22: This pull request references Jira Issue OCPBUGS-18340, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (weliang@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add a legacy section to runbooks with info needed for pre-ic clusters Signed-off-by: Ben Pickard <bpickard@redhat.com>
@bpickard22: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass
|
||
### If one of the ovnkube-control-plane pods is not running | ||
|
||
The ovnkube-cluster-manager container in the ovn kubernetes control-plane pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more like diagnosis than mitigation. I think in this mitigation, we should list a few error scenarios and then what to do for each scenario. Users can get why the leader election failed from the logs in the steps in diagnosis.
- Leader election failed because of timeout
In this scenario, we wanted to check 2 things: if the API server is overloaded and if CM can reach the API server. If its overloaded, reduce load on API server. If its not overloaded, check connection between node and api server endpoints. - Leader election failed because of connection refused
In this scenario, API server is not serving at the specific ip + port. Restore api server functionality.
We can add more when more scenarios arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe search for more scenarios why leader election failed in the past for some examples.
|
||
oc logs -n openshift-ovn-kubernetes ovnkube-control-plane-xxxxx --all-containers | grep elect | ||
|
||
## Mitigation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overall in this section, we should list some scenarios found in the previous sections diagnoses and offer some actions. We dont need to get into nodes not Ready, or the pod not ready i think. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think that makes sense lets go with that
[Running][PodRunning] | ||
This is a critical-level alert if no OVN-Kubernetes master control plane pods | ||
This is a critical-level alert if no OVN-Kubernetes control plane pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In impact section, i think put in the services that may not be working. i listed them in a previous comment.
|
||
`holder` shown above, contains the node name where the leader pod | ||
resides. | ||
Check the logs for any of the running ovnkube-control-plane to see if there is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think just stick with the one found in the lease - IF there is even one set there.
also any of the running ovnkube-control-plane pods
`holder` shown above, contains the node name where the leader pod | ||
resides. | ||
Check the logs for any of the running ovnkube-control-plane to see if there is | ||
leader election happened and if there is an error occurred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is bad english - rephase.
|
||
### If all the ovnkube-control-plane pods are not running | ||
|
||
Check the status of the ovnkube-control-plane pods, and follow the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id remove this. Cluster admins or SREs understand this pod lifecycle.
|
||
### If all the ovnkube-control-plane pods are running | ||
|
||
Follow the steps above: [OVN-Kubernetes master pods](#ovn-kubernetes-control-plane-pods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control plane pods
|
||
## Mitigation | ||
|
||
### If the control plane nodes are not running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you want to follow the disaster and recovery doc here? I looked at the criteria for it and i dont know if CM not being able to become a leader warrants using this doc.
@bpickard22 : PTAL, bug is open too long |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Ben, we should proceed on this. Let me know when youre ready for review. |
remove runbooks for alerts no longer reported with implementation of ovn-ic and ubdate runbooks related to alerts still reported to be up to date with ovn-ic