OCPBUGS-18340: Update runbooks for ovn-ic #136

bpickard22 · 2023-09-13T22:49:11Z

remove runbooks for alerts no longer reported with implementation of ovn-ic and ubdate runbooks related to alerts still reported to be up to date with ovn-ic

openshift-ci-robot · 2023-09-13T22:49:16Z

@bpickard22: This pull request references Jira Issue OCPBUGS-18340, which is invalid:

expected the bug to target the "4.15.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

remove runbooks for alerts no longer reported with implementation of ovn-ic and ubdate runbooks related to alerts still reported to be up to date with ovn-ic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-09-13T22:49:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bpickard22
Once this PR has been reviewed and has the lgtm label, please assign nautilux for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

martinkennelly · 2023-09-14T10:51:07Z

/assign @martinkennelly
Can you fix the markdown lint?

martinkennelly · 2023-09-18T08:22:28Z

@bpickard22 After a quick look, it looks like you arent considering for versioning - unfortunately, this repo doesnt have versioning so you must include 4.13 and below and 4.14 and above :(

martinkennelly · 2023-09-18T08:23:12Z

Ill take a look again when youve updated the runbooks with versioning and also passing the lint.

martinkennelly · 2023-09-18T08:23:24Z

Any questions - DM me or ask here.

remove runbooks for alerts no longer reported with implementation of ovn-ic and ubdate runbooks related to alerts still reported to be up to date with ovn-ic Signed-off-by: Ben Pickard <bpickard@redhat.com>

bpickard22 · 2023-09-21T00:40:10Z

/jira refresh

openshift-ci-robot · 2023-09-21T00:40:16Z

@bpickard22: This pull request references Jira Issue OCPBUGS-18340, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (weliang@redhat.com), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add a legacy section to runbooks with info needed for pre-ic clusters Signed-off-by: Ben Pickard <bpickard@redhat.com>

openshift-ci · 2023-09-28T21:11:00Z

@bpickard22: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

martinkennelly

first pass

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

martinkennelly · 2023-10-25T12:02:42Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+
+### If one of the ovnkube-control-plane pods is not running
+
+The ovnkube-cluster-manager container in the ovn kubernetes control-plane pod


This seems more like diagnosis than mitigation. I think in this mitigation, we should list a few error scenarios and then what to do for each scenario. Users can get why the leader election failed from the logs in the steps in diagnosis.

Leader election failed because of timeout
In this scenario, we wanted to check 2 things: if the API server is overloaded and if CM can reach the API server. If its overloaded, reduce load on API server. If its not overloaded, check connection between node and api server endpoints.

Leader election failed because of connection refused
In this scenario, API server is not serving at the specific ip + port. Restore api server functionality.

We can add more when more scenarios arise.

Maybe search for more scenarios why leader election failed in the past for some examples.

martinkennelly · 2023-10-25T12:05:20Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+
+    oc logs -n openshift-ovn-kubernetes ovnkube-control-plane-xxxxx --all-containers | grep elect
+
+## Mitigation


I think overall in this section, we should list some scenarios found in the previous sections diagnoses and offer some actions. We dont need to get into nodes not Ready, or the pod not ready i think. WDYT?

yeah i think that makes sense lets go with that

martinkennelly · 2023-10-25T12:10:46Z

alerts/cluster-network-operator/NoRunningOvnControlPlane.md

 [Running][PodRunning]
-This is a critical-level alert if no OVN-Kubernetes master control plane pods
+This is a critical-level alert if no OVN-Kubernetes control plane pods


In impact section, i think put in the services that may not be working. i listed them in a previous comment.

martinkennelly · 2023-10-25T14:43:48Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+
+`holder` shown above, contains the node name where the leader pod
+resides.
+Check the logs for any of the running ovnkube-control-plane to see if there is


i think just stick with the one found in the lease - IF there is even one set there.
also any of the running ovnkube-control-plane pods

martinkennelly · 2023-10-25T14:44:31Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+`holder` shown above, contains the node name where the leader pod
+resides.
+Check the logs for any of the running ovnkube-control-plane to see if there is
+leader election happened and if there is an error occurred.


this is bad english - rephase.

martinkennelly · 2023-10-25T14:48:22Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+
+### If all the ovnkube-control-plane pods are not running
+
+Check the status of the ovnkube-control-plane pods, and follow the


id remove this. Cluster admins or SREs understand this pod lifecycle.

martinkennelly · 2023-10-25T14:48:42Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+
+### If all the ovnkube-control-plane pods are running
+
+Follow the steps above: [OVN-Kubernetes master pods](#ovn-kubernetes-control-plane-pods)


control plane pods

martinkennelly · 2023-10-25T14:50:15Z

alerts/cluster-network-operator/NoOvnClusterManagerLeader.md

+
+## Mitigation
+
+### If the control plane nodes are not running


why do you want to follow the disaster and recovery doc here? I looked at the criteria for it and i dont know if CM not being able to become a leader warrants using this doc.

tssurya · 2024-01-29T13:39:52Z

@bpickard22 : PTAL, bug is open too long

openshift-bot · 2024-04-29T01:00:30Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-05-29T08:31:01Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

martinkennelly · 2024-05-29T08:39:57Z

/remove-lifecycle rotten

martinkennelly · 2024-05-29T08:40:33Z

Ben, we should proceed on this. Let me know when youre ready for review.

openshift-ci bot requested review from npinaeva and ravitri September 13, 2023 22:49

openshift-ci bot assigned martinkennelly Sep 14, 2023

OCPBUGS-18340: Update runbooks for ovn-ic

ec2594e

remove runbooks for alerts no longer reported with implementation of ovn-ic and ubdate runbooks related to alerts still reported to be up to date with ovn-ic Signed-off-by: Ben Pickard <bpickard@redhat.com>

bpickard22 force-pushed the ic-runbooks branch from 66a092b to ec2594e Compare September 21, 2023 00:36

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 21, 2023

Add legacy info to updated alerts

b56eda8

Add a legacy section to runbooks with info needed for pre-ic clusters Signed-off-by: Ben Pickard <bpickard@redhat.com>

martinkennelly reviewed Oct 25, 2023

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 29, 2024

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-18340: Update runbooks for ovn-ic #136

OCPBUGS-18340: Update runbooks for ovn-ic #136

bpickard22 commented Sep 13, 2023

openshift-ci-robot commented Sep 13, 2023

openshift-ci bot commented Sep 13, 2023

martinkennelly commented Sep 14, 2023

martinkennelly commented Sep 18, 2023

martinkennelly commented Sep 18, 2023

martinkennelly commented Sep 18, 2023

bpickard22 commented Sep 21, 2023

openshift-ci-robot commented Sep 21, 2023

openshift-ci bot commented Sep 28, 2023

martinkennelly left a comment

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

bpickard22 Oct 30, 2023

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

martinkennelly Oct 25, 2023

tssurya commented Jan 29, 2024

openshift-bot commented Apr 29, 2024

openshift-bot commented May 29, 2024

martinkennelly commented May 29, 2024

martinkennelly commented May 29, 2024


		### If one of the ovnkube-control-plane pods is not running

		The ovnkube-cluster-manager container in the ovn kubernetes control-plane pod


		oc logs -n openshift-ovn-kubernetes ovnkube-control-plane-xxxxx --all-containers \| grep elect

		## Mitigation


		### If all the ovnkube-control-plane pods are not running

		Check the status of the ovnkube-control-plane pods, and follow the


		### If all the ovnkube-control-plane pods are running

		Follow the steps above: [OVN-Kubernetes master pods](#ovn-kubernetes-control-plane-pods)

OCPBUGS-18340: Update runbooks for ovn-ic #136

Are you sure you want to change the base?

OCPBUGS-18340: Update runbooks for ovn-ic #136

Conversation

bpickard22 commented Sep 13, 2023

openshift-ci-robot commented Sep 13, 2023

openshift-ci bot commented Sep 13, 2023

martinkennelly commented Sep 14, 2023

martinkennelly commented Sep 18, 2023

martinkennelly commented Sep 18, 2023

martinkennelly commented Sep 18, 2023

bpickard22 commented Sep 21, 2023

openshift-ci-robot commented Sep 21, 2023

openshift-ci bot commented Sep 28, 2023

martinkennelly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tssurya commented Jan 29, 2024

openshift-bot commented Apr 29, 2024

openshift-bot commented May 29, 2024

martinkennelly commented May 29, 2024

martinkennelly commented May 29, 2024