SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68

kyrtapz · 2022-09-06T15:25:32Z

Signed-off-by: Patryk Diak pdiak@redhat.com

npinaeva

Looks good, some minor questions

alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md

tssurya

LGTM thanks @kyrtapz ! , just a few suggestions which I'm ok even if not addresses since they aren't major concerns. Overall I like this runbook idea but just feel a bit bad that we can't provide many concrete steps in the mitigation part since these are complex problems to solve and there is no single step we could take..

alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md

…se alert Signed-off-by: Patryk Diak <pdiak@redhat.com>

npinaeva · 2022-09-08T11:08:30Z

alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md

+
+- [NoRunningOvnMaster](./NoRunningOvnMaster.md)
+
+### OVN-kubernetes master pods


maybe call that like "Check sbdb". We need to find sbdb leader and check only its logs, because this instance will accept connections.
We can find sbdb leader with a network-tools command oc adm must-gather --image=quay.io/openshift/origin-network-tools:latest -- network-tools ovn-get leaders and then check logs of sbdb on returned pod only instead of all pods? Also, if no leader will be found by this command, this is a good place to create a bug (and hopefully soon we will have a separate alert+runbook for no db leader)
wdyt?

I think that OVN controller can connect to any sbdb instance:

for pod in $(oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-node -o jsonpath={..metadata.name}) do echo "${pod}:" oc logs ${pod} -n openshift-ovn-kubernetes -c ovn-controller | grep "ssl.*connected" done ovnkube-node-2bmrc: 2022-09-08T09:12:57.096Z|00010|reconnect|INFO|ssl:10.0.128.155:9642: connected ovnkube-node-68pwf: 2022-09-08T09:07:06.821Z|00024|reconnect|INFO|ssl:10.0.128.155:9642: connected ovnkube-node-dhnks: 2022-09-08T09:12:54.577Z|00008|reconnect|INFO|ssl:10.0.168.127:9642: connected ovnkube-node-ppjbq: 2022-09-08T09:07:06.812Z|00024|reconnect|INFO|ssl:10.0.149.143:9642: connected ovnkube-node-psdrj: 2022-09-08T09:13:00.102Z|00008|reconnect|INFO|ssl:10.0.168.127:9642: connected ovnkube-node-rp8hv: 2022-09-08T09:07:08.338Z|00024|reconnect|INFO|ssl:10.0.128.155:9642: connected

So I think it is still worth to check all of the SBDBs as they can be potentially picked as an endpoint.

correct, to avoid scaling issues, 1/3rd of controllers connect to sbdbA 1/3rd to sbdbB and 1/3rd to sbdbC

tssurya · 2022-09-08T15:28:45Z

/lgtm

kyrtapz · 2022-09-21T13:25:11Z

@martinkennelly ptal

openshift-bot · 2022-12-21T01:00:51Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kyrtapz · 2022-12-21T08:29:52Z

/remove-lifecycle stale

martinkennelly · 2023-03-15T12:49:23Z

/lgtm

martinkennelly · 2023-03-15T12:49:35Z

/approve

openshift-ci · 2023-03-15T12:49:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyrtapz, martinkennelly, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~alerts/cluster-network-operator/OWNERS~~ [martinkennelly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-03-15T12:56:49Z

@kyrtapz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from martinkennelly, npinaeva and tssurya September 6, 2022 15:25

kyrtapz force-pushed the controller_disconnect branch from 9f487e7 to 697d4a8 Compare September 6, 2022 15:46

npinaeva reviewed Sep 7, 2022

View reviewed changes

alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md Show resolved Hide resolved

alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md Show resolved Hide resolved

tssurya reviewed Sep 7, 2022

View reviewed changes

kyrtapz force-pushed the controller_disconnect branch 2 times, most recently from 79ac8dc to a19b5b6 Compare September 7, 2022 12:21

kyrtapz mentioned this pull request Sep 8, 2022

SDN-3444: Add runbook url for SBDB connectivity alert openshift/cluster-network-operator#1553

Merged

Add a runbook for OVNKubernetesControllerDisconnectedSouthboundDataba…

b0a2192

…se alert Signed-off-by: Patryk Diak <pdiak@redhat.com>

kyrtapz force-pushed the controller_disconnect branch from a19b5b6 to b0a2192 Compare September 8, 2022 11:00

npinaeva reviewed Sep 8, 2022

View reviewed changes

openshift-ci bot assigned tssurya Sep 8, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

openshift-ci bot assigned martinkennelly Mar 15, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 15, 2023

openshift-merge-robot merged commit dd028cf into openshift:master Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68

SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68

kyrtapz commented Sep 6, 2022

npinaeva left a comment

tssurya left a comment

npinaeva Sep 8, 2022

kyrtapz Sep 8, 2022 •

edited

Loading

tssurya Sep 8, 2022

tssurya commented Sep 8, 2022

kyrtapz commented Sep 21, 2022

openshift-bot commented Dec 21, 2022

kyrtapz commented Dec 21, 2022

martinkennelly commented Mar 15, 2023

martinkennelly commented Mar 15, 2023

openshift-ci bot commented Mar 15, 2023

openshift-ci bot commented Mar 15, 2023


		- [NoRunningOvnMaster](./NoRunningOvnMaster.md)

		### OVN-kubernetes master pods

SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68

SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68

Conversation

kyrtapz commented Sep 6, 2022

npinaeva left a comment

Choose a reason for hiding this comment

tssurya left a comment

Choose a reason for hiding this comment

npinaeva Sep 8, 2022

Choose a reason for hiding this comment

kyrtapz Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

tssurya Sep 8, 2022

Choose a reason for hiding this comment

tssurya commented Sep 8, 2022

kyrtapz commented Sep 21, 2022

openshift-bot commented Dec 21, 2022

kyrtapz commented Dec 21, 2022

martinkennelly commented Mar 15, 2023

martinkennelly commented Mar 15, 2023

openshift-ci bot commented Mar 15, 2023

openshift-ci bot commented Mar 15, 2023

kyrtapz Sep 8, 2022 •

edited

Loading