Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68

Merged

Conversation

kyrtapz
Copy link
Contributor

@kyrtapz kyrtapz commented Sep 6, 2022

/cc @tssurya @martinkennelly @npinaeva

Signed-off-by: Patryk Diak pdiak@redhat.com

Copy link
Member

@npinaeva npinaeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor questions

Copy link

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @kyrtapz ! , just a few suggestions which I'm ok even if not addresses since they aren't major concerns. Overall I like this runbook idea but just feel a bit bad that we can't provide many concrete steps in the mitigation part since these are complex problems to solve and there is no single step we could take..

…se alert

Signed-off-by: Patryk Diak <pdiak@redhat.com>

- [NoRunningOvnMaster](./NoRunningOvnMaster.md)

### OVN-kubernetes master pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call that like "Check sbdb". We need to find sbdb leader and check only its logs, because this instance will accept connections.
We can find sbdb leader with a network-tools command oc adm must-gather --image=quay.io/openshift/origin-network-tools:latest -- network-tools ovn-get leaders and then check logs of sbdb on returned pod only instead of all pods? Also, if no leader will be found by this command, this is a good place to create a bug (and hopefully soon we will have a separate alert+runbook for no db leader)
wdyt?

Copy link
Contributor Author

@kyrtapz kyrtapz Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that OVN controller can connect to any sbdb instance:

for pod in $(oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-node -o jsonpath={..metadata.name})
do
  echo "${pod}:"
  oc logs ${pod} -n openshift-ovn-kubernetes -c ovn-controller | grep "ssl.*connected"
done
ovnkube-node-2bmrc:
2022-09-08T09:12:57.096Z|00010|reconnect|INFO|ssl:10.0.128.155:9642: connected
ovnkube-node-68pwf:
2022-09-08T09:07:06.821Z|00024|reconnect|INFO|ssl:10.0.128.155:9642: connected
ovnkube-node-dhnks:
2022-09-08T09:12:54.577Z|00008|reconnect|INFO|ssl:10.0.168.127:9642: connected
ovnkube-node-ppjbq:
2022-09-08T09:07:06.812Z|00024|reconnect|INFO|ssl:10.0.149.143:9642: connected
ovnkube-node-psdrj:
2022-09-08T09:13:00.102Z|00008|reconnect|INFO|ssl:10.0.168.127:9642: connected
ovnkube-node-rp8hv:
2022-09-08T09:07:08.338Z|00024|reconnect|INFO|ssl:10.0.128.155:9642: connected

So I think it is still worth to check all of the SBDBs as they can be potentially picked as an endpoint.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, to avoid scaling issues, 1/3rd of controllers connect to sbdbA 1/3rd to sbdbB and 1/3rd to sbdbC

@tssurya
Copy link

tssurya commented Sep 8, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 8, 2022
@kyrtapz
Copy link
Contributor Author

kyrtapz commented Sep 21, 2022

@martinkennelly ptal

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
@kyrtapz
Copy link
Contributor Author

kyrtapz commented Dec 21, 2022

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
@martinkennelly
Copy link
Contributor

/lgtm

@martinkennelly
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyrtapz, martinkennelly, tssurya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 15, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

@kyrtapz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit dd028cf into openshift:master Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants