-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68
SDN-3444: OVNKubernetesControllerDisconnectedSouthboundDatabase runbook #68
Conversation
9f487e7
to
697d4a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor questions
alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md
Show resolved
Hide resolved
alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @kyrtapz ! , just a few suggestions which I'm ok even if not addresses since they aren't major concerns. Overall I like this runbook idea but just feel a bit bad that we can't provide many concrete steps in the mitigation part since these are complex problems to solve and there is no single step we could take..
alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md
Outdated
Show resolved
Hide resolved
alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md
Show resolved
Hide resolved
alerts/cluster-network-operator/OVNKubernetesControllerDisconnectedSouthboundDatabase.md
Show resolved
Hide resolved
79ac8dc
to
a19b5b6
Compare
…se alert Signed-off-by: Patryk Diak <pdiak@redhat.com>
a19b5b6
to
b0a2192
Compare
|
||
- [NoRunningOvnMaster](./NoRunningOvnMaster.md) | ||
|
||
### OVN-kubernetes master pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call that like "Check sbdb". We need to find sbdb leader and check only its logs, because this instance will accept connections.
We can find sbdb leader with a network-tools command oc adm must-gather --image=quay.io/openshift/origin-network-tools:latest -- network-tools ovn-get leaders
and then check logs of sbdb on returned pod only instead of all pods? Also, if no leader will be found by this command, this is a good place to create a bug (and hopefully soon we will have a separate alert+runbook for no db leader)
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that OVN controller can connect to any sbdb instance:
for pod in $(oc get pod -n openshift-ovn-kubernetes -l app=ovnkube-node -o jsonpath={..metadata.name})
do
echo "${pod}:"
oc logs ${pod} -n openshift-ovn-kubernetes -c ovn-controller | grep "ssl.*connected"
done
ovnkube-node-2bmrc:
2022-09-08T09:12:57.096Z|00010|reconnect|INFO|ssl:10.0.128.155:9642: connected
ovnkube-node-68pwf:
2022-09-08T09:07:06.821Z|00024|reconnect|INFO|ssl:10.0.128.155:9642: connected
ovnkube-node-dhnks:
2022-09-08T09:12:54.577Z|00008|reconnect|INFO|ssl:10.0.168.127:9642: connected
ovnkube-node-ppjbq:
2022-09-08T09:07:06.812Z|00024|reconnect|INFO|ssl:10.0.149.143:9642: connected
ovnkube-node-psdrj:
2022-09-08T09:13:00.102Z|00008|reconnect|INFO|ssl:10.0.168.127:9642: connected
ovnkube-node-rp8hv:
2022-09-08T09:07:08.338Z|00024|reconnect|INFO|ssl:10.0.128.155:9642: connected
So I think it is still worth to check all of the SBDBs as they can be potentially picked as an endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, to avoid scaling issues, 1/3rd of controllers connect to sbdbA 1/3rd to sbdbB and 1/3rd to sbdbC
/lgtm |
@martinkennelly ptal |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kyrtapz, martinkennelly, tssurya The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kyrtapz: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cc @tssurya @martinkennelly @npinaeva
Signed-off-by: Patryk Diak pdiak@redhat.com