[release-4.11] OCPBUGS-3490: OVN-Kubernetes: Prefer oldest nodes #1641

Sometimes the number of masters changes, like when in the etcd test: etcd [apigroup:config.openshift.io] is able to vertically scale up and down with a single node This leads to problems like: I0909 11:16:02.221234 1 ovn_kubernetes.go:938] Waiting to complete OVN bootstrap: found (4) master nodes out of (3) expected: timing out in 235 seconds ovsdb-server only ever wants an odd number of members to ensure consensus in RAFT clusters. If we have 4 members and one of them is dead (like when the 4th one gets deleted) the RAFT cluster gets a bit unhappy. The CNO currently renders the ovnkube master pods with the IP addresses of all master nodes, regardless of how many control plane nodes were actually requested at install time. That's not cool. Don't do that. Instead, take the oldest master nodes (sorted by creation time) as the RAFT cluster members. Tell any NB/SB containers that aren't in the list to do nothing for a really long time (to prevent CrashloopBackoff due to early exits from the container script) and not join the cluster. If this really is a master replacement, then the cluster will shift over to the new master when the original one is finally removed. Signed-off-by: Dan Williams <dcbw@redhat.com> (cherry picked from commit c0c317e)

(cherry picked from commit 9d22f87)

When the postStart hooks fail kubelet kills the DB containers with a 30s grace period. If the DBs started at different times (because they're on different nodes, have different kubelets, etc) they may not have enough runtime overlap to establish the RAFT cluster before one or more of them get killed by kubelet. First, make the postStart scripts wait longer by retrying the stuff they do more times until the cluster is established. Second, wrap the IPsec enable/disable in a retry loop too and make it exit with an error if it fails instead of ignoring the problem. Third, add an IPsec check to the SB postStart to wait a bit more time for the SB cluster to establish, if needed. (cherry picked from commit d994351)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-4.11] OCPBUGS-3490: OVN-Kubernetes: Prefer oldest nodes #1641

[release-4.11] OCPBUGS-3490: OVN-Kubernetes: Prefer oldest nodes #1641

Commits on Nov 28, 2022