Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.11] OCPBUGS-3490: OVN-Kubernetes: Prefer oldest nodes #1641

Merged

Commits on Nov 28, 2022

  1. ovn: prefer oldest nodes for RAFT cluster

    Sometimes the number of masters changes, like when in the etcd test:
    
    etcd [apigroup:config.openshift.io] is able to vertically scale up and down with a single node
    
    This leads to problems like:
    
    I0909 11:16:02.221234       1 ovn_kubernetes.go:938] Waiting to complete OVN bootstrap: found (4) master nodes out of (3) expected: timing out in 235 seconds
    
    ovsdb-server only ever wants an odd number of members to ensure consensus in
    RAFT clusters. If we have 4 members and one of them is dead (like when the
    4th one gets deleted) the RAFT cluster gets a bit unhappy.
    
    The CNO currently renders the ovnkube master pods with the IP addresses of all
    master nodes, regardless of how many control plane nodes were actually
    requested at install time. That's not cool. Don't do that.
    
    Instead, take the oldest master nodes (sorted by creation time) as the
    RAFT cluster members. Tell any NB/SB containers that aren't in the list
    to do nothing for a really long time (to prevent CrashloopBackoff due to
    early exits from the container script) and not join the cluster.
    
    If this really is a master replacement, then the cluster will shift over
    to the new master when the original one is finally removed.
    
    Signed-off-by: Dan Williams <dcbw@redhat.com>
    (cherry picked from commit c0c317e)
    dcbw authored and kyrtapz committed Nov 28, 2022
    Configuration menu
    Copy the full SHA
    6063452 View commit details
    Browse the repository at this point in the history
  2. ovn: remove unused variable

    (cherry picked from commit 9d22f87)
    dcbw authored and kyrtapz committed Nov 28, 2022
    Configuration menu
    Copy the full SHA
    b7f0002 View commit details
    Browse the repository at this point in the history
  3. ovn: make DB startup wait longer for cluster upgradability

    When the postStart hooks fail kubelet kills the DB containers with a 30s
    grace period. If the DBs started at different times (because they're on
    different nodes, have different kubelets, etc) they may not have enough
    runtime overlap to establish the RAFT cluster before one or more of them
    get killed by kubelet.
    
    First, make the postStart scripts wait longer by retrying the stuff they
    do more times until the cluster is established.
    
    Second, wrap the IPsec enable/disable in a retry loop too and make it exit
    with an error if it fails instead of ignoring the problem.
    
    Third, add an IPsec check to the SB postStart to wait a bit more time for
    the SB cluster to establish, if needed.
    
    (cherry picked from commit d994351)
    dcbw authored and kyrtapz committed Nov 28, 2022
    Configuration menu
    Copy the full SHA
    c14fa49 View commit details
    Browse the repository at this point in the history