New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1880591: Fixes race during 4.5->4.6 upgrade with ovn node and master #307
Bug 1880591: Fixes race during 4.5->4.6 upgrade with ovn node and master #307
Conversation
When an ovnkube-node upgrades, it annotates its gateway config onto the kubernets node object. When ovnkube-master sees this node add/update event it then will try to sync its gateway configuration in OVN. In ovnkube-master 4.6, if we see a node that is in a 4.5 state, but its pod is a 4.6 ovnkube-pod we will ignore doing any OVN gateway config on it. However, this patch addresses a case where 4.6 ovnkube-node upgraded, started, and annotated its node, before ovnkube-master pod was upgraded to 4.6. In that case the old 4.5 ovnkube-master code will try to create a gateway interface, for the updated node, and we were never setting that interface ID. Thus, OVN would get a new port with an empty name in addition to its already configured br-local port and not know which one to use. If it picks the "new port" then kapi access will stop working. This patch works around this potential race by setting the node annotation to the old interface value in 4.6 ovnkube-node so that if we hit this race, the old ovnkube-master code will not create an empty additional port in OVN. Signed-off-by: Tim Rozet <trozet@redhat.com>
@trozet: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@trozet: This pull request references Bugzilla bug 1880591, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
/hold |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhat, danwinship, trozet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-vsphere-ovn |
@trozet: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest |
1 similar comment
/retest |
/hold cancel |
With a manual upgrade yesterday on metal (thanks @rbbratta ) it works. There is a period of time during upgrade after network is rolled out that openshift-api-server becomes unavailable, and clusterversion will complain with a warning. About 5 minutes later openshift-api-server becomes available again and upgrade completes. I'm not sure what causes this, or if it is expected. But it's something else we need to look into. For now this fix does resolve the race. |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@trozet: All pull requests linked via external trackers have merged:
Bugzilla bug 1880591 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When an ovnkube-node upgrades, it annotates its gateway config onto the
kubernets node object. When ovnkube-master sees this node add/update
event it then will try to sync its gateway configuration in OVN. In
ovnkube-master 4.6, if we see a node that is in a 4.5 state, but its pod
is a 4.6 ovnkube-pod we will ignore doing any OVN gateway config on it.
However, this patch addresses a case where 4.6 ovnkube-node upgraded,
started, and annotated its node, before ovnkube-master pod was upgraded
to 4.6. In that case the old 4.5 ovnkube-master code will try to create
a gateway interface, for the updated node, and we were never setting
that interface ID. Thus, OVN would get a new port with an empty name in
addition to its already configured br-local port and not know which one
to use. If it picks the "new port" then kapi access will stop working.
This patch works around this potential race by setting the node
annotation to the old interface value in 4.6 ovnkube-node so that if we
hit this race, the old ovnkube-master code will not create an empty
additional port in OVN.
Signed-off-by: Tim Rozet trozet@redhat.com