New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1996201: Fixes cases of timed out while waiting for OVS port binding #686
Bug 1996201: Fixes cases of timed out while waiting for OVS port binding #686
Conversation
trozet
commented
Aug 24, 2021
- Add back checking for pod flows (to ensure the port that gets ovn_installed is the latest port for the pod)
- Cancel oldest sandbox request if the pod's MAC or UUID changes. Basically if ovnkube is serving a CNI request and the mac was changed due to a pod delete/add event, cancel this request and move onto the newer pod.
The runtime might call ovnkube to set up the pod sandbox before the pod informer has received the pod from the apiserver. Currently the code simply returns an error and expects kubelet to retry. Instead let's be nicer and wait a short bit of time for the pod to show up before erroring out. Signed-off-by: Dan Williams <dcbw@redhat.com>
We see at scale that this can happen: 1. CNI delete 2. OVN is so busy it takes 30 seconds to remove the old logical port 3. CNI ADD within 30 seconds 4. ovn-controller sees old logical switchport, binds and considers new pod up, but no traffic works 5. sometime later OVN gets updated, and ovn-controller updates the pod with the new flows and traffic finally works To solve this problem we need to have a minimal check to ensure the right flows are present for the pod before we check if ovn_installed is true. This change adds back the checks for mac address and of port number. Signed-off-by: Tim Rozet <trozet@redhat.com>
If the the pod's UID changes that means the pod was deleted and re-created. There's no point in continuing this sandbox request as kubelet will just be tearing it down soon anyway. If the pod's MAC changes, that means the master was behind and set the IPAM annotation on a new instance of the pod (since the master just uses Patch with namespace+name and ignores UID), and this sandbox will be torn down soon as well so kubelet can start the newer one. Signed-off-by: Dan Williams <dcbw@redhat.com>
Use a pod UID from the runtime to close the race between when kubelet starts the sandbox request and when cni.go gets the pod from the informer cache or the apiserver. If the pod was deleted and recreated during that window the sandbox could be configured for the new pod instance, only to be torn down soon and possibly confuse the new sandbox ADD for the new pod instance. Signed-off-by: Dan Williams <dcbw@redhat.com>
Passing the Kube API authentication data via the CNI config file has two problems: 1) the CA file path might be different to the cniserver (because it's containerized) than it is to the cnishim running outside a container 2) it's better not to leak authentication info into the host filesystem, even though the CNI config file should have restricted permissions To solve these two issues, pass the Kube API authentication data back from the cniserver (running in ovnkube-node) to the cnishim in the JSON response instead of writing it to a file on-disk. This commit reverts parts of: d397166 cni: cancel pod sandbox add requests if the pod's UID or MAC changes Signed-off-by: Dan Williams <dcbw@redhat.com>
We need to pass the CA data itself between ovnkube-node and the cnishim since the node is containerized and the shim is not, and the path could be different between the two since they have different filesystem namespaces. So we might as well just read the CA file and pass data around internally, rather than using a file path. Signed-off-by: Dan Williams <dcbw@redhat.com>
@trozet: This pull request references Bugzilla bug 1996201, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@trozet: This pull request references Bugzilla bug 1996201, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla (anusaxen@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @dcbw |
/retest the gcp-ovn-upgrade job failures look a lot like what we see in the periodic job |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dcbw, trozet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
11 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@trozet: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@trozet: All pull requests linked via external trackers have merged: Bugzilla bug 1996201 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>
Commit fixing conflicts and other issues for backporting openshift#686 to 4.7 without the smartnic code Signed-off-by: astoycos <astoycos@redhat.com>