Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1996201: Fixes cases of timed out while waiting for OVS port binding #686

Merged
merged 6 commits into from Sep 11, 2021

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Aug 24, 2021

  1. Add back checking for pod flows (to ensure the port that gets ovn_installed is the latest port for the pod)
  2. Cancel oldest sandbox request if the pod's MAC or UUID changes. Basically if ovnkube is serving a CNI request and the mac was changed due to a pod delete/add event, cancel this request and move onto the newer pod.

dcbw and others added 6 commits August 24, 2021 16:14
The runtime might call ovnkube to set up the pod sandbox before
the pod informer has received the pod from the apiserver. Currently
the code simply returns an error and expects kubelet to retry.

Instead let's be nicer and wait a short bit of time for the pod
to show up before erroring out.

Signed-off-by: Dan Williams <dcbw@redhat.com>
We see at scale that this can happen:
1. CNI delete
2. OVN is so busy it takes 30 seconds to remove the old logical port
3. CNI ADD within 30 seconds
4. ovn-controller sees old logical switchport, binds and considers new
   pod up, but no traffic works
5. sometime later OVN gets updated, and ovn-controller updates the pod
   with the new flows and traffic finally works

To solve this problem we need to have a minimal check to ensure the
right flows are present for the pod before we check if ovn_installed is
true. This change adds back the checks for mac address and of port
number.

Signed-off-by: Tim Rozet <trozet@redhat.com>
If the the pod's UID changes that means the pod was deleted
and re-created. There's no point in continuing this sandbox
request as kubelet will just be tearing it down soon anyway.

If the pod's MAC changes, that means the master was behind and
set the IPAM annotation on a new instance of the pod (since
the master just uses Patch with namespace+name and ignores
UID), and this sandbox will be torn down soon as well so
kubelet can start the newer one.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Use a pod UID from the runtime to close the race between when kubelet
starts the sandbox request and when cni.go gets the pod from the
informer cache or the apiserver. If the pod was deleted and
recreated during that window the sandbox could be configured for
the new pod instance, only to be torn down soon and possibly
confuse the new sandbox ADD for the new pod instance.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Passing the Kube API authentication data via the CNI config file
has two problems:

1) the CA file path might be different to the cniserver (because
it's containerized) than it is to the cnishim running outside
a container

2) it's better not to leak authentication info into the host
filesystem, even though the CNI config file should have restricted
permissions

To solve these two issues, pass the Kube API authentication data
back from the cniserver (running in ovnkube-node) to the cnishim
in the JSON response instead of writing it to a file on-disk.

This commit reverts parts of:
d397166
cni: cancel pod sandbox add requests if the pod's UID or MAC changes

Signed-off-by: Dan Williams <dcbw@redhat.com>
We need to pass the CA data itself between ovnkube-node and the cnishim
since the node is containerized and the shim is not, and the path could
be different between the two since they have different filesystem namespaces.

So we might as well just read the CA file and pass data around internally,
rather than using a file path.

Signed-off-by: Dan Williams <dcbw@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2021

@trozet: This pull request references Bugzilla bug 1996201, which is invalid:

  • expected dependent Bugzilla bug 1952846 to target a release in 4.9.0, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1996201: Fixes cases of timed out while waiting for OVS port binding

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 24, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2021
@trozet
Copy link
Contributor Author

trozet commented Aug 24, 2021

/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2021

@trozet: This pull request references Bugzilla bug 1996201, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.z) matches configured target release for branch (4.8.z)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1959200 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 1959200 targets the "4.9.0" release, which is one of the valid target releases: 4.9.0
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (anusaxen@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 24, 2021
@trozet
Copy link
Contributor Author

trozet commented Aug 24, 2021

/assign @dcbw

@jluhrsen
Copy link
Contributor

jluhrsen commented Sep 2, 2021

/retest

the gcp-ovn-upgrade job failures look a lot like what we see in the periodic job
the metal jobs have failures that we see in the periodic jobs, but those should be able to pass once in a while
not sure about the hybird-step-registry job. it looks like it fails 50% of the time, although I didn't dig in to why
each job is failing.

@dcbw
Copy link
Member

dcbw commented Sep 2, 2021

/lgtm
retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

11 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2021

@trozet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-dualstack 7156032 link /test e2e-metal-ipi-ovn-dualstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 04a34e6 into openshift:release-4.8 Sep 11, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 11, 2021

@trozet: All pull requests linked via external trackers have merged:

Bugzilla bug 1996201 has been moved to the MODIFIED state.

In response to this:

Bug 1996201: Fixes cases of timed out while waiting for OVS port binding

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686

to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
astoycos added a commit to astoycos/ovn-kubernetes-1 that referenced this pull request Nov 29, 2021
Commit fixing conflicts and other issues for
backporting openshift#686 to 4.7 without the smartnic code

Signed-off-by: astoycos <astoycos@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants