Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1959200: Adds back checking OF flows for CNI #555

Closed
wants to merge 4 commits into from

Conversation

dcbw
Copy link
Contributor

@dcbw dcbw commented May 25, 2021

We see at scale that this can happen:

  1. CNI delete
  2. OVN is so busy it takes 30 seconds to remove the old logical port
  3. CNI ADD within 30 seconds
  4. ovn-controller sees old logical switchport, binds and considers new
    pod up, but no traffic works
  5. sometime later OVN gets updated, and ovn-controller updates the pod
    with the new flows and traffic finally works

To solve this problem we need to have a minimal check to ensure the
right flows are present for the pod before we check if ovn_installed is
true. This change adds back the checks for mac address and of port
number.

(cherry picked from commit 22bed6a)

4.8 backport of ovn-org/ovn-kubernetes#2220

@trozet

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested review from abhat and squeed May 25, 2021 19:21
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2021
@trozet
Copy link
Contributor

trozet commented May 25, 2021

/hold

we need to bump OVN with the relevant fix also in this PR. https://bugzilla.redhat.com/show_bug.cgi?id=1959200 depends on:
https://bugzilla.redhat.com/show_bug.cgi?id=1952846

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 25, 2021
@dcbw dcbw changed the title Adds back checking OF flows for CNI Bug 1959200: Adds back checking OF flows for CNI May 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 25, 2021

@dcbw: This pull request references Bugzilla bug 1959200, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (anusaxen@redhat.com), skipping review request.

In response to this:

Bug 1959200: Adds back checking OF flows for CNI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 25, 2021
@dcbw
Copy link
Contributor Author

dcbw commented Jun 2, 2021

ovn2.13-20.12.0-135 has the necessary fixes now; need that pulled into OCP.

@trozet
Copy link
Contributor

trozet commented Jun 4, 2021

@dcbw https://bugzilla.redhat.com/show_bug.cgi?id=1952846 has been verified, can you tag and pull that into this PR?

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/hold cancel
Bumped OVN with required fixes.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2021
@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/retest
perhaps the mirrors have synced now

@trozet
Copy link
Contributor

trozet commented Jun 7, 2021

/retest

1 similar comment
@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/retest

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/test ci/prow/images

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 7, 2021

@dcbw: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test 4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-windows
  • /test e2e-azure-ovn
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack-ovn
  • /test e2e-ovn-hybrid-step-registry
  • /test e2e-vsphere-ovn
  • /test e2e-vsphere-windows
  • /test images
  • /test okd-e2e-gcp-ovn
  • /test okd-images

Use /test all to run all jobs.

In response to this:

/test ci/prow/images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/test images

4 similar comments
@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/test images

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/test images

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/test images

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/test images

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/retest

2 similar comments
@trozet
Copy link
Contributor

trozet commented Jun 7, 2021

/retest

@dcbw
Copy link
Contributor Author

dcbw commented Jun 7, 2021

/retest

@trozet
Copy link
Contributor

trozet commented Jun 8, 2021

Hybrid step:

ns/e2e-statefulset-937 pod/ss-0 node/ip-10-0-230-171.ec2.internal - 24.72 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ss-0_e2e-statefulset-937_3c9e8983-67c2-495c-936c-c16447924956_0(f902bb70527f34146fd8f98fdb0c5f6e4e9ea9c135db55ec754bad0105b6adb4): [e2e-statefulset-937/ss-0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[e2e-statefulset-937/ss-0 f902bb70527f34146fd8f98fdb0c5f6e4e9ea9c135db55ec754bad0105b6adb4] [e2e-statefulset-937/ss-0 f902bb70527f34146fd8f98fdb0c5f6e4e9ea9c135db55ec754bad0105b6adb4] failed to configure pod interface: error while waiting on OVS.Interface.external-ids:ovn-installed for pod: timed out while waiting for OVS port binding

@dcbw
Copy link
Contributor Author

dcbw commented Jun 8, 2021

/test okd-e2e-gcp-ovn

�[36mINFO�[0m[2021-06-07T21:07:12Z] Wrote inspect data to /logs/artifacts/audit-logs. 
�[36mINFO�[0m[2021-06-07T21:07:12Z] error running backup collection: errors ocurred while gathering data: 
�[36mINFO�[0m[2021-06-07T21:07:12Z]     skipping gathering endpoints/host-etcd-2 due to error: endpoints "host-etcd-2" not foundError from server (Forbidden): pods "must-gather-" is forbidden: error looking up service account openshift-must-gather-wr98q/default: serviceaccount "default" not found 
�[36mINFO�[0m[2021-06-07T21:07:12Z] {"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2021-06-07T21:07:02Z"} 
�[36mINFO�[0m[2021-06-07T21:07:12Z] error: failed to execute wrapped command: exit status 1 
�[36mINFO�[0m[2021-06-07T21:07:12Z]                                              
�[36mINFO�[0m[2021-06-07T21:07:12Z] Step e2e-gcp-ovn-gather-audit-logs failed after 2m10s. 

@dcbw
Copy link
Contributor Author

dcbw commented Jun 8, 2021

/retest

@dcbw
Copy link
Contributor Author

dcbw commented Jun 8, 2021

/override ci/prow/e2e-metal-ipi-ovn-dualstack

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2021

@dcbw: Overrode contexts on behalf of dcbw: ci/prow/e2e-metal-ipi-ovn-dualstack

In response to this:

/override ci/prow/e2e-metal-ipi-ovn-dualstack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor

trozet commented Jun 8, 2021

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 8, 2021
@dcbw dcbw force-pushed the check_of_flows_cni branch 2 times, most recently from ac6f1a4 to 90626bf Compare June 9, 2021 16:55
@dcbw
Copy link
Contributor Author

dcbw commented Jun 9, 2021

/test all

trozet and others added 4 commits June 10, 2021 10:47
We see at scale that this can happen:
1. CNI delete
2. OVN is so busy it takes 30 seconds to remove the old logical port
3. CNI ADD within 30 seconds
4. ovn-controller sees old logical switchport, binds and considers new
   pod up, but no traffic works
5. sometime later OVN gets updated, and ovn-controller updates the pod
   with the new flows and traffic finally works

To solve this problem we need to have a minimal check to ensure the
right flows are present for the pod before we check if ovn_installed is
true. This change adds back the checks for mac address and of port
number.

Signed-off-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit 22bed6a)
-140 fixes ovn-installed status (rhbz#1952846), backports the
lflow cache bounding, and fixes IPv6 ECMP symmetric flows
(rhbz#1959008).
The runtime might call ovnkube to set up the pod sandbox before
the pod informer has received the pod from the apiserver. Currently
the code simply returns an error and expects kubelet to retry.

Instead let's be nicer and wait a short bit of time for the pod
to show up before erroring out.

Signed-off-by: Dan Williams <dcbw@redhat.com>
(cherry picked from commit adaed36)
If the the pod's MAC annotation changes, that means the pod was
deleted and re-created. We should cancel any outstanding pod
sandbox ADD request that doesn't match the latest MAC address
for the pod.

Signed-off-by: Dan Williams <dcbw@redhat.com>
@dcbw
Copy link
Contributor Author

dcbw commented Jun 10, 2021

/retest

@dcbw
Copy link
Contributor Author

dcbw commented Jun 10, 2021

/test all

@dcbw
Copy link
Contributor Author

dcbw commented Jun 10, 2021

/retest

1 similar comment
@dcbw
Copy link
Contributor Author

dcbw commented Jun 11, 2021

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 11, 2021

@dcbw: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-openstack-ovn a872814 link /test e2e-openstack-ovn
ci/prow/e2e-vsphere-windows a872814 link /test e2e-vsphere-windows
ci/prow/e2e-metal-ipi-ovn-dualstack a872814 link /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-gcp-ovn-upgrade a872814 link /test e2e-gcp-ovn-upgrade
ci/prow/e2e-aws-ovn-windows a872814 link /test e2e-aws-ovn-windows
ci/prow/4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade a872814 link /test 4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 20, 2021

@dcbw: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2021
@dcbw
Copy link
Contributor Author

dcbw commented Jul 6, 2021

ovn-org/ovn-kubernetes#2275 should fix this instead.

@dcbw dcbw closed this Jul 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 6, 2021

@dcbw: This pull request references Bugzilla bug 1959200. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1959200: Adds back checking OF flows for CNI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants