Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2118717: [release-4.11] Fix race when adding and removing pod with same name #1247

Conversation

flavio-fernandes
Copy link
Contributor

@flavio-fernandes flavio-fernandes commented Aug 16, 2022

Adding and removing a pod on changing nodes back to back can end up in a race where
corresponding logical switch port remains in the wrong logical switch and never gets
properly removed. In order for this to happen, the logical switch port has to have
the same name, which is the <namespace>_<podName>.

This PR includes the back-porting of 2 fixes needed to address this race.
They were merged to D/S master via PR #1237

…e informer cache

When processing an object in terminal state there is a chance that it was already removed from
the API server. Since delete events for objects in terminal state are skipped delete it here.

Conflicts:
  go-controller/pkg/ovn/obj_retry.go

Signed-off-by: Patryk Diak <pdiak@redhat.com>
(cherry picked from commit f1be8d2)
@openshift-ci openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 16, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2022

@flavio-fernandes: This pull request references Bugzilla bug 2118717, which is invalid:

  • expected dependent Bugzilla bug 2117310 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2118717: [release-4.11] Fix race when adding and removing pod with same name

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@flavio-fernandes
Copy link
Contributor Author

/assign @kyrtapz
/assign @trozet

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2022

@flavio-fernandes: This pull request references Bugzilla bug 2118717, which is invalid:

  • expected dependent Bugzilla bug 2117310 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2118717: [release-4.11] Fix race when adding and removing pod with same name

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from dcbw and tssurya August 16, 2022 15:20
@kyrtapz
Copy link
Contributor

kyrtapz commented Aug 16, 2022

/lgtm

First commit cherry-pick was not clean but the changes are pretty much identical

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2022
@trozet
Copy link
Contributor

trozet commented Aug 16, 2022

/approve
/label backport-risk-assessed

@openshift-ci openshift-ci bot added backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 16, 2022
@flavio-fernandes
Copy link
Contributor Author

/retest-required

@flavio-fernandes
Copy link
Contributor Author

/test e2e-openstack-ovn

@openshift-bot
Copy link
Contributor

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2022

@openshift-bot: This pull request references Bugzilla bug 2118717, which is invalid:

  • expected dependent Bugzilla bug 2117310 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor

trozet commented Aug 17, 2022

@flavio-fernandes unit test is failing:

go test -mod=vendor -test.v -race  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn -ginkgo.v  -ginkgo.reportFile /go/src/github.com/openshift/ovn-kubernetes/go-controller/_artifacts/junit-pkg_ovn.xml
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.test]
pkg/ovn/pods_test.go:958:5: undefined: checkRetryObjectEventually
pkg/ovn/pods_test.go:1514:5: undefined: checkRetryObjectEventually

@flavio-fernandes
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2022
Adding and removing a pod on changing nodes back to back can end up in a race where
corresponding logical switch port remains in the wrong logical switch and never gets
properly removed. In order for this to happen, the logical switch port has to have
the same name, which is the <namespace>_<podName>.

Conflicts:
    go-controller/pkg/ovn/pods_test.go

Signed-off-by: Flavio Fernandes <flaviof@redhat.com>
Co-authored-by: Tim Rozet <trozet@redhat.com>
(cherry picked from commit be8786a)
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 17, 2022
@flavio-fernandes
Copy link
Contributor Author

/remove-hold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2022
@flavio-fernandes
Copy link
Contributor Author

@flavio-fernandes unit test is failing:

go test -mod=vendor -test.v -race  github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn -ginkgo.v  -ginkgo.reportFile /go/src/github.com/openshift/ovn-kubernetes/go-controller/_artifacts/junit-pkg_ovn.xml
# github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn [github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.test]
pkg/ovn/pods_test.go:958:5: undefined: checkRetryObjectEventually
pkg/ovn/pods_test.go:1514:5: undefined: checkRetryObjectEventually

my bad. the clean patch apply made me think that the units tests were ok.
fixed now.

@flavio-fernandes
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 17, 2022
@openshift-ci openshift-ci bot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 17, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2022

@flavio-fernandes: This pull request references Bugzilla bug 2118717, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.z) matches configured target release for branch (4.11.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 2117310 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 2117310 targets the "4.12.0" release, which is one of the valid target releases: 4.12.0
  • bug has dependents

Requesting review from QA contact:
/cc @rbbratta

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from rbbratta August 17, 2022 15:05
@abhat
Copy link
Contributor

abhat commented Aug 17, 2022

@anuragthehatter @rbbratta for FastFix verification and cherry-pick-approval.

@rbbratta
Copy link
Contributor

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 17, 2022
@flavio-fernandes
Copy link
Contributor Author

/test e2e-metal-ipi-ovn-dualstack

@flavio-fernandes
Copy link
Contributor Author

Backport to 4.10 links :
https://issues.redhat.com//browse/OCPBUGSM-47974
#1250

@rbbratta
Copy link
Contributor

@kyrtapz
Copy link
Contributor

kyrtapz commented Aug 18, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 18, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: flavio-fernandes, kyrtapz, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 2 against base HEAD b42cfc1 and 8 for PR HEAD 88ebb96 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 1 against base HEAD b42cfc1 and 7 for PR HEAD 88ebb96 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b42cfc1 and 6 for PR HEAD 88ebb96 in total

@flavio-fernandes
Copy link
Contributor Author

/test e2e-aws-ovn-local-gateway
/test e2e-metal-ipi-ovn-ipv6

@rbbratta
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 19, 2022
@dcbw
Copy link
Member

dcbw commented Aug 19, 2022

/test e2e-aws-ovn-local-gateway

 info: Loading sha256:fbd422ea7d52e89f86703ce283e48ea149f7b85cdffd7788f3e115927caf34cd vsphere-problem-detector
error: unable to extract layer sha256:0176ab762ec7e13d774bd3b034676186a9c089737a7d8aeadb028fadb09dfc03 from registry.build03.ci.openshift.org/ci-op-5lfml957/stable@sha256:bd8d66a05d0a1a116c4654a2594265d588cb9dd04eabcb9c7a5bb622bd9800e7: unexpected EOF
{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"k8s.io/test-infra/prow/entrypoint/run.go:80","func":"k8s.io/test-infra/prow/entrypoint.Options.Run","level":"error","msg":"Error executing test process","severity":"error","time":"2022-08-18T18:54:05Z"} 

@dcbw
Copy link
Member

dcbw commented Aug 19, 2022

/test e2e-metal-ipi-ovn-ipv6

1 similar comment
@flavio-fernandes
Copy link
Contributor Author

/test e2e-metal-ipi-ovn-ipv6

@flavio-fernandes
Copy link
Contributor Author

/test e2e-openstack-ovn

@flavio-fernandes
Copy link
Contributor Author

flavio-fernandes commented Aug 19, 2022

@dcbw @trozet may we override this test?

/override ci/prow/e2e-metal-ipi-ovn-ipv6

CI failure:

"e2e-metal-ipi-ovn-ipv6" pod "e2e-metal-ipi-ovn-ipv6-baremetalds-devscripts-setup" failed: the pod ci-op-cjgkyci9/e2e-metal-ipi-ovn-ipv6-baremetalds-devscripts-setup failed after 9m6s (failed containers: test): ContainerFailed one or more containers exited Container test exited with code 1,


  * could not run steps: step e2e-metal-ipi-ovn-ipv6 failed: "e2e-metal-ipi-ovn-ipv6" pre steps failed: "e2e-metal-ipi-ovn-ipv6" pod "e2e-metal-ipi-ovn-ipv6-baremetalds-devscripts-setup" failed: the pod ci-op-cjgkyci9/e2e-metal-ipi-ovn-ipv6-baremetalds-devscripts-setup failed after 9m6s (failed containers: test): ContainerFailed one or more containers exited
     "stderr": "WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.\nERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.\n\nWe recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.\n\nrequests-oauthlib 1.3.1 requires oauthlib>=3.0.0, but you'll have oauthlib 2.1.0 which is incompatible.\nflask-oauthlib 0.9.6 requires requests-oauthlib<1.2.0,>=0.6.2, but you'll have requests-oauthlib 1.3.1 which is incompatible.\n",
    "stderr_lines": [
        "WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.",
        "ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.", 

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2022

@flavio-fernandes: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn 88ebb96 link false /test e2e-openstack-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 2c605e0 into openshift:release-4.11 Aug 19, 2022
25 of 26 checks passed
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2022

@flavio-fernandes: All pull requests linked via external trackers have merged:

Bugzilla bug 2118717 has been moved to the MODIFIED state.

In response to this:

Bug 2118717: [release-4.11] Fix race when adding and removing pod with same name

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@anuragthehatter
Copy link

/bugzila refresh

@anuragthehatter
Copy link

/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 25, 2022

@anuragthehatter: Bugzilla bug 2118717 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants