Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.10] Bug 2113861: reconcile-node-lbs #1227

Merged
merged 11 commits into from Aug 25, 2022

Conversation

maiqueb
Copy link
Contributor

@maiqueb maiqueb commented Aug 2, 2022

- What this PR does and why is it needed

services controller: reconcile on node deletion

A node load balancer is a load balancer associated to any of the following
services:

  • service with NodePort set
  • service with host-network endpoints
  • service with ExternalTrafficPolicy=Local
  • service with InternalTrafficPolicy=Local

This commit forces the reconciliation of the load balancers upon node
deletion. Since the node was deleted, the newly generated list of load
balancers will not feature the deleted node's load balancers, which will
lead to their deletion.

Leaking these load balancers causes an issue when the deleted node is re-added
(as part of a remediation procedure, for instance) since the newly created node
logical switch does not point to these node load balancers, thus breaking
connectivity of pods on the node to the services described in the list above.

- Special notes for reviewers

Clean pick of the commits included in the ab7f0e636 merge commit to the master branch.

- How to verify it

  1. Backup one of your node's configuration - kubectl get node -oyaml > nodexyz-backup.yaml
  2. Delete said node - kubectl delete node
  3. Ensure the node load balancers are deleted from the OVN NB database. These end with the nodeXYZ suffix
  4. Add the node back - kubectl apply -f nodexyz-backup.yaml
  5. Ensure the node load balancers are found from the OVN NB database. These end with the nodeXYZ suffix

- Description for the changelog

Force service reconciliation whenever a node is deleted.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit 26abb48)
(cherry picked from commit 621da2c)
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit 0837503)
(cherry picked from commit 5aa338b)
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit bca6be0)
(cherry picked from commit 231840b)
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit dd9425c)
(cherry picked from commit 9d21222)
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit f8b822c)
(cherry picked from commit 1a33e16)
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit 5d5aef0)
(cherry picked from commit 3b3d587)
This commit adds a new unit test that shows the node load balancers are
not removed upon node removal.

If the node is re-added (as part of any remediation procedure), the
newly created node logical switches will *not* point to the existing
load balancers, since the ovn-k master assures the load-balancers are
not re-created. The existing node logical switches will only be mutated
to point at these load balancers upon load balancer creation.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit 12d2b26)
(cherry picked from commit 92a1fa0)
A node load balancer is a load balancer associated to any of the following
services:
- service with NodePort set
- service with host-network endpoints
- service with ExternalTrafficPolicy=Local
- service with InternalTrafficPolicy=Local

This commit forces the reconciliation of the load balancers upon node
deletion. Since the node was deleted, the newly generated list of load
balancers will *not* feature the deleted node's load balancers, which will
lead to their deletion.

Leaking these load balancers causes an issue when the deleted node is re-added
(as part of a remediation procedure, for instance) since the newly created node
logical switch does not point to these node load balancers, thus breaking
connectivity of pods on the node to the services described in the list above.

This commit fixes bug [0].

[0] - https://bugzilla.redhat.com/show_bug.cgi?id=2068910

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit bbe7133)
(cherry picked from commit 85291b5)
It is next to impossible to see the differences between expected /
actual results without the full structures being printed.

This commit adds a function that temporarily disables the maximum length
of the diff, and returns a function that when invoked puts the original
value back.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit 7309fc4)
(cherry picked from commit f881bfa)
The test that asserts the expected NBDB state previously matched
ignoring UUIDs, because the ovsdb test framework does not provide
any sort of weak reference cleanup - as such, it is impossible for
it to remove the references of load balancers that were deleted from
logical switches / routers.

In order to make the test assert the NBDB state *with* UUIDs, we need
to patch up the UUIDs in the expected state - since the test framework
is unable to correlate the name of the load balancer to its expected
UUID once the load balancer is deleted.

This occurs because the production code only dissociates the load
balancers from the node's logical switches / GW routers for load
balancers that **are not** going to be deleted.

Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit d3107e4)
(cherry picked from commit ba01630)
Signed-off-by: Miguel Duarte Barroso <mdbarroso@redhat.com>
(cherry picked from commit 9d2fa9a)
(cherry picked from commit 1448bc4)
@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 2, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 2, 2022

@maiqueb: This pull request references Bugzilla bug 2113861, which is invalid:

  • expected dependent Bugzilla bug 2113860 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is ASSIGNED instead
  • expected dependent Bugzilla bug 2113860 to target a release in 4.11.0, but it targets "4.11.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[release-4.10] Bug 2113861: reconcile-node-lbs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from abhat and jcaamano August 2, 2022 10:16
@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 2, 2022

/test ci/prow/images

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 2, 2022

@maiqueb: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
  • /test 4.10-upgrade-from-stable-4.9-images
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-local-gateway
  • /test e2e-aws-ovn-local-to-shared-gateway-mode-migration
  • /test e2e-aws-ovn-shared-to-local-gateway-mode-migration
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-ovn
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test images
  • /test okd-images

The following commands are available to trigger optional jobs:

  • /test e2e-aws-ovn-serial
  • /test e2e-aws-ovn-windows
  • /test e2e-azure-ovn
  • /test e2e-metal-ipi-ovn-ipv4
  • /test e2e-openstack-ovn
  • /test e2e-ovn-hybrid-step-registry
  • /test e2e-vsphere-ovn
  • /test e2e-vsphere-windows
  • /test okd-e2e-gcp-ovn

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-ovn-kubernetes-release-4.10-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
  • pull-ci-openshift-ovn-kubernetes-release-4.10-4.10-upgrade-from-stable-4.9-images
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn-local-gateway
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn-local-to-shared-gateway-mode-migration
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn-serial
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn-shared-to-local-gateway-mode-migration
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn-upgrade
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-aws-ovn-windows
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-azure-ovn
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-gcp-ovn
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-metal-ipi-ovn-dualstack
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-openstack-ovn
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-ovn-hybrid-step-registry
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-vsphere-ovn
  • pull-ci-openshift-ovn-kubernetes-release-4.10-e2e-vsphere-windows
  • pull-ci-openshift-ovn-kubernetes-release-4.10-images
  • pull-ci-openshift-ovn-kubernetes-release-4.10-okd-e2e-gcp-ovn
  • pull-ci-openshift-ovn-kubernetes-release-4.10-okd-images

In response to this:

/test ci/prow/images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 2, 2022

/test images
/test okd-images

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 17, 2022

/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2022

@maiqueb: This pull request references Bugzilla bug 2113861, which is invalid:

  • expected dependent Bugzilla bug 2113860 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is MODIFIED instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 23, 2022

/bugzilla refresh
/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2022

@maiqueb: An error was encountered querying GitHub for users with public email (anusaxen@redhat.com) for bug 2113861 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. Post "http://ghproxy/graphql": dial tcp 172.30.229.2:80: i/o timeout

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh
/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 23, 2022

/bugzilla refresh

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2022

@maiqueb: An error was encountered querying GitHub for users with public email (anusaxen@redhat.com) for bug 2113861 on the Bugzilla server at https://bugzilla.redhat.com. No known errors were detected, please see the full error message for details.

Full error message. Post "http://ghproxy/graphql": dial tcp 172.30.229.2:80: i/o timeout

Please contact an administrator to resolve this issue, then request a bug refresh with /bugzilla refresh.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb maiqueb changed the title [release-4.10] Bug 2113861: reconcile-node-lbs [release-4.10] OCPBUGS-2113861: reconcile-node-lbs Aug 24, 2022
@openshift-ci-robot
Copy link
Contributor

@maiqueb: No Jira issue with key OCPBUGS-2113861 exists in the tracker at https://issues.redhat.com/.
Once a valid bug is referenced in the title of this pull request, request a bug refresh with /jira refresh.

In response to this:

- What this PR does and why is it needed

services controller: reconcile on node deletion

A node load balancer is a load balancer associated to any of the following
services:

  • service with NodePort set
  • service with host-network endpoints
  • service with ExternalTrafficPolicy=Local
  • service with InternalTrafficPolicy=Local

This commit forces the reconciliation of the load balancers upon node
deletion. Since the node was deleted, the newly generated list of load
balancers will not feature the deleted node's load balancers, which will
lead to their deletion.

Leaking these load balancers causes an issue when the deleted node is re-added
(as part of a remediation procedure, for instance) since the newly created node
logical switch does not point to these node load balancers, thus breaking
connectivity of pods on the node to the services described in the list above.

- Special notes for reviewers

Clean pick of the commits included in the ab7f0e636 merge commit to the master branch.

- How to verify it

  1. Backup one of your node's configuration - kubectl get node -oyaml > nodexyz-backup.yaml
  2. Delete said node - kubectl delete node
  3. Ensure the node load balancers are deleted from the OVN NB database. These end with the nodeXYZ suffix
  4. Add the node back - kubectl apply -f nodexyz-backup.yaml
  5. Ensure the node load balancers are found from the OVN NB database. These end with the nodeXYZ suffix

- Description for the changelog

Force service reconciliation whenever a node is deleted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2022

@maiqueb: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] OCPBUGS-2113861: reconcile-node-lbs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb maiqueb changed the title [release-4.10] OCPBUGS-2113861: reconcile-node-lbs [release-4.10] OCPBUGS-46843: reconcile-node-lbs Aug 24, 2022
@openshift-ci-robot
Copy link
Contributor

@maiqueb: No Jira issue with key OCPBUGS-46843 exists in the tracker at https://issues.redhat.com/.
Once a valid bug is referenced in the title of this pull request, request a bug refresh with /jira refresh.

In response to this:

- What this PR does and why is it needed

services controller: reconcile on node deletion

A node load balancer is a load balancer associated to any of the following
services:

  • service with NodePort set
  • service with host-network endpoints
  • service with ExternalTrafficPolicy=Local
  • service with InternalTrafficPolicy=Local

This commit forces the reconciliation of the load balancers upon node
deletion. Since the node was deleted, the newly generated list of load
balancers will not feature the deleted node's load balancers, which will
lead to their deletion.

Leaking these load balancers causes an issue when the deleted node is re-added
(as part of a remediation procedure, for instance) since the newly created node
logical switch does not point to these node load balancers, thus breaking
connectivity of pods on the node to the services described in the list above.

- Special notes for reviewers

Clean pick of the commits included in the ab7f0e636 merge commit to the master branch.

- How to verify it

  1. Backup one of your node's configuration - kubectl get node -oyaml > nodexyz-backup.yaml
  2. Delete said node - kubectl delete node
  3. Ensure the node load balancers are deleted from the OVN NB database. These end with the nodeXYZ suffix
  4. Add the node back - kubectl apply -f nodexyz-backup.yaml
  5. Ensure the node load balancers are found from the OVN NB database. These end with the nodeXYZ suffix

- Description for the changelog

Force service reconciliation whenever a node is deleted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2022

@maiqueb: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.10] OCPBUGS-46843: reconcile-node-lbs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@npinaeva
Copy link
Member

/rename [release-4.10] Bug 2113861: reconcile-node-lbs

@maiqueb maiqueb changed the title [release-4.10] OCPBUGS-46843: reconcile-node-lbs [release-4.10] Bug 2113861: reconcile-node-lbs Aug 24, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2022

@maiqueb: This pull request references Bugzilla bug 2113861, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.z) matches configured target release for branch (4.10.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 2113860 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 2113860 targets the "4.11.z" release, which is one of the valid target releases: 4.11.0, 4.11.z
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

[release-4.10] Bug 2113861: reconcile-node-lbs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 24, 2022

/retest-required

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 24, 2022

/test e2e-azure-ovn
/test e2e-openstack-ovn
/test e2e-vsphere-ovn

@maiqueb
Copy link
Contributor Author

maiqueb commented Aug 24, 2022

/test e2e-openstack-ovn
/test e2e-vsphere-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 24, 2022

@maiqueb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ovn-hybrid-step-registry 0dd1168 link false /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-vsphere-windows 0dd1168 link false /test e2e-vsphere-windows
ci/prow/e2e-aws-ovn-serial 0dd1168 link false /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-windows 0dd1168 link false /test e2e-aws-ovn-windows
ci/prow/e2e-openstack-ovn 0dd1168 link false /test e2e-openstack-ovn
ci/prow/e2e-vsphere-ovn 0dd1168 link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jcaamano
Copy link
Contributor

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 25, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcaamano, maiqueb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 25, 2022
@jcaamano
Copy link
Contributor

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Aug 25, 2022
@anuragthehatter
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 25, 2022
@openshift-merge-robot openshift-merge-robot merged commit 6e36803 into openshift:release-4.10 Aug 25, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 25, 2022

@maiqueb: All pull requests linked via external trackers have merged:

Bugzilla bug 2113861 has been moved to the MODIFIED state.

In response to this:

[release-4.10] Bug 2113861: reconcile-node-lbs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants