Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release 4.14] OCPBUGS-16267: Fix controller reboot bug #155

Conversation

liornoy
Copy link

@liornoy liornoy commented Nov 1, 2023

This PR changes the behavior of the service reconciler
to fix the following bug:

There is an LB service with a specific IP (annotated) assigned to it.
Also there are other LB services in the cluster on "pending".
-MetalLB's controller resets and when it goes back up again,
it loops over the services, sees first the "pending" LB service,
and assigns it the IP that was assigned to the annotated service.
Here we make the reconciler ignore the services, up until the first
reprocessAll event, where we handle only the services with IP assigned
to them already.

Make the ConditionStatus function private
as it's only being used internally by IsNetworkUnavailable

Also switch the order of the functions making the exported
one at the top.

Signed-off-by: Lior Noy <lnoy@redhat.com>
* We add a test case where we create 4 services
and assert that when the controller restarts, it keeps assigning
the first two services the same IPs, and not removing/changing them.

* Move functions "validateDesiredLB" and "getIngressIPs" to a different
package, to be able to reuse in the l2tests.

Signed-off-by: Lior Noy <lnoy@redhat.com>
This commit changes the behavior of the service reconciler
to fix a bug that the controller de-assign an ip for a service
after reboot.

Make the service reconciler initially ignore the services,
up until the first reprocessAll event finishes, where we sort
and handle all of the services with assigned IP first.
By doing so, we make the controller aware of the LB services with
existing external IPs and sync the internal state.
Only after we reprocessed all services once, and know what
services are allocated and what ips are in use, return to
work as normal.

Add unit tests for the service controller

Add unit test cases to cover the FirstCongifurtaion flag.
Testcase 1: Testing the service reconcile with the flag set to true.
Testcase 2: Testing the reprocessAll with the flag set to true:
validate that the value is modifeid to false by the controller.

Signed-off-by: liornoy <lnoy@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 1, 2023
@openshift-ci-robot
Copy link

@liornoy: This pull request references Jira Issue OCPBUGS-16267, which is invalid:

  • expected Jira Issue OCPBUGS-16267 to depend on a bug targeting a version in 4.15.0 and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR changes the behavior of the service reconciler
to fix the following bug:

There is an LB service with a specific IP (annotated) assigned to it.
Also there are other LB services in the cluster on "pending".
-MetalLB's controller resets and when it goes back up again,
it loops over the services, sees first the "pending" LB service,
and assigns it the IP that was assigned to the annotated service.
Here we make the reconciler ignore the services, up until the first
reprocessAll event, where we handle only the services with IP assigned
to them already.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from dcbw and fedepaol November 1, 2023 09:27
@liornoy
Copy link
Author

liornoy commented Nov 1, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 1, 2023
@openshift-ci-robot
Copy link

@liornoy: This pull request references Jira Issue OCPBUGS-16267, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.z) matches configured target version for branch (4.14.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-19745 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-19745 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
  • bug has dependents

Requesting review from QA contact:
/cc @asood-rh

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from asood-rh November 1, 2023 09:28
Copy link

openshift-ci bot commented Nov 1, 2023

@liornoy: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fedepaol
Copy link
Member

fedepaol commented Nov 3, 2023

/approve
/lgtm

@fedepaol
Copy link
Member

fedepaol commented Nov 3, 2023

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Nov 3, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2023
Copy link

openshift-ci bot commented Nov 3, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fedepaol, liornoy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2023
@asood-rh
Copy link

asood-rh commented Nov 6, 2023

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Nov 6, 2023
@openshift-merge-bot openshift-merge-bot bot merged commit a09f95c into openshift:release-4.14 Nov 6, 2023
4 checks passed
@openshift-ci-robot
Copy link

@liornoy: Jira Issue OCPBUGS-16267: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-16267 has been moved to the MODIFIED state.

In response to this:

This PR changes the behavior of the service reconciler
to fix the following bug:

There is an LB service with a specific IP (annotated) assigned to it.
Also there are other LB services in the cluster on "pending".
-MetalLB's controller resets and when it goes back up again,
it loops over the services, sees first the "pending" LB service,
and assigns it the IP that was assigned to the annotated service.
Here we make the reconciler ignore the services, up until the first
reprocessAll event, where we handle only the services with IP assigned
to them already.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liornoy liornoy deleted the cherry-pick-controller-reboot-fix branch November 6, 2023 14:13
@liornoy
Copy link
Author

liornoy commented Nov 10, 2023

/cherry-pick release-4.13

@openshift-cherrypick-robot

@liornoy: #155 failed to apply on top of branch "release-4.13":

Applying: Make ConditionStatus private
Using index info to reconstruct a base tree...
A	internal/k8s/nodes/nodes.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): internal/k8s/nodes/nodes.go deleted in HEAD and modified in Make ConditionStatus private. Version Make ConditionStatus private of internal/k8s/nodes/nodes.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Make ConditionStatus private
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants