Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make controller preserve internal state after reboot #2004

Merged
merged 3 commits into from Sep 21, 2023

Conversation

liornoy
Copy link
Contributor

@liornoy liornoy commented Jul 17, 2023

This PR changes the behavior of the service reconciler
to fix the following bug:

  • There is an LB service with a specific IP (annotated) assigned to it.
  • Also there are other LB services in the cluster on "pending".
  • MetalLB's controller resets and when it goes back up again,
    it loops over the services, sees first the "pending" LB service,
    and assigns it the IP that was assigned to the annotated service.

Here we make the reconciler ignore the services, up until the first
reprocessAll event, where we handle only the services with IP assigned
to them already.

fixes #1984

@liornoy liornoy force-pushed the fix-controller-reboot branch 5 times, most recently from 99f16ca to c0c8b0a Compare July 19, 2023 07:39
@liornoy liornoy changed the title WIP: Fix controller reboot Fix specific ip allocation with controller reboot Jul 19, 2023
@fedepaol
Copy link
Member

I haven't checked the tests yet but a few comments on the implementation. The implementation works, but it's a bit hacky and could be made more explicit by:

  • letting the pool handler understand if it's the first time we receive a configuration (
    func (c *controller) SetPools(l log.Logger, pools *config.Pools) controllers.SyncState {
    )
  • adding a new SyncState value (reprocessAssignedFirst)
  • propagating the information (reloadAssignedFirst) through the channel adding a field to the reloadEvent
  • handling the flag accordingly in the reprocessAll function

This has the advantage that the flow is clear(er), and the controller implementation is in charge of the mechanism

@liornoy
Copy link
Contributor Author

liornoy commented Jul 25, 2023

I haven't checked the tests yet but a few comments on the implementation. The implementation works, but it's a bit hacky and could be made more explicit by:

  • letting the pool handler understand if it's the first time we receive a configuration (
    func (c *controller) SetPools(l log.Logger, pools *config.Pools) controllers.SyncState {

    )
  • adding a new SyncState value (reprocessAssignedFirst)
  • propagating the information (reloadAssignedFirst) through the channel adding a field to the reloadEvent
  • handling the flag accordingly in the reprocessAll function

This has the advantage that the flow is clear(er), and the controller implementation is in charge of the mechanism

@fedepaol without using some kind of manual boolean "isFirstReload" flag, how would you make the pool handler understand if it's the first time we receive a configuration?

I mean, the first time we run SetPools the pools aren't empty.
A different approach is inspecting the c.ips but still, having no allocated ips is a valid state even not the first time.

@fedepaol
Copy link
Member

This has the advantage that the flow is clear(er), and the controller implementation is in charge of the mechanism

@fedepaol without using some kind of manual boolean "isFirstReload" flag, how would you make the pool handler understand if it's the first time we receive a configuration?

Having the flag is legit I guess, something like "firstConfiguration" to be checked and reset together when we get the first non nil set of pools.

@liornoy
Copy link
Contributor Author

liornoy commented Jul 25, 2023

This has the advantage that the flow is clear(er), and the controller implementation is in charge of the mechanism

@fedepaol without using some kind of manual boolean "isFirstReload" flag, how would you make the pool handler understand if it's the first time we receive a configuration?

Having the flag is legit I guess, something like "firstConfiguration" to be checked and reset together when we get the first non nil set of pools.

Ok, thanks for the reply. I was just making sure that this flag wasn't the hacky element of the implementation.

@fedepaol
Copy link
Member

This has the advantage that the flow is clear(er), and the controller implementation is in charge of the mechanism

@fedepaol without using some kind of manual boolean "isFirstReload" flag, how would you make the pool handler understand if it's the first time we receive a configuration?

Having the flag is legit I guess, something like "firstConfiguration" to be checked and reset together when we get the first non nil set of pools.

Ok, thanks for the reply. I was just making sure that this flag wasn't the hacky element of the implementation.

Nope, what I think is hacky is doing a sequence of actions from the wrapping site (the controller) which has no control on how the items are handled

@fedepaol
Copy link
Member

As discussed offline, another possible solution is to always loop over the services with an IP first, and those without the IP afterwards. This will remove a lot of the complexity of understanding when to do what, but it has the downside of having this ad hoc logic in the wrapping controller instead of inside the business logic.
Thoughts?
Also, @oribon wdyt?

@liornoy
Copy link
Contributor Author

liornoy commented Jul 25, 2023

I'll expand on this and say that we drift from the approach of signaling the "firstReload" from within the ReloadEvent struct because it is not normally reachable from the reconcile function and forcing it there will add unnecessary complexity.

@liornoy liornoy force-pushed the fix-controller-reboot branch 4 times, most recently from 37c06a2 to fd553b6 Compare August 2, 2023 12:59
@liornoy liornoy force-pushed the fix-controller-reboot branch 2 times, most recently from 40a5f7c to fec81d0 Compare August 10, 2023 11:29
@liornoy liornoy changed the title Fix specific ip allocation with controller reboot Make controller preserve internal state after reboot Aug 10, 2023
Copy link
Member

@oribon oribon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits, will go over the e2es later

internal/k8s/controllers/service_controller.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller_reload.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller_reload.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller_test.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller_test.go Outdated Show resolved Hide resolved
@liornoy liornoy force-pushed the fix-controller-reboot branch 2 times, most recently from 9526c3d to 3593a15 Compare August 16, 2023 09:15
@liornoy
Copy link
Contributor Author

liornoy commented Aug 16, 2023

@oribon the ci failures here are being worked on:
actions/runner-images#8080 (comment)

internal/k8s/controllers/service_controller.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller_test.go Outdated Show resolved Hide resolved
e2etest/l2tests/assignment.go Outdated Show resolved Hide resolved
e2etest/pkg/metallb/metallb.go Show resolved Hide resolved
internal/k8s/nodes/nodes.go Show resolved Hide resolved
e2etest/pkg/service/validate.go Outdated Show resolved Hide resolved
e2etest/l2tests/assignment.go Outdated Show resolved Hide resolved
e2etest/l2tests/assignment.go Outdated Show resolved Hide resolved
@liornoy liornoy force-pushed the fix-controller-reboot branch 3 times, most recently from 3c11ae8 to bc211b9 Compare August 22, 2023 20:18
e2etest/l2tests/assignment.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller.go Outdated Show resolved Hide resolved
internal/k8s/controllers/service_controller.go Outdated Show resolved Hide resolved
This commit changes the behavior of the service reconciler
to fix a bug that the controller de-assign an ip for a service
after reboot.

Make the service reconciler initially ignore the services,
up until the first reprocessAll event finishes, where we sort
and handle all of the services with assigned IP first.
By doing so, we make the controller aware of the LB services with
existing external IPs and sync the internal state.
Only after we reprocessed all services once, and know what
services are allocated and what ips are in use, return to
work as normal.

Add unit tests for the service controller

Add unit test cases to cover the FirstCongifurtaion flag.
Testcase 1: Testing the service reconcile with the flag set to true.
Testcase 2: Testing the reprocessAll with the flag set to true:
validate that the value is modifeid to false by the controller.

Signed-off-by: liornoy <lnoy@redhat.com>
* We add a test case where we create 4 services
and assert that when the controller restarts, it keeps assigning
the first two services the same IPs, and not removing/changing them.

* Move functions "validateDesiredLB" and "getIngressIPs" to a different
package, to be able to reuse in the l2tests.

Signed-off-by: Lior Noy <lnoy@redhat.com>
Make the ConditionStatus function private
as it's only being used internally by IsNetworkUnavailable

Also switch the order of the functions making the exported
one at the top.

Signed-off-by: Lior Noy <lnoy@redhat.com>
@fedepaol
Copy link
Member

lgtm, thanks for fixing this!

@fedepaol fedepaol added this pull request to the merge queue Sep 21, 2023
Merged via the queue into metallb:main with commit adf1274 Sep 21, 2023
25 checks passed
@liornoy liornoy deleted the fix-controller-reboot branch November 6, 2023 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]Request specific IPs not working
3 participants