Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator: LeaderElectionReleaseOnCancel #556

Merged
merged 3 commits into from
May 2, 2024

Conversation

zeeke
Copy link
Member

@zeeke zeeke commented Dec 5, 2023

When manually restarting the operator, the leader election
takes 5+ minutes to acquire the lease on startup:

I1205 16:06:02.101302       1 leaderelection.go:245] attempting to acquire leader lease openshift-sriov-network-operator/a56def2a.openshift.io...
...
I1205 16:08:40.133558       1 leaderelection.go:255] successfully acquired lease openshift-sriov-network-operator/a56def2a.openshift.io

This PR makes sure the lease is released when the operator shutdown

Copy link

github-actions bot commented Dec 5, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@coveralls
Copy link

coveralls commented Dec 5, 2023

Pull Request Test Coverage Report for Build 8325705141

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 37.718%

Files with Coverage Reduction New Missed Lines %
controllers/drain_controller.go 3 70.68%
Totals Coverage Status
Change from base Build 8324682841: 0.0%
Covered Lines: 4837
Relevant Lines: 12824

💛 - Coveralls

@zeeke zeeke force-pushed the fast-leader-election branch 2 times, most recently from 9472a4c to 361c81b Compare December 6, 2023 08:55
Copy link

github-actions bot commented Dec 6, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

1 similar comment
Copy link

github-actions bot commented Dec 6, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

github-actions bot commented Dec 6, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

github-actions bot commented Dec 6, 2023

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@zeeke
Copy link
Member Author

zeeke commented Dec 6, 2023

I validated this change with a test commit. This comes from the related CI logs:

// operator logs
I1206 10:27:04.430837       1 leaderelection.go:250] attempting to acquire leader lease sriov-network-operator/a56def2a.openshift.io...
I1206 10:27:04.435885       1 leaderelection.go:260] successfully acquired lease sriov-network-operator/a56def2a.openshift.io

// events.json
{
            "metadata": {
                "name": "a56def2a.openshift.io.179e37328a9a0420",
                "namespace": "sriov-network-operator",
                ...
            },
            "involvedObject": {
                "kind": "Lease",
                "namespace": "sriov-network-operator",
                "name": "a56def2a.openshift.io",
                ...
            },
            "reason": "LeaderElection",
            "message": "sriov-network-operator-787ddd7794-xl66c_d5427070-808c-4c52-ae03-c53bf37c6e9f became leader",
            "source": {
                "component": "sriov-network-operator-787ddd7794-xl66c_d5427070-808c-4c52-ae03-c53bf37c6e9f"
            },
            "firstTimestamp": "2023-12-06T10:26:39Z",
            ...
            "reportingComponent": "sriov-network-operator-787ddd7794-xl66c_d5427070-808c-4c52-ae03-c53bf37c6e9f",
            "reportingInstance": ""
        },
        {
            "metadata": {
                "name": "a56def2a.openshift.io.179e37382af3762a",
                "namespace": "sriov-network-operator",
                ...
            },
            "involvedObject": {
                "kind": "Lease",
                "namespace": "sriov-network-operator",
                "name": "a56def2a.openshift.io",
                ...
            },
            "reason": "LeaderElection",
            "message": "sriov-network-operator-787ddd7794-xl66c_d5427070-808c-4c52-ae03-c53bf37c6e9f stopped leading",
            "source": {
                "component": "sriov-network-operator-787ddd7794-xl66c_d5427070-808c-4c52-ae03-c53bf37c6e9f"
            },
            "firstTimestamp": "2023-12-06T10:27:03Z",
            ...
            "reportingComponent": "sriov-network-operator-787ddd7794-xl66c_d5427070-808c-4c52-ae03-c53bf37c6e9f",
            "reportingInstance": ""
        },
        {
            "metadata": {
                "name": "a56def2a.openshift.io.179e37385e817eda",
                "namespace": "sriov-network-operator",
                ...
            },
            "involvedObject": {
                "kind": "Lease",
                "namespace": "sriov-network-operator",
                "name": "a56def2a.openshift.io",
                ...
            },
            "reason": "LeaderElection",
            "message": "sriov-network-operator-787ddd7794-vmsck_a6f02db2-11d5-4dc2-b4b6-8c615ba7f452 became leader",
            "source": {
                "component": "sriov-network-operator-787ddd7794-vmsck_a6f02db2-11d5-4dc2-b4b6-8c615ba7f452"
            },
            "firstTimestamp": "2023-12-06T10:27:04Z",
            ...
            "reportingComponent": "sriov-network-operator-787ddd7794-vmsck_a6f02db2-11d5-4dc2-b4b6-8c615ba7f452",
        },

@zeeke zeeke marked this pull request as draft December 6, 2023 17:05
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@adrianchiris
Copy link
Collaborator

controller runtime defaults seem much shorter.

i see we override with openshift inspired defaults.

lease is 137s and renew 26s, so it should take no longer than 2.7min
in your logs its 2.6min

given its not 5+ min, do we really need this ?

main.go Outdated
setupLog.Info("starting leader election manager")
if err := leaderElectionMgr.Start(leaderElectionContext); err != nil {
setupLog.Error(err, "Leader Election Manager exited non-zero")
os.Exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something seems off to me with the use of os.Exit() here and below

maybe just need to use waitgroup to wait for these managers to complete.

what if we os.Exit() before any defer statements are executed. it may not really be an issue IDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't know: if one manager returns an error while the others are running, it should signal a stop and wait. But it could become a little bit complicated and there is a risk of getting deadlock as far as I can see.
I can explore a little more if you think it's worth

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on looking how to streamline this

main.go Outdated Show resolved Hide resolved
@zeeke zeeke marked this pull request as ready for review December 19, 2023 13:47
setupLog.Error(err, "unable to set up ready check")
os.Exit(1)
}
leaderElectionContext, cancelLeaderElection := context.WithCancel(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to create this WithCancel context using the stopCh context (created below) instead of a context.Background

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way leaderElectionContext would stop as soon as the stop context is Done. Which is what I'm trying to achieve with this PR.

I tried using nested contexts here
https://go.dev/play/p/6dFfBQyXlW1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by using the stop context as parent of this one, the leaderElectionMgr.Start function will be cancelled if a singint or sigterm is received.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I don't want to happen. Here's the ending sequence

  • sigint arrive
  • stopCh is Done
  • mgr and mgrGlobals finish their work and return
  • as mgr is not on a go routing, when it returns it complete the function, triggering all the defers
  • utils.Shutdown() defer goes first. do the cleanup (the leader election is still running here, so we have the lock lease)
  • cancelLeaderElection() defer goes second, stopping the manager which in turn release the lock

Am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it would be the same as using the ReleaseOnCancel option.

My concern here is that after calling cancelLeaderElection() (with the defer), the program exits, so nothing ensures that leaderElectionMgr really finishes. There is a race condition.

The internal implementation of the ReleaseOnCancel option uses a channel to ensure that the routine has stopped before continuing with the shutdown. Perhaps you can do the same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that after calling cancelLeaderElection() (with the defer), the program exits, so nothing ensures that leaderElectionMgr really finishes. There is a race condition.

Added a wait group to ensure leaderElectionMgr is correctly stopped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that this never exits. The leader election manager is stopped with a defer but before that, you are waiting for this to stop, so, it won't happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@zeeke
Copy link
Member Author

zeeke commented Dec 19, 2023

controller runtime defaults seem much shorter.

i see we override with openshift inspired defaults.

lease is 137s and renew 26s, so it should take no longer than 2.7min in your logs its 2.6min

given its not 5+ min, do we really need this ?

You're right: regular clusters may get a delay of 2.7min at most. With Single Node Openshift (SNO in this context) it can grow to 5 minutes.

func leaderElectionSingleNodeConfig(config leaderelection.LeaderElectionConfig) leaderelection.LeaderElectionConfig {

openshift/library-go@2612981#diff-61dd95c7fd45fa18038e825205fbfab8a803f1970068157608b6b1e9e6c27248R127

I'm taking that math for granted, and one of our customers is experiencing that 5 minutes restart during upgrades, on SNO.

I1123 13:20:44.792821       1 leaderelection.go:245] attempting to acquire leader lease openshift-sriov-network-operator/a56def2a.openshift.io...
I1123 13:25:56.033843       1 leaderelection.go:255] successfully acquired lease openshift-sriov-network-operator/a56def2a.openshift.io

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 20, 2023

Hi @zeeke @adrianchiris not related to this PR directly but question do you know why we have two manges?? and now 3

@zeeke
Copy link
Member Author

zeeke commented Dec 20, 2023

Hi @zeeke @adrianchiris not related to this PR directly but question do you know why we have two manges?? and now 3

AFAIU, mgr watches resource in the operator's namespace. mgrGlobal watches every namespace, as it runs Sriov[IB}Network reconcilers which react on NetworkAttachmentDefinitions.

I guess we can unify those managers with one that watches all namespaces, but maybe we get some performance degradation.

Also, I'm investigating about the shutdown logic to see if we can get rid of some bits and pieces, as I'm not very happy with the current election-proxy-manager proposal

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

1 similar comment
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@zeeke
Copy link
Member Author

zeeke commented Mar 5, 2024

Shutdown logic for the long run will be discussed in

@SchSeba, @e0ne , @adrianchiris Please take another look at this.

I added an end-to-end test to verify the restart of the operator is fast enough

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small comments nice work :)

deploy/operator.yaml Show resolved Hide resolved
hack/run-e2e-conformance-virtual-ocp.sh Show resolved Hide resolved
g.Expect(err).ToNot(HaveOccurred())

g.Expect(newLease.Spec.HolderIdentity).ToNot(Equal(oldLease.Spec.HolderIdentity))
}, 45*time.Second, 5*time.Second).Should(Succeed())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I a complicated test. if the operator starts on another master for example we need to add the time that takes for the image to get pulled into the node.

maybe we wait for the new pod to be running and then we do the leader elector check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! This will save us a lot of flakes.
Fixing

Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SchSeba
Copy link
Collaborator

SchSeba commented Mar 18, 2024

please rebase this one an open the issue so we don't forget :)

When manually restarting the operator, the leader election may
take 5+ minutest to acquire the lease on startup:

```
I1205 16:06:02.101302       1 leaderelection.go:245] attempting to acquire leader lease openshift-sriov-network-operator/a56def2a.openshift.io...
...
I1205 16:08:40.133558       1 leaderelection.go:255] successfully acquired lease openshift-sriov-network-operator/a56def2a.openshift.io
```

The manager's option `LeaderElectionReleaseOnCancel` would solve this
problem, but it's not safe as the shutdown cleanup procedures
(inhibiting webhooks and removing finalizers) would run without any
leader guard.

This commit moves the LeaderElection mechanism from the namespaced
manager to a dedicated, no-op controller manager. This approach has been
preferred to directly dealing with the LeaderElection API as:
- It leverages library code that has been proved to be stable
- It includes recording k8s Events about the Lease process
- The election process must come after setting up the health probe.
  Doing it manually would involve handling the healthz endpoint as well.

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Add CoordinationV1 to `test/util/clients.go` to make
assertions on `coordination.k8s.io/Lease` objects.

Add `OPERATOR_LEADER_ELECTION_ENABLE` environment variable
to `deploy/operator.yaml` to let user enable leader election
on the operator.

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
Copy link

Thanks for your PR,
To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@zeeke
Copy link
Member Author

zeeke commented Mar 18, 2024

please rebase this one an open the issue so we don't forget :)

rebased! The discussion about shutdown is here:

@SchSeba
Copy link
Collaborator

SchSeba commented Mar 21, 2024

@e0ne @ykulazhenkov if you have time please take a look on this PR please

Copy link
Collaborator

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zeeke @adrianchiris not related to this PR directly but question do you know why we have two manges?? and now 3

AFAIU, mgr watches resource in the operator's namespace. mgrGlobal watches every namespace, as it runs Sriov[IB}Network reconcilers which react on NetworkAttachmentDefinitions.

I guess we can unify those managers with one that watches all namespaces, but maybe we get some performance degradation.

Also, I'm investigating about the shutdown logic to see if we can get rid of some bits and pieces, as I'm not very happy with the current election-proxy-manager proposal

@zeeke

TBH, I think we should try to use the single manager and rely on cancellation logic from the controller-runtime package.
Recent versions of the controller-runtime package provides a way to configure cache/watches in a very granular way. It is possible to watch some resources in all namespace and other only in specific namespace.

Here is example:
https://github.com/Mellanox/nvidia-k8s-ipam/blob/a34b4a2547815b7c016df7d563a2e04000f0add3/cmd/ipam-node/app/app.go#L146

WDYT?

@zeeke
Copy link
Member Author

zeeke commented Mar 25, 2024

Hi @zeeke @adrianchiris not related to this PR directly but question do you know why we have two manges?? and now 3

AFAIU, mgr watches resource in the operator's namespace. mgrGlobal watches every namespace, as it runs Sriov[IB}Network reconcilers which react on NetworkAttachmentDefinitions.
I guess we can unify those managers with one that watches all namespaces, but maybe we get some performance degradation.
Also, I'm investigating about the shutdown logic to see if we can get rid of some bits and pieces, as I'm not very happy with the current election-proxy-manager proposal

@zeeke

TBH, I think we should try to use the single manager and rely on cancellation logic from the controller-runtime package. Recent versions of the controller-runtime package provides a way to configure cache/watches in a very granular way. It is possible to watch some resources in all namespace and other only in specific namespace.

Here is example: https://github.com/Mellanox/nvidia-k8s-ipam/blob/a34b4a2547815b7c016df7d563a2e04000f0add3/cmd/ipam-node/app/app.go#L146

WDYT?

Hi @ykulazhenkov , Thank you for your feedback.

I agree with you we should simplify as much as we can. With your suggestion, we can merge the namespace and the non-namespaced manager to a single one, but it's not the problem this PR is trying to solve.

The problem here is that the operator has a shutdown logic with two constraints:
a. it must run with the controllers turned off
b. it must run as a cluster leader, i.e. no other sriov-network-operators are supposed to do any logic during that shutdown logic

The cleaner solution I found (apart from getting rid of the shutdown logic, see discussions/608) is to add a new manager that controls only the leader election.

I admit it's probably not the best solution, but it solves an issue found in a production environment (OCPBUGS-23795)

@ykulazhenkov
Copy link
Collaborator

Hi @ykulazhenkov , Thank you for your feedback.

I agree with you we should simplify as much as we can. With your suggestion, we can merge the namespace and the non-namespaced manager to a single one, but it's not the problem this PR is trying to solve.

The problem here is that the operator has a shutdown logic with two constraints: a. it must run with the controllers turned off b. it must run as a cluster leader, i.e. no other sriov-network-operators are supposed to do any logic during that shutdown logic

The cleaner solution I found (apart from getting rid of the shutdown logic, see discussions/608) is to add a new manager that controls only the leader election.

I admit it's probably not the best solution, but it solves an issue found in a production environment (OCPBUGS-23795)

thx, for the clarification

Copy link
Collaborator

@ykulazhenkov ykulazhenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks!

@SchSeba SchSeba merged commit c2d9e32 into k8snetworkplumbingwg:master May 2, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants