Set the node as the owner reference of the enactmant #808

AlonaKaplan · 2021-08-19T18:37:23Z

Is this a BUG FIX or a FEATURE ?:

Uncomment only one, leave it on its own line:

/kind bug

/kind enhancement

What this PR does / why we need it:

If the node is removed the enactment should be garbage collected.
Currently we have a leak.

Setting both the node and the policy as the owner references of the
enactment means that the enactmant will be removed only when both the node
and the policy are removed. Therefore, We have to choose only one of the
two to be the owner ref.
The removal of the other one should be handled by the controller.
We don't have a centrelized controller, node specific controllers are
running on each node.
It doesn't make sense to handle a node removal by a controller running on
the removed node.
Therefore, the node was chosen to be the owner ref while the policy removal
will be handled by the policy controller.

Special notes for your reviewer:

Release note:

NONE

kubevirt-bot · 2021-08-19T18:37:32Z

Hi @AlonaKaplan. Thanks for your PR.

I'm waiting for a nmstate member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rhrazdil

Thanks!
/approve
/lgtm

kubevirt-bot · 2021-08-23T07:46:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhrazdil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rhrazdil]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhrazdil · 2021-08-23T07:46:37Z

/retest

phoracek

/hold

phoracek · 2021-08-23T07:47:53Z

api/v1beta1/nodenetworkconfigurationenactment_types.go

 			OwnerReferences: []metav1.OwnerReference{
-				{Name: policy.Name, Kind: policy.TypeMeta.Kind, APIVersion: policy.TypeMeta.APIVersion, UID: policy.UID},
+				{Name: policy.Name, Kind: policy.Kind, APIVersion: policy.APIVersion, UID: policy.UID},


Does not having two reference mean that it will be removed only when both of these are gone? While what we want to do is to remove it as soon as one of them is gone. Please keep me honest here.

It should based on a little google research of mine.
E2E tests should keep us covered as we're testing NNCEs are cleaned up on NNCP deletion

Oh, that's awesome then, nice cleanup. Sorry about the raised hold.

No need to apologize, looks like your concerns are correct!
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/nmstate_kubernetes-nmstate/808/pull-kubernetes-nmstate-e2e-handler-k8s/1429712164502900736

Bless the tests \o/

yep, you're right!

phoracek · 2021-08-23T07:48:35Z

/ok-to-test

phoracek · 2021-08-23T07:58:35Z

/hold cancel

rhrazdil · 2021-08-23T08:17:55Z

/lgtm cancel

Looks like NNCEs are indeed not removed with when only one or ownerReferrenced resources is removed.

If the node is removed the enactment should be garbage collected. Currently we have a leak. Setting both the node and the policy as the owner references of the enactment means that the enactmant will be removed only when both the node and the policy are removed. Therefore, We have to choose only one of the two to be the owner ref. The removal of the other one should be handled by the controller. We don't have a centrelized controller, node specific controllers are running on each node. It doesn't make sense to handle a node removal by a controller running on the removed node. Therefore, the node was chosen to be the owner ref while the policy removal will be handled by the policy controller. Signed-off-by: Alona Kaplan <alkaplan@redhat.com>

rhrazdil · 2021-08-24T08:50:24Z

Thanks!
One issue I see with this change is that we may leak NNCEs in scenarios when nodes are rebooted.
Consider:

1. create a policy
cat <<EOF | kubectl creat -f -
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: linux-bridge
spec:
  desiredState:
    interfaces:
    - name: br11
      description: Linux bridge with eth1 as a port
      type: linux-bridge
      state: up
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: eth1
EOF

2. Reboot some node
3. Wait for the node to come up
4. Delete NNCP

The result is that the NNCE on the node that was rebooted is still present, because session key of the handler
is not valid anymore. It takes some while for the handler to get restarted after reboot.

IMO it's not a blocker for the PR, but it's something we should be aware of at the very least.

/lgtm

Adding hold to give Petr a chance since he's started review
/hold

phoracek · 2021-08-24T09:09:17Z

I won't get a chance to re-review it today. Please feel free to continue without me.

rhrazdil · 2021-08-24T10:57:20Z

After offline discussion with @AlonaKaplan, we agreed to address the issue with node restart in a follow up PR by checking for orphaned NNCEs at handler startup.

Unholding this
/unhold

AlonaKaplan · 2021-08-24T11:10:17Z

Thanks!
One issue I see with this change is that we may leak NNCEs in scenarios when nodes are rebooted.
Consider:
1. create a policy
cat <<EOF | kubectl creat -f -
apiVersion: nmstate.io/v1beta1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: linux-bridge
spec:
  desiredState:
    interfaces:
    - name: br11
      description: Linux bridge with eth1 as a port
      type: linux-bridge
      state: up
      bridge:
        options:
          stp:
            enabled: false
        port:
        - name: eth1
EOF

2. Reboot some node
3. Wait for the node to come up
4. Delete NNCP
The result is that the NNCE on the node that was rebooted is still present, because session key of the handler
is not valid anymore. It takes some while for the handler to get restarted after reboot.

IMO it's not a blocker for the PR, but it's something we should be aware of at the very least.

Good point! I think we can add a finalizar to the policy to solve it. The problem is that we have the same bug if the policy node selector is updated during the node down time and finalizer won't solve it.
So, another option that will solve the two bugs is to cleanup the enactments on sartup (remove orphan enactments and enactments that don't match the policy node selector).

If that's ok, I will do it on a separate PR.

If the node is removed the enactment should be garbage collected. Currently we have a leak. Setting both the node and the policy as the owner references of the enactment means that the enactmant will be removed only when both the node and the policy are removed. Therefore, We have to choose only one of the two to be the owner ref. The removal of the other one should be handled by the controller. We don't have a centrelized controller, node specific controllers are running on each node. It doesn't make sense to handle a node removal by a controller running on the removed node. Therefore, the node was chosen to be the owner ref while the policy removal will be handled by the policy controller. Signed-off-by: Alona Kaplan <alkaplan@redhat.com>

* Squash nns controller into node controller (#806) The reconcile loops of both of the controllers were pracitically the same. The only difference was the NodeController stored the previous state and updated the NNS only if the state was changed. When squashing, this logic had to be changed since removal of an exisiting NNS may be skipped. Now the nns is updated/re-created if the nns doesn't exist or the state was changed. The main changes to the exsisting flow are- 1. The existance of NNS will be verified anyway (so in the common use case of periodic update that happens every minute, there will be one extra, probably redundant, api call. I beleive it worth the duplication removal). 2. If the state was not changed since the last Reconcile, even in case of force-update, the NNS won't be updated. Signed-off-by: Alona Kaplan <alkaplan@redhat.com> * handler,webhook:use variable to set replicas (#810) In order to set the number of webhook replicas dynamically, the operator yaml should accept a variable to set number of replicas. An example use-case for that would be to set the number of webhook replicas to 1 when running on a Single-Node-Openshift cluster, and 2 otherwise. This commit introduce this variable and sets it with a fixed value of 2. In a follow-up work This could be enhanced to be completely dynamic. Signed-off-by: alonsadan <asadan@redhat.com> * Set the node as the owner reference of the enactment (#808) If the node is removed the enactment should be garbage collected. Currently we have a leak. Setting both the node and the policy as the owner references of the enactment means that the enactmant will be removed only when both the node and the policy are removed. Therefore, We have to choose only one of the two to be the owner ref. The removal of the other one should be handled by the controller. We don't have a centrelized controller, node specific controllers are running on each node. It doesn't make sense to handle a node removal by a controller running on the removed node. Therefore, the node was chosen to be the owner ref while the policy removal will be handled by the policy controller. Signed-off-by: Alona Kaplan <alkaplan@redhat.com> * vendor, Bump protobuf to v1.3.2 (#811) Signed-off-by: Ram Lavi <ralavi@redhat.com> * Use nmstate API for setting port vlan trunks in favor of using vlan-filtering script (#793) * Add test suite file to pkg/helpers pkg/helper tests are not executed because of missing suite file. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Add sjson package for manipulating json strings Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Set default vlan config on linux-bridge ports in api Instead of applying vlan configuration after setting desiredState using a script, add default configuration to desiredState using nmstate API. Set default VLAN configuration to policy enactment desiredState in Reconcile, so that created NNCE desiredState contains the applied default values. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Update desiredState defaults withing enactment init In order to save one API call, update the enactment desiredState defaults in initializeEnactment function. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Remove unused vlan-filtering script Vlan configuration is done via nmstate API, so there is no need to keep the vlan_filtering script. Update test that relied on renaming the vlan_filtering script to make the NNCP configuration fail to achieve rollback. Instead, pass incorrect yaml. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Add example for linux-bridge with port vlan Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * handler,webhook:use variable to set min replicas (#812) In order to set the number of minimum webhook replicas dynamically, the operator yaml should accept a variable to set number of minimum replicas. An example use-case for that would be to set the minimum number of webhook replicas to 0 when running on a Single-Node-Openshift cluster, to no interfere with the upgrade process, where the number of active webhook pods will be reduced from 1 to 0 to perform the upgrade. This commit introduce this variable and sets it with a fixed value of 1. In a follow-up work This could be enhanced to be completely dynamic. Signed-off-by: alonsadan <asadan@redhat.com> * operator: Fix deletion of newer NMstate CRs (#816) Operator deletes new NMState CR if there is already one existing. The problem is however, that current CRs are not always listed in correct order, which means that the current logic sometimes doesn't delete new CR when another already exists, causing e2e tests to flake. This commit sorts listed NMstate CRs when there are more than one of them, and compare the name of the reconciled CR with the oldest. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Delete unused function (#798) Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Try update nncp with non-cached client as backup (#815) After applying a policy, the cached client may return outdated resource if network configuration breaks connectivity, which may take up to a minute to refresh again. A previous commit addressed this by adding a custom backoff timeout that would wait for the cached client to return updated resource. But that solution is sub-optimal since the wait time may take a lot of time, causing the handler to be effectively stuck. To improve this behaviour, this change updates NNCP in two steps 1. Try using cached client 2. Try using non-cached client if all attempts with cached client fail on conflict This way, handler isn't blocking reconcile for long period of time Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Fix Dockerfile.openshift (#817) Dockerfile still attempty to copy build/bin directory, but this folder has been removed. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> Co-authored-by: Alona Paz <alkaplan@redhat.com> Co-authored-by: Alon Sadan <46392127+alonSadan@users.noreply.github.com> Co-authored-by: RamLavi <ralavi@redhat.com>

kubevirt-bot added kind/bug release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Aug 19, 2021

kubevirt-bot requested review from qinqon and rhrazdil August 19, 2021 18:37

kubevirt-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 19, 2021

kubevirt-bot added the size/M label Aug 19, 2021

AlonaKaplan force-pushed the nodeEnactmentOwnerRef branch from 0c979ec to 85a44e8 Compare August 19, 2021 18:37

rhrazdil approved these changes Aug 23, 2021

View reviewed changes

kubevirt-bot assigned rhrazdil Aug 23, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2021

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 23, 2021

phoracek reviewed Aug 23, 2021

View reviewed changes

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2021

kubevirt-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 23, 2021

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2021

kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2021

AlonaKaplan force-pushed the nodeEnactmentOwnerRef branch from 85a44e8 to 45cc57d Compare August 23, 2021 11:54

AlonaKaplan changed the title ~~Add node as owner reference to the enactmant~~ Set the node as the owner reference of the enactmant Aug 23, 2021

AlonaKaplan force-pushed the nodeEnactmentOwnerRef branch from 45cc57d to 5060e81 Compare August 23, 2021 12:23

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2021

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2021

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2021

kubevirt-bot merged commit b6a1fec into nmstate:main Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set the node as the owner reference of the enactmant #808

Set the node as the owner reference of the enactmant #808

AlonaKaplan commented Aug 19, 2021 •

edited

kubevirt-bot commented Aug 19, 2021

rhrazdil left a comment

kubevirt-bot commented Aug 23, 2021

rhrazdil commented Aug 23, 2021

phoracek left a comment

phoracek Aug 23, 2021

rhrazdil Aug 23, 2021

phoracek Aug 23, 2021

rhrazdil Aug 23, 2021

phoracek Aug 23, 2021

AlonaKaplan Aug 23, 2021

phoracek commented Aug 23, 2021

phoracek commented Aug 23, 2021

rhrazdil commented Aug 23, 2021

rhrazdil commented Aug 24, 2021

phoracek commented Aug 24, 2021

rhrazdil commented Aug 24, 2021

AlonaKaplan commented Aug 24, 2021

Set the node as the owner reference of the enactmant #808

Set the node as the owner reference of the enactmant #808

Conversation

AlonaKaplan commented Aug 19, 2021 • edited

kubevirt-bot commented Aug 19, 2021

rhrazdil left a comment

Choose a reason for hiding this comment

kubevirt-bot commented Aug 23, 2021

rhrazdil commented Aug 23, 2021

phoracek left a comment

Choose a reason for hiding this comment

phoracek Aug 23, 2021

Choose a reason for hiding this comment

rhrazdil Aug 23, 2021

Choose a reason for hiding this comment

phoracek Aug 23, 2021

Choose a reason for hiding this comment

rhrazdil Aug 23, 2021

Choose a reason for hiding this comment

phoracek Aug 23, 2021

Choose a reason for hiding this comment

AlonaKaplan Aug 23, 2021

Choose a reason for hiding this comment

phoracek commented Aug 23, 2021

phoracek commented Aug 23, 2021

rhrazdil commented Aug 23, 2021

rhrazdil commented Aug 24, 2021

phoracek commented Aug 24, 2021

rhrazdil commented Aug 24, 2021

AlonaKaplan commented Aug 24, 2021

AlonaKaplan commented Aug 19, 2021 •

edited