New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set the node as the owner reference of the enactmant #808
Conversation
Hi @AlonaKaplan. Thanks for your PR. I'm waiting for a nmstate member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
0c979ec
to
85a44e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/approve
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rhrazdil The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
OwnerReferences: []metav1.OwnerReference{ | ||
{Name: policy.Name, Kind: policy.TypeMeta.Kind, APIVersion: policy.TypeMeta.APIVersion, UID: policy.UID}, | ||
{Name: policy.Name, Kind: policy.Kind, APIVersion: policy.APIVersion, UID: policy.UID}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not having two reference mean that it will be removed only when both of these are gone? While what we want to do is to remove it as soon as one of them is gone. Please keep me honest here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should based on a little google research of mine.
E2E tests should keep us covered as we're testing NNCEs are cleaned up on NNCP deletion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's awesome then, nice cleanup. Sorry about the raised hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to apologize, looks like your concerns are correct!
https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/nmstate_kubernetes-nmstate/808/pull-kubernetes-nmstate-e2e-handler-k8s/1429712164502900736
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bless the tests \o/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, you're right!
/ok-to-test |
/hold cancel |
/lgtm cancel Looks like NNCEs are indeed not removed with when only one or ownerReferrenced resources is removed. |
85a44e8
to
45cc57d
Compare
If the node is removed the enactment should be garbage collected. Currently we have a leak. Setting both the node and the policy as the owner references of the enactment means that the enactmant will be removed only when both the node and the policy are removed. Therefore, We have to choose only one of the two to be the owner ref. The removal of the other one should be handled by the controller. We don't have a centrelized controller, node specific controllers are running on each node. It doesn't make sense to handle a node removal by a controller running on the removed node. Therefore, the node was chosen to be the owner ref while the policy removal will be handled by the policy controller. Signed-off-by: Alona Kaplan <alkaplan@redhat.com>
45cc57d
to
5060e81
Compare
Thanks!
The result is that the NNCE on the node that was rebooted is still present, because session key of the handler IMO it's not a blocker for the PR, but it's something we should be aware of at the very least. /lgtm Adding hold to give Petr a chance since he's started review |
I won't get a chance to re-review it today. Please feel free to continue without me. |
After offline discussion with @AlonaKaplan, we agreed to address the issue with node restart in a follow up PR by checking for orphaned NNCEs at handler startup. Unholding this |
Good point! I think we can add a finalizar to the policy to solve it. The problem is that we have the same bug if the policy node selector is updated during the node down time and finalizer won't solve it. If that's ok, I will do it on a separate PR. |
If the node is removed the enactment should be garbage collected. Currently we have a leak. Setting both the node and the policy as the owner references of the enactment means that the enactmant will be removed only when both the node and the policy are removed. Therefore, We have to choose only one of the two to be the owner ref. The removal of the other one should be handled by the controller. We don't have a centrelized controller, node specific controllers are running on each node. It doesn't make sense to handle a node removal by a controller running on the removed node. Therefore, the node was chosen to be the owner ref while the policy removal will be handled by the policy controller. Signed-off-by: Alona Kaplan <alkaplan@redhat.com>
* Squash nns controller into node controller (#806) The reconcile loops of both of the controllers were pracitically the same. The only difference was the NodeController stored the previous state and updated the NNS only if the state was changed. When squashing, this logic had to be changed since removal of an exisiting NNS may be skipped. Now the nns is updated/re-created if the nns doesn't exist or the state was changed. The main changes to the exsisting flow are- 1. The existance of NNS will be verified anyway (so in the common use case of periodic update that happens every minute, there will be one extra, probably redundant, api call. I beleive it worth the duplication removal). 2. If the state was not changed since the last Reconcile, even in case of force-update, the NNS won't be updated. Signed-off-by: Alona Kaplan <alkaplan@redhat.com> * handler,webhook:use variable to set replicas (#810) In order to set the number of webhook replicas dynamically, the operator yaml should accept a variable to set number of replicas. An example use-case for that would be to set the number of webhook replicas to 1 when running on a Single-Node-Openshift cluster, and 2 otherwise. This commit introduce this variable and sets it with a fixed value of 2. In a follow-up work This could be enhanced to be completely dynamic. Signed-off-by: alonsadan <asadan@redhat.com> * Set the node as the owner reference of the enactment (#808) If the node is removed the enactment should be garbage collected. Currently we have a leak. Setting both the node and the policy as the owner references of the enactment means that the enactmant will be removed only when both the node and the policy are removed. Therefore, We have to choose only one of the two to be the owner ref. The removal of the other one should be handled by the controller. We don't have a centrelized controller, node specific controllers are running on each node. It doesn't make sense to handle a node removal by a controller running on the removed node. Therefore, the node was chosen to be the owner ref while the policy removal will be handled by the policy controller. Signed-off-by: Alona Kaplan <alkaplan@redhat.com> * vendor, Bump protobuf to v1.3.2 (#811) Signed-off-by: Ram Lavi <ralavi@redhat.com> * Use nmstate API for setting port vlan trunks in favor of using vlan-filtering script (#793) * Add test suite file to pkg/helpers pkg/helper tests are not executed because of missing suite file. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Add sjson package for manipulating json strings Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Set default vlan config on linux-bridge ports in api Instead of applying vlan configuration after setting desiredState using a script, add default configuration to desiredState using nmstate API. Set default VLAN configuration to policy enactment desiredState in Reconcile, so that created NNCE desiredState contains the applied default values. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Update desiredState defaults withing enactment init In order to save one API call, update the enactment desiredState defaults in initializeEnactment function. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Remove unused vlan-filtering script Vlan configuration is done via nmstate API, so there is no need to keep the vlan_filtering script. Update test that relied on renaming the vlan_filtering script to make the NNCP configuration fail to achieve rollback. Instead, pass incorrect yaml. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Add example for linux-bridge with port vlan Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * handler,webhook:use variable to set min replicas (#812) In order to set the number of minimum webhook replicas dynamically, the operator yaml should accept a variable to set number of minimum replicas. An example use-case for that would be to set the minimum number of webhook replicas to 0 when running on a Single-Node-Openshift cluster, to no interfere with the upgrade process, where the number of active webhook pods will be reduced from 1 to 0 to perform the upgrade. This commit introduce this variable and sets it with a fixed value of 1. In a follow-up work This could be enhanced to be completely dynamic. Signed-off-by: alonsadan <asadan@redhat.com> * operator: Fix deletion of newer NMstate CRs (#816) Operator deletes new NMState CR if there is already one existing. The problem is however, that current CRs are not always listed in correct order, which means that the current logic sometimes doesn't delete new CR when another already exists, causing e2e tests to flake. This commit sorts listed NMstate CRs when there are more than one of them, and compare the name of the reconciled CR with the oldest. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Delete unused function (#798) Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Try update nncp with non-cached client as backup (#815) After applying a policy, the cached client may return outdated resource if network configuration breaks connectivity, which may take up to a minute to refresh again. A previous commit addressed this by adding a custom backoff timeout that would wait for the cached client to return updated resource. But that solution is sub-optimal since the wait time may take a lot of time, causing the handler to be effectively stuck. To improve this behaviour, this change updates NNCP in two steps 1. Try using cached client 2. Try using non-cached client if all attempts with cached client fail on conflict This way, handler isn't blocking reconcile for long period of time Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> * Fix Dockerfile.openshift (#817) Dockerfile still attempty to copy build/bin directory, but this folder has been removed. Signed-off-by: Radim Hrazdil <rhrazdil@redhat.com> Co-authored-by: Alona Paz <alkaplan@redhat.com> Co-authored-by: Alon Sadan <46392127+alonSadan@users.noreply.github.com> Co-authored-by: RamLavi <ralavi@redhat.com>
Is this a BUG FIX or a FEATURE ?:
/kind bug
What this PR does / why we need it:
If the node is removed the enactment should be garbage collected.
Currently we have a leak.
Setting both the node and the policy as the owner references of the
enactment means that the enactmant will be removed only when both the node
and the policy are removed. Therefore, We have to choose only one of the
two to be the owner ref.
The removal of the other one should be handled by the controller.
We don't have a centrelized controller, node specific controllers are
running on each node.
It doesn't make sense to handle a node removal by a controller running on
the removed node.
Therefore, the node was chosen to be the owner ref while the policy removal
will be handled by the policy controller.
Special notes for your reviewer:
Release note: