Fix change condition conflict in reconcileDelete #3157

richardchen331 · 2022-02-05T19:09:09Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
#2180 is occurring consistently when I looked at controller logs in https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-aws-e2e-eks

I investigated into this a bit, and I think the root cause is that:

In AWSMCP's reconcileDelete, it calls DeleteSecurityGroups and then DeleteNetwork.
In DeleteSecurityGroups, it first marks ClusterSecurityGroupsReadyCondition to Deleting, patch it, and at the end, marks ClusterSecurityGroupsReadyCondition to Deleted.
In DeleteNetwork, it patches again, which propagate the change to ClusterSecurityGroupsReadyCondition (changing it to Deleted) to the management cluster.
If at the start of reconcileDelete, ClusterSecurityGroupsReadyCondition is already Deleted (this could happen, for example, if the controller tried reconcileDelete earlier, finished deleting security groups but didn't complete all the steps afterwards), then we are essentially patching ClusterSecurityGroupsReadyCondition from Deleted to Deleting then to Deleted again in one reconcileDelete loop. This triggers an error here (https://github.com/kubernetes-sigs/cluster-api/blob/v1.0.0/util/conditions/patch.go#L167) when the controller attempts to patch ClusterSecurityGroupsReadyCondition from Deleting to Deleted because latestCondition is Deleting, however conditionPatch.Before is Deleted (value retrieved from management cluster at the beginning of reconcileDelete) and conditionPatch.After is Deleted (the value the controller tries to patch).

This PR fixes the issue by checking if ClusterSecurityGroupsReadyCondition is already Deleted, if yes, then skip DeleteSecurityGroups.

The same issue happens to a number of other conditions (e.g. RouteTablesReady, NatGatewaysReady, InternetGatewayReady). The reason that only ClusterSecurityGroupsReadyCondition is observed is that the error is triggered in the first PatchObject call in DeleteNetwork (https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/network/network.go#L103), and the controller doesn't have the change to set other conditions from Deleted to Deleting then back to Deleted.

If the fix looks good, I can apply the same fix to other conditions.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2180

Special notes for your reviewer:

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

Fix change condition conflict in reconcileDelete

k8s-ci-robot · 2022-02-05T19:09:16Z

@richardchen331: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-02-05T19:09:17Z

Hi @richardchen331. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sedefsavas · 2022-02-08T04:49:56Z

/ok-to-test

sedefsavas · 2022-02-08T05:00:17Z

Adding logic that relies on the condition that is set by the very same controller is not preferred.

Does it make sense to set this condition only during DeleteSecurityGroups() call? Because DeleteNetworks() does that based on the response coming from it.

richardcase · 2022-02-08T09:14:57Z

controlplane/eks/controllers/awsmanagedcontrolplane_controller.go

-	if err := sgService.DeleteSecurityGroups(); err != nil {
-		log.Error(err, "error deleting general security groups for AWSManagedControlPlane", "namespace", controlPlane.Namespace, "name", controlPlane.Name)
-		return reconcile.Result{}, err
+	if conditions.GetReason(managedScope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition) != clusterv1.DeletedReason {


I don't think using the condition as part of reconciliation logic is the right way to go....when in the same reconciliation process.

Would something like this work:

change DeleteSecurityGroup :

If the length of clustergroups is 0 then exit early

Then if length of clustergroups is greater than set clusterv1.DeletingReason

In any error handling blocks just return the error

change AWSMCP reconcileDelete:

In the error handler block for sgService.DeleteSecurityGroups() set the DeletingFailed reason

if successful set the DeletedReason

Thanks for the feedback! That sounds good. Updated logic according to your suggestion.

richardchen331 · 2022-02-15T05:36:13Z

/retest

richardchen331 · 2022-02-15T17:29:40Z

/retest

pydctw · 2022-02-16T21:15:05Z

/test?

k8s-ci-robot · 2022-02-16T21:15:06Z

@pydctw: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pydctw · 2022-02-16T21:15:17Z

/test pull-cluster-api-provider-aws-e2e-eks

pydctw · 2022-02-16T21:28:35Z

pkg/cloud/services/securitygroup/securitygroups.go

-	conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletedReason, clusterv1.ConditionSeverityInfo, "")
-
-	return nil
+	return err


Shouldn't we return nil here? (I know err is nil here but it's hard to understand the code)

Makes sense. Let me change it to nil.

pydctw · 2022-02-16T22:59:30Z

pkg/cloud/services/securitygroup/securitygroups.go

+	if len(clusterGroups) == 0 {
+		return nil
+	}
+	conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")


Wouldn't this condition, clusterv1.DeletingReason never used as it is not patched inside the function and when DeleteSecurityGroups function returns, ClusterSecurityGroupsReadyCondition condition is set again?

You're right. Let me remove it.

pydctw · 2022-02-16T23:03:49Z

I've also observed this patch condition conflict and am wondering why this is observed only in EKS. DeleteSecurityGroups is used both for regular and EKS clusters but I haven't seen it with a regular cluster. Any thoughts, @richardcase?

If we decide to go with the solution, awscluster_controller.go also needs a change as DeleteSecurityGroups is a common function.

pydctw · 2022-02-18T18:20:29Z

@richardchen331, FYI, I opened a PR to Add ClusterSecurityGroupsReadyCondition to managedcontrolplane's patchObject

I still think checking the length of clusterGroups before setting the condition to DeletingReason as you implemented in the PR is needed as it will prevent condition going back from Deleted -> Deleting.

richardchen331 · 2022-03-01T02:38:25Z

Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks!

richardcase · 2022-03-01T16:07:46Z

/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e

sedefsavas · 2022-03-01T22:15:00Z

controlplane/eks/controllers/awsmanagedcontrolplane_controller.go

@@ -285,8 +285,10 @@ func (r *AWSManagedControlPlaneReconciler) reconcileDelete(ctx context.Context,

 	if err := sgService.DeleteSecurityGroups(); err != nil {
 		log.Error(err, "error deleting general security groups for AWSManagedControlPlane", "namespace", controlPlane.Namespace, "name", controlPlane.Name)
+		conditions.MarkFalse(managedScope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletionFailedReason, clusterv1.ConditionSeverityWarning, err.Error())


Deletion is not failed in all scenarios where DeleteSecurityGroups() returns error.
Having the condition set as deleting and erroring is more accurate.
For example, describeClusterOwnedSecurityGroups() could fail and this doesn't mean deletion process failed.

I agree with you that failed call to describeClusterOwnedSecurityGroups and failed call to revokeAllSecurityGroupIngressRules are different.

However if we mark the condition as Deleting first, we could run into the same issue that this PR is trying to address (patching the same condition multiple times in one reconciliation loop).

One option is to mark conditions explicitly inside DeleteSecurityGroups (e.g. set the reason to something like DescribeFailed if describeClusterOwnedSecurityGroups failed and set the reason to DeleteFailed if revokeAllSecurityGroupIngressRules failed) and does not mark the condition in awsmanagedcontrolplane_controller.go. WDYT?

sedefsavas · 2022-03-01T22:16:37Z

pkg/cloud/services/securitygroup/securitygroups.go

@@ -252,27 +252,26 @@ func (s *Service) ec2SecurityGroupToSecurityGroup(ec2SecurityGroup *ec2.Security

 // DeleteSecurityGroups will delete a service's security groups.
 func (s *Service) DeleteSecurityGroups() error {
-	conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")


I'd rather keep this deleting state, if something fails between this and actual deletion, we will know exactly which section is problematic.

pydctw · 2022-03-01T23:45:47Z

Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks!

Hi @richardchen331, I am not sure if you had a chance to look at the PR I linked, #3234, but the error patching conditions seems to be gone with it.

I think the scope of this PR can change to solve the problem of ClusterSecurityGroupsReadyCondition circling between Deleted <-> Deleting, which I have observed many times.

We can rearrange the code to set Deleting condition only when len(clusterGroups) > 0 and patch it right away.

if len(clusterGroups) == 0 {
	return nil
}

conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
if err := s.scope.PatchObject(); err != nil {
		return err
}

Thoughts?

richardchen331 · 2022-03-02T01:07:58Z

Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks!

Hi @richardchen331, I am not sure if you had a chance to look at the PR I linked, #3234, but the error patching conditions seems to be gone with it.

I think the scope of this PR can change to solve the problem of ClusterSecurityGroupsReadyCondition circling between Deleted <-> Deleting, which I have observed many times.

We can rearrange the code to set Deleting condition only when len(clusterGroups) > 0 and patch it right away.
if len(clusterGroups) == 0 {
	return nil
}

conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
if err := s.scope.PatchObject(); err != nil {
		return err
}
Thoughts?

For some reason i missed your PR. After looking at it I think your suggestion makes perfect sense :) Updated the PR to incorporate your feedback. Could you help take another look?

pydctw · 2022-03-02T01:13:54Z

/lgtm

richardchen331 · 2022-03-02T01:24:54Z

/retest

sedefsavas · 2022-03-02T08:03:41Z

The oscillation between deleting and deleted reason is not unusual and could happen for any condition we set in DeleteNetwork() as we are setting deleting right before delete method is called.

sedefsavas · 2022-03-02T08:03:53Z

/test ?

k8s-ci-robot · 2022-03-02T08:03:54Z

@sedefsavas: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sedefsavas · 2022-03-02T08:04:18Z

/test pull-cluster-api-provider-aws-test

sedefsavas · 2022-03-02T08:24:48Z

/approve

k8s-ci-robot · 2022-03-02T08:25:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sedefsavas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sedefsavas]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Fix change condition conflict in reconcileDelete

1331736

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Feb 5, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 5, 2022

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 5, 2022

k8s-ci-robot requested review from dlipovetsky and shivi28 February 5, 2022 19:09

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 5, 2022

richardchen331 mentioned this pull request Feb 7, 2022

Conditions error on deleting AWSManagedControlPlane #2180

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 8, 2022

richardcase reviewed Feb 8, 2022

View reviewed changes

sedefsavas added this to the v1.4.0 milestone Feb 8, 2022

address feedback

e75ca19

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 15, 2022

richardchen331 requested a review from richardcase February 15, 2022 02:54

pydctw reviewed Feb 16, 2022

View reviewed changes

pydctw mentioned this pull request Feb 18, 2022

Add ClusterSecurityGroupsReadyCondition to managedcontrolplane's patchObject #3234

Merged

4 tasks

Address review comments

5f5e1c7

richardchen331 requested a review from pydctw March 1, 2022 02:37

sedefsavas reviewed Mar 1, 2022

View reviewed changes

richardchen331 requested a review from sedefsavas March 1, 2022 23:21

richardchen331 added 3 commits March 1, 2022 17:03

incorporate PR feedback

6154132

add backed Deleted

f6eac09

add new line

95c2dc2

k8s-ci-robot assigned pydctw Mar 2, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2022

k8s-ci-robot merged commit 0eee277 into kubernetes-sigs:main Mar 2, 2022

k8s-ci-robot modified the milestones: v1.4.0, v1.x Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix change condition conflict in reconcileDelete #3157

Fix change condition conflict in reconcileDelete #3157

richardchen331 commented Feb 5, 2022 •

edited

k8s-ci-robot commented Feb 5, 2022

k8s-ci-robot commented Feb 5, 2022

sedefsavas commented Feb 8, 2022

sedefsavas commented Feb 8, 2022

richardcase Feb 8, 2022

richardchen331 Feb 15, 2022

richardchen331 commented Feb 15, 2022

richardchen331 commented Feb 15, 2022

pydctw commented Feb 16, 2022

k8s-ci-robot commented Feb 16, 2022

pydctw commented Feb 16, 2022

pydctw Feb 16, 2022 •

edited

richardchen331 Mar 1, 2022

pydctw Feb 16, 2022 •

edited

richardchen331 Mar 1, 2022

pydctw commented Feb 16, 2022 •

edited

pydctw commented Feb 18, 2022 •

edited

richardchen331 commented Mar 1, 2022

richardcase commented Mar 1, 2022

sedefsavas Mar 1, 2022

richardchen331 Mar 1, 2022

sedefsavas Mar 1, 2022

pydctw commented Mar 1, 2022

richardchen331 commented Mar 2, 2022

pydctw commented Mar 2, 2022

richardchen331 commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

k8s-ci-robot commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

k8s-ci-robot commented Mar 2, 2022

Fix change condition conflict in reconcileDelete #3157

Fix change condition conflict in reconcileDelete #3157

Conversation

richardchen331 commented Feb 5, 2022 • edited

k8s-ci-robot commented Feb 5, 2022

k8s-ci-robot commented Feb 5, 2022

sedefsavas commented Feb 8, 2022

sedefsavas commented Feb 8, 2022

richardcase Feb 8, 2022

Choose a reason for hiding this comment

richardchen331 Feb 15, 2022

Choose a reason for hiding this comment

richardchen331 commented Feb 15, 2022

richardchen331 commented Feb 15, 2022

pydctw commented Feb 16, 2022

k8s-ci-robot commented Feb 16, 2022

pydctw commented Feb 16, 2022

pydctw Feb 16, 2022 • edited

Choose a reason for hiding this comment

richardchen331 Mar 1, 2022

Choose a reason for hiding this comment

pydctw Feb 16, 2022 • edited

Choose a reason for hiding this comment

richardchen331 Mar 1, 2022

Choose a reason for hiding this comment

pydctw commented Feb 16, 2022 • edited

pydctw commented Feb 18, 2022 • edited

richardchen331 commented Mar 1, 2022

richardcase commented Mar 1, 2022

sedefsavas Mar 1, 2022

Choose a reason for hiding this comment

richardchen331 Mar 1, 2022

Choose a reason for hiding this comment

sedefsavas Mar 1, 2022

Choose a reason for hiding this comment

pydctw commented Mar 1, 2022

richardchen331 commented Mar 2, 2022

pydctw commented Mar 2, 2022

richardchen331 commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

k8s-ci-robot commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

sedefsavas commented Mar 2, 2022

k8s-ci-robot commented Mar 2, 2022

richardchen331 commented Feb 5, 2022 •

edited

pydctw Feb 16, 2022 •

edited

pydctw Feb 16, 2022 •

edited

pydctw commented Feb 16, 2022 •

edited

pydctw commented Feb 18, 2022 •

edited