New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix change condition conflict in reconcileDelete #3157
Fix change condition conflict in reconcileDelete #3157
Conversation
@richardchen331: This issue is currently awaiting triage. If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @richardchen331. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Adding logic that relies on the condition that is set by the very same controller is not preferred. Does it make sense to set this condition only during |
if err := sgService.DeleteSecurityGroups(); err != nil { | ||
log.Error(err, "error deleting general security groups for AWSManagedControlPlane", "namespace", controlPlane.Namespace, "name", controlPlane.Name) | ||
return reconcile.Result{}, err | ||
if conditions.GetReason(managedScope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition) != clusterv1.DeletedReason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using the condition as part of reconciliation logic is the right way to go....when in the same reconciliation process.
Would something like this work:
- change
DeleteSecurityGroup
:- If the length of
clustergroups
is 0 then exit early - Then if length of
clustergroups
is greater than setclusterv1.DeletingReason
- In any error handling blocks just return the error
- If the length of
- change AWSMCP reconcileDelete:
- In the error handler block for
sgService.DeleteSecurityGroups()
set theDeletingFailed
reason - if successful set the
DeletedReason
- In the error handler block for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! That sounds good. Updated logic according to your suggestion.
/retest |
1 similar comment
/retest |
/test? |
@pydctw: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-cluster-api-provider-aws-e2e-eks |
conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletedReason, clusterv1.ConditionSeverityInfo, "") | ||
|
||
return nil | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we return nil here? (I know err is nil here but it's hard to understand the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Let me change it to nil.
if len(clusterGroups) == 0 { | ||
return nil | ||
} | ||
conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this condition, clusterv1.DeletingReason
never used as it is not patched inside the function and when DeleteSecurityGroups function returns, ClusterSecurityGroupsReadyCondition condition is set again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Let me remove it.
I've also observed this patch condition conflict and am wondering why this is observed only in EKS. DeleteSecurityGroups is used both for regular and EKS clusters but I haven't seen it with a regular cluster. Any thoughts, @richardcase? If we decide to go with the solution, awscluster_controller.go also needs a change as DeleteSecurityGroups is a common function. |
@richardchen331, FYI, I opened a PR to Add ClusterSecurityGroupsReadyCondition to managedcontrolplane's patchObject I still think checking the length of clusterGroups before setting the condition to DeletingReason as you implemented in the PR is needed as it will prevent condition going back from Deleted -> Deleting. |
Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks! |
/test pull-cluster-api-provider-aws-e2e-eks |
@@ -285,8 +285,10 @@ func (r *AWSManagedControlPlaneReconciler) reconcileDelete(ctx context.Context, | |||
|
|||
if err := sgService.DeleteSecurityGroups(); err != nil { | |||
log.Error(err, "error deleting general security groups for AWSManagedControlPlane", "namespace", controlPlane.Namespace, "name", controlPlane.Name) | |||
conditions.MarkFalse(managedScope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletionFailedReason, clusterv1.ConditionSeverityWarning, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deletion is not failed in all scenarios where DeleteSecurityGroups() returns error.
Having the condition set as deleting and erroring is more accurate.
For example, describeClusterOwnedSecurityGroups()
could fail and this doesn't mean deletion process failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that failed call to describeClusterOwnedSecurityGroups
and failed call to revokeAllSecurityGroupIngressRules
are different.
However if we mark the condition as Deleting first, we could run into the same issue that this PR is trying to address (patching the same condition multiple times in one reconciliation loop).
One option is to mark conditions explicitly inside DeleteSecurityGroups
(e.g. set the reason to something like DescribeFailed if describeClusterOwnedSecurityGroups
failed and set the reason to DeleteFailed if revokeAllSecurityGroupIngressRules
failed) and does not mark the condition in awsmanagedcontrolplane_controller.go
. WDYT?
@@ -252,27 +252,26 @@ func (s *Service) ec2SecurityGroupToSecurityGroup(ec2SecurityGroup *ec2.Security | |||
|
|||
// DeleteSecurityGroups will delete a service's security groups. | |||
func (s *Service) DeleteSecurityGroups() error { | |||
conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep this deleting state, if something fails between this and actual deletion, we will know exactly which section is problematic.
Hi @richardchen331, I am not sure if you had a chance to look at the PR I linked, #3234, but the I think the scope of this PR can change to solve the problem of We can rearrange the code to set Deleting condition only when len(clusterGroups) > 0 and patch it right away.
Thoughts? |
For some reason i missed your PR. After looking at it I think your suggestion makes perfect sense :) Updated the PR to incorporate your feedback. Could you help take another look? |
/lgtm |
/retest |
The oscillation between deleting and deleted reason is not unusual and could happen for any condition we set in |
/test ? |
@sedefsavas: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-cluster-api-provider-aws-test |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sedefsavas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
#2180 is occurring consistently when I looked at controller logs in https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-aws-e2e-eks
I investigated into this a bit, and I think the root cause is that:
reconcileDelete
, it callsDeleteSecurityGroups
and thenDeleteNetwork
.DeleteSecurityGroups
, it first marksClusterSecurityGroupsReadyCondition
toDeleting
, patch it, and at the end, marksClusterSecurityGroupsReadyCondition
toDeleted
.DeleteNetwork
, it patches again, which propagate the change toClusterSecurityGroupsReadyCondition
(changing it toDeleted
) to the management cluster.reconcileDelete
,ClusterSecurityGroupsReadyCondition
is alreadyDeleted
(this could happen, for example, if the controller triedreconcileDelete
earlier, finished deleting security groups but didn't complete all the steps afterwards), then we are essentially patchingClusterSecurityGroupsReadyCondition
fromDeleted
toDeleting
then toDeleted
again in onereconcileDelete
loop. This triggers an error here (https://github.com/kubernetes-sigs/cluster-api/blob/v1.0.0/util/conditions/patch.go#L167) when the controller attempts to patchClusterSecurityGroupsReadyCondition
fromDeleting
toDeleted
becauselatestCondition
isDeleting
, howeverconditionPatch.Before
isDeleted
(value retrieved from management cluster at the beginning ofreconcileDelete
) andconditionPatch.After
isDeleted
(the value the controller tries to patch).This PR fixes the issue by checking if
ClusterSecurityGroupsReadyCondition
is alreadyDeleted
, if yes, then skipDeleteSecurityGroups
.The same issue happens to a number of other conditions (e.g.
RouteTablesReady
,NatGatewaysReady
,InternetGatewayReady
). The reason that onlyClusterSecurityGroupsReadyCondition
is observed is that the error is triggered in the firstPatchObject
call inDeleteNetwork
(https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/network/network.go#L103), and the controller doesn't have the change to set other conditions fromDeleted
toDeleting
then back toDeleted
.If the fix looks good, I can apply the same fix to other conditions.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2180
Special notes for your reviewer:
Checklist:
Release note: