Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix change condition conflict in reconcileDelete #3157

Conversation

richardchen331
Copy link
Contributor

@richardchen331 richardchen331 commented Feb 5, 2022

What type of PR is this?
/kind bug

What this PR does / why we need it:
#2180 is occurring consistently when I looked at controller logs in https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-aws-e2e-eks

I investigated into this a bit, and I think the root cause is that:

  • In AWSMCP's reconcileDelete, it calls DeleteSecurityGroups and then DeleteNetwork.
  • In DeleteSecurityGroups, it first marks ClusterSecurityGroupsReadyCondition to Deleting, patch it, and at the end, marks ClusterSecurityGroupsReadyCondition to Deleted.
  • In DeleteNetwork, it patches again, which propagate the change to ClusterSecurityGroupsReadyCondition (changing it to Deleted) to the management cluster.
  • If at the start of reconcileDelete, ClusterSecurityGroupsReadyCondition is already Deleted (this could happen, for example, if the controller tried reconcileDelete earlier, finished deleting security groups but didn't complete all the steps afterwards), then we are essentially patching ClusterSecurityGroupsReadyCondition from Deleted to Deleting then to Deleted again in one reconcileDelete loop. This triggers an error here (https://github.com/kubernetes-sigs/cluster-api/blob/v1.0.0/util/conditions/patch.go#L167) when the controller attempts to patch ClusterSecurityGroupsReadyCondition from Deleting to Deleted because latestCondition is Deleting, however conditionPatch.Before is Deleted (value retrieved from management cluster at the beginning of reconcileDelete) and conditionPatch.After is Deleted (the value the controller tries to patch).

This PR fixes the issue by checking if ClusterSecurityGroupsReadyCondition is already Deleted, if yes, then skip DeleteSecurityGroups.

The same issue happens to a number of other conditions (e.g. RouteTablesReady, NatGatewaysReady, InternetGatewayReady). The reason that only ClusterSecurityGroupsReadyCondition is observed is that the error is triggered in the first PatchObject call in DeleteNetwork (https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/network/network.go#L103), and the controller doesn't have the change to set other conditions from Deleted to Deleting then back to Deleted.

If the fix looks good, I can apply the same fix to other conditions.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2180

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Fix change condition conflict in reconcileDelete

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Feb 5, 2022
@k8s-ci-robot
Copy link
Contributor

@richardchen331: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 5, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @richardchen331. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 5, 2022
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 5, 2022
@sedefsavas
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 8, 2022
@sedefsavas
Copy link
Contributor

Adding logic that relies on the condition that is set by the very same controller is not preferred.

Does it make sense to set this condition only during DeleteSecurityGroups() call? Because DeleteNetworks() does that based on the response coming from it.

if err := sgService.DeleteSecurityGroups(); err != nil {
log.Error(err, "error deleting general security groups for AWSManagedControlPlane", "namespace", controlPlane.Namespace, "name", controlPlane.Name)
return reconcile.Result{}, err
if conditions.GetReason(managedScope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition) != clusterv1.DeletedReason {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using the condition as part of reconciliation logic is the right way to go....when in the same reconciliation process.

Would something like this work:

  • change DeleteSecurityGroup :
    • If the length of clustergroups is 0 then exit early
    • Then if length of clustergroups is greater than set clusterv1.DeletingReason
    • In any error handling blocks just return the error
  • change AWSMCP reconcileDelete:
    • In the error handler block for sgService.DeleteSecurityGroups() set the DeletingFailed reason
    • if successful set the DeletedReason

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! That sounds good. Updated logic according to your suggestion.

@sedefsavas sedefsavas added this to the v1.4.0 milestone Feb 8, 2022
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 15, 2022
@richardchen331
Copy link
Contributor Author

/retest

1 similar comment
@richardchen331
Copy link
Contributor Author

/retest

@pydctw
Copy link
Contributor

pydctw commented Feb 16, 2022

/test?

@k8s-ci-robot
Copy link
Contributor

@pydctw: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pydctw
Copy link
Contributor

pydctw commented Feb 16, 2022

/test pull-cluster-api-provider-aws-e2e-eks

conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletedReason, clusterv1.ConditionSeverityInfo, "")

return nil
return err
Copy link
Contributor

@pydctw pydctw Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return nil here? (I know err is nil here but it's hard to understand the code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Let me change it to nil.

if len(clusterGroups) == 0 {
return nil
}
conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
Copy link
Contributor

@pydctw pydctw Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this condition, clusterv1.DeletingReason never used as it is not patched inside the function and when DeleteSecurityGroups function returns, ClusterSecurityGroupsReadyCondition condition is set again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Let me remove it.

@pydctw
Copy link
Contributor

pydctw commented Feb 16, 2022

I've also observed this patch condition conflict and am wondering why this is observed only in EKS. DeleteSecurityGroups is used both for regular and EKS clusters but I haven't seen it with a regular cluster. Any thoughts, @richardcase?

If we decide to go with the solution, awscluster_controller.go also needs a change as DeleteSecurityGroups is a common function.

@pydctw
Copy link
Contributor

pydctw commented Feb 18, 2022

@richardchen331, FYI, I opened a PR to Add ClusterSecurityGroupsReadyCondition to managedcontrolplane's patchObject

I still think checking the length of clusterGroups before setting the condition to DeletingReason as you implemented in the PR is needed as it will prevent condition going back from Deleted -> Deleting.

@richardchen331
Copy link
Contributor Author

Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks!

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e

@@ -285,8 +285,10 @@ func (r *AWSManagedControlPlaneReconciler) reconcileDelete(ctx context.Context,

if err := sgService.DeleteSecurityGroups(); err != nil {
log.Error(err, "error deleting general security groups for AWSManagedControlPlane", "namespace", controlPlane.Namespace, "name", controlPlane.Name)
conditions.MarkFalse(managedScope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletionFailedReason, clusterv1.ConditionSeverityWarning, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deletion is not failed in all scenarios where DeleteSecurityGroups() returns error.
Having the condition set as deleting and erroring is more accurate.
For example, describeClusterOwnedSecurityGroups() could fail and this doesn't mean deletion process failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that failed call to describeClusterOwnedSecurityGroups and failed call to revokeAllSecurityGroupIngressRules are different.

However if we mark the condition as Deleting first, we could run into the same issue that this PR is trying to address (patching the same condition multiple times in one reconciliation loop).

One option is to mark conditions explicitly inside DeleteSecurityGroups (e.g. set the reason to something like DescribeFailed if describeClusterOwnedSecurityGroups failed and set the reason to DeleteFailed if revokeAllSecurityGroupIngressRules failed) and does not mark the condition in awsmanagedcontrolplane_controller.go. WDYT?

@@ -252,27 +252,26 @@ func (s *Service) ec2SecurityGroupToSecurityGroup(ec2SecurityGroup *ec2.Security

// DeleteSecurityGroups will delete a service's security groups.
func (s *Service) DeleteSecurityGroups() error {
conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this deleting state, if something fails between this and actual deletion, we will know exactly which section is problematic.

@pydctw
Copy link
Contributor

pydctw commented Mar 1, 2022

Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks!

Hi @richardchen331, I am not sure if you had a chance to look at the PR I linked, #3234, but the error patching conditions seems to be gone with it.

I think the scope of this PR can change to solve the problem of ClusterSecurityGroupsReadyCondition circling between Deleted <-> Deleting, which I have observed many times.

We can rearrange the code to set Deleting condition only when len(clusterGroups) > 0 and patch it right away.

if len(clusterGroups) == 0 {
	return nil
}

conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
if err := s.scope.PatchObject(); err != nil {
		return err
}

Thoughts?

@richardchen331
Copy link
Contributor Author

Hi @pydctw , I updated the PR according to your comments, and also updated awscluster_controller.go as well. Could you take a look? Thanks!

Hi @richardchen331, I am not sure if you had a chance to look at the PR I linked, #3234, but the error patching conditions seems to be gone with it.

I think the scope of this PR can change to solve the problem of ClusterSecurityGroupsReadyCondition circling between Deleted <-> Deleting, which I have observed many times.

We can rearrange the code to set Deleting condition only when len(clusterGroups) > 0 and patch it right away.

if len(clusterGroups) == 0 {
	return nil
}

conditions.MarkFalse(s.scope.InfraCluster(), infrav1.ClusterSecurityGroupsReadyCondition, clusterv1.DeletingReason, clusterv1.ConditionSeverityInfo, "")
if err := s.scope.PatchObject(); err != nil {
		return err
}

Thoughts?

For some reason i missed your PR. After looking at it I think your suggestion makes perfect sense :) Updated the PR to incorporate your feedback. Could you help take another look?

@pydctw
Copy link
Contributor

pydctw commented Mar 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2022
@richardchen331
Copy link
Contributor Author

/retest

@sedefsavas
Copy link
Contributor

The oscillation between deleting and deleted reason is not unusual and could happen for any condition we set in DeleteNetwork() as we are setting deleting right before delete method is called.

@sedefsavas
Copy link
Contributor

/test ?

@k8s-ci-robot
Copy link
Contributor

@sedefsavas: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sedefsavas
Copy link
Contributor

/test pull-cluster-api-provider-aws-test

@sedefsavas
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sedefsavas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2022
@k8s-ci-robot k8s-ci-robot merged commit 0eee277 into kubernetes-sigs:main Mar 2, 2022
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.4.0, v1.x Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditions error on deleting AWSManagedControlPlane
5 participants