Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: external load balancer garbage collection (part 2) - new gc service #3610

Merged
merged 3 commits into from
Jul 29, 2022

Conversation

richardcase
Copy link
Member

@richardcase richardcase commented Jul 21, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

This introduces a new garbage collection service which will be used later by the controllers to do clean up of AWS resources for workload clusters that were created via the CCM. Initially, we support deleting load balancers (classic elb, nlb, alb), target groups and security groups.

The service is design to be called by the infra cluster reconcilers when a cluster is delete. The entry point is ReconcileDelete into the service.

When a cluster is deleted the cluster controllers will call ReconcileDelete. The first job is to determine if the workload cluster being deleted should be garbage collected. A cluster will be garbage collected if either of these are true:

  • the aws.cluster.x-k8s.io/external-resource-gc annotation is absent
  • the aws.cluster.x-k8s.io/external-resource-gc annotation exists and its value is set to true

If a cluster is to be garbage collected then we need to identify the AWS resources that where created by the CCM in the workload cluster. This is done by using the AWS resource tagging api and getting resources with the Kubernetes cluster owned label (i.e. kubernetes.io/cluster/[CLUSTERNAME]). The resources that are returned are converted and then the list of them is passed to the defined cleanup functions in order. The order of the functions matter because some AWS resources cannot be delete before others. For example, target groups can only be deleted after load balancers. The cleanup function will then iterate over the resources and decide if it wants to handle that resource and delete it from AWS.

The service will be use by a later PR to the controllers and so is initially unused.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Relates #1718

Special notes for your reviewer:

This part 2 of 4 of changes (i.e. a stack) to implement the garbage collection. This is work to split up the original pr #3518

Checklist:

  • squashed commits
  • adds unit tests

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 21, 2022
@k8s-ci-robot
Copy link
Contributor

@richardcase: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 21, 2022
@richardcase
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@richardcase: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

Comment on lines 51 to 58
eksClusterName := getTagValue(eksClusterNameTag, mapping)
if eksClusterName != "" {
s.scope.V(2).Info("Security group created by EKS directly, skipping deletion", "cluster_name", eksClusterName)

return nil
}

//TODO: should we check for the security group name start with k8s-elb-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that the caller filters the set of resources to pass to the clean up function, e.g., https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/3610/files#diff-ac45a93b8f5e87ab26c8de0dd3165846262b1cb89b0e318586f6ed2b33f61627R127, and then the clean up function filters the set further.

Can we make either the caller, or this function, responsible for all the filtering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller is only filtering based on the resource type (i.e. security-group). I'm not sure this will be any clearer:

if strings.HasPrefix(parsedARN.Resource, "security-group/") {
  eksClusterName := getTagValue(eksClusterNameTag, mapping)
  if eksClusterName != "" {
    s.scope.V(2).Info("Security group created by EKS directly, skipping deletion", "cluster_name", eksClusterName)
     continue
  }
  s.scope.V(2).Info("Deleting Security group", "arn", parsedARN.String())
  return s.deleteSecurityGroup(ctx, &parsedARN, res)
}

But happy to change if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I am trying to say is this: if I had to call this function, I'm not how I would know that my input is correct.

Right now, the caller needs to filter resources before calling this function, but I think that's not enforced by the compiler.

For example, what happens if I call the function as it currently stands and pass a resource that is NOT a security group? Does groupID get an invalid value, and then the DeleteSecurityGroupWithContext call fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see what you are saying. I guess i am relying on the fact that the only exportable function is ReconcileDelete and from here we control what calls what.

The current call to deleteSecurityGroup is protected with a check (i.e. strings.HasPrefix(parsedARN.Resource, "security-group/") ). But what you are saying is that someone could come along later and make a change in the gc package somewhere and call deleteSecurityGroup without this check.

Its a bit late in the day for me...will ponder this tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlipovetsky - i have changed this now so that the clean up functions get passed the list of all the resources, they then decide to filter the resources and delete if required.

@dlipovetsky
Copy link
Contributor

Really nice job, especially with all the new tests, which are complex to write! 🎉

I looked at everything, except the Reconcile and ReconcileDelete in pkg/cloud/services/gc/cleanup.go, because I asked about their design in the proposal PR.

@richardcase
Copy link
Member Author

I looked at everything, except the Reconcile and ReconcileDelete in pkg/cloud/services/gc/cleanup.go, because I asked about their design in the proposal PR.

@dlipovetsky - i have updated this in line with the latest changes from the proposal (which was updated based on your suggestions).

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member Author

Updated based on latest proposal changes....removing the extra finalizer we were using.

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e-eks

@richardcase richardcase force-pushed the 1718_gc_new_service branch 3 times, most recently from 80a3a16 to 5cde6ae Compare July 28, 2022 14:43
The the cleanup functions now accepted the full list of AWS resources
and then they filter which resources they want to delete themselves.

Signed-off-by: Richard Case <richard.case@outlook.com>
@Skarlso
Copy link
Contributor

Skarlso commented Jul 28, 2022

Great work! Really cool. :) I can't seem to resolve my comments. :( So sorry about them litering too much.

@richardcase
Copy link
Member Author

Great work! Really cool. :) I can't seem to resolve my comments. :( So sorry about them litering too much.

Thanks for the review @Skarlso - some great ideas on how to improve the implementation 🙇

@sedefsavas
Copy link
Contributor

Started reviewing this one, will leave comments by EOD today.

@sedefsavas
Copy link
Contributor

Great work @richardcase.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2022
@richardcase
Copy link
Member Author

With 3 different reviews lets merge. We can make changes if needed in the future.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2022
@k8s-ci-robot k8s-ci-robot merged commit daab273 into kubernetes-sigs:main Jul 29, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.x milestone Jul 29, 2022
@k8s-infra-cherrypick-robot

@richardcase: #3610 failed to apply on top of branch "release-1.5":

Applying: feat: new garbage collection servive
Applying: refactor: gc opt in/out any time before delete
Using index info to reconstruct a base tree...
M	exp/api/v1beta1/types.go
A	pkg/cloud/services/awsnode/cni_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): pkg/cloud/services/awsnode/cni_test.go deleted in HEAD and modified in refactor: gc opt in/out any time before delete. Version refactor: gc opt in/out any time before delete of pkg/cloud/services/awsnode/cni_test.go left in tree.
Auto-merging exp/api/v1beta1/types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 refactor: gc opt in/out any time before delete
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

And we will want all GC change in 1.5 so:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@richardcase
Copy link
Member Author

/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@richardcase: #3610 failed to apply on top of branch "release-1.5":

Applying: feat: new garbage collection servive
Applying: refactor: gc opt in/out any time before delete
Using index info to reconstruct a base tree...
M	exp/api/v1beta1/types.go
A	pkg/cloud/services/awsnode/cni_test.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): pkg/cloud/services/awsnode/cni_test.go deleted in HEAD and modified in refactor: gc opt in/out any time before delete. Version refactor: gc opt in/out any time before delete of pkg/cloud/services/awsnode/cni_test.go left in tree.
Auto-merging exp/api/v1beta1/types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 refactor: gc opt in/out any time before delete
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants