Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: external load balancer garbage collection (part 1) - proposal #3609

Merged
merged 2 commits into from
Jul 27, 2022

Conversation

richardcase
Copy link
Member

@richardcase richardcase commented Jul 21, 2022

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This is the proposal for the new garbage collection functionality. It was written post implementation to make sure we captured the intent of the change and to aid with the review.

Even though its written post implementation it may merge first as the gc work has been split into smaller PRs to aid review.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Relates #1718

Special notes for your reviewer:

This part 1 of 4 of changes (i.e. a stack) to implement the garbage collection. This is work to split up the original pr #3518

Checklist:

  • squashed commits
  • includes documentation

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 21, 2022
@k8s-ci-robot
Copy link
Contributor

@richardcase: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 21, 2022
@Skarlso
Copy link
Contributor

Skarlso commented Jul 21, 2022

tag

Copy link
Contributor

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First initial read-through. :D

I will review the structure next, but I think this is looking good so far. :)

docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
This is the proposal for the new garbage collection functionality. It
was written post implementation to make sure we captured the intent of
the change and to aid with the review.

Even though its written post implementation it may merge first as the gc
work has been split into smaller PRs to aid review.

Signed-off-by: Richard Case <richard.case@outlook.com>
Copy link
Contributor

@pydctw pydctw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review - looks good with minor nits.

One question - I have noticed the term, a child or a tenant cluster is used. Is there a reason that these terms are used instead of a workload cluster, which is commonly used in CAPI (i.e. CAPI quickstart guide)?

docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
docs/proposal/20220712-garbage-collection.md Outdated Show resolved Hide resolved
Copy link
Contributor

@dlipovetsky dlipovetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at the original PR, I formed a different model, perhaps by mistake. At the time, I said:

The garbage collection controller always applies its finalizer, and always removes it, but it decides whether to perform garbage collection based on the annotation.
-- #3518 (comment)

This proposal describes a different mental model: The annotation controls whether the finalizer is applied.

I find this model more complex, because it allows these scenarios:

  1. User creates cluster without the gc annotation.
  2. Controller applies the gc finalizer.
  3. User wants to disable the gc, so they add the gc annotation, its value set to false.
  4. User deletes the cluster.

Observe that the gc runs during step 4, even though the user might believe that what they do in step 3 disables the gc.

  1. User creates cluster with the gc annotation, its value set to false .
  2. Controller does not apply the gc finalizer.
  3. User wants to enable the gc, so they remove the gc annotation, or modify its value to true.
  4. User deletes the cluster.

Observe that the gc does not run during step 4, even though the user might believe that what they do in step 3 enables the gc.

  1. User creates cluster with the gc annotation, its value set to false.
  2. Controller does not apply the gc finalizer.
  3. User wants to enable the gc, so they add the gc finalizer.
  4. User deletes the cluster.

Observe that the gc does run during step 4, just as the user wants, even though the annotation's value is false.

I think these scenarios violate the Principle of Least Surprise.


What do you think of the model I had in mind, where the controller always applies the gc finalizer, but, at the time it reconciles the cluster delete, evaluates the annotation and decides how to act on the gc finalizer, i.e. to clean up AWS resources, or to skip them.

I think another way of distinguishing the two models is this: In the proposal, the user decides whether to enable the gc when they create the cluster. In the model I describe, the user decides whether to enable the gc at any time before they delete the cluster.

@richardcase
Copy link
Member Author

One question - I have noticed the term, a child or a tenant cluster is used. Is there a reason that these terms are used instead of a workload cluster, which is commonly used in CAPI (i.e. CAPI quickstart guide)?

Thanks @pydctw . No reason other than we use that term a lot.....i have changed it to workload.

@richardcase
Copy link
Member Author

In the model I describe, the user decides whether to enable the gc at any time before they delete the cluster.

I agree this is a better model.

I think we could simplify the implementation further. As we aren't using a separate controller and the GC logic is being added to the existing reconciliation of the awsc & awsmcp then we don't specifically need another finalizer. If the gc feature is enabled, in the reconcileDelete of the 2 controllers we will call the gc service. The gc service will then decide at that point if it should do GC based on the existence of the annotation.

Although having a separate finalizer does allow us to remove it after doing gc so that if reconcileDelete occurs again we don't try and do gc again.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 26, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 26, 2022
@richardcase
Copy link
Member Author

@dlipovetsky - i have updated the proposal with the suggested model changes. Any chance you could take a look?

@dlipovetsky
Copy link
Contributor

I think we could simplify the implementation further. As we aren't using a separate controller and the GC logic is being added to the existing reconciliation of the awsc & awsmcp then we don't specifically need another finalizer.

Agreed!

If the gc feature is enabled, in the reconcileDelete of the 2 controllers we will call the gc service. The gc service will then decide at that point if it should do GC based on the existence of the annotation.

👍

Although having a separate finalizer does allow us to remove it after doing gc so that if reconcileDelete occurs again we don't try and do gc again.

While we can save a few "discover" API calls if we skip, I would prefer to call the gc every time; it has to be idempotent, since the controller could crash at any time.

@dlipovetsky
Copy link
Contributor

@dlipovetsky - i have updated the proposal with the suggested model changes. Any chance you could take a look?

Looks very good! Thank you for incorporating the feedback.

I think not using a separate finalizer would be an improvement, so I'll wait to see if you want to make that change before giving an lgtm.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 27, 2022

That was some very good thinking @dlipovetsky. I shall endeavour to be more vigilant. :)

@richardcase
Copy link
Member Author

@dlipovetsky @Skarlso - i have now removed the use of the additional finalizer based on the discussion above.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 27, 2022

I'll re-read the entire thing in a bit with that in mind.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 27, 2022

Seems like you have the same ci problem as this PR: #3483

@richardcase
Copy link
Member Author

/retest

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-test

This proposal has been changed so that a user can decide if they want a
workload cluster to opt in or out of garbage collection anytime before
its deleted. The default is that if the gc feature is enabled then a
workload cluster will be deleted.

This change is based on feedback from review: To quote dlipovetsky:

"In the proposal, the user decides whether to enable the gc when they
create the cluster. In the model I describe, the user decides whether
to enable the gc at any time before they delete the cluster."

Signed-off-by: Richard Case <richard.case@outlook.com>
@dlipovetsky
Copy link
Contributor

Thanks for a great proposal!

/lgtm

I just had one more thought: Would we ever want the user to choose which resources are garbage collected? If so, we may want to include the resource category in the annotation key. For example,

external-resource-gc.aws.cluster.x-k8s.io/elb: true
external-resource-gc.aws.cluster.x-k8s.io/ebs: false

Since this is an experimental feature, I think we can make this change later, if necessary.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2022
@richardcase
Copy link
Member Author

I just had one more thought: Would we ever want the user to choose which resources are garbage collected?

Yes i think we would and have created #3541 to look at this type customization.

@richardcase
Copy link
Member Author

Thanks everyone for taking the time to review.

And special thanks to @dlipovetsky for his diligence in the review and helping to drive us to a better overall solution 🙇

/approve

And we will want all GC change in 1.5 so:

/cherry-pick release-1.5

@k8s-infra-cherrypick-robot

@richardcase: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

Thanks everyone for taking the time to review.

And special thanks to @dlipovetsky for his diligence in the review and helping to drive us to a better overall solution 🙇

/approve

And we will want all GC change in 1.5 so:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit d6ea609 into kubernetes-sigs:main Jul 27, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.x milestone Jul 27, 2022
@k8s-infra-cherrypick-robot

@richardcase: new pull request created: #3625

In response to this:

Thanks everyone for taking the time to review.

And special thanks to @dlipovetsky for his diligence in the review and helping to drive us to a better overall solution 🙇

/approve

And we will want all GC change in 1.5 so:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Skarlso
Copy link
Contributor

Skarlso commented Jul 27, 2022

Very nice work! Sorry I didn't get to re-reading it. :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants