Fork the tagging controller into generic node customization controller #833

mmerkes · 2024-01-21T23:39:47Z

I intend to begin work on this feature on 2/5/24, and I've discussed in detail with my colleagues, but I wanted to post an issue here for tracking purposes and for feedback from the community.

What would you like to be added?

In 2022, a tagging controller was added to the cloud provider, which is responsible for tagging and untagging node resources as they join and leave the cluster. This controller runs in clusters managed by EKS by default and it can be run elsewhere.

The tagging controller is a very focused controller at this point. While there has been discussions about expanding the resources supported, it's still very limited. However, the usefulness of the tagging controller could be expanded by converting it into a generic node customization controller. It already watches node events, so you could take advantage of the overhead already running and extend that further. There are other feature requests, like #300, that this would simplify and require operators to only run a single controller.

Why is this needed?

There's a feature request to add support for labeling nodes with the AWS zone ID along with the zone name. The zone name is set configured as the node's zone today, and it can create complications for users running in multiple AWS accounts. In the largest AWS regions like us-east-1 and us-west-2, zone names, like us-east-2a, do not always map to the same zone IDs, like use1-az2, across accounts, so if you're trying to assess impact to a particular zone, you name you need normalize the node name across accounts.

There's no hook to provide an additional zone name in existing cloud provider and changing the zone to be a zone ID rather than zone name would be a breaking, backwards incompatible change. Any expectations around the zone name in existing cluster, either around resources deployed there or tooling, will break, and the migration may cause unexpected migration in scheduling, PDBs, etc. As a result, the only practical option is sticking this into a controller, and by utilizing the tagging controller, albeit forking and extending it, will mean that operators only need to operate one controller for both purposes.

We've also heard requests to label nodes with the accelerator family as well, which would be easy to add with this change. I can imagine other labels that could be added in the future that might also be useful.

FAQ

What's being added to the tagging controller?

Support to add additional labels on nodes, specifically zone IDs #300. Label would be something like topology.k8s.aws/zone-id.

Why not just add it to the tagging controller?
Right now, the tagging controller is pretty lightweight. We can do the migration with little controversy now to a name that's more appropriate.

Can I just keep using the tagging controller?
Yes. The change would done in a way to re-use that code, but keep the existing tagging controller command to avoid breaking clusters unaware of the change. I would add a log to recommend migrating to the new controller, and possibly deprecate the tagging controller some day, but that could be discussed in the future

How would cluster admins migrate from the tagging controller to the new controller?
Clusters admins would swap out the tagging controller for the new controller as a direct replacement. The intention is for the operation to be simple and quiet.

Do we need to add any new permissions?
CCM should already have all of the permissions needed to support mentioned functionality.

/kind feature

The text was updated successfully, but these errors were encountered:

mmerkes · 2024-01-21T23:39:58Z

/assign mmerkes

mmerkes · 2024-01-21T23:40:56Z

/triage accepted

mmerkes · 2024-02-20T17:10:30Z

Resolving this issue in favor of a different solution: kubernetes/kubernetes#123223

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 21, 2024

k8s-ci-robot assigned mmerkes Jan 21, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 21, 2024

mmerkes closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fork the tagging controller into generic node customization controller #833

Fork the tagging controller into generic node customization controller #833

mmerkes commented Jan 21, 2024 •

edited

mmerkes commented Jan 21, 2024

mmerkes commented Jan 21, 2024

mmerkes commented Feb 20, 2024

Fork the tagging controller into generic node customization controller #833

Fork the tagging controller into generic node customization controller #833

Comments

mmerkes commented Jan 21, 2024 • edited

What would you like to be added?

Why is this needed?

FAQ

mmerkes commented Jan 21, 2024

mmerkes commented Jan 21, 2024

mmerkes commented Feb 20, 2024

mmerkes commented Jan 21, 2024 •

edited