Node Labelling from AWS Tags #110

TBBle · 2020-06-23T05:45:28Z

What would you like to be added:

I'd like the AWS Cloud Provider to be able to apply labels from Instance tags (probably inherited from ASG or Launch Template) to the Node at creation time.

Why is this needed:

Since k8s 1.16, kubelet is not allowed to set node labels (with specific exceptions) in *.kubernetes.io namespace. A specific instance of this are the role labels, node-role.kubernetes.io/X= and kubernetes.io/role=X.

Tools such as eksctl currently rely on the EC2 user_data to pass labels to kubelet to set as self-labels, so that the user-specified labels are available on nodes started from an AutoScalingGroup, and hence have no direct interaction with eksctl. Currently, it is forced to reject labels that kubelet cannot self-apply.

Per my comment on the eksctl issue, setting kubelet-disallowed labels on nodes from eksctl should not be blocked by this change, as this is a "user" action, not "kubelet" action.

It seems to me that the AWS Cloud Provider (Really the Node controller) is the right place to apply some other source of information i.e. AWS Tags, as used by, e.g, Cluster Autoscaler, in order to add labels to incoming Nodes.

That said, poking at CloudNodeController doesn't show an obvious place where arbitrary labels can be applied to an incoming Node, so this might not be as simple as I'd hoped, or even the right place to do this.

Per aws/containers-roadmap#733, EKS's Managed Node Groups feature is also blocked from applying kubelet-forbidden labels to its nodes. Separately, metal3.io have encountered the same issue, and are contemplating support in their CAPM3 controller to manage this, which seems the equivalent of the AWS Cloud Provider in a metal3 Cluster API rollout.

Separately, this would open a discussion about whether the node labels would be updated if their tags are changed while they are alive.

/kind feature

The text was updated successfully, but these errors were encountered:

fejta-bot · 2020-09-21T06:20:16Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

TBBle · 2020-09-21T09:19:45Z

/remove-lifecycle stale

fejta-bot · 2020-12-20T09:30:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

TBBle · 2020-12-20T10:29:46Z

/remove-lifecycle stale

chiefy · 2021-02-17T15:14:34Z

Just want to add that this would be a huge help in the AWS-CNI bootstrapping process when you use custom networking (ENIConfig CRDs which are AZ dependent) to label the nodes to use the proper corresponding ENIConfig. Currently we have to setup a systemd unit that annotates the node on startup which is... not great.

nckturner · 2021-02-18T23:35:42Z

/assign

ayberk · 2021-03-24T22:38:58Z

I'm tentatively starting to look into this, but I'd appreciate some input before I dive deeper:

Are we OK with only applying certain predefined tags as labels? If we want to support arbitrary tags, there are a lot of edge cases we need to consider because of formatting and limits. I'd say let's start small for initial implementation.
Assuming we apply the "role" label, anyone that can launch an instance will be able to assume that "role" in the cluster, potentially creating a new attack surface. Is this a big enough concern to not implement this feature?

TBBle · 2021-03-25T11:48:17Z

The 'role' label risk applies to all arbitrary labels, because any label might be the label used by the cluster admins to put sensitive workloads on tightly-controlled nodes.

That's the rationale for removing this feature from kubelet in the first place, except for the node.kubernetes.io/…, since cluster administrators should now be aware that a node can self-apply labels in that namespace, and hence should not use it for selecting nodes when they do not want to allow anyone who starts a node to be able to control their membership of that selection.

So the equivalent security posture would be that all EC2 tag-sourced roles go under, e.g., tags.ec2.aws.amazon.com/..., so that administrators positively know that those labels may be assigned by anyone who can edit EC2 tags for a node, and conversely that EC2 tag-editing does not allow setting labels outside that namespace.

Then for @chiefy's use-case, the ENI_CONFIG_LABEL_DEF env-var would be something like tags.ec2.aws.amazon.com/eni-config, and it'd be pretty clear that it was trusting the 'eni-config' tag from the EC2 instance.

rifelpet · 2021-03-25T12:16:21Z

FWIW kOps supports node labeling from tags using the tag prefix that cluster-autoscaler recognizes:

https://github.com/kubernetes/kops/pull/9575/files#diff-7f1248cf1e27af9ae310930d67bd24dfc04d295e8e75c2e910097769e790a872R42

https://github.com/kubernetes/autoscaler/blob/5f2e54a87fb9e17c20ca1e4db52fd9bd50696b55/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L410

ayberk · 2021-03-26T23:23:33Z

Appreciate the input. That makes sense to me.

As an additional check, how do you feel about hiding this behind a flag? It'd be false by default and the cluster admin would have to explicitly allow the tag labeling, forcing them to acknowledge this behavior.

TBBle · 2021-03-27T04:09:40Z

I think it would make sense to:

Have a single kill-switch for cloud-provider-aws supporting node labelling. I'd probably default this on, but "off" for the "kOps/CAPI/etc. is managing this already, thank you" use-case.
Automatically handle some known tag namespace, like tags.ec2.aws.amazon.com/... I mentioned above.
- Anyone who was already using that namespace, and isn't ec2.aws.amazon.com, was just asking for trouble anyway.
Optionally handle cluster-autoscaler's node-template tag namespace k8s.io/cluster-autoscaler/node-template/label/.... I'd default this to 'on', but it might make sense to default it to 'off' to avoid conflicts with kOps or other existing solutions? For an EKS-managed AWS Cloud Provider, I'd turn it on, and document it clearly.
- Perhaps also support tainting from cluster-autoscaler's node-template taint tags k8s.io/cluster-autoscaler/node-template/taint? This is probably important, to avoid the labelled-before-tainted misscheduling issue, as this seems like it might move labelling earlier than it is now.
- This would mean the AWS Cloud Provider is able to easily fulfils the CA doc's "It is your responsibility to ensure such labels and/or taints are applied via the node's kubelet configuration at startup.".
Support (but not enable by default) other administrator-provided tag or tag-prefixes to honour.
- This covers the node-role.kubernetes.io/X case, and also allows cases like ENI_CONFIG_LABEL_DEF to use custom labels outside the pre-defined tags.ec2.aws.amazon.com/... namespace.

Supporting Cluster Autoscaler's Auto-Discovery tags is particularly nice, because right now I have to duplicate those values in eksctl, once as tags and again as labels/taints for the Nodegroup. eksctl-io/eksctl#1481 (comment) is a request to have eksctl auto-copy nodegroup labels and taints to the ASG tags, but it never actually advanced, probably because you can solve it manually in the config.

All that said, I just realised that CA's tags are on the ASG, and are not required to propagate to the EC2 Instance for CA to work correctly. So to implement the k8s.io/cluster-autoscaler/node-template/... tags, we need to scan the instance's ASG. In the kOps implementation, they're tracking a kOps-specific label (kops.k8s.io/instancegroup) to look up the instance group in the kOps config, which I am assuming was the same data used to build the ASG.

I'm not actually sure if the AWS API lets you get an ASG from an EC2 Instance, so this might be painful to implement.

In Managed Nodegroups, the strongly-related aws/containers-roadmap#724 is moving to teach Cluster Autoscaler to pull the list of labels and taints from the Managed Nodegroup API, because there is a limit of 50 tags on the ASG, although that's not a new problem for non-managed Nodegroups in Cluster Autoscaler.

fejta-bot · 2021-06-27T19:57:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

TBBle · 2021-06-28T08:56:21Z

Following up on my notes above, the third dot-point ("Optionally handle cluster-autoscaler's node-template tag namespace") wouldn't be necessary for Managed Nodegroups, as per aws/containers-roadmap#724 (comment), Managed Nodegroups will not be using AWS tags to feed labels/taints to Cluster Autoscaler, so perhaps this would be better to default to "off" after all, to avoid surprises, particularly surprise interactions with Managed Nodegroups.

Then you have the choice between duplicating the tags into labels/taints locally, e.g. as I do now with eksctl, or turning on the automatic application of that tag namespace by cloud-privider-aws.

jfoy · 2021-07-13T00:10:24Z

/remove-lifecycle stale

TBBle · 2021-07-26T08:55:11Z

Since this came back on my radar again, I thought I'd mention that in my suggestion earlier, the fourth dot-point (Support (but not enable by default) other administrator-provided tag or tag-prefixes to honour.) has a high risk of introducing a security vulnerability, and it probably needs to provide, for example, an option to block using AWS Tags to mark a node as node-role.kubernetes.io/master=true, as if a node can steal the master role, and can then attract and subvert the AWS Cloud Provider, any other options specified to be allowed or disallowed are out-the-window.

On EKS, I imagine it doesn't make sense to mark EC2 nodes as 'master', as EKS provides that. I haven't used non-EKS setups on EC2, so I'm not sure if it would make sense to allow this tagging there, or if it could just be default-blocked in all cases.

k8s-triage-robot · 2021-10-24T09:31:47Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-11-23T10:10:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-12-23T10:34:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-12-23T10:34:54Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

olemarkus · 2021-12-23T10:48:59Z

I just saw this one. It's an interesting topic.
As node-role.kubernetes.io/X= and kubernetes.io/role=X was specifically mentioned, I'd like to point out that these labels are owned by the Kubernetes Installer. So e.g on kOps clusters, these labels are owned by the kops-controller. And kOps indeed supports the feature described here. If CCM were to implement support for this, it could get into conflict with what installers want to set.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 23, 2020

This was referenced Jun 23, 2020

node-role.kubernetes.io/ in labels leads to errors for k8s>=1.16 eksctl-io/eksctl#2197

Closed

Setting kubernetes.io/role and other restricted node labels eksctl-io/eksctl#2363

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020

k8s-ci-robot assigned nckturner Feb 18, 2021

nckturner assigned ayberk and unassigned nckturner Mar 19, 2021

hendrikhalkow mentioned this issue May 23, 2021

[EKS] [request]: Auto-apply node role labels securely aws/containers-roadmap#1381

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 27, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2021

justinsb unassigned ayberk Jul 23, 2021

This was referenced Jul 24, 2021

node-role labeling fails eksctl-io/eksctl#4007

Open

[EKS] Kubernetes Restricted Label support for Managed Node Groups #242

Closed

TBBle mentioned this issue Jul 24, 2021

[EKS] [request]: Kubernetes Restricted Label support for Managed Node Groups aws/containers-roadmap#1451

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 23, 2021

k8s-ci-robot closed this as completed Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Labelling from AWS Tags #110

Node Labelling from AWS Tags #110

TBBle commented Jun 23, 2020 •

edited

fejta-bot commented Sep 21, 2020

TBBle commented Sep 21, 2020

fejta-bot commented Dec 20, 2020

TBBle commented Dec 20, 2020

chiefy commented Feb 17, 2021 •

edited

nckturner commented Feb 18, 2021

ayberk commented Mar 24, 2021

TBBle commented Mar 25, 2021 •

edited

rifelpet commented Mar 25, 2021

ayberk commented Mar 26, 2021

TBBle commented Mar 27, 2021 •

edited by nckturner

fejta-bot commented Jun 27, 2021

TBBle commented Jun 28, 2021

jfoy commented Jul 13, 2021

TBBle commented Jul 26, 2021

k8s-triage-robot commented Oct 24, 2021

k8s-triage-robot commented Nov 23, 2021

k8s-triage-robot commented Dec 23, 2021

k8s-ci-robot commented Dec 23, 2021

olemarkus commented Dec 23, 2021

Node Labelling from AWS Tags #110

Node Labelling from AWS Tags #110

Comments

TBBle commented Jun 23, 2020 • edited

fejta-bot commented Sep 21, 2020

TBBle commented Sep 21, 2020

fejta-bot commented Dec 20, 2020

TBBle commented Dec 20, 2020

chiefy commented Feb 17, 2021 • edited

nckturner commented Feb 18, 2021

ayberk commented Mar 24, 2021

TBBle commented Mar 25, 2021 • edited

rifelpet commented Mar 25, 2021

ayberk commented Mar 26, 2021

TBBle commented Mar 27, 2021 • edited by nckturner

fejta-bot commented Jun 27, 2021

TBBle commented Jun 28, 2021

jfoy commented Jul 13, 2021

TBBle commented Jul 26, 2021

k8s-triage-robot commented Oct 24, 2021

k8s-triage-robot commented Nov 23, 2021

k8s-triage-robot commented Dec 23, 2021

k8s-ci-robot commented Dec 23, 2021

olemarkus commented Dec 23, 2021

TBBle commented Jun 23, 2020 •

edited

chiefy commented Feb 17, 2021 •

edited

TBBle commented Mar 25, 2021 •

edited

TBBle commented Mar 27, 2021 •

edited by nckturner