Investigate usage/requirements for ClusterID #12

andrewsykim · 2019-02-20T13:47:32Z

In kubernetes/kubernetes#48954 & kubernetes/kubernetes#49215 we made ClusterID a requirement, and added a flag --allow-untagged-cloud on the kube-controller-manager. The intention there was to allow clusters to get away with not setting ClusterID for a few releases but eventually make it a requirement. It seems we never followed through with cleaning up the --allow-untagged-cloud flag.

More interestingly, it's not exactly clear how ClusterID is being consumed by both in-tree and out-of-tree cloud providers. It seems it's critical to AWS/GCE but not really used by others. Do we still need ClusterID? Should we use a more generic approach with labels/annotations? If we need it, should we go ahead and remove the --allow-untagged-cloud flag?

If the plan is to continue to support ClusterID, we should at least add better documentation for how this works.

cc @justinsb @rrati

The text was updated successfully, but these errors were encountered:

justinsb · 2019-02-20T15:07:31Z

So on AWS the ClusterID is used as a tag to allow multiple clusters to coexist in the same "scope". AWS doesn't have a project concept like GCP does, so originally this was needed to allow multiple clusters to exist in the same VPC at all.

Drilling down, there are two use cases:

isolation: Cluster A shouldn't see cluster B's resources in e.g. the cloudprovider code
cleanup: We should be able to clean up any resources that cluster A created when we delete it, without cleaning up cluster B resources

Cleanup is a nice-to-have - in theory if you shut down k8s cleanly you don't need it, but it is nice.

Isolation is trickier:

we have some resources that we tag, but we could instead pass them in explicitly (e.g. RouteTables, SecurityGroups, Subnets).
We also have a lot of places where we still refer to instances by NodeName and expect those to be unique in the scope, but that isn't generally a safe assumption on AWS. It is safe today because we tag and because the NodeName is the PrivateDnsName which is derived from the internal IP; we actually could get away with just looking at the VPC + PrivateDnsName, but the effort to date has been about figuring out how to make the NodeName != PrivateDnsName. It would be great (IMO) to change the cloudprovider interface to pass in the Node object instead of a NodeName - and also much more efficient as it would save a lookup. But now we run into the future plans to lock-down how kubelets register and how much of this information can be done server-side. Not sure what our plans are there.

AFAICT, what GCE calls the cluster-id is currently only used in the load balancer, as part of the name for backend rules. I'm not sure if that is for cleanup, isolation, or both.

justinsb · 2019-02-20T15:54:05Z

I put up a WIP PR to show what is required to move to passing around v1.Node, BTW: kubernetes/kubernetes#74304 It's just one call, and it already looks bad, but the only complicated thing was the desiredStateOfWorld map on node-update - and the volume reconciler is where I've always got stuck before.

andrewsykim · 2019-02-27T16:51:33Z

We also have a lot of places where we still refer to instances by NodeName and expect those to be unique in the scope, but that isn't generally a safe assumption on AWS. It is safe today because we tag and because the NodeName is the PrivateDnsName which is derived from the internal IP; we actually could get away with just looking at the VPC + PrivateDnsName, but the effort to date has been about figuring out how to make the NodeName != PrivateDnsName. It would be great (IMO) to change the cloudprovider interface to pass in the Node object instead of a NodeName - and also much more efficient as it would save a lookup. But now we run into the future plans to lock-down how kubelets register and how much of this information can be done server-side. Not sure what our plans are there.

@justinsb is it safe to say that if we replace where we pass in node name only with the entire node object this would remove the need for clusterID? Seems like a reasonable thing to do to me because there are a lot of places where we have to "search node by name" when we could just get the node ID from spec.ProviderID.

andrewsykim · 2019-02-27T17:01:26Z

AFAICT, what GCE calls the cluster-id is currently only used in the load balancer, as part of the name for backend rules. I'm not sure if that is for cleanup, isolation, or both.

For the GCE case (and please correct me if I'm wrong), it seems like it supports cluster ID but not in the same way as AWS. GCE generates it's own cluster ID dynamically and stores the value in a ConfigMap. Its main purpose is isolation given the cluster ID value is propagated to LB resources like you mentioned. For what it's worth, it seems like cluster ID is an internal implementation detail and not something that requires configuration on the control plane, in which case removing "Cluster ID" as a feature wouldn't really change things for GCE.

andrewsykim · 2019-03-20T20:43:27Z

TODO: investigate if other providers need this

/milestone Next
/priority important-longterm

timoreimann · 2019-04-16T09:18:51Z

FWIW, DigitalOcean is similar to AWS in the sense that we do not have a dedicated notion of a cluster scope / project. We currently use a cluster ID (injected via an environment variable since we weren't certain on the future of the interface-provided cluster ID) for billing / grouping purposes (translating them to tags on DO resources like load-balancers) only.

The isolation and cleanup goals could help us as well to better deal with certain edge cases. That said, a more reliable way look up nodes as suggested by @justinsb may already suffice for us. Given that approach works, I feel that DO does not require a SIG-blessed cluster ID.

andrewsykim · 2019-07-10T20:13:48Z

cc @craiglpeters @jaypipes

andrewsykim · 2019-10-02T20:43:15Z

/assign @nckturner

fejta-bot · 2019-12-31T21:13:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

cheftako · 2020-01-02T23:15:24Z

/remove-lifecycle stale

cheftako · 2020-01-02T23:16:09Z

@nckturner Did you have a chance to investigate this?

fejta-bot · 2020-04-02T00:09:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-05-02T00:52:41Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-06-01T01:38:45Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-06-01T01:38:58Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Resolves warning 2 from k3s-io#2471. As per kubernetes/cloud-provider#12 the ClusterID requirement was never really followed through on, so the flag is probably going to be removed in the future along with the warning it currently triggers. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Resolves warning 2 from k3s-io#2471. As per kubernetes/cloud-provider#12 the ClusterID requirement was never really followed through on, so the flag is probably going to be removed in the future. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Resolves warning 2 from k3s-io#2471. As per kubernetes/cloud-provider#12 the ClusterID requirement was never really followed through on, so the flag is probably going to be removed in the future. One side-effect of this is that the core k8s cloud-controller-manager also wants to watch nodes, and needs RBAC to do so. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Resolves warning 2 from #2471. As per kubernetes/cloud-provider#12 the ClusterID requirement was never really followed through on, so the flag is probably going to be removed in the future. One side-effect of this is that the core k8s cloud-controller-manager also wants to watch nodes, and needs RBAC to do so. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

As per kubernetes/cloud-provider#12 the ClusterID requirement was never really followed through on, so the flag is probably going to be removed in the future. See also brandond/k3s@e109f13 I tested this and I didn't notice any side-effects. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>

Resolves warning 2 from k3s-io#2471. As per kubernetes/cloud-provider#12 the ClusterID requirement was never really followed through on, so the flag is probably going to be removed in the future. One side-effect of this is that the core k8s cloud-controller-manager also wants to watch nodes, and needs RBAC to do so. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

andrewsykim added this to the Next milestone Feb 20, 2019

andrewsykim mentioned this issue Feb 27, 2019

ccm: ignore nodes not present in cloud-provider kubernetes/kubernetes#73171

Closed

andrewsykim mentioned this issue Mar 6, 2019

WIP: Translate one API call to use v1.Node instead of NodeName kubernetes/kubernetes#74304

Closed

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 20, 2019

andrewsykim added P1 Priority 1 P2 Priority 2 and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. P1 Priority 1 labels Mar 28, 2019

k8s-ci-robot assigned nckturner Oct 2, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2020

k8s-ci-robot closed this as completed Jun 1, 2020

brandond mentioned this issue Nov 4, 2020

Multiple small fixes k3s-io/k3s#2475

Merged

invidian mentioned this issue Nov 13, 2020

Add stub support for clusterID kubernetes-sigs/cloud-provider-equinix-metal#107

Merged

guettli mentioned this issue Jan 26, 2023

Meaning of HasClusterID() ? #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate usage/requirements for ClusterID #12

Investigate usage/requirements for ClusterID #12

andrewsykim commented Feb 20, 2019 •

edited

Loading

justinsb commented Feb 20, 2019

justinsb commented Feb 20, 2019 •

edited

Loading

andrewsykim commented Feb 27, 2019

andrewsykim commented Feb 27, 2019 •

edited

Loading

andrewsykim commented Mar 20, 2019

timoreimann commented Apr 16, 2019

andrewsykim commented Jul 10, 2019

andrewsykim commented Oct 2, 2019

fejta-bot commented Dec 31, 2019

cheftako commented Jan 2, 2020

cheftako commented Jan 2, 2020

fejta-bot commented Apr 2, 2020

fejta-bot commented May 2, 2020

fejta-bot commented Jun 1, 2020

k8s-ci-robot commented Jun 1, 2020

Investigate usage/requirements for ClusterID #12

Investigate usage/requirements for ClusterID #12

Comments

andrewsykim commented Feb 20, 2019 • edited Loading

justinsb commented Feb 20, 2019

justinsb commented Feb 20, 2019 • edited Loading

andrewsykim commented Feb 27, 2019

andrewsykim commented Feb 27, 2019 • edited Loading

andrewsykim commented Mar 20, 2019

timoreimann commented Apr 16, 2019

andrewsykim commented Jul 10, 2019

andrewsykim commented Oct 2, 2019

fejta-bot commented Dec 31, 2019

cheftako commented Jan 2, 2020

cheftako commented Jan 2, 2020

fejta-bot commented Apr 2, 2020

fejta-bot commented May 2, 2020

fejta-bot commented Jun 1, 2020

k8s-ci-robot commented Jun 1, 2020

andrewsykim commented Feb 20, 2019 •

edited

Loading

justinsb commented Feb 20, 2019 •

edited

Loading

andrewsykim commented Feb 27, 2019 •

edited

Loading