Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate usage/requirements for ClusterID #12

Closed
andrewsykim opened this issue Feb 20, 2019 · 15 comments
Closed

Investigate usage/requirements for ClusterID #12

andrewsykim opened this issue Feb 20, 2019 · 15 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. P2 Priority 2
Milestone

Comments

@andrewsykim
Copy link
Member

andrewsykim commented Feb 20, 2019

In kubernetes/kubernetes#48954 & kubernetes/kubernetes#49215 we made ClusterID a requirement, and added a flag --allow-untagged-cloud on the kube-controller-manager. The intention there was to allow clusters to get away with not setting ClusterID for a few releases but eventually make it a requirement. It seems we never followed through with cleaning up the --allow-untagged-cloud flag.

More interestingly, it's not exactly clear how ClusterID is being consumed by both in-tree and out-of-tree cloud providers. It seems it's critical to AWS/GCE but not really used by others. Do we still need ClusterID? Should we use a more generic approach with labels/annotations? If we need it, should we go ahead and remove the --allow-untagged-cloud flag?

If the plan is to continue to support ClusterID, we should at least add better documentation for how this works.

cc @justinsb @rrati

@andrewsykim andrewsykim added this to the Next milestone Feb 20, 2019
@justinsb
Copy link
Member

So on AWS the ClusterID is used as a tag to allow multiple clusters to coexist in the same "scope". AWS doesn't have a project concept like GCP does, so originally this was needed to allow multiple clusters to exist in the same VPC at all.

Drilling down, there are two use cases:

  • isolation: Cluster A shouldn't see cluster B's resources in e.g. the cloudprovider code
  • cleanup: We should be able to clean up any resources that cluster A created when we delete it, without cleaning up cluster B resources

Cleanup is a nice-to-have - in theory if you shut down k8s cleanly you don't need it, but it is nice.

Isolation is trickier:

  • we have some resources that we tag, but we could instead pass them in explicitly (e.g. RouteTables, SecurityGroups, Subnets).
  • We also have a lot of places where we still refer to instances by NodeName and expect those to be unique in the scope, but that isn't generally a safe assumption on AWS. It is safe today because we tag and because the NodeName is the PrivateDnsName which is derived from the internal IP; we actually could get away with just looking at the VPC + PrivateDnsName, but the effort to date has been about figuring out how to make the NodeName != PrivateDnsName. It would be great (IMO) to change the cloudprovider interface to pass in the Node object instead of a NodeName - and also much more efficient as it would save a lookup. But now we run into the future plans to lock-down how kubelets register and how much of this information can be done server-side. Not sure what our plans are there.

AFAICT, what GCE calls the cluster-id is currently only used in the load balancer, as part of the name for backend rules. I'm not sure if that is for cleanup, isolation, or both.

@justinsb
Copy link
Member

justinsb commented Feb 20, 2019

I put up a WIP PR to show what is required to move to passing around v1.Node, BTW: kubernetes/kubernetes#74304 It's just one call, and it already looks bad, but the only complicated thing was the desiredStateOfWorld map on node-update - and the volume reconciler is where I've always got stuck before.

@andrewsykim
Copy link
Member Author

  • We also have a lot of places where we still refer to instances by NodeName and expect those to be unique in the scope, but that isn't generally a safe assumption on AWS. It is safe today because we tag and because the NodeName is the PrivateDnsName which is derived from the internal IP; we actually could get away with just looking at the VPC + PrivateDnsName, but the effort to date has been about figuring out how to make the NodeName != PrivateDnsName. It would be great (IMO) to change the cloudprovider interface to pass in the Node object instead of a NodeName - and also much more efficient as it would save a lookup. But now we run into the future plans to lock-down how kubelets register and how much of this information can be done server-side. Not sure what our plans are there.

@justinsb is it safe to say that if we replace where we pass in node name only with the entire node object this would remove the need for clusterID? Seems like a reasonable thing to do to me because there are a lot of places where we have to "search node by name" when we could just get the node ID from spec.ProviderID.

@andrewsykim
Copy link
Member Author

andrewsykim commented Feb 27, 2019

AFAICT, what GCE calls the cluster-id is currently only used in the load balancer, as part of the name for backend rules. I'm not sure if that is for cleanup, isolation, or both.

For the GCE case (and please correct me if I'm wrong), it seems like it supports cluster ID but not in the same way as AWS. GCE generates it's own cluster ID dynamically and stores the value in a ConfigMap. Its main purpose is isolation given the cluster ID value is propagated to LB resources like you mentioned. For what it's worth, it seems like cluster ID is an internal implementation detail and not something that requires configuration on the control plane, in which case removing "Cluster ID" as a feature wouldn't really change things for GCE.

@andrewsykim
Copy link
Member Author

TODO: investigate if other providers need this

/milestone Next
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Mar 20, 2019
@andrewsykim andrewsykim added P1 Priority 1 P2 Priority 2 and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. P1 Priority 1 labels Mar 28, 2019
@timoreimann
Copy link
Contributor

FWIW, DigitalOcean is similar to AWS in the sense that we do not have a dedicated notion of a cluster scope / project. We currently use a cluster ID (injected via an environment variable since we weren't certain on the future of the interface-provided cluster ID) for billing / grouping purposes (translating them to tags on DO resources like load-balancers) only.

The isolation and cleanup goals could help us as well to better deal with certain edge cases. That said, a more reliable way look up nodes as suggested by @justinsb may already suffice for us. Given that approach works, I feel that DO does not require a SIG-blessed cluster ID.

@andrewsykim
Copy link
Member Author

cc @craiglpeters @jaypipes

@andrewsykim
Copy link
Member Author

/assign @nckturner

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019
@cheftako
Copy link
Member

cheftako commented Jan 2, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2020
@cheftako
Copy link
Member

cheftako commented Jan 2, 2020

@nckturner Did you have a chance to investigate this?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

brandond added a commit to brandond/k3s that referenced this issue Nov 3, 2020
Resolves warning 2 from k3s-io#2471.

As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future along with the
warning it currently triggers.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to brandond/k3s that referenced this issue Nov 4, 2020
Resolves warning 2 from k3s-io#2471.

As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future along with the
warning it currently triggers.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to brandond/k3s that referenced this issue Nov 4, 2020
Resolves warning 2 from k3s-io#2471.

As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to brandond/k3s that referenced this issue Nov 4, 2020
Resolves warning 2 from k3s-io#2471.

As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

One side-effect of this is that the core k8s cloud-controller-manager
also wants to watch nodes, and needs RBAC to do so.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to k3s-io/k3s that referenced this issue Nov 5, 2020
Resolves warning 2 from #2471.

As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

One side-effect of this is that the core k8s cloud-controller-manager
also wants to watch nodes, and needs RBAC to do so.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
invidian added a commit to kinvolk/packet-ccm that referenced this issue Nov 13, 2020
As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

See also
brandond/k3s@e109f13

I tested this and I didn't notice any side-effects.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit to kinvolk/packet-ccm that referenced this issue Nov 13, 2020
As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

See also
brandond/k3s@e109f13

I tested this and I didn't notice any side-effects.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit to kinvolk/packet-ccm that referenced this issue Dec 1, 2020
As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

See also
brandond/k3s@e109f13

I tested this and I didn't notice any side-effects.

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
briandowns pushed a commit to briandowns/k3s that referenced this issue Jan 14, 2021
Resolves warning 2 from k3s-io#2471.

As per kubernetes/cloud-provider#12 the
ClusterID requirement was never really followed through on, so the
flag is probably going to be removed in the future.

One side-effect of this is that the core k8s cloud-controller-manager
also wants to watch nodes, and needs RBAC to do so.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. P2 Priority 2
Projects
None yet
Development

No branches or pull requests

7 participants