Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Alternative Node Naming Convention #61878

Closed
wants to merge 3 commits into from

Conversation

nathanjsweet
Copy link

@nathanjsweet nathanjsweet commented Mar 29, 2018

What this PR does / why we need it:
Currently node names in aws are derived from the private-dns entry of the ec2 instance.
This PR offers administrators, who might have differing dns toplogies in their VPCs,
the option to manually configure their node names. This PR also directly addresses a
bug in K8s that prevents the Node Admission Controller from working properly when the AWS cloud provider is selected.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #11543
Fixes #47695
Fixes #52241

Addresses #48929

Is there someone where I need to document this?

Allow node names in AWS to be configured explicitly with a tag

/sig aws

Currently node names in aws are derived from the private-dns entry of the ec2 instance.
This PR offers administrators, who might have differing dns toplogies in their VPCs,
the option to manually configure their node names. This PR also directly addresses a
bug in K8s that prevents the Node Admission Controller from working properly when the AWS cloud provider is selected.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/aws size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 29, 2018
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 29, 2018
@DanyC97
Copy link

DanyC97 commented Apr 3, 2018

@liggitt is this part of the work you mentioned #54482 (comment) ?

@liggitt
Copy link
Member

liggitt commented Apr 3, 2018

I hadn't seen this PR in particular, but this type of change is what I was anticipating (though I didn't know whether @kubernetes/sig-aws-misc wanted to drive this while the cloudprovider was still in-tree)

@nathanjsweet nathanjsweet changed the title WIP: Adding Alternative Node Naming Convention Adding Alternative Node Naming Convention Apr 20, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2018
@nathanjsweet
Copy link
Author

Okay, I can confirm this is working now. It took me a while to get to testing it end-to-end.
@jsafrane can I assign this to you?

@nathanjsweet
Copy link
Author

Oh yeah the cloud config looks like this:

[Global]
ExplicitNodeNames=true

}
}
glog.Infof("Tag %s not found for instance %s", TagNameKubernetesNodeNamePrefix, aws.StringValue(i.InstanceId))
}
return types.NodeName(aws.StringValue(i.PrivateDnsName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume a custom tag value is mutable on an instance. If so, changing the value of the custom tag after a node has been added to the cluster will cause problems... docs should make that clear

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to where the docs are?

Copy link
Member

@liggitt liggitt Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the docs to be added around this new option should call out the sharp edges :)

if len(value) > 0 {
return types.NodeName(aws.StringValue(tag.Value))
}
glog.Infof("Tag %s was found for instance %s, but had no value", TagNameKubernetesNodeNamePrefix, aws.StringValue(i.InstanceId))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will silently falling back to the private dns name lead to scenarios where there's a race to apply the custom tag before the kubelet starts on up on the node and registers itself using the private dns name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling this feature and not having the tags in place will cause problems, but I think if you decide to enable this feature it's on you to make sure you don't mess up the names. This is already the case with the cloud-provider flag which requires the node-name and hostname to match (something that is undocumented).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, my guess is that kubelet doesn't register if this happens.

Copy link
Member

@liggitt liggitt Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the case with the cloud-provider flag which requires the node-name and hostname to match

in that case, the names aren't mutable... the race between kubelet start and tag addition is my main concern

my guess is that kubelet doesn't register if this happens

if it's the first startup in a tls-bootstrapping setup, the kublet will use the private dns name to get it's initial credential and register, then start failing as soon as the tag appears and it's idea of who it is changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling this feature and not having the tags in place will cause problems

This must be clearly documented. And IMO kubelet should exit when it starts on a node with ExplicitNodeNames enabled and without the right tag on the node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And IMO kubelet should exit when it starts on a node with ExplicitNodeNames enabled

Yeah, okay. I'll make that change.

@jsafrane
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 23, 2018
@jsafrane
Copy link
Member

Is there someone where I need to document this?

Yes! I could not find any documentation about AWS cloud.conf format and keys/values. I am afraid it's you who's supposed to create one.

It belongs here: https://kubernetes.io/docs/concepts/cluster-administration/cloud-providers/#aws
(source: https://github.com/kubernetes/website/tree/master/docs/concepts/cluster-administration)

@nathanjsweet
Copy link
Author

@jsafrane okay I gave mapInstanceToNodeName and error. I'm fairly confident that should address the issues.

@liggitt
Copy link
Member

liggitt commented May 1, 2018

Does this PR resolve #52241 ?

@nathanjsweet
Copy link
Author

@liggitt Yes it would.

Copy link
Contributor

@nckturner nckturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be an awesome improvement, but if it is possible I'd like to see it done without additional configuration options. I think this is possible by defaulting to private DNS if tags are not present nodes, but I'd love to hear other's thoughts.

func mapInstanceToNodeName(i *ec2.Instance) types.NodeName {
return types.NodeName(aws.StringValue(i.PrivateDnsName))
func mapInstanceToNodeName(i *ec2.Instance, explicitNames bool) (types.NodeName, error) {
if explicitNames {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the tags idea, but rather than differentiating between the old way (private DNS name) and the new way (tagged name) via a config option, I'd prefer if we can allow the new behavior in a backwards compatible way, without the need for a new config option. I.e., default to private DNS name but use a tagged name if present on the node.

return types.NodeName(aws.StringValue(i.PrivateDnsName))
func mapInstanceToNodeName(i *ec2.Instance, explicitNames bool) (types.NodeName, error) {
if explicitNames {
for _, tag := range i.Tags {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. check to see if a tag exists that matches kubernetes.io/nodeName, and if not, we fall back to private DNS instead of erroring.

filters := []*ec2.Filter{
newEc2Filter("private-dns-name", privateDNSName),
newEc2Filter(nodeFilter, privateDNSName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we would have to attempt with the tag filter, check to see if any instances were returned, then try private DNS.

@liggitt
Copy link
Member

liggitt commented May 2, 2018

I think this is possible by defaulting to private DNS if tags are not present nodes, but I'd love to hear other's thoughts.

switching a node name after the fact is disruptive enough that it doesn't seem great to infer and fallback this way. see discussion (now hidden) #61878 (comment) of the race conditions between adding tags and kubelet startup that falling back to the dns name can cause

@nathanjsweet
Copy link
Author

Sorry for the bombardment, I accidentally rebased higher up the tree than I should have.

@nathanjsweet
Copy link
Author

I can no longer carry this PR forward, as I have left Datica Inc. Datica has signed the CNCF agreement though, so I don't know if that means a maintainer can fork this and carry it forward. Sorry for the inconvenience.

@@ -141,7 +141,10 @@ func (c *Cloud) checkIfAttachedToNode(diskName KubernetesVolumeID, nodeName type
}

awsDiskInfo.ec2Instance = instanceInfo
awsDiskInfo.nodeName = mapInstanceToNodeName(instanceInfo)
awsDiskInfo.nodeName, err = mapInstanceToNodeName(instanceInfo, c.cfg.Global.ExplicitNodeNames)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be a problem, if someone switches naming topology midway through, then disks could potentially get orphaned.

@karlhungus
Copy link
Contributor

/retest

@karlhungus
Copy link
Contributor

I was going to offer help here, but I can't see any comments that haven't already been addressed

@karlhungus
Copy link
Contributor

I'm happy to take a stab at addressing this: #61878 (comment) (documentation) comment

@nathanjsweet
Copy link
Author

@karlhungus There needs to be dire warnings in the documentation that the node naming convention can never be switched after cloud resources have been deployed by Kubernetes without orphaning those resources (I suppose you could just manually delete them), however disks would get orphaned.

// as the assigned node name, and want to explicitly set the names of their nodes, via
// tags. This configuration indicates that all node names have been explicitly
// set with the tag "kubernetes.io/node-name".
ExplicitNodeNames bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this cold, a name like useTagsForNodeNames might be clearer

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 1, 2018
@v1k0d3n
Copy link

v1k0d3n commented Sep 6, 2018

as a service provider, it would be a great option to see implemented. is there any status update, or is there a way we can throw our hat in the ring and lend a hand with this issue? it seems like this fell off the wagon...

@liggitt
Copy link
Member

liggitt commented Sep 6, 2018

I'd recommend getting on the agenda in a sig-aws and/or sig-cloudprovider meeting for visibility and blessing to help get it over the finish line. Would also be good to make sure it's a pattern that could be imitated by other providers and make the maintainers of those providers aware of this effort.

@nickperry
Copy link

What's happening with this one? We really need this at our site.

@nathanjsweet
Copy link
Author

@liggitt With the benefit of time I am starting to think that this PR is a bad idea. I can't think of another K8s feature that requires the admin to understand such a significant amount of how the underlying cloud-controller works. I think a better solution would be for the cloud controller to include an interface for changing node names (this would require them to persist) so that if the admin decides to change the node naming convention it is easy to do so. Another reason I think that is a better solution is that this current PR only fixes the problem for new clusters. Old clusters really can't make use of this PR. I'd be willing to draft something like this up (I'll be working on K8s full time again in January), but WG should I propose this idea to?

@liggitt
Copy link
Member

liggitt commented Nov 27, 2018

I'd be willing to draft something like this up (I'll be working on K8s full time again in January), but WG should I propose this idea to?

cloud providers have been consolidated under sig-cloud-provider, to improve consistency in how k8s interfaces with cloud providers. That group would be the venue for discussing this

@liggitt
Copy link
Member

liggitt commented Nov 27, 2018

I think a better solution would be for the cloud controller to include an interface for changing node names (this would require them to persist) so that if the admin decides to change the node naming convention it is easy to do so.

as an aside, changing an existing node's name is likely to be just as disruptive as removing and creating a new node. API object names are immutable, as is the spec.nodeName field in pods.

@justinsb
Copy link
Member

I think it would be great to have the cloudprovider pass Node objects, instead of NodeNames, because then we could get the instance id from the node object (the ProviderID field IIRC). Then the cloudprovider wouldn't care about the Node Name at all.

It's also good because it avoids having to map from a NodeName -> an AWS instance ID, which requires caching and/or API calls.

As @liggitt points out, we can work on this as part of the cloudprovider extraction; and we probably won't be able to rename Nodes, but we can at least remove the cloudprovider constraint on the Name. I think at that point it's easy to wrap kubelet in a script to build a name from tags, IP, instance ID, etc.

@sigil66
Copy link

sigil66 commented Nov 29, 2018

@justinsb is there a target release for the cloud provider extraction project?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2019
@nathanjsweet
Copy link
Author

This PR is stale and not happening.

@andrewsykim
Copy link
Member

andrewsykim commented Feb 27, 2019

@justinsb is there a target release for the cloud provider extraction project?

Provider extraction efforts are still ongoing and will be ongoing for the next few releases. I personally am not opposed to addressing this along-side the extraction effort. Feel free to add this to the next SIG cloud provider meeting agenda

@pierluigilenoci
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2019
@cabrinha
Copy link

@justinsb is there a target release for the cloud provider extraction project?

Provider extraction efforts are still ongoing and will be ongoing for the next few releases. I personally am not opposed to addressing this along-side the extraction effort. Feel free to add this to the next SIG cloud provider meeting agenda

Can you add it to that agenda? It seems like this issue is getting lost in the noise.

@andrewsykim
Copy link
Member

@cabrinha best place to continue this discussion would be our mailing list https://groups.google.com/forum/#!forum/kubernetes-sig-cloud-provider

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet