New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud Controller Manger doesn't query cloud provider for node name, causing the node to be removed #70897
Comments
@yifan-gu: There are no sig labels on this issue. Please add a sig label by either:
Note: Method 1 will trigger an email to the group. See the group list. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
+1 I've also observed this happening if the hostname isn't set properly. |
Anyone looking into this? |
Hi @yifan-gu. It's expected that CCM is able to find a node either by its name or provider ID. It'll be hard to break this assumption unfortunately. If the hostname is not matching the node name then we have to expect the user to override the hostname with --hostname or use --provider-id. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/lifecycle frozen |
This remains an issue on latest k8s. Updated code path in kubelet: kubernetes/cmd/kubelet/app/server.go Lines 992 to 994 in 53a7922
I think this also poses a problem for the out-of-tree migration. It seems to me that e.g. the OpenStack legacy cloud provider doesn't guarantee that the node name and hostname match: kubernetes/staging/src/k8s.io/legacy-cloud-providers/openstack/openstack_instances.go Lines 71 to 79 in 53a7922
So, if one tries to upgrade a node from From openshift/machine-config-operator#2401 (comment) it sounds like this also affects VMware and AWS depending on configuration. (perhaps @nckturner can confirm impact on AWS) Wanted to check if this is on the external cloud provider migration radar and if there is a blessed migration path? |
This issue affects both the AWS and OpenStack in-tree cloud providers, both of which return instance.Name in CurrentNodeName(), which is not necessarily the same as kubelet's default of the FQDN hostname. We can work round this by setting the hostname to be whatever the in-tree cloud provider previously returned. However, this feels like a kludge because:
An idea I've seen is to request it from the CCM. However that may have a bootstrapping problem as kubelet communicates with the CCM via the API, so it would have to identify itself somehow in order to retrieve the correct information. I wonder if a simpler idea might be to persist kubelet's node name as local state, e.g. in Or even simpler, launching kubelet with |
/area provider/openstack Please add additional ones as affected :) |
/area provider/aws |
IMO the right approach is for all kubelets to set the |
Would you use the provider id as the node name? If not, how would you use it to resolve the node name? |
The provider ID maps to the |
@andrewsykim does this mean that setting |
Nvm, as you mentioned above, users may or may not have this set. I am interested in this suggestion by @mdbooth which sounds like it could get us through the upgrade without breaking existing kubelets:
|
Just to be clear this was just for addressing the delete case, existing kubelets should not need --provider-id. And even new kubelets do not need --provider-id. However, having |
So the above issue makes it sound like existing nodes will be deleted upon upgrade:
Am I misunderstanding when this might occur? I read it as: on upgrade, if hostname does not match nodename. Edit: ok, the existing kubelet case makes sense. |
This seems error-prone to me and I'd prefer we not add hacks in the kubelet to work around a cloud-provider issue. |
Since it's common in AWS for instances to be named after the private DNS name, I feel like the AWS CCM should just be updated to query nodes both by name instance name and it's private DNS name. Would that solve this issue? |
That might solve it for AWS but this doesn't just affect AWS; see the examples above involving OpenStack. I am not sure if other cloud providers are also affected. |
I think we're mixing up the delete case you mentioned above. It sounds like this new issue has hijacked a similar but different issue. Sorry! In the in-tree -> external CCM case, the external CCM is not involved in the bug at all, so unfortunately it's not useful for fixing anything. The issue is in the difference of behaviour between:
When kubelet does not have an in-tree cloud provider, its default behaviour is to assume that its Node object's name is the FQDN hostname of the host. This is not true if kubelet was previously using the AWS or OpenStack in-tree provider, which provided unqualified names. The result is that kubelet cannot find its Node object and does not start cleanly. For example, because kubelet stops pinging the Node object it becomes stale and the Node is marked NotReady. Static pods are not started (specifically the local coredns; that was a deep and fruitless rabbit hole). Probably other things I didn't get to. It's generally quite sad in ways that the CCM can't fix. |
/cc @leilajal |
Yes, this is an unfortunate result of the default naming convention when a kubelet is providerless and when it's using the AWS or OpenStack cloud provider. And unfortunately, you can't override hostname on AWS, see #54482. I think this use-case is not common enough and the solution would be sufficiently complex that we wouldn't try to fix this, but open to hearing other suggestions / alternatives. |
Herein lies the problem. When upgrading from in-tree to CCM, which we plan to do some time soon, it will affect 100% of AWS/OpenStack users. |
Worth clarifying that I was specifically referring to when kubelet goes from |
To the best of my understanding, |
The CCM can't help find kubelet find the node object, but it can give flexibility for kubelet to use different naming formats as supported by the cloud provider. So in the AWS case, if you switch from |
Excellent! I was thinking we could update the script which launches kubelet to look for a local file, e.g. |
I think this is up to the cluster admin / operator to figure out since this is pertaining to how kubelet eventually ends up running on a given node. The kubelet currently has |
@andrewsykim Sorry for confusing these 2 issues earlier, btw. Looks like we're going to go with a solution based on |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
CoreOS-stable-1911.3.0
on aws, with customized ignition configs.ip-10-3-18-1
, instead of the full private dnsip-10-3-18-1.us-west-1.compute.internal
because/etc/hostname
is not set.--cloud-provider=external
and skips this code path. So it's not able to set the private dns as the node name.GetInstanceProviderID()
and fails.getNodeAddressesByProviderIDOrName()
and fails too.What you expected to happen:
Since now the kubelet runs with
--cloud-provider=external
, no one is executing that code path to query the cloud provider to get the node name anymore.However this code path still needs to be executed by someone to get the correct node name from the cloud provider for the node.
I think the CCM might need to query the cloud provider for the full node hostname in case the hostname given by the kubelet is not the full hostname (in AWS case).
How to reproduce it (as minimally and precisely as possible):
Launch a container linux with non-empty ignition config in the user data, then the hostname won't be the full
private-dns
.Then launch kubelet with
--cloud-provider=external
and CCM will reproduce the issue described above.Anything else we need to know?:
Environment:
Linux ip-10-3-20-13 4.14.78-coreos #1 SMP Mon Nov 5 17:42:07 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
Internal k8s installer tool based on terraform
This issue can be mitigated by telling the igntion config to set the
/etc/hostname
to the private-dns (bycurl http://169.254.169.254/latest/meta-data/hostname
). Or just use the coreos-metadata service./cc @Quentin-M
@kubernetes/sig-aws-misc
@andrewsykim
/kind bug
The text was updated successfully, but these errors were encountered: