Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kubelet starting before hostname set on FCOS AWS #766

Merged
merged 1 commit into from Jun 19, 2020

Conversation

dghubble
Copy link
Member

@dghubble dghubble commented Jun 19, 2020

  • Fedora CoreOS kubelet.service can start before the hostname
    is set. Kubelet reads the hostname to determine the node name to
    register. If the hostname was read as localhost, Kubelet will
    continue trying to register as localhost (problem)
  • This race manifests as a node that appears NotReady, the Kubelet
    is trying to register as localhost, while the host itself (by then)
    has an AWS provided hostname. Restarting kubelet.service is a
    manual fix so Kubelet re-reads the hostname
  • This race could only be shown on AWS, not on Google Cloud or
    Azure despite attempts. Bare-metal and DigitalOcean differ and
    use hostname-override (e.g. afterburn) so they're not affected
  • Wait for nodes to have a non-localhost hostname in the oneshot
    that awaits /etc/resolve.conf. Typhoon has no valid cases for a
    node hostname being localhost (not even single-node clusters)

Related Openshift: openshift/machine-config-operator#1813
Close #765

* Fedora CoreOS `kubelet.service` can start before the hostname
is set. Kubelet reads the hostname to determine the node name to
register. If the hostname was read as localhost, Kubelet will
continue trying to register as localhost (problem)
* This race manifests as a node that appears NotReady, the Kubelet
is trying to register as localhost, while the host itself (by then)
has an AWS provided hostname. Restarting kubelet.service is a
manual fix so Kubelet re-reads the hostname
* This race could only be shown on AWS, not on Google Cloud or
Azure despite attempts. Bare-metal and DigitalOcean differ and
use hostname-override (e.g. afterburn) so they're not affected
* Wait for nodes to have a non-localhost hostname in the oneshot
that awaits /etc/resolve.conf. Typhoon has no valid cases for a
node hostname being localhost (not even single-node clusters)

Related Openshift: openshift/machine-config-operator#1813
Close #765
@dghubble dghubble force-pushed the aws-kubelet-and-hostname-race branch from 56c748e to 4cfafea Compare June 19, 2020 07:21
@dghubble dghubble merged commit 4cfafea into master Jun 19, 2020
@dghubble dghubble deleted the aws-kubelet-and-hostname-race branch June 19, 2020 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nodes are going into Not Ready State
1 participant