New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401
[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401
Conversation
506bf60
to
8dc8532
Compare
Hi Danil, What is the context of this PR? I don't believe any of the cloud platforms today use afterburn directly to acquire their hostname (other than GCP to work around GCP hostname truncating issues). Is there a bug you are trying to resolve? |
Is this related to the cloud provider plugin requiring that the hostname and the node name match for admission to the cluster? I know this is a problem for VMware and if we have to do this for AWS....I fear we'll be doing it for the other clouds too. |
What happened is that I tried an automated reconfiguration for kubelet with |
8dc8532
to
bed87a5
Compare
/retest |
RemainAfterExit=yes | ||
ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname | ||
# Manually set current hostname at /proc/sys/kernel/hostname to correct value for kubelet.service | ||
ExecStart=/bin/bash -c "cp /run/afterburn.hostname /proc/sys/kernel/hostname; cp /run/afterburn.hostname /etc/hostname" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/proc/sys/kernel/hostname
is the default and only place where golang os.Hostname
used in kubelet will gain the value. node-valid-hostname.service
hostnamectl
call fails to update current hostname in running machine, so there is a manual copy.
Could not set property: Connection timed out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rphillips Here is an issue with hostnamectl I observed, re: #2401 (comment)
@darkmuggle @yuqi-zhang PTAL. |
/retest |
We need to fix this in the external cloud providers. There are too many platforms this can effect. |
@rphillips I tested it in GCP, Azure, migration works fine on those platforms. vSphere has a fix, same for OpenStack, so it is possible this is the only place left to be changed. We can't really fix it for There is no way for standard |
/retest |
@darkmuggle I tested it in other clouds, does not seem to be the case. Only AWS is affected. So there is a fix. |
Upstream issue: kubernetes/kubernetes#70897 |
We should use the following script/function to set the hostname, which uses hostnamectl: machine-config-operator/templates/common/_base/files/usr-local-sbin-set-valid-hostname.yaml Line 38 in 522f0fa
|
The script gets installed at |
bed87a5
to
e56f075
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Danil-Grigorev The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@rphillips Your feedback should be addressed now, could you take a look? The script is now using |
@darkmuggle @yuqi-zhang PTAL. AWS provider is the only one affected and as @rphillips pointed out it is coreOS issue- kubernetes/kubernetes#70897. They propose to use afterburner for managing such tasks, which is this PR exactly doing. |
Hi, sorry for the late comment, I'm not the most comfortable with hostnames so I defer to @darkmuggle if he has the time to take a look. Just a quick question, was comparing this to:
That service appears to not run on firstboot. Should this do the same? |
The change looks fine, it's consistent with other platforms. |
i was able to manually run the commands from the afterburn-hostname, and those did work: [root@ip-10-0-152-55 sbin]# source /usr/local/sbin/set-valid-hostname.sh
[root@ip-10-0-152-55 sbin]# set_valid_hostname `cat /run/afterburn.hostname`
exit
[core@ip-10-0-152-55 sbin]$ echo $?
0
[core@ip-10-0-152-55 sbin]$ hostname
ip-10-0-152-55.us-east-2.compute.internal not sure why this is failing during boot though. perhaps there is another dependency we need to wait on, or add more retries? |
e56f075
to
debdd84
Compare
Type=oneshot | ||
RemainAfterExit=yes | ||
ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname | ||
ExecStart=/bin/bash -c "source /usr/local/sbin/set-valid-hostname.sh; set_valid_hostname `cat /run/afterburn.hostname`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need reworking to be based on #2618
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this needs to do the same thing as we're doing on GCP here: https://github.com/openshift/machine-config-operator/blob/master/templates/common/gcp/files/etc-networkmanager-conf.d-hostname.yaml
I guess just copy that file into templates/common/aws
, longer term it'd be good to dedup them.
Otherwise the hostname change may get reverted when NM renews the DHCP lease.
i created a small patch on top of this pr based off of @cgwalters suggestions and it seems to be working well for me.
when i install a cluster with this patch in place, i see machines joining and nodes created with the proper names. here is an example of the CSRs after bootstrap:
|
debdd84
to
a20b5b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor nit, but otherwise this looks good and i know it works ;)
templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
(At some point I am curious around the larger backstory on why in AWS DHCP doesn't give us the hostname we need)
108035e
to
55478e9
Compare
55478e9
to
1e2c932
Compare
- Add AWS support to usr-local-bin-mco-hostname.yaml based on @elmiko implementation Co-authored-by: Michael McCune <msm@opbstudios.com>
1e2c932
to
cef5683
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and adding lgtm for Ben & Colin's reviews
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, Danil-Grigorev, darkmuggle, kikisdeliveryservice The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-agnostic-upgrade |
@Danil-Grigorev: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/test e2e-agnostic-upgrade |
- What I did
During kubelet migration from
--cloud-provider=aws
to--cloud-provider=external
kubelet determinesNode
name by the hostname of the machine where it is running. AWS hostname is mismatched, usually it is a reversed DNS mapping of machine ip, likeip-10-0-195-176
. This results in node name mismatch in kubelet, and API server admission plugin rejects changes for the object.This fixes the issue.
- How to verify it
<region>.compute.internal
even when the cloud-provider is set toexternal
.- Description for the changelog