[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401

Danil-Grigorev · 2021-02-10T13:15:02Z

- What I did

During kubelet migration from --cloud-provider=aws to --cloud-provider=external kubelet determines Node name by the hostname of the machine where it is running. AWS hostname is mismatched, usually it is a reversed DNS mapping of machine ip, like ip-10-0-195-176. This results in node name mismatch in kubelet, and API server admission plugin rejects changes for the object.

This fixes the issue.

- How to verify it

Ensure that new nodes provisioned by kubelet will report hostname with aws prefix, typically reversed IP mapping + <region>.compute.internal even when the cloud-provider is set to external.

- Description for the changelog

yuqi-zhang · 2021-02-10T15:34:28Z

Hi Danil,

What is the context of this PR? I don't believe any of the cloud platforms today use afterburn directly to acquire their hostname (other than GCP to work around GCP hostname truncating issues). Is there a bug you are trying to resolve?

darkmuggle · 2021-02-10T17:25:43Z

Is this related to the cloud provider plugin requiring that the hostname and the node name match for admission to the cluster? I know this is a problem for VMware and if we have to do this for AWS....I fear we'll be doing it for the other clouds too.

Danil-Grigorev · 2021-02-10T19:14:25Z

What happened is that I tried an automated reconfiguration for kubelet with --cloud-provider=external flag for AWS and got this issue with admission lock on the Node resource in api server. The hostname was different, so the node was not applied by kubelet, which resulted in the thing never running. I'm still trying to verify if this will fix it for AWS, but yeah. That's what @darkmuggle mentioned is closely related. Same attempt for GCP was working, and the external configuration provisioned Node without issues.

Danil-Grigorev · 2021-02-11T11:55:47Z

/retest

Danil-Grigorev · 2021-02-11T12:04:08Z

templates/common/aws/units/afterburn-hostname.service.yaml

+  RemainAfterExit=yes
+  ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
+  # Manually set current hostname at /proc/sys/kernel/hostname to correct value for kubelet.service
+  ExecStart=/bin/bash -c "cp /run/afterburn.hostname /proc/sys/kernel/hostname; cp /run/afterburn.hostname /etc/hostname"


/proc/sys/kernel/hostname is the default and only place where golang os.Hostname used in kubelet will gain the value. node-valid-hostname.service hostnamectl call fails to update current hostname in running machine, so there is a manual copy.

Could not set property: Connection timed out

@rphillips Here is an issue with hostnamectl I observed, re: #2401 (comment)

Danil-Grigorev · 2021-02-11T12:08:34Z

@darkmuggle @yuqi-zhang PTAL.

Danil-Grigorev · 2021-02-11T13:02:15Z

/retest

rphillips · 2021-02-11T13:37:52Z

We need to fix this in the external cloud providers. There are too many platforms this can effect.

Danil-Grigorev · 2021-02-11T15:32:39Z

@rphillips I tested it in GCP, Azure, migration works fine on those platforms. vSphere has a fix, same for OpenStack, so it is possible this is the only place left to be changed. We can't really fix it for external cloud providers - it is common for all platform-less kubernetes configurations. The issue happens to be here: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/node/node.go#L56

There is no way for standard os go library to default to FQDN hostname, and I'm not sure changing this core functionality will be easily accepted.

Danil-Grigorev · 2021-03-03T10:40:27Z

/retest

Danil-Grigorev · 2021-03-03T13:37:21Z

Is this related to the cloud provider plugin requiring that the hostname and the node name match for admission to the cluster? I know this is a problem for VMware and if we have to do this for AWS....I fear we'll be doing it for the other clouds too.

@darkmuggle I tested it in other clouds, does not seem to be the case. Only AWS is affected. So there is a fix.

rphillips · 2021-03-03T14:02:29Z

Upstream issue: kubernetes/kubernetes#70897

rphillips · 2021-03-03T14:08:25Z

We should use the following script/function to set the hostname, which uses hostnamectl:

machine-config-operator/templates/common/_base/files/usr-local-sbin-set-valid-hostname.yaml

Line 38 in 522f0fa

set_valid_hostname() {

rphillips · 2021-03-03T14:09:45Z

The script gets installed at /usr/local/sbin/set-valid-hostname.sh

openshift-ci-robot · 2021-03-04T15:16:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Danil-Grigorev
To complete the pull request process, please assign sinnykumari after the PR has been reviewed.
You can assign the PR to them by writing /assign @sinnykumari in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Danil-Grigorev · 2021-03-10T15:08:02Z

@rphillips Your feedback should be addressed now, could you take a look? The script is now using set-valid-hostname.sh
/retest

Danil-Grigorev · 2021-03-10T15:11:23Z

@darkmuggle @yuqi-zhang PTAL. AWS provider is the only one affected and as @rphillips pointed out it is coreOS issue- kubernetes/kubernetes#70897. They propose to use afterburner for managing such tasks, which is this PR exactly doing.

yuqi-zhang · 2021-03-10T16:43:33Z

Hi, sorry for the late comment, I'm not the most comfortable with hostnames so I defer to @darkmuggle if he has the time to take a look. Just a quick question, was comparing this to:

machine-config-operator/templates/common/gcp/units/gcp-hostname.service.yaml

Line 7 in bb2630c

ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json

That service appears to not run on firstboot. Should this do the same?

darkmuggle · 2021-05-22T12:48:04Z

The change looks fine, it's consistent with other platforms.
/approve
/lgtm
(other than the failing CI)

elmiko · 2021-06-21T21:40:03Z

i was able to manually run the commands from the afterburn-hostname, and those did work:

[root@ip-10-0-152-55 sbin]# source /usr/local/sbin/set-valid-hostname.sh
[root@ip-10-0-152-55 sbin]# set_valid_hostname `cat /run/afterburn.hostname`
exit
[core@ip-10-0-152-55 sbin]$ echo $?
0
[core@ip-10-0-152-55 sbin]$ hostname
ip-10-0-152-55.us-east-2.compute.internal

not sure why this is failing during boot though. perhaps there is another dependency we need to wait on, or add more retries?

cgwalters · 2021-06-22T17:41:55Z

templates/common/aws/units/afterburn-hostname.service.yaml

+  Type=oneshot
+  RemainAfterExit=yes
+  ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
+  ExecStart=/bin/bash -c "source /usr/local/sbin/set-valid-hostname.sh; set_valid_hostname `cat /run/afterburn.hostname`"


This will need reworking to be based on #2618

cgwalters

Also, this needs to do the same thing as we're doing on GCP here: https://github.com/openshift/machine-config-operator/blob/master/templates/common/gcp/files/etc-networkmanager-conf.d-hostname.yaml

I guess just copy that file into templates/common/aws, longer term it'd be good to dedup them.

Otherwise the hostname change may get reverted when NM renews the DHCP lease.

templates/common/aws/units/afterburn-hostname.service.yaml

cgwalters · 2021-06-22T20:04:58Z

coreos/afterburn#509 (comment)

elmiko · 2021-06-22T21:47:52Z

i created a small patch on top of this pr based off of @cgwalters suggestions and it seems to be working well for me.

diff --git a/templates/common/_base/files/usr-local-bin-mco-hostname.yaml b/templates/common/_base/files/usr-local-bin-mco-hostname.yaml
index 0ad56d8c..d3a398bf 100644
--- a/templates/common/_base/files/usr-local-bin-mco-hostname.yaml
+++ b/templates/common/_base/files/usr-local-bin-mco-hostname.yaml
@@ -22,6 +22,16 @@ contents:
         exit 0
     }
 
+    set_aws_hostname() {
+        /usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
+
+        local host_name=$(cat /run/afterburn.hostname)
+
+        echo "setting transient hostname to ${host_name}"
+        /bin/hostnamectl --transient set-hostname "${host_name}"
+        exit 0
+    }
+
     set_gcp_hostname() {
         /usr/bin/afterburn --provider gcp --hostname=/run/afterburn.hostname
 
@@ -58,6 +68,7 @@ contents:
     arg=${1}; shift;
     case "${arg}" in
         --wait) wait_localhost;;
+        --aws) set_aws_hostname;;
         --gcp) set_gcp_hostname;;
         *) echo "Unhandled arg $arg"; exit 1
     esac
diff --git a/templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml b/templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml
new file mode 100644
index 00000000..de974ae2
--- /dev/null
+++ b/templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml
@@ -0,0 +1,10 @@
+mode: 0644
+path: "/etc/NetworkManager/conf.d/hostname.conf"
+contents:
+  inline: |
+    # The following configuration allows 90-long-hostname.sh
+    # to manage setting transient hostname instead of NetworkManager itself.
+    # See: https://developer.gnome.org/NetworkManager/stable/NetworkManager.conf.html
+    #      https://bugzilla.redhat.com/show_bug.cgi?id=1872885
+    [main]
+    hostname-mode=none
diff --git a/templates/common/aws/units/afterburn-hostname.service.yaml b/templates/common/aws/units/afterburn-hostname.service.yaml
deleted file mode 100644
index 6126d872..00000000
--- a/templates/common/aws/units/afterburn-hostname.service.yaml
+++ /dev/null
@@ -1,18 +0,0 @@
-name: afterburn-hostname.service
-enabled: true
-contents: |
-  [Unit]
-  Description=Afterburn Hostname for AWS
-  After=NetworkManager-wait-online.service
-  Before=node-valid-hostname.service kubelet.service
-
-  [Service]
-  Restart=on-failure
-  RestartSec=15
-  Type=oneshot
-  RemainAfterExit=yes
-  ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
-  ExecStart=/bin/bash -c "source /usr/local/sbin/set-valid-hostname.sh; set_valid_hostname `cat /run/afterburn.hostname`"
-
-  [Install]
-  WantedBy=network-online.target
diff --git a/templates/common/aws/units/aws-hostname.service.yaml b/templates/common/aws/units/aws-hostname.service.yaml
new file mode 100644
index 00000000..00b80017
--- /dev/null
+++ b/templates/common/aws/units/aws-hostname.service.yaml
@@ -0,0 +1,20 @@
+name: aws-hostname.service
+enabled: true
+contents: |
+  [Unit]
+  Description=Set AWS Transient Hostname
+  # Block services relying on networking being up.
+  Before=network-online.target
+  # Wait for NetworkManager to report it's online
+  After=NetworkManager-wait-online.service
+  # Run before hostname checks
+  Before=node-valid-hostname.service
+
+  [Service]
+  Type=oneshot
+  RemainAfterExit=yes
+  ExecStart=/usr/local/bin/mco-hostname --aws
+
+  [Install]
+  WantedBy=multi-user.target
+  WantedBy=network-online.target

when i install a cluster with this patch in place, i see machines joining and nodes created with the proper names. here is an example of the CSRs after bootstrap:

NAME                                       AGE   SIGNERNAME                                    REQUESTOR                                                                         CONDITION
csr-5zbw8                                  50m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-196-71.us-east-2.compute.internal                             Approved,Issued
csr-fgt7s                                  50m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-gddbc                                  57m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-hb24h                                  57m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-171-34.us-east-2.compute.internal                             Approved,Issued
csr-hhnh8                                  57m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-213-252.us-east-2.compute.internal                            Approved,Issued
csr-jz5g7                                  51m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-161-224.us-east-2.compute.internal                            Approved,Issued
csr-kzd2k                                  51m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-nzc9m                                  51m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-156-210.us-east-2.compute.internal                            Approved,Issued
csr-t6wf4                                  57m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-155-68.us-east-2.compute.internal                             Approved,Issued
csr-ttm2t                                  57m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-wtvsn                                  57m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-xvs8h                                  51m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
system:openshift:openshift-authenticator   56m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-authentication-operator:authentication-operator   Approved,Issued

elmiko

just a minor nit, but otherwise this looks good and i know it works ;)

templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml

cgwalters

LGTM.

(At some point I am curious around the larger backstory on why in AWS DHCP doesn't give us the hostname we need)

templates/common/aws/units/aws-hostname.service.yaml

@elmiko

- Add AWS support to usr-local-bin-mco-hostname.yaml based on @elmiko implementation Co-authored-by: Michael McCune <msm@opbstudios.com>

kikisdeliveryservice

and adding lgtm for Ben & Colin's reviews

/lgtm

openshift-ci · 2021-06-24T21:09:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, Danil-Grigorev, darkmuggle, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sinnykumari · 2021-06-25T08:18:52Z

/test e2e-agnostic-upgrade

openshift-ci · 2021-06-25T10:28:49Z

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/okd-e2e-aws	`cef5683`	link	`/test okd-e2e-aws`
ci/prow/e2e-aws-disruptive	`cef5683`	link	`/test e2e-aws-disruptive`
ci/prow/e2e-ovn-step-registry	`cef5683`	link	`/test e2e-ovn-step-registry`
ci/prow/e2e-metal-ipi	`cef5683`	link	`/test e2e-metal-ipi`
ci/prow/e2e-vsphere-upgrade	`cef5683`	link	`/test e2e-vsphere-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Danil-Grigorev · 2021-06-25T11:10:05Z

/test e2e-agnostic-upgrade

openshift-ci-robot requested review from darkmuggle and yuqi-zhang February 10, 2021 13:15

Danil-Grigorev force-pushed the aws-correct-hostname branch 2 times, most recently from 506bf60 to 8dc8532 Compare February 10, 2021 14:32

Danil-Grigorev force-pushed the aws-correct-hostname branch from 8dc8532 to bed87a5 Compare February 11, 2021 08:55

Danil-Grigorev commented Feb 11, 2021

View reviewed changes

Danil-Grigorev mentioned this pull request Feb 11, 2021

Initial AWS CCM deployment using upstream image openshift/cluster-cloud-controller-manager-operator#15

Merged

4 tasks

Danil-Grigorev changed the title ~~Add afterburn task to update AWS hostname to match instance metadata~~ [OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata Mar 3, 2021

Danil-Grigorev force-pushed the aws-correct-hostname branch from bed87a5 to e56f075 Compare March 4, 2021 15:13

Danil-Grigorev mentioned this pull request Apr 29, 2021

[OCPCLOUD-1110] Test full migration workflow to CCM with CSI driver openshift/release#18138

Merged

openshift-ci bot assigned darkmuggle May 22, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2021

Danil-Grigorev force-pushed the aws-correct-hostname branch from e56f075 to debdd84 Compare June 22, 2021 16:35

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2021

cgwalters reviewed Jun 22, 2021

View reviewed changes

cgwalters requested changes Jun 22, 2021

View reviewed changes

cgwalters reviewed Jun 22, 2021

View reviewed changes

templates/common/aws/units/afterburn-hostname.service.yaml Outdated Show resolved Hide resolved

cgwalters mentioned this pull request Jun 22, 2021

dracut/hostname: shorten overlong hostname coreos/afterburn#509

Closed

Danil-Grigorev force-pushed the aws-correct-hostname branch from debdd84 to a20b5b9 Compare June 23, 2021 08:50

Danil-Grigorev requested review from cgwalters and sinnykumari June 23, 2021 08:54

elmiko reviewed Jun 23, 2021

View reviewed changes

templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml Outdated Show resolved Hide resolved

cgwalters approved these changes Jun 23, 2021

View reviewed changes

Danil-Grigorev force-pushed the aws-correct-hostname branch 2 times, most recently from 108035e to 55478e9 Compare June 24, 2021 17:46

darkmuggle approved these changes Jun 24, 2021

View reviewed changes

templates/common/aws/units/aws-hostname.service.yaml Outdated Show resolved Hide resolved

Danil-Grigorev force-pushed the aws-correct-hostname branch from 55478e9 to 1e2c932 Compare June 24, 2021 20:17

Add afterburn task to update AWS hostname to match instance metadata

cef5683

- Add AWS support to usr-local-bin-mco-hostname.yaml based on @elmiko implementation Co-authored-by: Michael McCune <msm@opbstudios.com>

Danil-Grigorev force-pushed the aws-correct-hostname branch from 1e2c932 to cef5683 Compare June 24, 2021 20:20

kikisdeliveryservice approved these changes Jun 24, 2021

View reviewed changes

openshift-ci bot assigned kikisdeliveryservice Jun 24, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2021

openshift-merge-robot merged commit cd19ba0 into openshift:master Jun 25, 2021

ehashman mentioned this pull request Jul 15, 2021

Cloud Controller Manger doesn't query cloud provider for node name, causing the node to be removed kubernetes/kubernetes#70897

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401

[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401

Danil-Grigorev commented Feb 10, 2021 •

edited

yuqi-zhang commented Feb 10, 2021

darkmuggle commented Feb 10, 2021

Danil-Grigorev commented Feb 10, 2021

Danil-Grigorev commented Feb 11, 2021

Danil-Grigorev Feb 11, 2021

Danil-Grigorev Mar 3, 2021

Danil-Grigorev commented Feb 11, 2021

Danil-Grigorev commented Feb 11, 2021

rphillips commented Feb 11, 2021

Danil-Grigorev commented Feb 11, 2021

Danil-Grigorev commented Mar 3, 2021

Danil-Grigorev commented Mar 3, 2021

rphillips commented Mar 3, 2021

rphillips commented Mar 3, 2021

rphillips commented Mar 3, 2021

openshift-ci-robot commented Mar 4, 2021

Danil-Grigorev commented Mar 10, 2021

Danil-Grigorev commented Mar 10, 2021

yuqi-zhang commented Mar 10, 2021

darkmuggle commented May 22, 2021

elmiko commented Jun 21, 2021

cgwalters Jun 22, 2021

cgwalters left a comment

cgwalters commented Jun 22, 2021

elmiko commented Jun 22, 2021 •

edited

elmiko left a comment

cgwalters left a comment

kikisdeliveryservice left a comment

openshift-ci bot commented Jun 24, 2021

sinnykumari commented Jun 25, 2021

openshift-ci bot commented Jun 25, 2021 •

edited

Danil-Grigorev commented Jun 25, 2021

[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401

[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401

Conversation

Danil-Grigorev commented Feb 10, 2021 • edited

yuqi-zhang commented Feb 10, 2021

darkmuggle commented Feb 10, 2021

Danil-Grigorev commented Feb 10, 2021

Danil-Grigorev commented Feb 11, 2021

Danil-Grigorev Feb 11, 2021

Choose a reason for hiding this comment

Danil-Grigorev Mar 3, 2021

Choose a reason for hiding this comment

Danil-Grigorev commented Feb 11, 2021

Danil-Grigorev commented Feb 11, 2021

rphillips commented Feb 11, 2021

Danil-Grigorev commented Feb 11, 2021

Danil-Grigorev commented Mar 3, 2021

Danil-Grigorev commented Mar 3, 2021

rphillips commented Mar 3, 2021

rphillips commented Mar 3, 2021

rphillips commented Mar 3, 2021

openshift-ci-robot commented Mar 4, 2021

Danil-Grigorev commented Mar 10, 2021

Danil-Grigorev commented Mar 10, 2021

yuqi-zhang commented Mar 10, 2021

darkmuggle commented May 22, 2021

elmiko commented Jun 21, 2021

cgwalters Jun 22, 2021

Choose a reason for hiding this comment

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters commented Jun 22, 2021

elmiko commented Jun 22, 2021 • edited

elmiko left a comment

Choose a reason for hiding this comment

cgwalters left a comment

Choose a reason for hiding this comment

kikisdeliveryservice left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 24, 2021

sinnykumari commented Jun 25, 2021

openshift-ci bot commented Jun 25, 2021 • edited

Danil-Grigorev commented Jun 25, 2021

Danil-Grigorev commented Feb 10, 2021 •

edited

elmiko commented Jun 22, 2021 •

edited

openshift-ci bot commented Jun 25, 2021 •

edited