Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata #2401

Merged

Conversation

Danil-Grigorev
Copy link
Contributor

@Danil-Grigorev Danil-Grigorev commented Feb 10, 2021

- What I did

During kubelet migration from --cloud-provider=aws to --cloud-provider=external kubelet determines Node name by the hostname of the machine where it is running. AWS hostname is mismatched, usually it is a reversed DNS mapping of machine ip, like ip-10-0-195-176. This results in node name mismatch in kubelet, and API server admission plugin rejects changes for the object.

This fixes the issue.

- How to verify it

  • Ensure that new nodes provisioned by kubelet will report hostname with aws prefix, typically reversed IP mapping + <region>.compute.internal even when the cloud-provider is set to external.

- Description for the changelog

@yuqi-zhang
Copy link
Contributor

Hi Danil,

What is the context of this PR? I don't believe any of the cloud platforms today use afterburn directly to acquire their hostname (other than GCP to work around GCP hostname truncating issues). Is there a bug you are trying to resolve?

@darkmuggle
Copy link
Contributor

Is this related to the cloud provider plugin requiring that the hostname and the node name match for admission to the cluster? I know this is a problem for VMware and if we have to do this for AWS....I fear we'll be doing it for the other clouds too.

@Danil-Grigorev
Copy link
Contributor Author

What happened is that I tried an automated reconfiguration for kubelet with --cloud-provider=external flag for AWS and got this issue with admission lock on the Node resource in api server. The hostname was different, so the node was not applied by kubelet, which resulted in the thing never running. I'm still trying to verify if this will fix it for AWS, but yeah. That's what @darkmuggle mentioned is closely related. Same attempt for GCP was working, and the external configuration provisioned Node without issues.

@Danil-Grigorev
Copy link
Contributor Author

/retest

RemainAfterExit=yes
ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
# Manually set current hostname at /proc/sys/kernel/hostname to correct value for kubelet.service
ExecStart=/bin/bash -c "cp /run/afterburn.hostname /proc/sys/kernel/hostname; cp /run/afterburn.hostname /etc/hostname"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/proc/sys/kernel/hostname is the default and only place where golang os.Hostname used in kubelet will gain the value. node-valid-hostname.service hostnamectl call fails to update current hostname in running machine, so there is a manual copy.

Could not set property: Connection timed out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rphillips Here is an issue with hostnamectl I observed, re: #2401 (comment)

@Danil-Grigorev
Copy link
Contributor Author

@darkmuggle @yuqi-zhang PTAL.

@Danil-Grigorev
Copy link
Contributor Author

/retest

@rphillips
Copy link
Contributor

We need to fix this in the external cloud providers. There are too many platforms this can effect.

@Danil-Grigorev
Copy link
Contributor Author

@rphillips I tested it in GCP, Azure, migration works fine on those platforms. vSphere has a fix, same for OpenStack, so it is possible this is the only place left to be changed. We can't really fix it for external cloud providers - it is common for all platform-less kubernetes configurations. The issue happens to be here: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/node/node.go#L56

There is no way for standard os go library to default to FQDN hostname, and I'm not sure changing this core functionality will be easily accepted.

@Danil-Grigorev
Copy link
Contributor Author

/retest

@Danil-Grigorev
Copy link
Contributor Author

Is this related to the cloud provider plugin requiring that the hostname and the node name match for admission to the cluster? I know this is a problem for VMware and if we have to do this for AWS....I fear we'll be doing it for the other clouds too.

@darkmuggle I tested it in other clouds, does not seem to be the case. Only AWS is affected. So there is a fix.

@Danil-Grigorev Danil-Grigorev changed the title Add afterburn task to update AWS hostname to match instance metadata [OCPCLOUD-1106] Add afterburn task to update AWS hostname to match instance metadata Mar 3, 2021
@rphillips
Copy link
Contributor

Upstream issue: kubernetes/kubernetes#70897

@rphillips
Copy link
Contributor

We should use the following script/function to set the hostname, which uses hostnamectl:

@rphillips
Copy link
Contributor

The script gets installed at /usr/local/sbin/set-valid-hostname.sh

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Danil-Grigorev
To complete the pull request process, please assign sinnykumari after the PR has been reviewed.
You can assign the PR to them by writing /assign @sinnykumari in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Danil-Grigorev
Copy link
Contributor Author

@rphillips Your feedback should be addressed now, could you take a look? The script is now using set-valid-hostname.sh
/retest

@Danil-Grigorev
Copy link
Contributor Author

@darkmuggle @yuqi-zhang PTAL. AWS provider is the only one affected and as @rphillips pointed out it is coreOS issue- kubernetes/kubernetes#70897. They propose to use afterburner for managing such tasks, which is this PR exactly doing.

@yuqi-zhang
Copy link
Contributor

Hi, sorry for the late comment, I'm not the most comfortable with hostnames so I defer to @darkmuggle if he has the time to take a look. Just a quick question, was comparing this to:

ConditionPathExists=!/etc/ignition-machine-config-encapsulated.json

That service appears to not run on firstboot. Should this do the same?

@darkmuggle
Copy link
Contributor

The change looks fine, it's consistent with other platforms.
/approve
/lgtm
(other than the failing CI)

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 22, 2021
@elmiko
Copy link
Contributor

elmiko commented Jun 21, 2021

i was able to manually run the commands from the afterburn-hostname, and those did work:

[root@ip-10-0-152-55 sbin]# source /usr/local/sbin/set-valid-hostname.sh
[root@ip-10-0-152-55 sbin]# set_valid_hostname `cat /run/afterburn.hostname`
exit
[core@ip-10-0-152-55 sbin]$ echo $?
0
[core@ip-10-0-152-55 sbin]$ hostname
ip-10-0-152-55.us-east-2.compute.internal

not sure why this is failing during boot though. perhaps there is another dependency we need to wait on, or add more retries?

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2021
Type=oneshot
RemainAfterExit=yes
ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
ExecStart=/bin/bash -c "source /usr/local/sbin/set-valid-hostname.sh; set_valid_hostname `cat /run/afterburn.hostname`"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need reworking to be based on #2618

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this needs to do the same thing as we're doing on GCP here: https://github.com/openshift/machine-config-operator/blob/master/templates/common/gcp/files/etc-networkmanager-conf.d-hostname.yaml

I guess just copy that file into templates/common/aws, longer term it'd be good to dedup them.

Otherwise the hostname change may get reverted when NM renews the DHCP lease.

@cgwalters
Copy link
Member

@elmiko
Copy link
Contributor

elmiko commented Jun 22, 2021

i created a small patch on top of this pr based off of @cgwalters suggestions and it seems to be working well for me.

diff --git a/templates/common/_base/files/usr-local-bin-mco-hostname.yaml b/templates/common/_base/files/usr-local-bin-mco-hostname.yaml
index 0ad56d8c..d3a398bf 100644
--- a/templates/common/_base/files/usr-local-bin-mco-hostname.yaml
+++ b/templates/common/_base/files/usr-local-bin-mco-hostname.yaml
@@ -22,6 +22,16 @@ contents:
         exit 0
     }
 
+    set_aws_hostname() {
+        /usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
+
+        local host_name=$(cat /run/afterburn.hostname)
+
+        echo "setting transient hostname to ${host_name}"
+        /bin/hostnamectl --transient set-hostname "${host_name}"
+        exit 0
+    }
+
     set_gcp_hostname() {
         /usr/bin/afterburn --provider gcp --hostname=/run/afterburn.hostname
 
@@ -58,6 +68,7 @@ contents:
     arg=${1}; shift;
     case "${arg}" in
         --wait) wait_localhost;;
+        --aws) set_aws_hostname;;
         --gcp) set_gcp_hostname;;
         *) echo "Unhandled arg $arg"; exit 1
     esac
diff --git a/templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml b/templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml
new file mode 100644
index 00000000..de974ae2
--- /dev/null
+++ b/templates/common/aws/files/etc-networkmanager-conf.d-hostname.yaml
@@ -0,0 +1,10 @@
+mode: 0644
+path: "/etc/NetworkManager/conf.d/hostname.conf"
+contents:
+  inline: |
+    # The following configuration allows 90-long-hostname.sh
+    # to manage setting transient hostname instead of NetworkManager itself.
+    # See: https://developer.gnome.org/NetworkManager/stable/NetworkManager.conf.html
+    #      https://bugzilla.redhat.com/show_bug.cgi?id=1872885
+    [main]
+    hostname-mode=none
diff --git a/templates/common/aws/units/afterburn-hostname.service.yaml b/templates/common/aws/units/afterburn-hostname.service.yaml
deleted file mode 100644
index 6126d872..00000000
--- a/templates/common/aws/units/afterburn-hostname.service.yaml
+++ /dev/null
@@ -1,18 +0,0 @@
-name: afterburn-hostname.service
-enabled: true
-contents: |
-  [Unit]
-  Description=Afterburn Hostname for AWS
-  After=NetworkManager-wait-online.service
-  Before=node-valid-hostname.service kubelet.service
-
-  [Service]
-  Restart=on-failure
-  RestartSec=15
-  Type=oneshot
-  RemainAfterExit=yes
-  ExecStartPre=/usr/bin/afterburn --provider aws --hostname=/run/afterburn.hostname
-  ExecStart=/bin/bash -c "source /usr/local/sbin/set-valid-hostname.sh; set_valid_hostname `cat /run/afterburn.hostname`"
-
-  [Install]
-  WantedBy=network-online.target
diff --git a/templates/common/aws/units/aws-hostname.service.yaml b/templates/common/aws/units/aws-hostname.service.yaml
new file mode 100644
index 00000000..00b80017
--- /dev/null
+++ b/templates/common/aws/units/aws-hostname.service.yaml
@@ -0,0 +1,20 @@
+name: aws-hostname.service
+enabled: true
+contents: |
+  [Unit]
+  Description=Set AWS Transient Hostname
+  # Block services relying on networking being up.
+  Before=network-online.target
+  # Wait for NetworkManager to report it's online
+  After=NetworkManager-wait-online.service
+  # Run before hostname checks
+  Before=node-valid-hostname.service
+
+  [Service]
+  Type=oneshot
+  RemainAfterExit=yes
+  ExecStart=/usr/local/bin/mco-hostname --aws
+
+  [Install]
+  WantedBy=multi-user.target
+  WantedBy=network-online.target

when i install a cluster with this patch in place, i see machines joining and nodes created with the proper names. here is an example of the CSRs after bootstrap:

NAME                                       AGE   SIGNERNAME                                    REQUESTOR                                                                         CONDITION
csr-5zbw8                                  50m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-196-71.us-east-2.compute.internal                             Approved,Issued
csr-fgt7s                                  50m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-gddbc                                  57m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-hb24h                                  57m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-171-34.us-east-2.compute.internal                             Approved,Issued
csr-hhnh8                                  57m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-213-252.us-east-2.compute.internal                            Approved,Issued
csr-jz5g7                                  51m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-161-224.us-east-2.compute.internal                            Approved,Issued
csr-kzd2k                                  51m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-nzc9m                                  51m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-156-210.us-east-2.compute.internal                            Approved,Issued
csr-t6wf4                                  57m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-155-68.us-east-2.compute.internal                             Approved,Issued
csr-ttm2t                                  57m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-wtvsn                                  57m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
csr-xvs8h                                  51m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         Approved,Issued
system:openshift:openshift-authenticator   56m   kubernetes.io/kube-apiserver-client           system:serviceaccount:openshift-authentication-operator:authentication-operator   Approved,Issued

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor nit, but otherwise this looks good and i know it works ;)

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

(At some point I am curious around the larger backstory on why in AWS DHCP doesn't give us the hostname we need)

@Danil-Grigorev Danil-Grigorev force-pushed the aws-correct-hostname branch 2 times, most recently from 108035e to 55478e9 Compare June 24, 2021 17:46
- Add AWS support to usr-local-bin-mco-hostname.yaml based on @elmiko implementation

Co-authored-by: Michael McCune <msm@opbstudios.com>
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and adding lgtm for Ben & Colin's reviews

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, Danil-Grigorev, darkmuggle, kikisdeliveryservice

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2021
@sinnykumari
Copy link
Contributor

/test e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2021

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws cef5683 link /test okd-e2e-aws
ci/prow/e2e-aws-disruptive cef5683 link /test e2e-aws-disruptive
ci/prow/e2e-ovn-step-registry cef5683 link /test e2e-ovn-step-registry
ci/prow/e2e-metal-ipi cef5683 link /test e2e-metal-ipi
ci/prow/e2e-vsphere-upgrade cef5683 link /test e2e-vsphere-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Danil-Grigorev
Copy link
Contributor Author

/test e2e-agnostic-upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet