Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

templates: use Afterburn for setting GCP hostnames #2217

Conversation

darkmuggle
Copy link
Contributor

The only platform where over-log hostnames have been encountered is on
GCP. The code has proven buggy, racy, and caused a bunch of BZ.

To unwind this mess, on GCP, the new behavior:

  • leaves disabling NetworkManager on GCP
  • on each boot run Afterburn to fetch the metadata name
  • uses truncation code for ensuring the name is valid

Finally, this DROPS the NetworkManager dispatcher. FCOS/RHCOS is
pursuing a more permanent one.

@darkmuggle
Copy link
Contributor Author

darkmuggle commented Nov 12, 2020

/hold

I'm NOT sure I like this much better than the other code....its a hack, albeit a lesser hack. I would need to be convinced by others that this isn't horrible.

My objection to this code is that it does not fix the problem in a better place; it's still using a hack script to accomplish the same goal. The reason I submitted this code is that it gets us out of using the NetworkManager dispatcher script which has been buggy. Using Afterburn to write the hostname to an ephemeral location and run every boot is using Afterburn in unattended ways; this alone makes it a hack.

In terms of the carrying cost to MCO is much better than the existing code. So if we're carrying hacks, this narrowly scopes the hack to GCP, we stop fighting NetworkManager, stop checking magic strings, and it runs once at boot.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@darkmuggle
Copy link
Contributor Author

darkmuggle commented Nov 12, 2020

With RHCOS, Afterburn is enabled for GCP hostnames in 4.7. This patch would only be applicable for upgrading clusters and would be a bridge between the NetworkManager fix landing in RH/RHCOS and now. A nice side effect is that it fixes OKD.

@darkmuggle darkmuggle force-pushed the pr/hostnames-are-really-really-hard branch from 3ae201b to 1fa53eb Compare November 12, 2020 03:54
@darkmuggle
Copy link
Contributor Author

/retest

@LorbusChris
Copy link
Member

Thank you for putting this up @darkmuggle.

Question: The node-valid-hostname.service that calls set-valid-hostname.sh still exists in this repo, even after this PR. Is the goal to get rid of that too eventually and handle it elsewhere (in NM/Afterburn)?

@darkmuggle
Copy link
Contributor Author

Question: The node-valid-hostname.service that calls set-valid-hostname.sh still exists in this repo, even after this PR. Is the goal to get rid of that too eventually and handle it elsewhere (in NM/Afterburn)?

With this change, the code shifts from checking magic strings to ensuring each GCP node gets a hostname. valid-hostnames is needed to wait for the hostname to be set. The reason both this PR and the original code is in the MCO is under the argument that a Linux valid hostname is not a valid cluster-name (i.e 64 character hostnames are valid for Linux, but not the cluster or that is valid for Linux not for the cluster.)

Looking at this, I have half a mind to push this unit into FCOS and thereby RHCOS. When we put this into the MCO, we assumed that there was no place where we could fix the hostnames in the OS. We would still need to leave the valid-hostname code in the MCO since we're blocking CRI-O and the Kublet until a non-local name is found.

The argument for NOT including this in FCOS/RHCOS is that the OS does not require a hostname, while the cluster does.

@darkmuggle darkmuggle force-pushed the pr/hostnames-are-really-really-hard branch from 1fa53eb to 578f98d Compare November 12, 2020 13:57
The only platform where over-log hostnames have been encountered is on
GCP. The code has proven buggy, racy and caused a bunch of BZ.

To unwind this mess, on GCP, the new behavior:
- leaves disabling NetworkManager on GCP
- on each boot run Afterburn to fetch the hostname and writes it an
  ephemeral location
- uses the existing checks to truncate the length

Finally, this DROPS the NetworkManager dispatcher. FCOS/RHCOS is
pursuing a more permanent solution.

Signed-off-by: Ben Howard <ben.howard@redhat.com>
@darkmuggle darkmuggle force-pushed the pr/hostnames-are-really-really-hard branch from 578f98d to 631db89 Compare November 12, 2020 18:22
@vrutkovs
Copy link
Member

/test okd-e2e-gcp-op

@LorbusChris
Copy link
Member

/cherry-pick release-4.6

@openshift-cherrypick-robot

@LorbusChris: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vrutkovs
Copy link
Member

/test okd-e2e-upgrade

@darkmuggle
Copy link
Contributor Author

Logs on a GCP cluster look good:

Nov 12 20:25:42 localhost systemd[1]: Started Network Manager Wait Online.
Nov 12 20:25:42 localhost systemd[1]: Starting Set GCP Transient Hostname...
Nov 12 20:25:42 localhost afterburn[1702]: Nov 12 20:25:42.497 INFO Fetching http://metadata.google.internal/computeMetadata/v1/instance/hostname: Attempt #1
Nov 12 20:25:42 localhost afterburn[1702]: Nov 12 20:25:42.502 INFO Fetch successful
Nov 12 20:25:42 localhost systemd[1]: Starting Generate console-login-helper-messages issue snippet...
Nov 12 20:25:42 localhost systemd[1]: Started Generate console-login-helper-messages issue snippet.
Nov 12 20:25:42 localhost bash[1745]: ci-ln-sn85n7t-f76d1-mcv2k-master-2.c.openshift-gce-devel-ci.internal is longer than 63 characters, using trunacated hostname
Nov 12 20:25:42 localhost bash[1749]: setting transient hostname to ci-ln-sn85n7t-f76d1-mcv2k-master-2
Nov 12 20:25:42 localhost systemd[1]: console-login-helper-messages-issuegen.service: Consumed 20ms CPU time
Nov 12 20:25:42 ci-ln-sn85n7t-f76d1-mcv2k-master-2 systemd-hostnamed[1687]: Changed host name to 'ci-ln-sn85n7t-f76d1-mcv2k-master-2'
Nov 12 20:25:46 ci-ln-sn85n7t-f76d1-mcv2k-master-2 chronyd[1440]: Selected source 169.254.169.254
Nov 12 20:25:46 ci-ln-sn85n7t-f76d1-mcv2k-master-2 chronyd[1440]: System clock TAI offset set to 37 seconds
Nov 12 20:25:52 ci-ln-sn85n7t-f76d1-mcv2k-master-2 systemd[1]: NetworkManager-dispatcher.service: Consumed 100ms CPU time
Nov 12 20:26:07 ci-ln-sn85n7t-f76d1-mcv2k-master-2 bash[1729]: Could not set property: Connection timed out
Nov 12 20:26:07 ci-ln-sn85n7t-f76d1-mcv2k-master-2 systemd[1]: Started Set GCP Transient Hostname.
Nov 12 20:26:07 ci-ln-sn85n7t-f76d1-mcv2k-master-2 systemd[1]: Starting Ensure the node hostname is valid for the cluster...
Nov 12 20:26:07 ci-ln-sn85n7t-f76d1-mcv2k-master-2 bash[1756]: waiting for non-localhost hostname to be assigned
Nov 12 20:26:07 ci-ln-sn85n7t-f76d1-mcv2k-master-2 bash[1761]: node identified as ci-ln-sn85n7t-f76d1-mcv2k-master-2

@vrutkovs
Copy link
Member

okd-e2e-gcp-op reached install phase, okd-e2e-upgrade failed to install initial cluster (unsurprisingly, shouldn't have started that).

👍 from my side

@darkmuggle
Copy link
Contributor Author

To unblock OKD, this "hack" is better, and @lucab indicated in Slack that this acceptable. Afterburn will write an ephemeral hosts name and no files are modified so removal is straightforward. We get out of checking magic strings, and ensure that GCP has a good hostname.

/unhold

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2020
@vrutkovs
Copy link
Member

/test e2e-aws-serial
/test e2e-agnostic-upgrade
/skip

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above context I am tentatively in favour of this approach. Should do some upgrade tests etc. on GCP as well

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2020
@vrutkovs
Copy link
Member

/retest
/refresh

@openshift-ci-robot
Copy link
Contributor

@darkmuggle: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws 631db89 link /test e2e-aws
ci/prow/e2e-metal-ipi 631db89 link /test e2e-metal-ipi
ci/prow/okd-e2e-upgrade 631db89 link /test okd-e2e-upgrade
ci/prow/okd-e2e-gcp-op 631db89 link /test okd-e2e-gcp-op
ci/prow/e2e-gcp-op 631db89 link /test e2e-gcp-op
ci/prow/e2e-aws-serial 631db89 link /test e2e-aws-serial
ci/prow/e2e-aws-workers-rhel7 631db89 link /test e2e-aws-workers-rhel7
ci/prow/okd-e2e-aws 631db89 link /test okd-e2e-aws
ci/prow/e2e-ovn-step-registry 631db89 link /test e2e-ovn-step-registry

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vrutkovs
Copy link
Member

/test e2e-aws-serial
/skip

vrutkovs added a commit to vrutkovs/okd-machine-os that referenced this pull request Nov 25, 2020
@LorbusChris
Copy link
Member

/test okd-e2e-aws

@ashcrow
Copy link
Member

ashcrow commented Nov 30, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, darkmuggle, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ashcrow,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@LorbusChris
Copy link
Member

e2e-gcp-op timed out again, maybe it's time to bump the timeout a little again?
/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@LorbusChris
Copy link
Member

cluster failed to be provisioned
/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@darkmuggle
Copy link
Contributor Author

/retest

3 similar comments
@LorbusChris
Copy link
Member

/retest

@darkmuggle
Copy link
Contributor Author

/retest

@LorbusChris
Copy link
Member

/retest

@LorbusChris
Copy link
Member

/test e2e-aws

@LorbusChris
Copy link
Member

/test okd-e2e-gcp-op

@darkmuggle
Copy link
Contributor Author

/retest

The failures in CI are all unrelated to the change....sigh.

@darkmuggle
Copy link
Contributor Author

/retest

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Dec 3, 2020

@darkmuggle: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-upgrade 631db89 link /test okd-e2e-upgrade
ci/prow/okd-e2e-gcp-op 0a646d2 link /test okd-e2e-gcp-op

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-cherrypick-robot

@LorbusChris: new pull request created: #2286

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants