Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1711

darkmuggle · 2020-05-04T17:35:30Z

On GCP it is not uncommon to have DNS names that are too long.
Linux restricts the kernel hostname to being less than 63 characters.
The new template simply ensures that a hostname is set in the event that
NetworkManager is unable to do so.

Bug: 1809345

templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml

darkmuggle · 2020-05-04T17:44:02Z

I'm going to file a bug with NM and start the ball rolling over there to get this fixed properly. In the meantime, we need something like this to fix GCP

cgwalters · 2020-05-04T17:50:13Z

But per discussion in coreos/ignition-dracut#156 it's required that the hostnames are routable. Will GCP also handle lookups for these similarly truncated names?

cgwalters · 2020-05-04T17:57:48Z

For at least IPI GCP we should be in control of the network setup and ideally able to control what the DHCP server is sending the instance.

I only briefly glanced at https://cloud.google.com/compute/docs/internal-dns and haven't looked at what the installer terraform is doing, but I'd assume we could configure the infrastructure to fix this.

darkmuggle · 2020-05-04T17:59:29Z

But per discussion in coreos/ignition-dracut#156 it's required that the hostnames are routable. Will GCP also handle lookups for these similarly truncated names?

As I understood that discussion -- and plenty of slack commentaries -- the hostname needs to be resolvable. The hostname is still resolvable within the search path which is properly set in /etc/resolv.conf

@abhinavdahiya Can you confirm that truncating the hostname that the node level (NOT cluster) will work?

darkmuggle · 2020-05-04T18:01:56Z

For at least IPI GCP we should be in control of the network setup and ideally able to control what the DHCP server is sending the instance.

I only briefly glanced at https://cloud.google.com/compute/docs/internal-dns and haven't looked at what the installer terraform is doing, but I'd assume we could configure the infrastructure to fix this.

You are correct. Using internal DNS will work for new bootstraps; the request was for an OS solution for upgrades. @abhinavdahiya argued strongly against using it as the solution.

yuqi-zhang

If this is GCP specific I think it can go into /common/gcp.

I'm also not familiar enough with cluster-level routing so I'd suggest we check with someone who knows GCP well enough to ensure this will work

templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml

cgwalters · 2020-05-04T18:13:39Z

The hostname is still resolvable within the search path which is properly set in /etc/resolv.conf

If that's the case, then GCP must be using the same truncation algorithm.

You are correct. Using internal DNS will work for new bootstraps; the request was for an OS solution for upgrades.

Ahhh...yes, that is an issue.

OK to be clear, I am totally 👍 on this if we've tested it to work (did you?). And I guess that's the tricky aspect; we'd need to roll a new release image with this PR in it, and then try to create a cluster with openshift-install with a cluster name that provokes name length issues.

Or maybe it'd be possible to test this with an existing GCP cluster by editing the DNS setup to trigger it?

darkmuggle · 2020-05-04T18:19:34Z

I tested this outside the scope of a cluster by:

creating 90 character FQDN
dropping the file in manually
rebooting

Before the new dispatcher, I would get "localhost" as the hostname with a resolv.conf containing the search path and dns resolvers. After this new dispatcher, the hostname was set to the truncated named the same resolv.conf.

👍 to the testing path.

abhinavdahiya · 2020-05-04T18:27:17Z

Few things:

Does it make sense to add this to all platforms or just GCP? Because currently it does it for all.
Is this workaround, something we should be configure in MCO or should this be in rhcos

ashcrow

Overall looks good but followed on to @yuqi-zhang's comment with a possible extra safeguard.

yuqi-zhang · 2020-05-04T18:32:22Z

Does it make sense to add this to all platforms or just GCP? Because currently it does it for all.

Since the issue itself arises from Linux kernel restrictions I it might be fine living under common. That said,

If that's the case, then GCP must be using the same truncation algorithm.

Might not be the case for every platform.

Is this workaround, something we should be configure in MCO or should this be in rhcos

Today we have this: https://github.com/coreos/afterburn/blob/master/dracut/30afterburn/afterburn-hostname.service

Maybe that would be a better place for this to live?

cgwalters · 2020-05-04T18:51:55Z

Might not be the case for every platform.

Right, we can only truncate the names if the DNS server is doing the same thing. And I'm sure that's not the default.

Today we have this: https://github.com/coreos/afterburn/blob/master/dracut/30afterburn/afterburn-hostname.service

Hmm...I think that's only used when we need to use an out-of-band (non-DHCP) mechanism to get the hostname.

darkmuggle · 2020-05-04T20:05:25Z

I went back and forth on where to carry this I ultimately chose the MCO for a couple of reasons:

changing this NM is way harder and has a wider blast surface
providing this via the MCO means that we don't have the FCOS/RHCOS delta problem
if/when we get a proper "fix" elsewhere this is much easier to drop
Afterburn doesn't handle the discovered hostname which is where this is coming from.

The bigger issue that OCP is relying on implicit hostnames via the PTR records when nothing else sets the hostname. But per [1] we have a collision:

we MUST handle hostnames up to 63 characters
we SHOULD handle hostnames up to 255 characters.

The kernel and Systemd are compliant with the "must" requirement, but not with the "should". The issue is how NM is setting the hostname by not ensuring that the components under it handle the assignment. Since NM is using PTR records to discover the hostname (I'm not sure if this behavior is covered by a spec) we end up in this gray area.

IMO we'll need this fix for pre-4.6 OCP and until we get proper NM/Systemd/Kernel support for the "should" part of the RFC.

Finally, the reason I didn't scope the patch at GCP specifically, is that this condition can happen generically and is most likely to happen on bare-metal: just use the PTR record hostname with a character count more than 63.

[1] https://tools.ietf.org/html/rfc1123#page-13

darkmuggle · 2020-05-04T20:47:14Z

Few things:

Does it make sense to add this to all platforms or just GCP? Because currently it does it for all.

Taxonomically speaking, this condition is most often encountered on GCP, but it NOT unique to GCP. Any environment where the host is using PTR record hostname discovery can result in this situation (such as UPI).

I seriously considered putting within the scope of GCP, but per RFC 1123, anything less than 255 is valid, and software is NOT required to handle it. Fixing this just for GCP seems a bit short-sighted.

Is this workaround, something we should be configure in MCO or should this be in rhcos

Ultimately, I would like to see the upstreams of RHCOS (FCOS, NetworkManager or Systemd) come up with a proper fix, but given the blast radius and release times, its unlikely to happen soon. This fixes the immediate pain in a way that can be reversed quickly without baking it into the bootimages.

abhinavdahiya · 2020-05-04T21:36:33Z

@darkmuggle
#1711 (comment)
and #1711 (comment) answer all my concerns from #1711 (comment)

Thanks! i'm fine with the change. I can try and test this longer hostnames locally to verify.

openshift-ci-robot · 2020-05-04T22:10:10Z

@darkmuggle: This pull request references Bugzilla bug 1809345, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-05-04T22:10:11Z

/bugzilla refresh

openshift-ci-robot · 2020-05-04T22:10:14Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1809345, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2020-05-04T22:28:47Z

Here's a test that is testing the long names using cluster-bot

/msg @cluster-bot launch openshift/installer#3544,openshift/machine-config-operator#1711 gcp
/msg @cluster-bot test e2e openshift/installer#3544,openshift/machine-config-operator#1711 gcp

here are jobs currently running https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703 and https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/704

these should help figure out if the change is going to work..

abhinavdahiya · 2020-05-04T22:59:29Z

both the CI jobs
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/704
and
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703

failed to complete installation because the works did not join the clusters, so maybe this fix is not yet working...

abhinavdahiya · 2020-05-04T23:31:15Z

Okay i think I know what's up, the CSRs for the worker nodes are not approved because the machine-api doesn't add the vm name to the InternalDNS, therefore it doesn't get approved by the in-cluster-approver.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703/artifacts/launch/csr.json

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703/artifacts/launch/pods/openshift-cluster-machine-approver_machine-approver-68c87cd9c9-nxgss_machine-approver-controller.log

I0504 22:34:29.994328       1 main.go:147] CSR csr-thcx7 added
I0504 22:34:30.008009       1 main.go:182] CSR csr-thcx7 not authorized: failed to find machine for node ci-ln-g2vb4fb-f76d1-mst78-w-c-h7vsw
I0504 22:34:30.008044       1 main.go:218] Error syncing csr csr-thcx7: failed to find machine for node ci-ln-g2vb4fb-f76d1-mst78-w-c-h7vsw

openshift-ci-robot · 2020-05-05T00:24:08Z

@darkmuggle: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

/test e2e-aws
/test e2e-aws-disruptive
/test e2e-aws-scaleup-rhel7
/test e2e-gcp-op
/test e2e-gcp-upgrade
/test e2e-metal-ipi
/test e2e-openstack
/test e2e-ovirt
/test e2e-vsphere
/test images
/test unit
/test verify

Use /test all to run all jobs.

In response to this:

/retest ci/prow/e2e-metal-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-05-05T00:30:21Z

/retest ci/prow/e2e-metal-ipi

just fyi: the e2e-metal-ipi seems kind of flaky rn across prs (and not required) unless you think this failure is directly related to your PR..

/test e2e-metal-ipi

abhinavdahiya · 2020-05-05T02:33:36Z

Okay, nice this works with change from cluster-api-provider-gcp

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/708
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/707

Both succeeded.. so we need to merge openshift/cluster-api-provider-gcp#88 with this change and we should be good.

NautiluX · 2020-05-05T11:14:46Z

Nice! Thanks for following up on this issue.

On GCP it is not uncommon to have DNS names that are too long. Linux restricts the kernel hostname to being less than 63 characters. The new template simply ensures that a hostname is set in the event that NetworkManager is unable to do so. Bug: 1809345 Signed-off-by: Ben Howard <ben.howard@redhat.com>

ashcrow · 2020-05-05T18:02:18Z

templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml

+
+    default_host="${DHCP4_HOST_NAME:-$DHCP6_HOST_NAME}"
+    # truncate the hostname to the first dot and than 64 characters.
+    host=$(echo ${default_host} | cut -f1 -d'.' | cut -c -63)


$ echo "averylonghostnamewhichshouldprobablybelessbecausereallyitsprettycrazy.to.have.such.long.names" | cut -f1 -d'.' | cut -c -63 averylonghostnamewhichshouldprobablybelessbecausereallyitsprett $ echo -n "averylonghostnamewhichshouldprobablybelessbecausereallyitsprettycrazy.to.have.such.long.names" | cut -f1 -d'.' | cut -c -63 | wc -c 64

Enforces 64 char limit. However, the next check on default host is >63. Should this check really be limiting to 63 instead?

Other than this, LGTM!

Ha! I did the same thing:

$ echo hi | cut -c -2 | wc -c 3

wc -c counts from 1

It warms my heart that we both thought of and ran the same test.

ashcrow · 2020-05-05T18:29:58Z

/lgtm

openshift-ci-robot · 2020-05-05T18:30:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters, darkmuggle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow,cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-05-05T19:43:19Z

@darkmuggle: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:

openshift/cluster-api-provider-gcp#88 is open
Bugzilla bug 1809345 has been moved to the MODIFIED state.

In response to this:

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2020-05-18T21:26:25Z

/cherry-pick release-4.4
/cherry-pick release-4.3

openshift-cherrypick-robot · 2020-05-18T21:26:35Z

@abhinavdahiya: new pull request created: #1737

In response to this:

/cherry-pick release-4.4
/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2020-05-18T21:27:27Z

/cherry-pick release-4.3

openshift-cherrypick-robot · 2020-05-18T21:27:34Z

@abhinavdahiya: new pull request created: #1738

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.

This mirrors changes to GCP IPI in openshift#3544 The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.

darkmuggle requested a review from cgwalters May 4, 2020 17:35

openshift-ci-robot requested review from ashcrow and yuqi-zhang May 4, 2020 17:35

tedyu reviewed May 4, 2020

View reviewed changes

templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml Outdated Show resolved Hide resolved

yuqi-zhang reviewed May 4, 2020

View reviewed changes

templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml Outdated Show resolved Hide resolved

ashcrow reviewed May 4, 2020

View reviewed changes

cgwalters approved these changes May 4, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2020

kikisdeliveryservice changed the title ~~templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml~~ Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml May 4, 2020

openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label May 4, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 4, 2020

darkmuggle force-pushed the master branch from ad6ed33 to 25f476a Compare May 5, 2020 16:46

darkmuggle force-pushed the master branch from 25f476a to afb1f1d Compare May 5, 2020 16:54

ashcrow reviewed May 5, 2020

View reviewed changes

ashcrow approved these changes May 5, 2020

View reviewed changes

openshift-ci-robot assigned ashcrow May 5, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2020

openshift-merge-robot merged commit 07246eb into openshift:master May 5, 2020

enxebre mentioned this pull request May 6, 2020

Bug 1809345: Add the machine's name as a known NodeInternalDNS openshift/cluster-api-provider-gcp#88

Merged

openshift-cherrypick-robot mentioned this pull request May 18, 2020

Bug 1837122: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1737

Merged

openshift-cherrypick-robot mentioned this pull request May 18, 2020

Bug 1837124: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1738

Merged

abhinavdahiya mentioned this pull request May 19, 2020

Bug 1837642: installconfig: relax GCP cluster id restriction openshift/installer#3544

Merged

abhinavdahiya mentioned this pull request May 26, 2020

[release-4.4] Bug 1838885: installconfig: relax GCP cluster id restriction openshift/installer#3662

Merged

jstuever mentioned this pull request Jun 3, 2020

GCP UPI: relax cluster name openshift/installer#3713

Merged

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1711

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1711

Conversation

darkmuggle commented May 4, 2020

darkmuggle commented May 4, 2020

cgwalters commented May 4, 2020

cgwalters commented May 4, 2020

darkmuggle commented May 4, 2020

darkmuggle commented May 4, 2020

yuqi-zhang left a comment

Choose a reason for hiding this comment

cgwalters commented May 4, 2020

darkmuggle commented May 4, 2020

abhinavdahiya commented May 4, 2020

ashcrow left a comment

Choose a reason for hiding this comment

yuqi-zhang commented May 4, 2020

cgwalters commented May 4, 2020

darkmuggle commented May 4, 2020 • edited

darkmuggle commented May 4, 2020 • edited

abhinavdahiya commented May 4, 2020

openshift-ci-robot commented May 4, 2020

kikisdeliveryservice commented May 4, 2020

openshift-ci-robot commented May 4, 2020

abhinavdahiya commented May 4, 2020

abhinavdahiya commented May 4, 2020

abhinavdahiya commented May 4, 2020 • edited

openshift-ci-robot commented May 5, 2020

kikisdeliveryservice commented May 5, 2020

abhinavdahiya commented May 5, 2020

NautiluX commented May 5, 2020

ashcrow May 5, 2020

Choose a reason for hiding this comment

ashcrow May 5, 2020

Choose a reason for hiding this comment

darkmuggle May 5, 2020 • edited

Choose a reason for hiding this comment

ashcrow May 5, 2020

Choose a reason for hiding this comment

ashcrow commented May 5, 2020

openshift-ci-robot commented May 5, 2020

openshift-ci-robot commented May 5, 2020

abhinavdahiya commented May 18, 2020

openshift-cherrypick-robot commented May 18, 2020

abhinavdahiya commented May 18, 2020

openshift-cherrypick-robot commented May 18, 2020

darkmuggle commented May 4, 2020 •

edited

darkmuggle commented May 4, 2020 •

edited

abhinavdahiya commented May 4, 2020 •

edited

darkmuggle May 5, 2020 •

edited