New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1711
Conversation
templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml
Outdated
Show resolved
Hide resolved
I'm going to file a bug with NM and start the ball rolling over there to get this fixed properly. In the meantime, we need something like this to fix GCP |
But per discussion in coreos/ignition-dracut#156 it's required that the hostnames are routable. Will GCP also handle lookups for these similarly truncated names? |
For at least IPI GCP we should be in control of the network setup and ideally able to control what the DHCP server is sending the instance. I only briefly glanced at https://cloud.google.com/compute/docs/internal-dns and haven't looked at what the installer terraform is doing, but I'd assume we could configure the infrastructure to fix this. |
As I understood that discussion -- and plenty of slack commentaries -- the hostname needs to be resolvable. The hostname is still resolvable within the search path which is properly set in @abhinavdahiya Can you confirm that truncating the hostname that the node level (NOT cluster) will work? |
You are correct. Using internal DNS will work for new bootstraps; the request was for an OS solution for upgrades. @abhinavdahiya argued strongly against using it as the solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is GCP specific I think it can go into /common/gcp
.
I'm also not familiar enough with cluster-level routing so I'd suggest we check with someone who knows GCP well enough to ensure this will work
templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml
Outdated
Show resolved
Hide resolved
If that's the case, then GCP must be using the same truncation algorithm.
Ahhh...yes, that is an issue. OK to be clear, I am totally 👍 on this if we've tested it to work (did you?). And I guess that's the tricky aspect; we'd need to roll a new release image with this PR in it, and then try to create a cluster with openshift-install with a cluster name that provokes name length issues. Or maybe it'd be possible to test this with an existing GCP cluster by editing the DNS setup to trigger it? |
I tested this outside the scope of a cluster by:
Before the new dispatcher, I would get "localhost" as the hostname with a resolv.conf containing the search path and dns resolvers. After this new dispatcher, the hostname was set to the truncated named the same resolv.conf. 👍 to the testing path. |
Few things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good but followed on to @yuqi-zhang's comment with a possible extra safeguard.
Since the issue itself arises from Linux kernel restrictions I it might be fine living under common. That said,
Might not be the case for every platform.
Today we have this: https://github.com/coreos/afterburn/blob/master/dracut/30afterburn/afterburn-hostname.service Maybe that would be a better place for this to live? |
Right, we can only truncate the names if the DNS server is doing the same thing. And I'm sure that's not the default.
Hmm...I think that's only used when we need to use an out-of-band (non-DHCP) mechanism to get the hostname. |
I went back and forth on where to carry this I ultimately chose the MCO for a couple of reasons:
The bigger issue that OCP is relying on implicit hostnames via the PTR records when nothing else sets the hostname. But per [1] we have a collision:
The kernel and Systemd are compliant with the "must" requirement, but not with the "should". The issue is how NM is setting the hostname by not ensuring that the components under it handle the assignment. Since NM is using PTR records to discover the hostname (I'm not sure if this behavior is covered by a spec) we end up in this gray area. IMO we'll need this fix for pre-4.6 OCP and until we get proper NM/Systemd/Kernel support for the "should" part of the RFC. Finally, the reason I didn't scope the patch at GCP specifically, is that this condition can happen generically and is most likely to happen on bare-metal: just use the PTR record hostname with a character count more than 63. |
Taxonomically speaking, this condition is most often encountered on GCP, but it NOT unique to GCP. Any environment where the host is using PTR record hostname discovery can result in this situation (such as UPI). I seriously considered putting within the scope of GCP, but per RFC 1123, anything less than 255 is valid, and software is NOT required to handle it. Fixing this just for GCP seems a bit short-sighted.
Ultimately, I would like to see the upstreams of RHCOS (FCOS, NetworkManager or Systemd) come up with a proper fix, but given the blast radius and release times, its unlikely to happen soon. This fixes the immediate pain in a way that can be reversed quickly without baking it into the bootimages. |
@darkmuggle Thanks! i'm fine with the change. I can try and test this longer hostnames locally to verify. |
@darkmuggle: This pull request references Bugzilla bug 1809345, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@kikisdeliveryservice: This pull request references Bugzilla bug 1809345, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Here's a test that is testing the long names using cluster-bot
here are jobs currently running https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703 and https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/704 these should help figure out if the change is going to work.. |
both the CI jobs failed to complete installation because the works did not join the clusters, so maybe this fix is not yet working... |
Okay i think I know what's up, the CSRs for the worker nodes are not approved because the machine-api doesn't add the vm name to the InternalDNS, therefore it doesn't get approved by the in-cluster-approver.
|
@darkmuggle: The
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
just fyi: the e2e-metal-ipi seems kind of flaky rn across prs (and not required) unless you think this failure is directly related to your PR.. /test e2e-metal-ipi |
Okay, nice this works with change from cluster-api-provider-gcp https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/708 Both succeeded.. so we need to merge openshift/cluster-api-provider-gcp#88 with this change and we should be good. |
Nice! Thanks for following up on this issue. |
On GCP it is not uncommon to have DNS names that are too long. Linux restricts the kernel hostname to being less than 63 characters. The new template simply ensures that a hostname is set in the event that NetworkManager is unable to do so. Bug: 1809345 Signed-off-by: Ben Howard <ben.howard@redhat.com>
|
||
default_host="${DHCP4_HOST_NAME:-$DHCP6_HOST_NAME}" | ||
# truncate the hostname to the first dot and than 64 characters. | ||
host=$(echo ${default_host} | cut -f1 -d'.' | cut -c -63) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ echo "averylonghostnamewhichshouldprobablybelessbecausereallyitsprettycrazy.to.have.such.long.names" | cut -f1 -d'.' | cut -c -63
averylonghostnamewhichshouldprobablybelessbecausereallyitsprett
$ echo -n "averylonghostnamewhichshouldprobablybelessbecausereallyitsprettycrazy.to.have.such.long.names" | cut -f1 -d'.' | cut -c -63 | wc -c
64
Enforces 64 char limit. However, the next check on default host is >63. Should this check really be limiting to 63 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than this, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I did the same thing:
$ echo hi | cut -c -2 | wc -c
3
wc -c
counts from 1
It warms my heart that we both thought of and ran the same test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh 👍
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ashcrow, cgwalters, darkmuggle The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@darkmuggle: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.4 |
@abhinavdahiya: new pull request created: #1737 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.3 |
@abhinavdahiya: new pull request created: #1738 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
This mirrors changes to GCP IPI in openshift#3544 The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
This mirrors changes to GCP IPI in openshift#3544 The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64. More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345 now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms. Why is it important? On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction. Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
On GCP it is not uncommon to have DNS names that are too long.
Linux restricts the kernel hostname to being less than 63 characters.
The new template simply ensures that a hostname is set in the event that
NetworkManager is unable to do so.
Bug: 1809345