Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml #1711

Merged
merged 1 commit into from May 5, 2020

Conversation

darkmuggle
Copy link
Contributor

On GCP it is not uncommon to have DNS names that are too long.
Linux restricts the kernel hostname to being less than 63 characters.
The new template simply ensures that a hostname is set in the event that
NetworkManager is unable to do so.

Bug: 1809345

@darkmuggle
Copy link
Contributor Author

I'm going to file a bug with NM and start the ball rolling over there to get this fixed properly. In the meantime, we need something like this to fix GCP

@cgwalters
Copy link
Member

But per discussion in coreos/ignition-dracut#156 it's required that the hostnames are routable. Will GCP also handle lookups for these similarly truncated names?

@cgwalters
Copy link
Member

For at least IPI GCP we should be in control of the network setup and ideally able to control what the DHCP server is sending the instance.

I only briefly glanced at https://cloud.google.com/compute/docs/internal-dns and haven't looked at what the installer terraform is doing, but I'd assume we could configure the infrastructure to fix this.

@darkmuggle
Copy link
Contributor Author

But per discussion in coreos/ignition-dracut#156 it's required that the hostnames are routable. Will GCP also handle lookups for these similarly truncated names?

As I understood that discussion -- and plenty of slack commentaries -- the hostname needs to be resolvable. The hostname is still resolvable within the search path which is properly set in /etc/resolv.conf

@abhinavdahiya Can you confirm that truncating the hostname that the node level (NOT cluster) will work?

@darkmuggle
Copy link
Contributor Author

For at least IPI GCP we should be in control of the network setup and ideally able to control what the DHCP server is sending the instance.

I only briefly glanced at https://cloud.google.com/compute/docs/internal-dns and haven't looked at what the installer terraform is doing, but I'd assume we could configure the infrastructure to fix this.

You are correct. Using internal DNS will work for new bootstraps; the request was for an OS solution for upgrades. @abhinavdahiya argued strongly against using it as the solution.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is GCP specific I think it can go into /common/gcp.

I'm also not familiar enough with cluster-level routing so I'd suggest we check with someone who knows GCP well enough to ensure this will work

@cgwalters
Copy link
Member

The hostname is still resolvable within the search path which is properly set in /etc/resolv.conf

If that's the case, then GCP must be using the same truncation algorithm.

You are correct. Using internal DNS will work for new bootstraps; the request was for an OS solution for upgrades.

Ahhh...yes, that is an issue.

OK to be clear, I am totally 👍 on this if we've tested it to work (did you?). And I guess that's the tricky aspect; we'd need to roll a new release image with this PR in it, and then try to create a cluster with openshift-install with a cluster name that provokes name length issues.

Or maybe it'd be possible to test this with an existing GCP cluster by editing the DNS setup to trigger it?

@darkmuggle
Copy link
Contributor Author

I tested this outside the scope of a cluster by:

  • creating 90 character FQDN
  • dropping the file in manually
  • rebooting

Before the new dispatcher, I would get "localhost" as the hostname with a resolv.conf containing the search path and dns resolvers. After this new dispatcher, the hostname was set to the truncated named the same resolv.conf.

👍 to the testing path.

@abhinavdahiya
Copy link
Contributor

Few things:

  1. Does it make sense to add this to all platforms or just GCP? Because currently it does it for all.

  2. Is this workaround, something we should be configure in MCO or should this be in rhcos

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but followed on to @yuqi-zhang's comment with a possible extra safeguard.

@yuqi-zhang
Copy link
Contributor

Does it make sense to add this to all platforms or just GCP? Because currently it does it for all.

Since the issue itself arises from Linux kernel restrictions I it might be fine living under common. That said,

If that's the case, then GCP must be using the same truncation algorithm.

Might not be the case for every platform.

Is this workaround, something we should be configure in MCO or should this be in rhcos

Today we have this: https://github.com/coreos/afterburn/blob/master/dracut/30afterburn/afterburn-hostname.service

Maybe that would be a better place for this to live?

@cgwalters
Copy link
Member

Might not be the case for every platform.

Right, we can only truncate the names if the DNS server is doing the same thing. And I'm sure that's not the default.

Today we have this: https://github.com/coreos/afterburn/blob/master/dracut/30afterburn/afterburn-hostname.service

Hmm...I think that's only used when we need to use an out-of-band (non-DHCP) mechanism to get the hostname.

@darkmuggle
Copy link
Contributor Author

darkmuggle commented May 4, 2020

I went back and forth on where to carry this I ultimately chose the MCO for a couple of reasons:

  • changing this NM is way harder and has a wider blast surface
  • providing this via the MCO means that we don't have the FCOS/RHCOS delta problem
  • if/when we get a proper "fix" elsewhere this is much easier to drop
  • Afterburn doesn't handle the discovered hostname which is where this is coming from.

The bigger issue that OCP is relying on implicit hostnames via the PTR records when nothing else sets the hostname. But per [1] we have a collision:

  • we MUST handle hostnames up to 63 characters
  • we SHOULD handle hostnames up to 255 characters.

The kernel and Systemd are compliant with the "must" requirement, but not with the "should". The issue is how NM is setting the hostname by not ensuring that the components under it handle the assignment. Since NM is using PTR records to discover the hostname (I'm not sure if this behavior is covered by a spec) we end up in this gray area.

IMO we'll need this fix for pre-4.6 OCP and until we get proper NM/Systemd/Kernel support for the "should" part of the RFC.

Finally, the reason I didn't scope the patch at GCP specifically, is that this condition can happen generically and is most likely to happen on bare-metal: just use the PTR record hostname with a character count more than 63.

[1] https://tools.ietf.org/html/rfc1123#page-13

@darkmuggle
Copy link
Contributor Author

darkmuggle commented May 4, 2020

Few things:

  1. Does it make sense to add this to all platforms or just GCP? Because currently it does it for all.

Taxonomically speaking, this condition is most often encountered on GCP, but it NOT unique to GCP. Any environment where the host is using PTR record hostname discovery can result in this situation (such as UPI).

I seriously considered putting within the scope of GCP, but per RFC 1123, anything less than 255 is valid, and software is NOT required to handle it. Fixing this just for GCP seems a bit short-sighted.

  1. Is this workaround, something we should be configure in MCO or should this be in rhcos

Ultimately, I would like to see the upstreams of RHCOS (FCOS, NetworkManager or Systemd) come up with a proper fix, but given the blast radius and release times, its unlikely to happen soon. This fixes the immediate pain in a way that can be reversed quickly without baking it into the bootimages.

@abhinavdahiya
Copy link
Contributor

@darkmuggle
#1711 (comment)
and #1711 (comment) answer all my concerns from #1711 (comment)

Thanks! i'm fine with the change. I can try and test this longer hostnames locally to verify.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2020
@kikisdeliveryservice kikisdeliveryservice changed the title templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml May 4, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label May 4, 2020
@openshift-ci-robot
Copy link
Contributor

@darkmuggle: This pull request references Bugzilla bug 1809345, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 4, 2020
@kikisdeliveryservice
Copy link
Contributor

/bugzilla refresh

@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1809345, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

Here's a test that is testing the long names using cluster-bot

/msg @cluster-bot launch openshift/installer#3544,openshift/machine-config-operator#1711 gcp
/msg @cluster-bot test e2e openshift/installer#3544,openshift/machine-config-operator#1711 gcp

here are jobs currently running https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703 and https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/704

these should help figure out if the change is going to work..

@abhinavdahiya
Copy link
Contributor

both the CI jobs
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/704
and
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703

failed to complete installation because the works did not join the clusters, so maybe this fix is not yet working...

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented May 4, 2020

Okay i think I know what's up, the CSRs for the worker nodes are not approved because the machine-api doesn't add the vm name to the InternalDNS, therefore it doesn't get approved by the in-cluster-approver.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703/artifacts/launch/csr.json

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/703/artifacts/launch/pods/openshift-cluster-machine-approver_machine-approver-68c87cd9c9-nxgss_machine-approver-controller.log

I0504 22:34:29.994328       1 main.go:147] CSR csr-thcx7 added
I0504 22:34:30.008009       1 main.go:182] CSR csr-thcx7 not authorized: failed to find machine for node ci-ln-g2vb4fb-f76d1-mst78-w-c-h7vsw
I0504 22:34:30.008044       1 main.go:218] Error syncing csr csr-thcx7: failed to find machine for node ci-ln-g2vb4fb-f76d1-mst78-w-c-h7vsw

@openshift-ci-robot
Copy link
Contributor

@darkmuggle: The /retest command does not accept any targets.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-scaleup-rhel7
  • /test e2e-gcp-op
  • /test e2e-gcp-upgrade
  • /test e2e-metal-ipi
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-vsphere
  • /test images
  • /test unit
  • /test verify

Use /test all to run all jobs.

In response to this:

/retest ci/prow/e2e-metal-ipi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice
Copy link
Contributor

/retest ci/prow/e2e-metal-ipi

just fyi: the e2e-metal-ipi seems kind of flaky rn across prs (and not required) unless you think this failure is directly related to your PR..

/test e2e-metal-ipi

@abhinavdahiya
Copy link
Contributor

@NautiluX
Copy link

NautiluX commented May 5, 2020

Nice! Thanks for following up on this issue.

On GCP it is not uncommon to have DNS names that are too long.
Linux restricts the kernel hostname to being less than 63 characters.
The new template simply ensures that a hostname is set in the event that
NetworkManager is unable to do so.

Bug: 1809345
Signed-off-by: Ben Howard <ben.howard@redhat.com>

default_host="${DHCP4_HOST_NAME:-$DHCP6_HOST_NAME}"
# truncate the hostname to the first dot and than 64 characters.
host=$(echo ${default_host} | cut -f1 -d'.' | cut -c -63)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ echo "averylonghostnamewhichshouldprobablybelessbecausereallyitsprettycrazy.to.have.such.long.names" | cut -f1 -d'.' | cut -c -63
averylonghostnamewhichshouldprobablybelessbecausereallyitsprett
$ echo -n "averylonghostnamewhichshouldprobablybelessbecausereallyitsprettycrazy.to.have.such.long.names" | cut -f1 -d'.' | cut -c -63 | wc -c
64

Enforces 64 char limit. However, the next check on default host is >63. Should this check really be limiting to 63 instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than this, LGTM!

Copy link
Contributor Author

@darkmuggle darkmuggle May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I did the same thing:

$ echo hi | cut -c -2 | wc -c
3

wc -c counts from 1

It warms my heart that we both thought of and ran the same test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh 👍

@ashcrow
Copy link
Member

ashcrow commented May 5, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters, darkmuggle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 07246eb into openshift:master May 5, 2020
@openshift-ci-robot
Copy link
Contributor

@darkmuggle: Some pull requests linked via external trackers have merged: . The following pull requests linked via external trackers have not merged:

In response to this:

Bug 1809345: templates: add etc-networkmanager-dispatcher.d-90-long-hostname.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

/cherry-pick release-4.4
/cherry-pick release-4.3

@openshift-cherrypick-robot

@abhinavdahiya: new pull request created: #1737

In response to this:

/cherry-pick release-4.4
/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

/cherry-pick release-4.3

@openshift-cherrypick-robot

@abhinavdahiya: new pull request created: #1738

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya added a commit to abhinavdahiya/installer that referenced this pull request May 19, 2020
The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64.
More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345

now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms.

Why is it important?

On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction.

Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/installer that referenced this pull request May 22, 2020
The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64.
More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345

now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms.

Why is it important?

On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction.

Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/installer that referenced this pull request May 22, 2020
The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because we couldn't handle the hostname seen by rhcos machine to be greater than 64.
More details on this are available in https://bugzilla.redhat.com/show_bug.cgi?id=1809345

now since BZ 1809345 is fixed by openshift/machine-config-operator#1711 and openshift/cluster-api-provider-gcp#88 the installer can relax the restriction on the infra-id to match the other platforms.

Why is it important?

On GCP all resources are prefixed with infra-id, which currently is 12 chars with 6 chars used by random bit, leaving only 6 chars from cluster name. This causes trouble associating the cluster to jobs in CI as most of the identifyable characters are dropped from the resource names in CI due to this restriction.

Also because of the previous restriction, only one char are used from pool's name, making is higly likely to collide in cases there are more.
jstuever added a commit to jstuever/openshift-installer that referenced this pull request Jun 3, 2020
This mirrors changes to GCP IPI in openshift#3544

The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because
we couldn't handle the hostname seen by rhcos machine to be greater than
64.
More details on this are available in
https://bugzilla.redhat.com/show_bug.cgi?id=1809345

now since BZ 1809345 is fixed by openshift/machine-config-operator#1711
and openshift/cluster-api-provider-gcp#88 the installer can relax the
restriction on the infra-id to match the other platforms.

Why is it important?

On GCP all resources are prefixed with infra-id, which currently is 12
chars with 6 chars used by random bit, leaving only 6 chars from cluster
name. This causes trouble associating the cluster to jobs in CI as most
of the identifyable characters are dropped from the resource names in CI
due to this restriction.

Also because of the previous restriction, only one char are used from
pool's name, making is higly likely to collide in cases there are more.
patrickdillon pushed a commit to patrickdillon/installer that referenced this pull request Jun 17, 2020
This mirrors changes to GCP IPI in openshift#3544

The infra id of the clusters on GCP was reduced to 12 in openshift#2088 because
we couldn't handle the hostname seen by rhcos machine to be greater than
64.
More details on this are available in
https://bugzilla.redhat.com/show_bug.cgi?id=1809345

now since BZ 1809345 is fixed by openshift/machine-config-operator#1711
and openshift/cluster-api-provider-gcp#88 the installer can relax the
restriction on the infra-id to match the other platforms.

Why is it important?

On GCP all resources are prefixed with infra-id, which currently is 12
chars with 6 chars used by random bit, leaving only 6 chars from cluster
name. This causes trouble associating the cluster to jobs in CI as most
of the identifyable characters are dropped from the resource names in CI
due to this restriction.

Also because of the previous restriction, only one char are used from
pool's name, making is higly likely to collide in cases there are more.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet