OCPBUGS-11997: Prevent NM from unsetting the hostname #3794

mkowalski · 2023-07-12T10:08:02Z

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

openshift-ci-robot · 2023-07-12T10:08:04Z

@mkowalski: An error was encountered searching for bug OCPBUGS-11997 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message.


You do not have the permission to see the specified issue.: request failed. Please analyze the request body for more details. Status code: 401:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-07-12T10:08:07Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mkowalski · 2023-07-12T10:09:02Z

/jira refresh

openshift-ci-robot · 2023-07-12T10:09:08Z

@mkowalski: This pull request references Jira Issue OCPBUGS-11997, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2023-07-12T10:09:42Z

/jira refresh

openshift-ci-robot · 2023-07-12T10:09:49Z

@mkowalski: This pull request references Jira Issue OCPBUGS-11997, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mike-nguyen

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2023-07-12T10:14:24Z

/test all

mkowalski · 2023-07-12T12:15:27Z

Manual test for IPv4-only cluster succeeded

sh-5.1# systemctl status node-valid-hostname.service
● node-valid-hostname.service - Wait for a non-localhost hostname
     Loaded: loaded (/etc/systemd/system/node-valid-hostname.service; enabled; preset: disabled)
     Active: active (exited) since Wed 2023-07-12 11:28:00 UTC; 35min ago
   Main PID: 3755 (code=exited, status=0/SUCCESS)
        CPU: 8ms

Jul 12 11:28:00 master-0 systemd[1]: Starting Wait for a non-localhost hostname...
Jul 12 11:28:00 master-0 mco-hostname[3755]: waiting for non-localhost hostname to be assigned
Jul 12 11:28:00 master-0 mco-hostname[3755]: node identified as master-0
Jul 12 11:28:00 master-0 mco-hostname[3755]: saving hostname to prevent NetworkManager from ever unsetting it
Jul 12 11:28:00 master-0 systemd[1]: Finished Wait for a non-localhost hostname.
sh-5.1# hostname
master-0
sh-5.1# hostname -f
master-0.ostest.test.metalkube.org

# oc get nodes
NAME       STATUS   ROLES                         AGE   VERSION
master-0   Ready    control-plane,master,worker   36m   v1.27.3+af29f64
master-1   Ready    control-plane,master,worker   36m   v1.27.3+af29f64
master-2   Ready    control-plane,master,worker   37m   v1.27.3+af29f64

mkowalski · 2023-07-12T13:30:37Z

Manual test for IPv6-only good

sh-5.1# systemctl status node-valid-hostname.service
● node-valid-hostname.service - Wait for a non-localhost hostname
     Loaded: loaded (/etc/systemd/system/node-valid-hostname.service; enabled; preset: disabled)
     Active: active (exited) since Wed 2023-07-12 12:32:40 UTC; 57min ago
   Main PID: 3655 (code=exited, status=0/SUCCESS)
        CPU: 9ms

Jul 12 12:32:39 master-0.ostest.test.metalkube.org systemd[1]: Starting Wait for a non-localhost hostname...
Jul 12 12:32:39 master-0.ostest.test.metalkube.org mco-hostname[3655]: waiting for non-localhost hostname to be assigned
Jul 12 12:32:39 master-0.ostest.test.metalkube.org mco-hostname[3655]: node identified as master-0.ostest.test.metalkube.org
Jul 12 12:32:39 master-0.ostest.test.metalkube.org mco-hostname[3655]: saving hostname to prevent NetworkManager from ever unsetting it
Jul 12 12:32:40 master-0.ostest.test.metalkube.org systemd[1]: Finished Wait for a non-localhost hostname.
sh-5.1# hostname
master-0.ostest.test.metalkube.org
sh-5.1# hostname -f
master-0.ostest.test.metalkube.org

# oc get nodes
NAME                                 STATUS   ROLES                         AGE   VERSION
master-0.ostest.test.metalkube.org   Ready    control-plane,master,worker   56m   v1.27.3+af29f64
master-1.ostest.test.metalkube.org   Ready    control-plane,master,worker   55m   v1.27.3+af29f64
master-2.ostest.test.metalkube.org   Ready    control-plane,master,worker   55m   v1.27.3+af29f64

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as `localhost.localdomain`. It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as `localhost.localdomain`. In order to prevent this behaviour we are modifying `node-valid-hostname.service` so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using `hostnamectl set-hostname` so that NetworkManager will not try unsetting it even if no valid form is available anymore. The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one. This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well. Fixes: OCPBUGS-11997

mkowalski · 2023-07-12T15:56:39Z

/retitle OCPBUGS-11997, OCPBUGS-14692, OCPBUGS-14918: Prevent NM from unsetting the hostname

openshift-ci-robot · 2023-07-12T15:56:56Z

@mkowalski: This pull request references Jira Issue OCPBUGS-14918, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2023-07-12T15:57:21Z

/jira refresh

openshift-ci-robot · 2023-07-12T15:57:24Z

@mkowalski: This pull request references Jira Issue OCPBUGS-14918, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2023-07-13T08:26:22Z

/retitle OCPBUGS-11997: Prevent NM from unsetting the hostname

openshift-ci-robot · 2023-07-13T08:26:30Z

@mkowalski: This pull request references Jira Issue OCPBUGS-11997, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mike-nguyen

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cybertron · 2023-07-13T15:29:55Z

/test e2e-metal-ipi-ovn-dualstack
/lgtm

jkyros · 2023-07-14T03:49:20Z

this makes sense, hopefully CI will clear up
/approve

openshift-ci · 2023-07-14T03:49:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, jkyros, mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jkyros]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-07-14T04:23:17Z

/retest-required

Remaining retests: 0 against base HEAD 2069e55 and 2 for PR HEAD 5ec8f1b in total

openshift-ci · 2023-07-14T06:49:39Z

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`5ec8f1b`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-metal-ipi-ovn-dualstack	`5ec8f1b`	link	false	`/test e2e-metal-ipi-ovn-dualstack`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-07-14T06:55:09Z

@mkowalski: Jira Issue OCPBUGS-11997: Some pull requests linked via external trackers have merged:

openshift/machine-config-operator#3794

The following pull requests linked via external trackers have not merged:

openshift/machine-config-operator#3801 is open

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-11997 has not been moved to the MODIFIED state.

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2023-07-14T07:37:55Z

/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-07-14T07:38:41Z

@mkowalski: new pull request created: #3805

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2023-07-14T15:00:17Z

I know I'm late but I still think we should strongly encourage customers that were implicitly relying on rDNS to stop doing so.

Instead there are two possibilities that just make everything more reliable:

Fix the DNS server to give a hostname
Inject the hostname statically via Ignition

cgwalters · 2023-07-14T17:25:58Z

Hmm...in this wait loop, what is it that will end up causing the rDNS lookup to succeed? IOW when will NetworkManager query rDNS in this case? Is it only when the DHCP lease expires?

In environments with ipv6 (both single and dual stack) we explicitly set the hostname to the FQDN. However, the recent change to make the current hostname static in the node-valid-hostname service[0] has created a conflict with the resolv-prepender script where we force the FQDN. If node-valid-hostname runs before the "up" event for the interface which provides the hostname, it's possible for it to statically set a short name, which prevents the prepender script from later changing it to an FQDN. This can happen because the up event is asynchronous. The network is marked up before the event is dispatched, and as a result we currently have a race between the event and the node-valid-hostname service. To ensure proper ordering, this change adds a new dependency to node-valid-hostname which is a service that simply wait for the br-ex up event. Once we've received that, we should have run the prepender script and set the FQDN if necessary. 0: openshift#3794

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 12, 2023

openshift-ci bot requested a review from mike-nguyen July 12, 2023 10:10

mkowalski marked this pull request as ready for review July 12, 2023 13:30

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023

openshift-ci bot requested review from cdoern and yuqi-zhang July 12, 2023 13:31

mkowalski force-pushed the OCPBUGS-11997 branch from cc72c6e to 5ec8f1b Compare July 12, 2023 14:47

openshift-ci bot changed the title ~~OCPBUGS-11997: Prevent NM from unsetting the hostname~~ OCPBUGS-11997, OCPBUGS-14692, OCPBUGS-14918: Prevent NM from unsetting the hostname Jul 12, 2023

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. and removed jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. labels Jul 12, 2023

openshift-ci bot changed the title ~~OCPBUGS-11997, OCPBUGS-14692, OCPBUGS-14918: Prevent NM from unsetting the hostname~~ OCPBUGS-11997: Prevent NM from unsetting the hostname Jul 13, 2023

openshift-ci-robot removed the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Jul 13, 2023

openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Jul 13, 2023

openshift-ci bot assigned cybertron Jul 13, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 13, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 14, 2023

openshift-merge-robot merged commit d0c0670 into openshift:master Jul 14, 2023
12 of 14 checks passed

mkowalski deleted the OCPBUGS-11997 branch July 14, 2023 07:37

openshift-cherrypick-robot mentioned this pull request Jul 14, 2023

[release-4.13] OCPBUGS-16218: Prevent NM from unsetting the hostname #3805

Merged

cybertron mentioned this pull request Nov 10, 2023

OCPBUGS-22324: Wait for br-ex up event before node-valid-hostname #4028

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-11997: Prevent NM from unsetting the hostname #3794

OCPBUGS-11997: Prevent NM from unsetting the hostname #3794

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

openshift-ci bot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

mkowalski commented Jul 12, 2023

mkowalski commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 13, 2023

openshift-ci-robot commented Jul 13, 2023

cybertron commented Jul 13, 2023

jkyros commented Jul 14, 2023

openshift-ci bot commented Jul 14, 2023

openshift-ci-robot commented Jul 14, 2023

openshift-ci bot commented Jul 14, 2023

openshift-ci-robot commented Jul 14, 2023

mkowalski commented Jul 14, 2023

openshift-cherrypick-robot commented Jul 14, 2023

cgwalters commented Jul 14, 2023

cgwalters commented Jul 14, 2023

OCPBUGS-11997: Prevent NM from unsetting the hostname #3794

OCPBUGS-11997: Prevent NM from unsetting the hostname #3794

Conversation

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

openshift-ci bot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

mkowalski commented Jul 12, 2023

mkowalski commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

mkowalski commented Jul 13, 2023

openshift-ci-robot commented Jul 13, 2023

cybertron commented Jul 13, 2023

jkyros commented Jul 14, 2023

openshift-ci bot commented Jul 14, 2023

openshift-ci-robot commented Jul 14, 2023

openshift-ci bot commented Jul 14, 2023

openshift-ci-robot commented Jul 14, 2023

mkowalski commented Jul 14, 2023

openshift-cherrypick-robot commented Jul 14, 2023

cgwalters commented Jul 14, 2023

cgwalters commented Jul 14, 2023