Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-11997: Prevent NM from unsetting the hostname #3794

Merged
merged 1 commit into from Jul 14, 2023

Conversation

mkowalski
Copy link
Contributor

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

@openshift-ci-robot
Copy link
Contributor

@mkowalski: An error was encountered searching for bug OCPBUGS-11997 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message. You do not have the permission to see the specified issue.: request failed. Please analyze the request body for more details. Status code: 401:

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 12, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@mkowalski
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 12, 2023
@openshift-ci-robot
Copy link
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-11997, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 12, 2023
@openshift-ci-robot
Copy link
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-11997, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mike-nguyen

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from mike-nguyen July 12, 2023 10:10
@mkowalski
Copy link
Contributor Author

/test all

@mkowalski
Copy link
Contributor Author

Manual test for IPv4-only cluster succeeded

sh-5.1# systemctl status node-valid-hostname.service
● node-valid-hostname.service - Wait for a non-localhost hostname
     Loaded: loaded (/etc/systemd/system/node-valid-hostname.service; enabled; preset: disabled)
     Active: active (exited) since Wed 2023-07-12 11:28:00 UTC; 35min ago
   Main PID: 3755 (code=exited, status=0/SUCCESS)
        CPU: 8ms

Jul 12 11:28:00 master-0 systemd[1]: Starting Wait for a non-localhost hostname...
Jul 12 11:28:00 master-0 mco-hostname[3755]: waiting for non-localhost hostname to be assigned
Jul 12 11:28:00 master-0 mco-hostname[3755]: node identified as master-0
Jul 12 11:28:00 master-0 mco-hostname[3755]: saving hostname to prevent NetworkManager from ever unsetting it
Jul 12 11:28:00 master-0 systemd[1]: Finished Wait for a non-localhost hostname.
sh-5.1# hostname
master-0
sh-5.1# hostname -f
master-0.ostest.test.metalkube.org
# oc get nodes
NAME       STATUS   ROLES                         AGE   VERSION
master-0   Ready    control-plane,master,worker   36m   v1.27.3+af29f64
master-1   Ready    control-plane,master,worker   36m   v1.27.3+af29f64
master-2   Ready    control-plane,master,worker   37m   v1.27.3+af29f64

@mkowalski
Copy link
Contributor Author

Manual test for IPv6-only good

sh-5.1# systemctl status node-valid-hostname.service
● node-valid-hostname.service - Wait for a non-localhost hostname
     Loaded: loaded (/etc/systemd/system/node-valid-hostname.service; enabled; preset: disabled)
     Active: active (exited) since Wed 2023-07-12 12:32:40 UTC; 57min ago
   Main PID: 3655 (code=exited, status=0/SUCCESS)
        CPU: 9ms

Jul 12 12:32:39 master-0.ostest.test.metalkube.org systemd[1]: Starting Wait for a non-localhost hostname...
Jul 12 12:32:39 master-0.ostest.test.metalkube.org mco-hostname[3655]: waiting for non-localhost hostname to be assigned
Jul 12 12:32:39 master-0.ostest.test.metalkube.org mco-hostname[3655]: node identified as master-0.ostest.test.metalkube.org
Jul 12 12:32:39 master-0.ostest.test.metalkube.org mco-hostname[3655]: saving hostname to prevent NetworkManager from ever unsetting it
Jul 12 12:32:40 master-0.ostest.test.metalkube.org systemd[1]: Finished Wait for a non-localhost hostname.
sh-5.1# hostname
master-0.ostest.test.metalkube.org
sh-5.1# hostname -f
master-0.ostest.test.metalkube.org
# oc get nodes
NAME                                 STATUS   ROLES                         AGE   VERSION
master-0.ostest.test.metalkube.org   Ready    control-plane,master,worker   56m   v1.27.3+af29f64
master-1.ostest.test.metalkube.org   Ready    control-plane,master,worker   55m   v1.27.3+af29f64
master-2.ostest.test.metalkube.org   Ready    control-plane,master,worker   55m   v1.27.3+af29f64

@mkowalski mkowalski marked this pull request as ready for review July 12, 2023 13:30
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@openshift-ci openshift-ci bot requested review from cdoern and yuqi-zhang July 12, 2023 13:31
It has been discovered that in environments that use revDNS for setting
the hostname instead of DHCP Option 12 it is possible to intermittently
lose it and try to register as `localhost.localdomain`.

It happens in a scenario when we obtain the hostname from the DNS server
(PTR record for the IP address of the node), later we lose DNS for any
reason and during this time we reload the NM configuration. Because of
its design, if the revDNS is not available anymore instead of persisting
the previous name, NetworkManger will unset it completely leaving the
node visible as `localhost.localdomain`.

In order to prevent this behaviour we are modifying
`node-valid-hostname.service` so that instead of only waiting for the
non-localhost name to appear once (without detecting if it can
intermittently disappear in the future) it also sets it using
`hostnamectl set-hostname` so that NetworkManager will not try unsetting
it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow
to change the hostname during lifetime of the node. Thus, a scenario
where we would like NetworkManager to update the hostname based on a
changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames
because changing hostname via updating Option 12 field is not allowed as
well.

Fixes: OCPBUGS-11997
@mkowalski
Copy link
Contributor Author

/retitle OCPBUGS-11997, OCPBUGS-14692, OCPBUGS-14918: Prevent NM from unsetting the hostname

@openshift-ci openshift-ci bot changed the title OCPBUGS-11997: Prevent NM from unsetting the hostname OCPBUGS-11997, OCPBUGS-14692, OCPBUGS-14918: Prevent NM from unsetting the hostname Jul 12, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. and removed jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. labels Jul 12, 2023
@openshift-ci-robot
Copy link
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-14918, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-14918, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

/retitle OCPBUGS-11997: Prevent NM from unsetting the hostname

@openshift-ci openshift-ci bot changed the title OCPBUGS-11997, OCPBUGS-14692, OCPBUGS-14918: Prevent NM from unsetting the hostname OCPBUGS-11997: Prevent NM from unsetting the hostname Jul 13, 2023
@openshift-ci-robot openshift-ci-robot removed the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Jul 13, 2023
@openshift-ci-robot
Copy link
Contributor

@mkowalski: This pull request references Jira Issue OCPBUGS-11997, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mike-nguyen

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. label Jul 13, 2023
@cybertron
Copy link
Member

/test e2e-metal-ipi-ovn-dualstack
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 13, 2023
@jkyros
Copy link
Contributor

jkyros commented Jul 14, 2023

this makes sense, hopefully CI will clear up
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, jkyros, mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 14, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 2069e55 and 2 for PR HEAD 5ec8f1b in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2023

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 5ec8f1b link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-metal-ipi-ovn-dualstack 5ec8f1b link false /test e2e-metal-ipi-ovn-dualstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d0c0670 into openshift:master Jul 14, 2023
12 of 14 checks passed
@openshift-ci-robot
Copy link
Contributor

@mkowalski: Jira Issue OCPBUGS-11997: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-11997 has not been moved to the MODIFIED state.

In response to this:

It has been discovered that in environments that use revDNS for setting the hostname instead of DHCP Option 12 it is possible to intermittently lose it and try to register as localhost.localdomain.

It happens in a scenario when we obtain the hostname from the DNS server (PTR record for the IP address of the node), later we lose DNS for any reason and during this time we reload the NM configuration. Because of its design, if the revDNS is not available anymore instead of persisting the previous name, NetworkManger will unset it completely leaving the node visible as localhost.localdomain.

In order to prevent this behaviour we are modifying node-valid-hostname.service so that instead of only waiting for the non-localhost name to appear once (without detecting if it can intermittently disappear in the future) it also sets it using hostnamectl set-hostname so that NetworkManager will not try unsetting it even if no valid form is available anymore.

The change is valid as a platform-agnostic change because we never allow to change the hostname during lifetime of the node. Thus, a scenario where we would like NetworkManager to update the hostname based on a changed revDNS record is not a valid one.

This change is valid for DHCP Option 12 and revDNS-based hostnames because changing hostname via updating Option 12 field is not allowed as well.

Fixes: OCPBUGS-11997

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski mkowalski deleted the OCPBUGS-11997 branch July 14, 2023 07:37
@mkowalski
Copy link
Contributor Author

/cherry-pick release-4.13

@openshift-cherrypick-robot

@mkowalski: new pull request created: #3805

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member

I know I'm late but I still think we should strongly encourage customers that were implicitly relying on rDNS to stop doing so.

Instead there are two possibilities that just make everything more reliable:

  • Fix the DNS server to give a hostname
  • Inject the hostname statically via Ignition

@cgwalters
Copy link
Member

Hmm...in this wait loop, what is it that will end up causing the rDNS lookup to succeed? IOW when will NetworkManager query rDNS in this case? Is it only when the DHCP lease expires?

cybertron added a commit to cybertron/machine-config-operator that referenced this pull request Nov 10, 2023
In environments with ipv6 (both single and dual stack) we explicitly
set the hostname to the FQDN. However, the recent change to make the
current hostname static in the node-valid-hostname service[0] has
created a conflict with the resolv-prepender script where we force
the FQDN. If node-valid-hostname runs before the "up" event for the
interface which provides the hostname, it's possible for it to
statically set a short name, which prevents the prepender script
from later changing it to an FQDN.

This can happen because the up event is asynchronous. The network
is marked up before the event is dispatched, and as a result we
currently have a race between the event and the node-valid-hostname
service. To ensure proper ordering, this change adds a new
dependency to node-valid-hostname which is a service that simply
wait for the br-ex up event. Once we've received that, we should
have run the prepender script and set the FQDN if necessary.

0: openshift#3794
mkowalski pushed a commit to mkowalski/machine-config-operator that referenced this pull request Mar 4, 2024
In environments with ipv6 (both single and dual stack) we explicitly
set the hostname to the FQDN. However, the recent change to make the
current hostname static in the node-valid-hostname service[0] has
created a conflict with the resolv-prepender script where we force
the FQDN. If node-valid-hostname runs before the "up" event for the
interface which provides the hostname, it's possible for it to
statically set a short name, which prevents the prepender script
from later changing it to an FQDN.

This can happen because the up event is asynchronous. The network
is marked up before the event is dispatched, and as a result we
currently have a race between the event and the node-valid-hostname
service. To ensure proper ordering, this change adds a new
dependency to node-valid-hostname which is a service that simply
wait for the br-ex up event. Once we've received that, we should
have run the prepender script and set the FQDN if necessary.

0: openshift#3794
mkowalski pushed a commit to mkowalski/machine-config-operator that referenced this pull request Mar 4, 2024
In environments with ipv6 (both single and dual stack) we explicitly
set the hostname to the FQDN. However, the recent change to make the
current hostname static in the node-valid-hostname service[0] has
created a conflict with the resolv-prepender script where we force
the FQDN. If node-valid-hostname runs before the "up" event for the
interface which provides the hostname, it's possible for it to
statically set a short name, which prevents the prepender script
from later changing it to an FQDN.

This can happen because the up event is asynchronous. The network
is marked up before the event is dispatched, and as a result we
currently have a race between the event and the node-valid-hostname
service. To ensure proper ordering, this change adds a new
dependency to node-valid-hostname which is a service that simply
wait for the br-ex up event. Once we've received that, we should
have run the prepender script and set the FQDN if necessary.

0: openshift#3794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants