Skip to content

Commit

Permalink
OCPBUGS-25753: Run resolv-prepender entirely async
Browse files Browse the repository at this point in the history
Currently the resolv-prepender dispatcher script starts the systemd
service and then waits for it to complete. This can cause the
dispatcher script to time out if the runtimecfg image pull is slow
or if resolv.conf does not get populated in a timely fashion (it's
not entirely clear to me why the latter happens, but it does). This
can cause configure-ovs to time out if there are a large number of
interfaces on the system triggering the dispatcher script, such as
when there are many VLANs configured.

To avoid this, we can stop waiting for the systemd service in the
dispatcher script. In fact, there's an argument that we shouldn't
wait since we need to be able to handle asynchronous execution
anyway for the slow image pull case (which was the entire reason the
script was split into a service the way it is).

I have found a few possible issues with async execution however:
* If we start the service with an empty $DHCP6_FQDN_FQDN value and
  then later get a new value for that, we may not correctly apply
  the new value if the service is still running because we only
  ever "systemd start" the service, which is a noop if the service
  is already running.
* Similarly, if new IP4/6_DOMAINS values come in on a later
  connection that may not be reflected in the service either.

Even though these may sound like the same problem, I mention them
separately on purpose because the solutions are different:
* For the DHCP6 case, we can move that logic back into the dispatcher
  script so we will always set the hostname no matter what happens
  with the prepender code. One could argue that this should be in
  its own script anyway since it's largely unrelated to resolv.conf.
* For the domains case, we do need to restart the service since the
  domains are involved in resolv.conf generation. However, we do not
  want to restart the service every time since that may be unnecessary
  and if we restart in the middle of the image pull it could result
  in a corrupt image (the whole thing we were trying to avoid by
  running this as a service in the first place).

  To avoid problems with restarting the service when we don't want to,
  I've added logic that only restarts the service if there are
  changed env values AND the runtimecfg image has already been pulled.
  This should mean the worst case scenario is that we don't properly
  set the domains and resolv.conf is temporarily generated with and
  incorrect search line. This should be resolved the next time any
  event that triggers the dispatcher script happens.
  • Loading branch information
cybertron committed Jan 12, 2024
1 parent e0dda82 commit 10a4774
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 21 deletions.
34 changes: 24 additions & 10 deletions templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml
Expand Up @@ -10,18 +10,22 @@ contents:
function resolv_prepender {
mkdir -p /run/resolv-prepender
echo "DHCP6_FQDN_FQDN=$DHCP6_FQDN_FQDN" > /run/resolv-prepender/env
echo "IP4_DOMAINS=$IP4_DOMAINS" >> /run/resolv-prepender/env
echo "IP6_DOMAINS=$IP6_DOMAINS" >> /run/resolv-prepender/env
systemctl start on-prem-resolv-prepender
# Wait for the service to complete so we don't mark the network up too soon
while systemctl is-active on-prem-resolv-prepender
do
sleep 1
done
echo "IP4_DOMAINS=$IP4_DOMAINS" > /run/resolv-prepender/env.new
echo "IP6_DOMAINS=$IP6_DOMAINS" >> /run/resolv-prepender/env.new
# If we changed the environment, we should restart the service to pick up the
# new values. However, if the image hasn't been pulled successfully yet we can't
# restart the service or we may interrupt the pull and end up with a corrupt image.
# We're better off with incorrect search domains for a while than wedging the
# system with a bad image.
if ! diff -q /run/resolv-prepender/env /run/resolv-prepender/env.new && /usr/bin/podman image exists {{ .Images.baremetalRuntimeCfgImage }}; then
>&2 echo "NM resolv-prepender: Environment variable(s) changed. Restarting service."
systemctl is-active on-prem-resolv-prepender && systemctl kill on-prem-resolv-prepender
fi
mv -f /run/resolv-prepender/env.new /run/resolv-prepender/env
systemctl start --no-block on-prem-resolv-prepender
}
export DHCP6_FQDN_FQDN IP4_DOMAINS IP6_DOMAINS
export IP4_DOMAINS IP6_DOMAINS
export -f resolv_prepender
# Given an overall Network Manager dispatcher timeout of 90 seconds, and multiple events which
# may occur within this time period, we must enforce a time limit for each event. As some
Expand All @@ -34,6 +38,16 @@ contents:
>&2 echo "NM resolv-prepender: Timeout occurred"
exit 1
fi
# If $DHCP6_FQDN_FQDN is not empty and is not localhost.localdomain and static hostname was not already set
if [[ -n "$DHCP6_FQDN_FQDN" && "$DHCP6_FQDN_FQDN" != "localhost.localdomain" && "$DHCP6_FQDN_FQDN" =~ "." ]] ; then
STATIC_HOSTNAME="$(test ! -e /etc/hostname && echo -n || cat /etc/hostname | xargs)"
if [[ -z "$STATIC_HOSTNAME" || "$STATIC_HOSTNAME" == "localhost.localdomain" ]] ; then
# run with systemd-run to avoid selinux problems
systemd-run --property=Type=oneshot --unit resolve-prepender-hostnamectl -Pq \
hostnamectl set-hostname --static --transient $DHCP6_FQDN_FQDN
fi
fi
;;
*)
;;
Expand Down
11 changes: 0 additions & 11 deletions templates/common/on-prem/files/resolv-prepender.yaml
Expand Up @@ -33,17 +33,6 @@ contents:
}
function resolv_prepender {
# If $DHCP6_FQDN_FQDN is not empty and is not localhost.localdomain and static hostname was not already set
if [[ -n "$DHCP6_FQDN_FQDN" && "$DHCP6_FQDN_FQDN" != "localhost.localdomain" && "$DHCP6_FQDN_FQDN" =~ "." ]] ; then
STATIC_HOSTNAME="$(test ! -e /etc/hostname && echo -n || cat /etc/hostname | xargs)"
if [[ -z "$STATIC_HOSTNAME" || "$STATIC_HOSTNAME" == "localhost.localdomain" ]] ; then
# run with systemd-run to avoid selinux problems
systemd-run --property=Type=oneshot --unit resolve-prepender-hostnamectl -Pq \
hostnamectl set-hostname --static --transient $DHCP6_FQDN_FQDN
fi
fi
# In DHCP connections, the resolv.conf content may be late, thus we wait for nameservers
while ! grep nameserver /var/run/NetworkManager/resolv.conf; do
>&2 echo "NM resolv-prepender: NM resolv.conf still empty of nameserver"
Expand Down

0 comments on commit 10a4774

Please sign in to comment.