Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GracefulNodeShutdown tests failing due to connection with dbus #121124

Closed
kannon92 opened this issue Oct 10, 2023 · 10 comments · Fixed by #121506
Closed

GracefulNodeShutdown tests failing due to connection with dbus #121124

kannon92 opened this issue Oct 10, 2023 · 10 comments · Fixed by #121506
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@kannon92
Copy link
Contributor

Which jobs are failing?

Any test that runs Gracefulshutdown seems to be having issues.

Which tests are failing?

Reference: #120726

In this case, they are failing consistently.

Since when has it been failing?

The tests were flaky but as of 9/29, they have been failing ever since.

Testgrid link

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

Reason for failure (if possible)

@rphillips and I were looking into these failures. We are able to reproduce on a GCP instance but we are seeing failures to sending dbus shutdown signals.

We have looked into this code but it does not look like any code is responsible for this failure. One thing we did see is that there was a new release of COS around then.

We are seeing failures in the logs.

Running this test locally we can sometimes see issues with dbus.

E1010 20:47:08.348826    3687 nodeshutdown_manager_linux.go:193] "Unable to watch the node for shutdown events" err="failed reading InhibitDelayMaxUSec property from logind: Failed to activate service 'org.freedesktop.login1': timed out (service_start_timeout=25000ms)"

Running this locally (after copying the image config):

Anything else we need to know?

export SSH_USER=core && make test-e2e-node INSTANCE_PREFIX=kehannon REMOTE=true IMAGE_CONFIG_FILE="/home/kehannon/Downloads/image-config-serial.yaml" CLEANUP=false RUNTIME=remote CONTAINER_RUNTIME_ENDPOINT="unix:///run/containerd/containerd.sock" FOCUS="after restart dbus, should be able to gracefully shutdown" SKIP="" TEST_ARGS='--kubelet-flags="--cgroup-driver=systemd --cgroups-per-qos=true --cgroup-root=/ --runtime-cgroups=/system.slice/containerd.service --kubelet-cgroups=/system.slice/kubelet.service" --extra-log="{\"name\": \"containerd.log\", \"journalctl\": [\"-u\", \"containerd*\"]}"'

Relevant SIG(s)

/sig node

@kannon92 kannon92 added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 10, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 10, 2023
@kannon92
Copy link
Contributor Author

/cc @rphillips @sairameshv

@mmiranda96 mmiranda96 moved this from Triage to Issues - To do in SIG Node CI/Test Board Oct 11, 2023
@mmiranda96
Copy link
Contributor

/cc

@mmiranda96
Copy link
Contributor

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 11, 2023
@kannon92
Copy link
Contributor Author

kannon92 commented Oct 11, 2023

Looking in the logs of the last successful job: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial-containerd/1707529282919600128/artifacts/n1-standard-2-cos-stable-105-17412-156-63-c7409603-system.log

And viewing the kernel logs, I don't see any errors related to dbus.

Last successful job linux version:

Sep 28 23:10:42 localhost kernel: Linux version 5.15.120+ (builder@6a7eb91cf3ef) (Chromium OS 15.0_pre458507_p20220602-r18 clang version 15.0.0 (/var/tmp/portage/sys-devel/llvm-15.0_pre458507_p20220602-r18/work/llvm-15.0_pre458507_p20220602/clang a58d0af058038595c93de961b725f86997cf8d4a), LLD 15.0.0) #1 SMP Fri Sep 22 11:05:48 UTC 2023

The next job fails and the main difference seems to be a linux kernel update.

Sep 29 03:09:56 localhost kernel: Linux version 6.1.42+ (builder@a797a1cc3764) (Chromium OS 16.0_pre484197_p20230405-r10 clang version 16.0.0 (/var/tmp/portage/sys-devel/llvm-16.0_pre484197_p20230405-r10/work/llvm-16.0_pre484197_p20230405/clang 2916b99182752b1aece8cc4479d8d6a20b5e02da), LLD 16.0.0) #1 SMP PREEMPT_DYNAMIC Sun Sep 24 13:27:25 UTC 2023

@SergeyKanzhelev
Copy link
Member

/cc

1 similar comment
@bobbypage
Copy link
Member

/cc

@bobbypage
Copy link
Member

bobbypage commented Oct 18, 2023

I was able to repro this on COS M109 as follows, while the below works fine on COS M105.

$ gcloud compute instances create cos-109 --machine-type=n1-standard-1  --image="cos-109-17800-0-51"     --image-project="cos-cloud"     --zone=us-central1-c --project=MY_PROJECT

$ gcloud compute ssh cos-109
cos-109 ~ # gdbus introspect --system --dest org.freedesktop.login1 --object-path /org/freedesktop/login1
node /org/freedesktop/login1  
# works
cos-109 ~ # systemctl restart dbus
cos-109 ~ #  gdbus introspect --system --dest org.freedesktop.login1 --object-path /org/freedesktop/login1
node /org/freedesktop/login1  # will fail forever with "Timeout was reached" error
Error: Timeout was reached

Will dig a bit deeper if this is an issue in COS or systemd upstream regression.

@kwilczynski
Copy link
Contributor

@bobbypage, thank you for looking into this!

(adding @wzshiming who authored the Dbus restart code)

What we are seeing here would be something other than a regression. This problem affects Google's COS, Fedora CoreOS (and its RHEL-based equivalent), Fedora Workstations, and most likely any other Linux distribution, whether container optimised or not, that is not Debian or based on Debian, such as Ubuntu. Both Debian and Ubuntu are unaffected as they handle Dbus restarts slightly differently.

TL;DR
Restarting the Dbus broker service at runtime is generally not recommended or supported. Many systemd components (but also other software and services) that rely on Dbus wouldn't reconnect to a new Unix socket, which would have the same path (location) as before, following the Dbus broker restart.

This can lead to anything from authentication timeouts (since the systemd-logind and the polkitd services would be impacted) to the system generally broken following Dbus broker restart - this is especially the case for desktop (such as Fedora Workstation) systems where there are a lot of variety of services that rely on working Dbus connectivity.

Different software might handle this differently and choose to reconnect on a socket write or read timeout. However, systemd at large is not one of these.

Lennart has confirmed this on a number of occasions per (selected from a sample of issues):

Debian and its derivatives, such as Ubuntu or Linux Mint, etc., are not affected as they opt to trigger the restart (or activation, if you wish) of key system services (such as the systemd-logind) following a Dbus broker restart.

I recommend removing the Dbus broker restart from the test code base to fix the failing tests. If there were no objections, then I would be happy to do this work.

Additionally, we could see about porting Debian's behaviour to Google's COS and Fedora CoreOS (and, as such, the Red Hat RHEL-based equivalent) around restarting systemd-logind on Dbus restart. But that would be a bonus work and outside of this issue's scope.


Typically on a Fedora CoreOS distribution, the default list of Dbus connections just after system startup might look as follows:

Dbus connections list on Fedora CoreOS after startup
# busctl --list
NAME                            PID PROCESS         USER            CONNECTION    UNIT                      SESSION DESCRIPTION
:1.0                            756 systemd-resolve systemd-resolve :1.0          systemd-resolved.service  -       -          
:1.1                              1 systemd         root            :1.1          init.scope                -       -          
:1.11                          1104 zincati         zincati         :1.11         zincati.service           -       -          
:1.13                          1121 polkitd         polkitd         :1.13         polkit.service            -       -          
:1.2                            764 dbus-broker-lau root            :1.2          dbus-broker.service       -       -          
:1.23                          1260 systemd         core            :1.23         user@1000.service         -       -          
:1.25                          1283 systemd-hostnam root            :1.25         systemd-hostnamed.service -       -          
:1.29                          1323 busctl          root            :1.29         session-2.scope           2       -          
:1.3                            787 NetworkManager  root            :1.3          NetworkManager.service    -       -          
:1.4                            809 systemd-homed   root            :1.4          systemd-homed.service     -       -          
:1.5                            818 systemd-logind  root            :1.5          systemd-logind.service    -       -          
org.coreos.zincati             1104 zincati         zincati         :1.11         zincati.service           -       -          
org.freedesktop.DBus              1 systemd         root            -             init.scope                -       -          
org.freedesktop.NetworkManager  787 NetworkManager  root            :1.3          NetworkManager.service    -       -          
org.freedesktop.PolicyKit1     1121 polkitd         polkitd         :1.13         polkit.service            -       -          
org.freedesktop.fwupd             - -               -               (activatable) -                         -       -          
org.freedesktop.home1           809 systemd-homed   root            :1.4          systemd-homed.service     -       -          
org.freedesktop.hostname1      1283 systemd-hostnam root            :1.25         systemd-hostnamed.service -       -          
org.freedesktop.import1           - -               -               (activatable) -                         -       -          
org.freedesktop.locale1           - -               -               (activatable) -                         -       -          
org.freedesktop.login1          818 systemd-logind  root            :1.5          systemd-logind.service    -       -          
org.freedesktop.machine1          - -               -               (activatable) -                         -       -          
org.freedesktop.nm_dispatcher     - -               -               (activatable) -                         -       -          
org.freedesktop.nm_priv_helper    - -               -               (activatable) -                         -       -          
org.freedesktop.oom1              - -               -               (activatable) -                         -       -          
org.freedesktop.portable1         - -               -               (activatable) -                         -       -          
org.freedesktop.resolve1        756 systemd-resolve systemd-resolve :1.0          systemd-resolved.service  -       -          
org.freedesktop.systemd1          1 systemd         root            :1.1          init.scope                -       -          
org.freedesktop.timedate1         - -               -               (activatable) -                         -       -          
org.freedesktop.timesync1         - -               -               (activatable) -                         -       -          
org.projectatomic.rpmostree1      - -               -               (activatable) -                         -       -  

Several key systemd services are connected, such as the systemd-logind and the systemd-homed, plus the polkitd service.

Restarting Dbus would reduce this list to the following:

Dbus connections list on Fedora CoreOS after Dbus broker restart
# busctl --list
NAME                            PID PROCESS         USER CONNECTION    UNIT                SESSION DESCRIPTION
:1.0                              1 systemd         root :1.0          init.scope          -       -          
:1.1                           1344 dbus-broker-lau root :1.1          dbus-broker.service -       -          
:1.2                           1350 busctl          root :1.2          session-2.scope     2       -          
org.freedesktop.DBus              1 systemd         root -             init.scope          -       -          
org.freedesktop.PolicyKit1        - -               -    (activatable) -                   -       -          
org.freedesktop.fwupd             - -               -    (activatable) -                   -       -          
org.freedesktop.home1             - -               -    (activatable) -                   -       -          
org.freedesktop.hostname1         - -               -    (activatable) -                   -       -          
org.freedesktop.import1           - -               -    (activatable) -                   -       -          
org.freedesktop.locale1           - -               -    (activatable) -                   -       -          
org.freedesktop.login1            - -               -    (activatable) -                   -       -          
org.freedesktop.machine1          - -               -    (activatable) -                   -       -          
org.freedesktop.nm_dispatcher     - -               -    (activatable) -                   -       -          
org.freedesktop.nm_priv_helper    - -               -    (activatable) -                   -       -          
org.freedesktop.oom1              - -               -    (activatable) -                   -       -          
org.freedesktop.portable1         - -               -    (activatable) -                   -       -          
org.freedesktop.resolve1          - -               -    (activatable) -                   -       -          
org.freedesktop.systemd1          1 systemd         root :1.0          init.scope          -       -          
org.freedesktop.timedate1         - -               -    (activatable) -                   -       -          
org.freedesktop.timesync1         - -               -    (activatable) -                   -       -          
org.projectatomic.rpmostree1      - -               -    (activatable) -                   -       -   

There is almost nothing connected following the restart. Any attempt to introspect the /org/freedesktop/login1 object would now fail per:

Object introspection run
# time busctl introspect --verbose --system org.freedesktop.login1 /org/freedesktop/login1 > /dev/null
Failed to introspect object /org/freedesktop/login1 of service org.freedesktop.login1: Connection timed out

real	0m25.031s
user	0m0.001s
sys	0m0.005s

Restarting the systemd-logind service causes a fresh process to reconnect to the Dbus bus per:

Dbus connections list on Fedora CoreOS after systemd-logind restart
# systemctl restart systemd-logind
# busctl --list
NAME                            PID PROCESS         USER CONNECTION    UNIT                   SESSION DESCRIPTION
:1.1                           1344 dbus-broker-lau root :1.1          dbus-broker.service    -       -          
:1.11                             1 systemd         root :1.11         init.scope             -       -          
:1.15                          1598 systemd-logind  root :1.15         systemd-logind.service -       -          
:1.16                          1601 busctl          root :1.16         session-2.scope        2       -          
org.freedesktop.DBus              1 systemd         root -             init.scope             -       -          
org.freedesktop.PolicyKit1        - -               -    (activatable) -                      -       -          
org.freedesktop.fwupd             - -               -    (activatable) -                      -       -          
org.freedesktop.home1             - -               -    (activatable) -                      -       -          
org.freedesktop.hostname1         - -               -    (activatable) -                      -       -          
org.freedesktop.import1           - -               -    (activatable) -                      -       -          
org.freedesktop.locale1           - -               -    (activatable) -                      -       -          
org.freedesktop.login1         1598 systemd-logind  root :1.15         systemd-logind.service -       -          
org.freedesktop.machine1          - -               -    (activatable) -                      -       -          
org.freedesktop.nm_dispatcher     - -               -    (activatable) -                      -       -          
org.freedesktop.nm_priv_helper    - -               -    (activatable) -                      -       -          
org.freedesktop.oom1              - -               -    (activatable) -                      -       -          
org.freedesktop.portable1         - -               -    (activatable) -                      -       -          
org.freedesktop.resolve1          - -               -    (activatable) -                      -       -          
org.freedesktop.systemd1          1 systemd         root :1.11         init.scope             -       -          
org.freedesktop.timedate1         - -               -    (activatable) -                      -       -          
org.freedesktop.timesync1         - -               -    (activatable) -                      -       -          
org.projectatomic.rpmostree1      - -               -    (activatable) -                      -       -   

This resolves the problem of accessing the /org/freedesktop/login1 object per:

Object introspection run
# time busctl introspect --verbose --system org.freedesktop.login1 /org/freedesktop/login1 > /dev/null

real	0m0.010s
user	0m0.004s
sys	0m0.003s

However, other services might still be affected and would need to be restarted.

I mentioned that Debian and its derivatives are not affected as they trigger the restart of key systemd services, especially the systemd-logind service. This can be seen in the following output - pay close attention to the PID of the systemd-logind service on the list (using Debian 12 "bookworm" as an example):

Dbus connections list on Debian 12 following Dbus broker restart
# busctl --list
NAME                                 PID PROCESS        USER        CONNECTION    UNIT                   SESSION DESCRIPTION
:1.1                                   1 systemd        root        :1.1          init.scope             -       -          
:1.2                                4425 systemd-logind root        :1.2          systemd-logind.service -       -          
:1.3                                1544 wireplumber    kwilczynski :1.3          user@1000.service      -       -          
:1.4                                4427 busctl         root        :1.4          session-4.scope        4       -          
com.ubuntu.SoftwareProperties          - -              -           (activatable) -                      -       -          
fi.w1.wpa_supplicant1                  - -              -           (activatable) -                      -       -          
net.hadess.PowerProfiles               - -              -           (activatable) -                      -       -          
org.bluez                              - -              -           (activatable) -                      -       -          
org.freedesktop.Accounts               - -              -           (activatable) -                      -       -          
org.freedesktop.Avahi                  - -              -           (activatable) -                      -       -          
org.freedesktop.ColorManager           - -              -           (activatable) -                      -       -          
org.freedesktop.DBus                   1 systemd        root        -             init.scope             -       -          
org.freedesktop.GeoClue2               - -              -           (activatable) -                      -       -          
org.freedesktop.ModemManager1          - -              -           (activatable) -                      -       -          
org.freedesktop.PackageKit             - -              -           (activatable) -                      -       -          
org.freedesktop.PolicyKit1             - -              -           (activatable) -                      -       -          
org.freedesktop.RealtimeKit1           - -              -           (activatable) -                      -       -          
org.freedesktop.UDisks2                - -              -           (activatable) -                      -       -          
org.freedesktop.UPower                 - -              -           (activatable) -                      -       -          
org.freedesktop.bolt                   - -              -           (activatable) -                      -       -          
org.freedesktop.fwupd                  - -              -           (activatable) -                      -       -          
org.freedesktop.hostname1              - -              -           (activatable) -                      -       -          
org.freedesktop.locale1                - -              -           (activatable) -                      -       -          
org.freedesktop.login1              4425 systemd-logind root        :1.2          systemd-logind.service -       -          
org.freedesktop.network1               - -              -           (activatable) -                      -       -          
org.freedesktop.nm_dispatcher          - -              -           (activatable) -                      -       -          
org.freedesktop.nm_priv_helper         - -              -           (activatable) -                      -       -          
org.freedesktop.realmd                 - -              -           (activatable) -                      -       -          
org.freedesktop.systemd1               1 systemd        root        :1.1          init.scope             -       -          
org.freedesktop.timedate1              - -              -           (activatable) -                      -       -          
org.opensuse.CupsPkHelper.Mechanism    - -              -           (activatable) -                      -       -          
# systemctl restart dbus
# busctl --list
NAME                                 PID PROCESS        USER        CONNECTION    UNIT                   SESSION DESCRIPTION
:1.0                                   1 systemd        root        :1.0          init.scope             -       -          
:1.2                                4773 systemd-logind root        :1.2          systemd-logind.service -       -          
:1.3                                1544 wireplumber    kwilczynski :1.3          user@1000.service      -       -          
:1.4                                4775 busctl         root        :1.4          session-4.scope        4       -          
com.ubuntu.SoftwareProperties          - -              -           (activatable) -                      -       -          
fi.w1.wpa_supplicant1                  - -              -           (activatable) -                      -       -          
net.hadess.PowerProfiles               - -              -           (activatable) -                      -       -          
org.bluez                              - -              -           (activatable) -                      -       -          
org.freedesktop.Accounts               - -              -           (activatable) -                      -       -          
org.freedesktop.Avahi                  - -              -           (activatable) -                      -       -          
org.freedesktop.ColorManager           - -              -           (activatable) -                      -       -          
org.freedesktop.DBus                   1 systemd        root        -             init.scope             -       -          
org.freedesktop.GeoClue2               - -              -           (activatable) -                      -       -          
org.freedesktop.ModemManager1          - -              -           (activatable) -                      -       -          
org.freedesktop.PackageKit             - -              -           (activatable) -                      -       -          
org.freedesktop.PolicyKit1             - -              -           (activatable) -                      -       -          
org.freedesktop.RealtimeKit1           - -              -           (activatable) -                      -       -          
org.freedesktop.UDisks2                - -              -           (activatable) -                      -       -          
org.freedesktop.UPower                 - -              -           (activatable) -                      -       -          
org.freedesktop.bolt                   - -              -           (activatable) -                      -       -          
org.freedesktop.fwupd                  - -              -           (activatable) -                      -       -          
org.freedesktop.hostname1              - -              -           (activatable) -                      -       -          
org.freedesktop.locale1                - -              -           (activatable) -                      -       -          
org.freedesktop.login1              4773 systemd-logind root        :1.2          systemd-logind.service -       -          
org.freedesktop.network1               - -              -           (activatable) -                      -       -          
org.freedesktop.nm_dispatcher          - -              -           (activatable) -                      -       -          
org.freedesktop.nm_priv_helper         - -              -           (activatable) -                      -       -          
org.freedesktop.realmd                 - -              -           (activatable) -                      -       -          
org.freedesktop.systemd1               1 systemd        root        :1.0          init.scope             -       -          
org.freedesktop.timedate1              - -              -           (activatable) -                      -       -          
org.opensuse.CupsPkHelper.Mechanism    - -              -           (activatable) -                      -       -   

Note that for this particular Debian 12 system, the initial state of the connections was different. The following is the state immediately after the system starts up, this is a desktop flavour with a lot of other software that relies on Dbus, and that will also fail to reconnect correctly per:

Dbus connections list on Debian 12 after startup
# busctl --list
NAME                                 PID PROCESS         USER        CONNECTION    UNIT                          SESSION DESCRIPTION
:1.0                                 486 avahi-daemon    avahi       :1.0          avahi-daemon.service          -       -          
:1.1                                 497 polkitd         polkitd     :1.1          polkit.service                -       -          
:1.10                                506 NetworkManager  root        :1.10         NetworkManager.service        -       -          
:1.11                                533 ModemManager    root        :1.11         ModemManager.service          -       -          
:1.15                                589 cupsd           root        :1.15         cups.service                  -       -          
:1.16                                609 cups-browsed    root        :1.16         cups-browsed.service          -       -          
:1.17                                609 cups-browsed    root        :1.17         cups-browsed.service          -       -          
:1.18                                622 gdm3            root        :1.18         gdm.service                   -       -          
:1.2                                   1 systemd         root        :1.2          init.scope                    -       -          
:1.21                                705 systemd         Debian-gdm  :1.21         user@115.service              -       -          
:1.22                                913 wireplumber     Debian-gdm  :1.22         user@115.service              -       -          
:1.23                                919 pipewire-pulse  Debian-gdm  :1.23         user@115.service              -       -          
:1.24                                909 pipewire        Debian-gdm  :1.24         user@115.service              -       -          
:1.25                                915 gdm-wayland-ses Debian-gdm  :1.25         session-c1.scope              c1      -          
:1.26                                926 rtkit-daemon    root        :1.26         rtkit-daemon.service          -       -          
:1.27                                924 gnome-session-b Debian-gdm  :1.27         session-c1.scope              c1      -          
:1.28                                913 wireplumber     Debian-gdm  :1.28         user@115.service              -       -          
:1.29                                961 gnome-shell     Debian-gdm  :1.29         session-c1.scope              c1      -          
:1.3                                 501 udisksd         root        :1.3          udisks2.service               -       -          
:1.31                               1002 gvfs-udisks2-vo Debian-gdm  :1.31         user@115.service              -       -          
:1.32                               1024 goa-daemon      Debian-gdm  :1.32         user@115.service              -       -          
:1.33                                980 tracker-miner-f Debian-gdm  :1.33         user@115.service              -       -          
:1.34                               1047 upowerd         root        :1.34         upower.service                -       -          
:1.37                               1065 packagekitd     root        :1.37         packagekit.service            -       -          
:1.38                               1072 colord          colord      :1.38         colord.service                -       -          
:1.4                                 498 power-profiles- root        :1.4          power-profiles-daemon.service -       -          
:1.44                               1101 gsd-rfkill      Debian-gdm  :1.44         session-c1.scope              c1      -          
:1.45                               1100 gsd-print-notif Debian-gdm  :1.45         session-c1.scope              c1      -          
:1.46                               1084 gsd-sharing     Debian-gdm  :1.46         session-c1.scope              c1      -          
:1.47                               1169 gsd-printer     Debian-gdm  :1.47         session-c1.scope              c1      -          
:1.49                               1260 realmd          root        :1.49         realmd.service                -       -          
:1.5                                 499 switcheroo-cont root        :1.5          switcheroo-control. service    -       -          
:1.50                               1108 gsd-media-keys  Debian-gdm  :1.50         session-c1.scope              c1      -          
:1.51                               1095 gsd-keyboard    Debian-gdm  :1.51         session-c1.scope              c1      -          
:1.52                               1128 gsd-power       Debian-gdm  :1.52         session-c1.scope              c1      -          
:1.53                               1277 ibus-dconf      Debian-gdm  :1.53         session-c1.scope              c1      -          
:1.56                               1327 systemd         kwilczynski :1.56         user@1000.service             -       -          
:1.57                               1348 pipewire        kwilczynski :1.57         user@1000.service             -       -          
:1.58                               1365 pipewire-pulse  kwilczynski :1.58         user@1000.service             -       -          
:1.59                               1364 wireplumber     kwilczynski :1.59         user@1000.service             -       -          
:1.6                                 482 accounts-daemon root        :1.6          accounts-daemon.service       -       -          
:1.60                               1364 wireplumber     kwilczynski :1.60         user@1000.service             -       -          
:1.61                               1402 gvfs-udisks2-vo kwilczynski :1.61         user@1000.service             -       -          
:1.62                               1419 goa-daemon      kwilczynski :1.62         user@1000.service             -       -          
:1.63                               1397 tracker-miner-f kwilczynski :1.63         user@1000.service             -       -          
:1.65                               1450 busctl          root        :1.65         session-2.scope               2       -          
:1.7                                 512 wpa_supplicant  root        :1.7          wpa_supplicant.service        -       -          
:1.8                                 500 systemd-logind  root        :1.8          systemd-logind.service        -       -          
:1.9                                 496 low-memory-moni root        :1.9          low-memory-monitor.service    -       -          
com.redhat.NewPrinterNotification   1169 gsd-printer     Debian-gdm  :1.47         session-c1.scope              c1      -          
com.redhat.PrinterDriversInstaller  1169 gsd-printer     Debian-gdm  :1.47         session-c1.scope              c1      -          
com.ubuntu.SoftwareProperties          - -               -           (activatable) -                             -       -          
fi.w1.wpa_supplicant1                512 wpa_supplicant  root        :1.7          wpa_supplicant.service        -       -          
net.hadess.PowerProfiles             498 power-profiles- root        :1.4          power-profiles-daemon.service -       -          
net.hadess.SwitcherooControl         499 switcheroo-cont root        :1.5          switcheroo-control.service    -       -          
org.bluez                              - -               -           (activatable) -                             -       -          
org.freedesktop.Accounts             482 accounts-daemon root:1.6          accounts-daemon.service       -       -          
org.freedesktop.Avahi                486 avahi-daemon    avahi       :1.0          avahi-daemon.service          -       -          
org.freedesktop.ColorManager        1072 colord          colord      :1.38         colord.service                -       -          
org.freedesktop.DBus                   1 systemd         root        -             init.scope                    -       -          
org.freedesktop.GeoClue2               - -               -           (activatable) -                             -       -          
org.freedesktop.LowMemoryMonitor     496 low-memory-moni root        :1.9          low-memory-monitor.service    -       -          
org.freedesktop.ModemManager1        533 ModemManager    root        :1.11         ModemManager.service          -       -          
org.freedesktop.NetworkManager       506 NetworkManager  root        :1.10         NetworkManager.service        -       -          
org.freedesktop.PackageKit          1065 packagekitd     root        :1.37         packagekit.service            -       -          
org.freedesktop.PolicyKit1           497 polkitd         polkitd     :1.1          polkit.service                -       -          
org.freedesktop.RealtimeKit1         926 rtkit-daemon    root        :1.26         rtkit-daemon.service          -       -          
org.freedesktop.UDisks2              501 udisksd         root        :1.3          udisks2.service               -       -          
org.freedesktop.UPower              1047 upowerd         root        :1.34         upower.service                -       -          
org.freedesktop.bolt                   - -               -           (activatable) -                             -       -          
org.freedesktop.fwupd                  - -               -           (activatable) -                             -       -          
org.freedesktop.hostname1              - -               -           (activatable) -                             -       -          
org.freedesktop.locale1                - -               -           (activatable) -                             -       -          
org.freedesktop.login1               500 systemd-logind  root        :1.8          systemd-logind.service        -       -          
org.freedesktop.network1               - -               -           (activatable) -                             -       -          
org.freedesktop.nm_dispatcher          - -               -           (activatable) -                             -       -          
org.freedesktop.nm_priv_helper         - -               -           (activatable) -                             -       -          
org.freedesktop.realmd              1260 realmd          root        :1.49         realmd.service                -       -          
org.freedesktop.systemd1               1 systemd         root        :1.2          init.scope                    -       -          
org.freedesktop.timedate1              - -               -           (activatable) -                             -       -          
org.gnome.DisplayManager             622 gdm3            root        :1.18         gdm.service                   -       -          
org.opensuse.CupsPkHelper.Mechanism    - -               -           (activatable) -                             -       -

When the Dbus broker is restarted, and a service that exposes an object hasn't reconnected to the new Unix socket, the following breakdown in connection between the client, the Dbus bus, and the service of interest in question can be seen.

Any attempt to access the /org/freedesktop/login1 object will not work:

Object introspection output
‣ Type=method_call  Endian=l  Flags=0  Version=1 Cookie=1  Timestamp="Wed 2023-10-25 01:29:32.367014 UTC"
  Sender=:1.28  Destination=org.freedesktop.DBus  Path=/org/freedesktop/DBus  Interface=org.freedesktop.DBus  Member=Hello
  UniqueName=:1.28
  MESSAGE "" {
  };

‣ Type=method_return  Endian=l  Flags=1  Version=1 Cookie=-1  ReplyCookie=1  Timestamp="Wed 2023-10-25 01:29:32.367295 UTC"
  Sender=org.freedesktop.DBus  Destination=:1.28
  MESSAGE "s" {
          STRING ":1.28";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=-1  Timestamp="Wed 2023-10-25 01:29:32.367419 UTC"
  Sender=org.freedesktop.DBus  Path=/org/freedesktop/DBus  Interface=org.freedesktop.DBus  Member=NameOwnerChanged
  MESSAGE "sss" {
          STRING ":1.28";
          STRING "";
          STRING ":1.28";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=-1  Timestamp="Wed 2023-10-25 01:29:32.367571 UTC"
  Sender=org.freedesktop.DBus  Destination=:1.28  Path=/org/freedesktop/DBus  Interface=org.freedesktop.DBus  Member=NameAcquired
  MESSAGE "s" {
          STRING ":1.28";
  };

‣ Type=method_call  Endian=l  Flags=0  Version=1 Cookie=2  Timestamp="Wed 2023-10-25 01:29:32.367697 UTC"
  Sender=:1.28  Destination=org.freedesktop.login1  Path=/org/freedesktop/login1  Interface=org.freedesktop.DBus.Introspectable  Member=Introspect
  UniqueName=:1.28
  MESSAGE "" {
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=-1  Timestamp="Wed 2023-10-25 01:29:35.371152 UTC"
  Sender=org.freedesktop.DBus  Destination=:1.28  Path=/org/freedesktop/DBus  Interface=org.freedesktop.DBus  Member=NameLost
  MESSAGE "s" {
          STRING ":1.28";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=-1  Timestamp="Wed 2023-10-25 01:29:35.371281 UTC"
  Sender=org.freedesktop.DBus  Path=/org/freedesktop/DBus  Interface=org.freedesktop.DBus  Member=NameOwnerChanged
  MESSAGE "sss" {
          STRING ":1.28";
          STRING ":1.28";
          STRING "";
  };

This failure can also be seen on a bit lower level too, per:

The busctl strace output
# strace -f -s 65535 -tt -e network -- busctl introspect --verbose --system org.freedesktop.login1 /org/freedesktop/login1 
06:59:35.217564 socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
06:59:35.217721 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
06:59:35.217841 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [8388608], 4) = 0
06:59:35.217961 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [425984], [4]) = 0
06:59:35.218018 setsockopt(3, SOL_SOCKET, SO_RCVBUFFORCE, [8388608], 4) = 0
06:59:35.218095 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
06:59:35.218220 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [8388608], 4) = 0
06:59:35.218349 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [425984], [4]) = 0
06:59:35.218495 setsockopt(3, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = 0
06:59:35.218574 connect(3, {sa_family=AF_UNIX, sun_path="/run/dbus/system_bus_socket"}, 30) = 0
06:59:35.218823 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
06:59:35.218961 getsockopt(3, SOL_SOCKET, SO_PEERSEC, "system_u:system_r:system_dbusd_t:s0-s0:c0.c1023\0", [64 => 48]) = 0
06:59:35.219128 getsockopt(3, SOL_SOCKET, SO_PEERGROUPS, 0x563a7b711b10, [256 => 0]) = 0
06:59:35.219311 getsockopt(3, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
06:59:35.219375 getsockname(3, {sa_family=AF_UNIX}, [128 => 2]) = 0
06:59:35.219497 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0AUTH EXTERNAL\r\nDATA\r\n", iov_len=22}, {iov_base="NEGOTIATE_UNIX_FD\r\n", iov_len=19}, {iov_base="BEGIN\r\n", iov_len=7}], msg_iovlen=3, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 48
06:59:35.220067 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="DATA\r\nOK 0b9e39cd320c22eeea94bce5f17c4629\r\nAGREE_UNIX_FD\r\n", iov_len=256}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 58
06:59:35.220285 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 128
06:59:35.220535 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\2\1\1\t\0\0\0\377\377\377\377?\0\0\0\5\1u\0\1\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
06:59:35.220783 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\4\0\0\0:1.4\0\0\0\0\10\1g\0\1s\0\0\4\0\0\0:1.4\0", iov_len=65}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 65
06:59:35.220978 sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\1\0\1\0\0\0\0\2\0\0\0\207\0\0\0\1\1o\0\27\0\0\0/org/freedesktop/login1\0\3\1s\0\n\0\0\0Introspect\0\0\0\0\0\0\2\1s\0#\0\0\0org.freedesktop.DBus.Introspectable\0\0\0\0\0\6\1s\0\26\0\0\0org.freedesktop.login1\0\0", iov_len=152}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 152
06:59:35.221186 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="l\4\1\1\t\0\0\0\377\377\377\377\217\0\0\0\7\1s\0\24\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 24
06:59:35.221391 recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="org.freedesktop.DBus\0\0\0\0\6\1s\0\4\0\0\0:1.4\0\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\f\0\0\0NameAcquired\0\0\0\0\10\1g\0\1s\0\0\4\0\0\0:1.4\0", iov_len=145}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = 145
06:59:35.221598 recvmsg(3, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
Failed to introspect object /org/freedesktop/login1 of service org.freedesktop.login1: Connection timed out
07:00:00.247471 +++ exited with 1 +++

A test has been added in the past to catch issues following the Dbus broker service restart as part of a genuine issue that consisted of adding support for reconnecting to the Dbus socket in case of errors or connection failure. This test would work fine on Debian and Ubuntu, as these Linux distributions aren't affected as much as we already learned.

Specific commit: 990d094

This was in response to a reported issue:

Given that tests are nowadays commonly run on either Googe Container-optimised OS (COS) or Fedora CoreOS, the addition of the Dbus broker restart would have side effects:

  • Any use of sudo would either appear as hang or take a long time to time out
  • Attempt to sign into a given compute instance using SSH would have taken a long time or appear as a hang connection
  • Dispatching requests to Dbus to send a signal to some other participating services can also take a long time before eventually timing out

Some combination of the above might help explain why a lot of our tests were timing out either on connection (SSH) or execution (Ginko, etc.).

Related code:

func restartDbus() error {
cmd := "systemctl restart dbus"
_, err := runCommand("sh", "-c", cmd)
return err
}

ginkgo.It("after restart dbus, should be able to gracefully shutdown", func(ctx context.Context) {
// allows manual restart of dbus to work in Ubuntu.
err := overlayDbusConfig()
framework.ExpectNoError(err)
defer func() {
err := restoreDbusConfig()
framework.ExpectNoError(err)
}()
ginkgo.By("Restart Dbus")
err = restartDbus()
framework.ExpectNoError(err)
gomega.Eventually(ctx, func(ctx context.Context) error {
// re-send the shutdown signal in case the dbus restart is not done
ginkgo.By("Emitting Shutdown signal")
err = emitSignalPrepareForShutdown(true)
if err != nil {
return err
}
isReady := getNodeReadyStatus(ctx, f)
if isReady {
return fmt.Errorf("node did not become shutdown as expected")
}
return nil
}, nodeStatusUpdateTimeout, pollInterval).Should(gomega.Succeed())
})
})

The code responsible for sending a signal (SIGHUP) to systemd-logind to ask it to reload its configuration:

func (bus *DBusCon) ReloadLogindConf() error {
systemdService := "org.freedesktop.systemd1"
systemdObject := "/org/freedesktop/systemd1"
systemdInterface := "org.freedesktop.systemd1.Manager"
obj := bus.SystemBus.Object(systemdService, dbus.ObjectPath(systemdObject))
unit := "systemd-logind.service"
who := "all"
var signal int32 = 1 // SIGHUP
call := obj.Call(systemdInterface+".KillUnit", 0, unit, who, signal)
if call.Err != nil {
return fmt.Errorf("unable to reload logind conf: %w", call.Err)
}
return nil
}

The above is an interesting take on sending the SIGHUP signal to a process since we don't need to find out what PID the process in question has and just ask Dbus to do the work for us - a bit unorthodox and relies on the service that when breaks would render other interactions with the system and other services unreliable.

Related upstream issues and Pull Requests for reference:


Linux distributions used for testing:

A list of Linux distributions showing the /etc/os-release file content
# cat /etc/os-release 
NAME="Fedora Linux"
VERSION="38.20231002.3.1 (CoreOS)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora CoreOS 38.20231002.3.1"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
HOME_URL="https://getfedora.org/coreos/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora-coreos/"
SUPPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
BUG_REPORT_URL="https://github.com/coreos/fedora-coreos-tracker/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
VARIANT="CoreOS"
VARIANT_ID=coreos
OSTREE_VERSION='38.20231002.3.1'
# cat /etc/os-release 
NAME="Fedora Linux"
VERSION="38 (Workstation Edition)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora Linux 38 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
VARIANT="Workstation Edition"
VARIANT_ID=workstation
# cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
# cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

@wzshiming
Copy link
Member

I took the time to diagnose it this week, and I tried to find a way around it. #120728

The previous purpose was to fix that kubelet could not work after dbus restart. Currently, this issue has been fixed in kubelet for a long time, and I have no opinion on deleting this test.
#100328 (comment)

@kwilczynski
Copy link
Contributor

Hi @wzshiming,

The previous purpose was to fix that kubelet could not work after dbus restart. Currently, this issue has been fixed in kubelet for a long time, and I have no opinion on deleting this test. #100328 (comment)

This is working without issues. The issue is not between kubelet and Dbus, but between Dbus and other services that do not handle the Dbus broker being restarted. We are good on the kubelet front. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging a pull request may close this issue.

7 participants