-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GracefulNodeShutdown tests failing due to connection with dbus #121124
Comments
/cc |
/triage accepted |
Looking in the logs of the last successful job: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial-containerd/1707529282919600128/artifacts/n1-standard-2-cos-stable-105-17412-156-63-c7409603-system.log And viewing the kernel logs, I don't see any errors related to dbus. Last successful job linux version:
The next job fails and the main difference seems to be a linux kernel update.
|
/cc |
1 similar comment
/cc |
I was able to repro this on COS M109 as follows, while the below works fine on COS M105.
Will dig a bit deeper if this is an issue in COS or systemd upstream regression. |
@bobbypage, thank you for looking into this! (adding @wzshiming who authored the Dbus restart code) What we are seeing here would be something other than a regression. This problem affects Google's COS, Fedora CoreOS (and its RHEL-based equivalent), Fedora Workstations, and most likely any other Linux distribution, whether container optimised or not, that is not Debian or based on Debian, such as Ubuntu. Both Debian and Ubuntu are unaffected as they handle Dbus restarts slightly differently. TL;DR This can lead to anything from authentication timeouts (since the systemd-logind and the polkitd services would be impacted) to the system generally broken following Dbus broker restart - this is especially the case for desktop (such as Fedora Workstation) systems where there are a lot of variety of services that rely on working Dbus connectivity. Different software might handle this differently and choose to reconnect on a socket write or read timeout. However, systemd at large is not one of these. Lennart has confirmed this on a number of occasions per (selected from a sample of issues):
Debian and its derivatives, such as Ubuntu or Linux Mint, etc., are not affected as they opt to trigger the restart (or activation, if you wish) of key system services (such as the systemd-logind) following a Dbus broker restart. I recommend removing the Dbus broker restart from the test code base to fix the failing tests. If there were no objections, then I would be happy to do this work. Additionally, we could see about porting Debian's behaviour to Google's COS and Fedora CoreOS (and, as such, the Red Hat RHEL-based equivalent) around restarting systemd-logind on Dbus restart. But that would be a bonus work and outside of this issue's scope. Typically on a Fedora CoreOS distribution, the default list of Dbus connections just after system startup might look as follows: Dbus connections list on Fedora CoreOS after startup
Several key systemd services are connected, such as the systemd-logind and the systemd-homed, plus the polkitd service. Restarting Dbus would reduce this list to the following: Dbus connections list on Fedora CoreOS after Dbus broker restart
There is almost nothing connected following the restart. Any attempt to introspect the /org/freedesktop/login1 object would now fail per: Object introspection run
Restarting the systemd-logind service causes a fresh process to reconnect to the Dbus bus per: Dbus connections list on Fedora CoreOS after systemd-logind restart
This resolves the problem of accessing the /org/freedesktop/login1 object per: Object introspection run
However, other services might still be affected and would need to be restarted. I mentioned that Debian and its derivatives are not affected as they trigger the restart of key systemd services, especially the systemd-logind service. This can be seen in the following output - pay close attention to the PID of the systemd-logind service on the list (using Debian 12 "bookworm" as an example): Dbus connections list on Debian 12 following Dbus broker restart
Note that for this particular Debian 12 system, the initial state of the connections was different. The following is the state immediately after the system starts up, this is a desktop flavour with a lot of other software that relies on Dbus, and that will also fail to reconnect correctly per: Dbus connections list on Debian 12 after startup
When the Dbus broker is restarted, and a service that exposes an object hasn't reconnected to the new Unix socket, the following breakdown in connection between the client, the Dbus bus, and the service of interest in question can be seen. Any attempt to access the /org/freedesktop/login1 object will not work: Object introspection output
This failure can also be seen on a bit lower level too, per: The busctl strace output
A test has been added in the past to catch issues following the Dbus broker service restart as part of a genuine issue that consisted of adding support for reconnecting to the Dbus socket in case of errors or connection failure. This test would work fine on Debian and Ubuntu, as these Linux distributions aren't affected as much as we already learned. Specific commit: 990d094 This was in response to a reported issue: Given that tests are nowadays commonly run on either Googe Container-optimised OS (COS) or Fedora CoreOS, the addition of the Dbus broker restart would have side effects:
Some combination of the above might help explain why a lot of our tests were timing out either on connection (SSH) or execution (Ginko, etc.). Related code: kubernetes/test/e2e_node/node_shutdown_linux_test.go Lines 683 to 687 in 925a8dd
kubernetes/test/e2e_node/node_shutdown_linux_test.go Lines 378 to 406 in 8453eb0
The code responsible for sending a signal (SIGHUP) to systemd-logind to ask it to reload its configuration: kubernetes/pkg/kubelet/nodeshutdown/systemd/inhibit_linux.go Lines 118 to 134 in 925a8dd
The above is an interesting take on sending the SIGHUP signal to a process since we don't need to find out what PID the process in question has and just ask Dbus to do the work for us - a bit unorthodox and relies on the service that when breaks would render other interactions with the system and other services unreliable. Related upstream issues and Pull Requests for reference:
Linux distributions used for testing: A list of Linux distributions showing the /etc/os-release file content
|
I took the time to diagnose it this week, and I tried to find a way around it. #120728 The previous purpose was to fix that kubelet could not work after dbus restart. Currently, this issue has been fixed in kubelet for a long time, and I have no opinion on deleting this test. |
Hi @wzshiming,
This is working without issues. The issue is not between kubelet and Dbus, but between Dbus and other services that do not handle the Dbus broker being restarted. We are good on the kubelet front. 😄 |
Which jobs are failing?
Any test that runs Gracefulshutdown seems to be having issues.
Which tests are failing?
Reference: #120726
In this case, they are failing consistently.
Since when has it been failing?
The tests were flaky but as of 9/29, they have been failing ever since.
Testgrid link
https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd
Reason for failure (if possible)
@rphillips and I were looking into these failures. We are able to reproduce on a GCP instance but we are seeing failures to sending dbus shutdown signals.
We have looked into this code but it does not look like any code is responsible for this failure. One thing we did see is that there was a new release of COS around then.
We are seeing failures in the logs.
Running this test locally we can sometimes see issues with dbus.
Running this locally (after copying the image config):
Anything else we need to know?
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: