Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Supervised system is reported as Unhealthy system - not privileged every time after supervisor upgrade. #342

Open
3 tasks done
Lockszmith-GH opened this issue Dec 19, 2023 · 13 comments
Labels
bug Something isn't working

Comments

@Lockszmith-GH
Copy link

OS Version

Debian GNU/Linux 12 (bookworm)

System Information

Linux ha 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux

What happened?

Whenever the supervisor is updated, the Unhealthy system - Not privileged.
Resolving this is easy, all I need to do is to restart the supervisor service from the OS prompt - I actually have a cron job that does it one a day.

I've been reading the threads, and I was not able to pinpointed the root cause of this issue.

This bug report has 2 goals:

  1. If there is a known root cause, one that I can test for, idneitfy it and remove it - I would love to know.
  2. I would really want a sensor stating whether the system is healthy or not, then I can automate log collection, and probably workaround the issue, also be able to pinpoint the exact set of curcumstances this is happening.

The only method I have right now is:

docker logs hassio_supervisor 2>&1 \
| grep -q "System is running in an unhealthy state and needs manual intervention!" \
&& sudo systemctl restart hassio-supervisor.service

It works, but very crude. I put it in a cron job as this has become really annoying.

Machine Type

generic-x86-64

Installer output

N/A

Relevant log output

Dec 19 10:05:54 ha hassio_supervisor[621]: 23-12-19 10:05:54 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!

ADR

  • I have read through the ADR and have confirmed that my system is compliant with the requirements
  • I understand that if my system is found to not be compliant, my issue will be closed immediately without further investigation

Code of Conduct

@Lockszmith-GH Lockszmith-GH added the bug Something isn't working label Dec 19, 2023
@ikifar2012
Copy link
Member

The main issue this is being tracked under is home-assistant/supervisor#4381 it appears to happen only under bookworm and no root cause has been identified yet, the only thing I can say for certain is it seems to happen whenever the container is recreated and resolves itself with a restart, until the next supervisor update when the container is recreated yet again.

@Lockszmith-GH
Copy link
Author

thanks for confirming @ikifar2012.

So it seems, at least at this point, I would ask the maintainers for a system state that will report this unhealthy state - something that I can send an alert on, and track.

@donaldguy
Copy link

See home-assistant/supervisor#4839 for my investigation of what concerning-things seems to be happening here

@donaldguy
Copy link

donaldguy commented Jan 28, 2024

The "short" version of that is

after a reboot with an existing hassio_supervisor container:

$ uptime
 18:44:13 up 1 min,  1 user,  load average: 1.65, 0.75, 0.28
$ docker logs hassio_supervisor 2>&1 | tail -2
24-01-25 18:44:04 INFO (MainThread) [supervisor.resolution.fixup] System autofix complete
24-01-25 18:44:04 INFO (MainThread) [supervisor.host.manager] Host information reload completed
$ ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: true
supported: true
version: 2023.12.1
version_latest: 2023.12.1
$ docker inspect hassio_supervisor -f '{{.HostConfig.SecurityOpt}}'
[apparmor=hassio-supervisor label=disable]

big "but"

docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
    xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
    xargs sudo ps Z

(or if you prefer / don't want to install pstree, the single-depth: docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid

gives

LABEL                               PID TTY      STAT   TIME COMMAND
unconfined                         2503 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
unconfined                         2638 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
unconfined                         2641 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
unconfined                         2666 ?        S      0:00 s6-supervise s6rc-fdholder
unconfined                         2667 ?        S      0:00 s6-supervise s6rc-oneshot-runner
unconfined                         2680 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
unconfined                         2722 ?        S      0:00 s6-supervise supervisor
unconfined                         2724 ?        S      0:00 s6-supervise watchdog
unconfined                         2726 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
unconfined                         2727 ?        Ssl    0:06 python3 -m supervisor
unconfined                         7424 ?        S      0:00 sleep 300

that is the apparmor profile is loaded BUT NOT APPLIED to any supervisor process

but after supervisor upgrade

(or any other event which causes recreation of hassio_supervisor container on a running system)

odroidn2:~:# docker inspect hassio_supervisor --format='{{.State.Pid}}' | \
  xargs pstree -p |  grep -o '([0-9]\+)' | sed -e 's/(/-p /' -e 's/)//' | paste -sd ' ' | \
  xargs ps Z
LABEL                               PID TTY      STAT   TIME COMMAND
hassio-supervisor (enforce)        9035 ?        Ss     0:00 /package/admin/s6/command/s6-svscan -d4 -- /run/service
hassio-supervisor (enforce)        9077 ?        S      0:00 s6-supervise s6-linux-init-shutdownd
hassio-supervisor (enforce)        9079 ?        Ss     0:00 /package/admin/s6-linux-init/command/s6-linux-init-shutdownd -d3 -c /run/s6/basedir -g 300
hassio-supervisor (enforce)        9086 ?        S      0:00 s6-supervise s6rc-fdholder
hassio-supervisor (enforce)        9087 ?        S      0:00 s6-supervise s6rc-oneshot-runner
hassio-supervisor (enforce)        9095 ?        Ss     0:00 /package/admin/s6/command/s6-ipcserverd -1 -- /package/admin/s6/command/s6-ipcserver-acces
hassio-supervisor (enforce)        9135 ?        S      0:00 s6-supervise supervisor
hassio-supervisor (enforce)        9137 ?        S      0:00 s6-supervise watchdog
hassio-supervisor (enforce)        9138 ?        Ssl    0:03 python3 -m supervisor
hassio-supervisor (enforce)        9142 ?        Ss     0:00 bash /usr/bin/bashio ./run watchdog
hassio-supervisor (enforce)        9773 ?        S      0:00 sleep 30

odroidn2:~:# ha supervisor info 2>&1 | grep 'healthy\|supported\|^version'
healthy: false
supported: true
version: 2023.12.1
version_latest: 2023.12.1

odroidn2:~:# docker logs hassio_supervisor 2>&1 | tail -3
24-01-25 20:07:52 INFO (MainThread) [supervisor.host.manager] Host information reload completed
24-01-25 20:07:52 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
24-01-25 20:07:52 INFO (MainThread) [supervisor.jobs] 'ResolutionFixup.run_autofix' blocked from execution, system is not healthy - privileged

odroidn2:~:# docker logs hassio_supervisor 2>&1 | grep 'CRITICAL\|ERROR\|WARNING'
24-01-25 20:07:42 CRITICAL (MainThread) [supervisor.hardware.monitor] Not privileged to run udev monitor!
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.rauc] Host has no rauc support. OTA updates have been disabled.
24-01-25 20:07:42 WARNING (MainThread) [supervisor.dbus.agent] No OS-Agent support on the host. Some Host functions have been disabled.
24-01-25 20:07:52 CRITICAL (MainThread) [supervisor.core] System is running in an unhealthy state and needs manual intervention!

AppArmor profile now is in effect and not sufficient for (socket (2) as socket(PF_NETLINK, SOCK_RAW|SOCK_CLOEXEC|SOCK_NONBLOCK, NETLINK_KOBJECT_UEVENT) inside of) udev_monitor_new_from_netlink (3)

@donaldguy
Copy link

Thus there appear to be two important issues:

  1. why is the apparmor profile as written not sufficient for this udev socket open on Debian 12

    • (while remaining sufficient on HAOS ... one assumes/hopes).
    • My best guess (which I should test, but haven't yet, is that its that deny network raw o nline 8? ; that or the policy needs to be being applied to the actual pyc in pyudev and/or libudev.so not just the python interpretor? or there is a new capability needed?)
  2. why is the apparmor profile not being applied to the container's proccesses on first-start after boot? (also no longer, for me, on siblings of ns pid1 spawned by docker exec)

    • (is this a docker/containerd.bug? a kernel bug? a systemd bug?

    • (or is it the initialization chain of hassio-apparmor being race-y? As I demonstrated in this comment it does at least seem like containerd will not start the container when the profile is not loaded - but I'm thinking maybe there is a race between allocation of the name and restore of its contents? [and that apparmor modules short circuit application of a profile if its found empty at the time?])

@donaldguy
Copy link

donaldguy commented Jan 28, 2024

I think I ended up leaving it out of the version posted in there, but its perhaps worth noting that with the profile applied, pyudev.Context().list_devices() still works fine [that is, I believe: udenv_enumerate_new (3) and udev_list_entry_get_next(3)) in the supervisor process itself (with debug/debug_wait]

so the problem seems to be potentially less the udev part writ large then the (monitor_new_from_)netlink part in particular

(udevadm monitor in the supervisor container (as well as in the hassio_audio container under docker-default apparmor) still works - but that appears to be by virtue of the Ux v ix the policy designates it with)

@donaldguy
Copy link

donaldguy commented Jan 28, 2024

speaking to OP's workaround:

For my part systemctl restart hassio-supervisor et al are not sufficient (at least not consistently1) to get from "bad" to "good", only a full system reboot - but this may have changed at some point since I started encountering the issue

Conversely, while I was writing home-assistant/supervisor#4839 I could have sworn at some point a "soft" restart was initally but not repeatably sufficient to get from "good" to "bad" (but that could, for example, be something like /run/supervisor/startup-marker causing a recreate instead of a docker stop / docker start)

Footnotes

  1. though one early time I was looking into this, the act of editing config.json or using ha supervisor options to set debug and debug_block and then bootstrapping my first supervisor vscode environment did appear to immediately sidestep the failure--launching it to full hesienbug status. I do not know if that suggests potentially an even more subtle timing issue (such as last supervisor TCP_WAIT-ish socket cleanup?) or if I just somewhere that time I e.g. rebooted the machine without really noting it as part of what I did.

@donaldguy
Copy link

but stopping my thread-spamming, I would be very curious whether other people experiencing this sort of issue can replicate (or countermand) my catching the processes unconstrained

(using docker inspect hassio_supervisor --format='{{.State.Pid}}' | xargs ps Z --ppid, the deeper pstree|sed version, or just aa-status/apparmor_status though there you are looking for presence/absence of the hassio-supervisor profile in the list of processes)

@rlue
Copy link

rlue commented Feb 17, 2024

I just installed hassio supervised on a fresh debian (bookworm) system and happy to help debug, but much of this thread is over my head. What I can say with certainty is:

For my part systemctl restart hassio-supervisor et al are not sufficient (at least not consistently) to get from "bad" to "good", only a full system reboot

This was also my experience. Happy to run any commands and provide their output for corroboration.

Copy link

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@okin
Copy link

okin commented Apr 17, 2024

home-assistant/version#370 fixed home-assistant/supervisor#4381.
Is this issue fixed by the PR aswell?

@donaldguy
Copy link

I would fully expect so, but you know, it needs to roll out first

for what it’s worth I belatedly applied my suggested workaround of pulling the profile entirely and haven’t had an issue since, though technically that’s a different thing

@ikifar2012 ikifar2012 assigned ikifar2012 and unassigned ikifar2012 Apr 22, 2024
@ikifar2012
Copy link
Member

home-assistant/version#370 fixed home-assistant/supervisor#4381. Is this issue fixed by the PR aswell?

Still awaiting this home-assistant/version#372 to be merged as well... We will see after the next supervisor update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants