New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125
Comments
After repairing the boot partition, the EFI file system seemed intact but I am stuck on the barebox bootloader and 100% CPU usage : Update: Booting manually via GRUB command line as specified in system1 reveals |
There are two partition (A/B update system), you might have booted the old 5.8 release. Did you by chance had to reset/force poweroff the VM? Can you reproduce the issue? You are not the only report along those lines, see #1092. I use libvirt (which uses kvm underneath) and did a bunch of updates using OVA, I wasn't able to reproduce this issue. |
Which version did you upgrade from? |
5.8 to 5.9 via UI I booted system0 and system1 via GRUB, let me know if there are other procedures to follow for booting specific versions. The vm was not forced off by me, it did the update, corrupted EFI and rebooted. When i looked at VNC, it was saying cannot find boot entry. |
Just to clarify from my perspective there are the following multiple failures:
If there are any log files I can provide let me know. I will try to repro this issue to extract some more data. I should also mention that the initial EFI corruption has broken OVMF detection on proxmox 6.2, the disk had to be migrated to a new VM to be detected even with repaired EFI.
|
MAJOR UPDATE :
The ONLY issue was the EFI corruption although the cause remains unknown HassOS EFI Recovery GuideIf your EFI is corrupted (you get the message cannot find QEMU HARDDISK etc..) this procedure may help:
|
Thanks for posting the update and instructions how to recover the boot partition! I changed to use the |
I am unable to reproduce the issue, the only thing I could suggest is to implement a sanity check after writing to the boot partition, look for the bootx64.efi file, and check if it fails (?) I did a clean install of 5.8 and updated via CLI to 5.9, it worked normally on latest proxmox. Very strange. Could be some incompatibility between specific old versions of proxmox and barebox maybe? |
Same issue here running on a libvirt VM. |
Same thing happened to me running HassOS as VM on a Unraid server (6.8.3). The fix from @mj-sakellaropoulos worked as a charm. (Thank you!) So my hass-boot partition was also corrupted. When reverting to a backup image, I did ended up reproducing the problem. This was when I updated OS 3.12 to 5.9. Al seemed fine when updating. But after power cycling the VM, nothing. Wouldn't boot any more. So could be as @mj-sakellaropoulos suggested compatibility issues between older versions? |
I had the same upgrading from 4.17 running on an esxi vm. I didn't bother recovering the EFI and simply rolled back my hard drive image to the backup. |
There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.
Same again on 5.10 (as iopenguin mentioned already). Interestingly this time the machine booted fine after update and was stable but after a power cut it failed to come back online. I guess the EFI wasn't needed for a soft reboot after the update. |
There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.
There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.
I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down. I've seen this before with libvirt and fixed it in my setups by running a And wait a while with sleep after an update has been completed. Else the system partition might look like this: |
I'm running a VM on Proxmox and this presents as clearing the MBR but leaving GPT intact. From a Proxmox shell I can use fdisk on the zvol that hosts the VM and it will rebuild the MBR, which lets the VM boot. But I have to do this after every HA OS upgrade. |
Having the same issue under Proxmox. Happens every other week. |
Then just write out the MBR again and try again. |
That is what I thought as well, but we also see it on Intel NUCs. Also when rebooting a proper sync should be done on reboot anyways, and at least some people claimed they did a proper reboot but still experienced the issue...
This is essentially what #1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards. @ahknight just to clarify: the images uses UEFI to boot, there is no "MBR". MBR is a DOS partition table/BIOS concept. In UEFI, there is just a FAT partition called EFI System Partition (ESP), which has to have the right files in the right place. The UEFI BIOS then picks up the boot loader from there. No "magic" master boot record (MBR) needed. I guess you refer to the ESP here. @GJT to fix a qcow2 image, you can follow the instructions in #1125 (comment). |
@agners interesting. We might experience two separate issues here: The consumer grade SSDs do have a write cache which is not protected by a battery backup. If the shutdown process is (basically) too fast we might write this to the write cache and cut the power to the device before the SSD had time to flush it on the permanent storage. There are some Intel SSDs which have a power backup built-in to avoid this. They call this "enhanced power-loss data protection". It's probably pretty racy in most setups, so we might have this issue everywhere but it only shows symptoms by a very small chance. We could debug this by writing a file unsynced when the shutdown is initiated. If it's gone when we startup, we know something fishy is going on. If it's still there we delete it. Anyway, I think we could mitigate this issue if we use a hook on shutdown when the FSes are unmounted. If we just add a - say 5 second - sleep afterwards even the slowest SSD should have plenty of time to write everything from the write cache to the disk. |
…ome-assistant#1190) There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.
@agners I know how the startup process is supposed to work. However, I'm explaining what I did. Proxmox got stuck in an EFI boot prompt loop until I SSHed into the PX host, ran We can argue should forever, but that did do it. Repeatedly. |
For me this issue occurs every 2-4 weeks, the system becomes unresponsive out of the blue and i'am greeted with the corrupt efi on a reset. It even occurs on different proxmox cluster nodes that use different storage systems. |
Thanks @mj-sakellaropoulos your recovery instructions worked well, and I managed to recovery a botched upgrade. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I've just experienced this, and I last upgraded from 6.0 to 6.1. I'm not convinced that an upgrade is causing this for me, as I successfully rebooted the vm multiple times after the upgrade with no issues. What I did notice yesterday is that the filesystem had gone read-only and required a reboot. After the reboot it seemed to run fine, but when I woke up this morning the vm was powered off, and required rebuilding the partition table in order to boot again. This seems to be a semi-weekly occurrence for me. |
There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. |
I'm late to the party, but I still experience this regularly when rebooting the host machine. My temporary fix is:
|
Did this happen with a recent OS version? |
100% for /dev/root is normal since we use a read-only squashfs as root file system. |
There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. |
@agners wrote:
Yeah. The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync. When the machine is turned off, the cache won't get flushed but instead just discarded. I've seen this a lot of times in KVM, not sure if that's a Linux kernel bug or somewhere in the emulation layer of the disk itself. Last time I've seen this is like 3 years ago. I just keep the machines running for several minutes before rebooting – which fixed this for me. |
"gets cached" by whom? If its hardware, then it's broken hardware. The OS needs to be able to rely on flush stuff to the underlying non-volatile storage, otherwise the whole stack of cards fall apart (journaling file systems won't be able to implement consistency guarantees, databases ACID breaks). If its the VM's virtual disk driver, than that VM disk driver is buggy or reckless. Granted, you might want that option so you can trade performance for reliability if you really don't need any relyability (e.g. for testing). But it shouldn't be the default, and it should not be configured for Home Assistant OS :) KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default... |
So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive. Linux also writes something out like
On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.
Yeah not by intention but because of a bug somewhere. Making sure that writes are atomic without data journaling / copy-on-write is kinda hard. I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away. |
Btw ext4 got a mount option to fix some application issues: auto_da_alloc. But I don't think this will cover block based replacements.
|
Yeah, that would explain it, but it would be a big fat bug IMHO. I mean, just throwing away caches when the VM gets destroyed seems a major oversight. I doubt that this is what is going on.
This issue is about virtual machines though. Also, SD cards are exposed as mmcblk. I don't think that the kernel makes such assumptions for those type of devices.
Keep in mind that the boot folder is FAT. Also, it is mounted sync now, so writes should go out immediately today. With OS 8.x we switch to GRUB2 boot loader and to the latest Linux kernel, let's see if reports appear with that combination. |
I tried to update from Hass OS 8.2 to 8.4 today (Proxmox VM). |
I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: #1125 (comment) |
I have no folder /dev/zvol/rpool/ Do I need to do it for the data disk or also for the EFI disk (disk-1)? |
I've only ever done it on the darts disk, but my disk was zfs. Be sure to take a backup of the cm before trying it on your lvm/s disk. |
Odd that nobody else with LVS had this issue and could give a hint. |
Just had the issue again after a long time without problems after upgrading from 8.2 to 8.4. |
We really don't do anything special with that partition other than writing some files to it right before rebooting. Rebooting should properly unmount the disk, which should cause all buffers to be properly flushed. Can you check if the file system checks were all good before the upgrade, e.g. using the following command in the console/HAOS ssh shell:
|
Hi there. I must report this bug seems still present. After a host (proxmox 7.3-6) reboot, my Home Assistant VM was not able to boot anymore. I get the message: The solution, as mentionned there was to check the disk table with: After that, the VM was able to boot again. |
Was that a graceful reboot or a power cut? If the former, can you reproduce this with each reboot? |
This was a host graceful reboot. I have not rebooted since (I'm a bit afraid to not being to properly recover the VM this time), but I will test. FYI this is the first time I have this problem in about 3 years, and many many HA updates. |
Just updated to OS 5.9 via UI, vm no longer boots.
VM is on Proxmox 6.2, OVA, UEFI OVMF
Upon investigation in ubuntu, garbage data was found in the EFI folder :
Will dd the boot partition from release page and report back, suspect update process is broken somehow ?
( as first reported here: whiskerz007/proxmox_hassos_install#96 )
The text was updated successfully, but these errors were encountered: