Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

mj-sakellaropoulos · 2020-12-23T01:33:40Z

Just updated to OS 5.9 via UI, vm no longer boots.
VM is on Proxmox 6.2, OVA, UEFI OVMF

Upon investigation in ubuntu, garbage data was found in the EFI folder :

Will dd the boot partition from release page and report back, suspect update process is broken somehow ?

( as first reported here: whiskerz007/proxmox_hassos_install#96 )

mj-sakellaropoulos · 2020-12-23T02:02:27Z

After repairing the boot partition, the EFI file system seemed intact but I am stuck on the barebox bootloader and 100% CPU usage :

Update: Booting manually via GRUB command line as specified in system1 reveals ~~the system is completely broken, the update never completed~~ (os-release still 5.8). ~~docker, homeassistant, networkmanager and other services do not start~~.

agners · 2020-12-23T09:41:22Z

There are two partition (A/B update system), you might have booted the old 5.8 release.

Did you by chance had to reset/force poweroff the VM? Can you reproduce the issue? You are not the only report along those lines, see #1092. I use libvirt (which uses kvm underneath) and did a bunch of updates using OVA, I wasn't able to reproduce this issue.

agners · 2020-12-23T10:31:00Z

Which version did you upgrade from?

mj-sakellaropoulos · 2020-12-23T16:01:55Z

5.8 to 5.9 via UI

I booted system0 and system1 via GRUB, let me know if there are other procedures to follow for booting specific versions.

The vm was not forced off by me, it did the update, corrupted EFI and rebooted. When i looked at VNC, it was saying cannot find boot entry.

mj-sakellaropoulos · 2020-12-23T17:06:07Z

Just to clarify from my perspective there are the following multiple failures:

~~OS update failed (?)~~
Bootloader update failed, killing EFI
~~New bootloader from 5.9 OVA (.qcow2.gz) unable to boot existing system (maybe related to Operating System 5.8 update breaking VHDX image #1092 bug?)~~

If there are any log files I can provide let me know.

I will try to repro this issue to extract some more data.

I should also mention that the initial EFI corruption has broken OVMF detection on proxmox 6.2, the disk had to be migrated to a new VM to be detected even with repaired EFI.

I have updated proxmox to latest (6.3)
I have installed 5.8 via importing qcow2 into proxmox, barebox bootloader is still broken

mj-sakellaropoulos · 2020-12-23T20:17:10Z

MAJOR UPDATE :

Bootloader hang is caused by having IDE DVD-ROM attached for debugging, remove IDE and barebox works properly
System update worked, and system is running fully functional as long as booted by barebox (just need to manually start docker with systemctl start docker)

The ONLY issue was the EFI corruption although the cause remains unknown
Some hints: the directory listing of corrupted EFI contain strings like "Attempt 7" which are found in NvVars and are also part of barebox boot process (?)

HassOS EFI Recovery Guide

If your EFI is corrupted (you get the message cannot find QEMU HARDDISK etc..) this procedure may help:

Attach an ubuntu live CD iso to the VM and boot to the desktop, open terminal
do sudo fdisk -l /dev/sda and make sure hassos-boot is /dev/sda1 and is the same size as /dev/nbd0p1 later on
Follow these steps to recover EFI partition :

wget https://github.com/home-assistant/operating-system/releases/download/5.9/hassos_ova-5.9.qcow2.xz
sudo su
apt install qemu-utils
modprobe nbd max_part=10
xz -d -v ./hassos_ova-5.9.qcow2.xz
qemu-nbd --connect=/dev/nbd0 ./hassos_ova-5.9.qcow2

# Dangerous step is next, double check partition size and BACKUP your disk FIRST !
fdisk -l
dd if=/dev/nbd0p1 of=/dev/sda1

Now mount the repaired /dev/sda1 to verify contents :

mkdir /mnt/hass-boot
mount /dev/sda1 /mnt/hass-boot
ls -al /mnt/hass-boot/EFI

if everything looks good, unmount and shutdown

umount /mnt/hass-boot
qemu-nbd --disconnect /dev/nbd0
shutdown now

IMPORTANT: Remove the ISO and IDE DVD from the VM before rebooting
The VM should boot normally, you may need to systemctl start docker to get the hassio CLI working

agners · 2020-12-24T09:40:55Z

Thanks for posting the update and instructions how to recover the boot partition!

I changed to use the sync mount option when mounting the boot partition, hoping things get written out immediately after update (see #1101). Although, since you did not force off/force reboot, there must have been something else causing the corruption. Maybe it is some sort of kernel bug. It it is that, then I hope the latest Linux kernel stable update (part of 5.9) fixes it. But if you have hints/ideas what could have caused the corruption in first place (or if you have a process to reproduce) I would be very interested to hear.

mj-sakellaropoulos · 2020-12-24T19:37:16Z

I am unable to reproduce the issue, the only thing I could suggest is to implement a sanity check after writing to the boot partition, look for the bootx64.efi file, and check if it fails (?)

I did a clean install of 5.8 and updated via CLI to 5.9, it worked normally on latest proxmox. Very strange. Could be some incompatibility between specific old versions of proxmox and barebox maybe?

tumd · 2021-01-05T14:01:03Z

Same issue here running on a libvirt VM.
I seem to have been able to restore the boot partition with help from @mj-sakellaropoulos informative post.

markkamp · 2021-01-08T09:40:17Z

Same thing happened to me running HassOS as VM on a Unraid server (6.8.3). The fix from @mj-sakellaropoulos worked as a charm. (Thank you!) So my hass-boot partition was also corrupted.

When reverting to a backup image, I did ended up reproducing the problem. This was when I updated OS 3.12 to 5.9. Al seemed fine when updating. But after power cycling the VM, nothing. Wouldn't boot any more. So could be as @mj-sakellaropoulos suggested compatibility issues between older versions?

lexathon · 2021-01-18T10:40:51Z

I had the same upgrading from 4.17 running on an esxi vm. I didn't bother recovering the EFI and simply rolled back my hard drive image to the backup.

There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.

lexathon · 2021-01-29T09:25:23Z

Same again on 5.10 (as iopenguin mentioned already). Interestingly this time the machine booted fine after update and was stable but after a power cut it failed to come back online. I guess the EFI wasn't needed for a soft reboot after the update.
I used mj-sakellaropoulos workaround (using 5.10) to recover the EFI on this occasion as I'd made some changes I wanted to keep - thanks for that by the way.

There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.

RubenKelevra · 2021-02-08T19:19:56Z

I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down.

I've seen this before with libvirt and fixed it in my setups by running a sync as root on the machines after anything as been updated.

And wait a while with sleep after an update has been completed.

Else the system partition might look like this:

ahknight · 2021-02-12T04:25:29Z

I'm running a VM on Proxmox and this presents as clearing the MBR but leaving GPT intact. From a Proxmox shell I can use fdisk on the zvol that hosts the VM and it will rebuild the MBR, which lets the VM boot. But I have to do this after every HA OS upgrade.

GJT · 2021-02-22T19:18:48Z

Having the same issue under Proxmox. Happens every other week.
@ahknight What command do you run exactly to fix it?

ahknight · 2021-02-22T23:25:36Z

$ gdisk /dev/zd##

Then just write out the MBR again and try again.

agners · 2021-02-25T13:31:28Z

@RubenKelevra

I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down.

That is what I thought as well, but we also see it on Intel NUCs. Also when rebooting a proper sync should be done on reboot anyways, and at least some people claimed they did a proper reboot but still experienced the issue...

I've seen this before with libvirt and fixed it in my setups by running a sync as root on the machines after anything as been updated.

This is essentially what #1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards.

@ahknight just to clarify: the images uses UEFI to boot, there is no "MBR". MBR is a DOS partition table/BIOS concept. In UEFI, there is just a FAT partition called EFI System Partition (ESP), which has to have the right files in the right place. The UEFI BIOS then picks up the boot loader from there. No "magic" master boot record (MBR) needed. I guess you refer to the ESP here.

@GJT to fix a qcow2 image, you can follow the instructions in #1125 (comment).

RubenKelevra · 2021-02-25T13:43:00Z

@agners interesting.

We might experience two separate issues here:

The consumer grade SSDs do have a write cache which is not protected by a battery backup.

If the shutdown process is (basically) too fast we might write this to the write cache and cut the power to the device before the SSD had time to flush it on the permanent storage.

There are some Intel SSDs which have a power backup built-in to avoid this.

They call this "enhanced power-loss data protection".

It's probably pretty racy in most setups, so we might have this issue everywhere but it only shows symptoms by a very small chance.

We could debug this by writing a file unsynced when the shutdown is initiated. If it's gone when we startup, we know something fishy is going on.

If it's still there we delete it.

Anyway, I think we could mitigate this issue if we use a hook on shutdown when the FSes are unmounted. If we just add a - say 5 second - sleep afterwards even the slowest SSD should have plenty of time to write everything from the write cache to the disk.

…ome-assistant#1190) There are incident reports on the internet where poeple report that fsck.(v)fat actually leads to problems rather file system fixes. Around the time when Home Assistant OS added fsck.fat for the boot partition, reports of empty boot partitions or file with weired filenames started to appear. This could be caused by fsck.fat. Disable fsck on the boot partition.

ahknight · 2021-03-05T20:12:39Z

@agners I know how the startup process is supposed to work. However, I'm explaining what I did. Proxmox got stuck in an EFI boot prompt loop until I SSHed into the PX host, ran gdisk on the zvol, read the error message that said the MBR was missing, wrote out the MBR that it recovered from the GPT tables, and then started the VM again. Suddenly it worked.

We can argue should forever, but that did do it. Repeatedly.

GJT · 2021-03-21T16:24:06Z

For me this issue occurs every 2-4 weeks, the system becomes unresponsive out of the blue and i'am greeted with the corrupt efi on a reset.
I usually roll back to my working snapshot (OS 5.12) which I can reboot as much as I like. But after some time it gets corrupted again without any updates or changes to the System.

It even occurs on different proxmox cluster nodes that use different storage systems.

hcooper · 2021-03-28T02:05:51Z

HassOS EFI Recovery Guide

Thanks @mj-sakellaropoulos your recovery instructions worked well, and I managed to recovery a botched upgrade.

stale · 2021-06-11T02:17:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Eeems · 2021-07-06T14:54:24Z

I've just experienced this, and I last upgraded from 6.0 to 6.1. I'm not convinced that an upgrade is causing this for me, as I successfully rebooted the vm multiple times after the upgrade with no issues. What I did notice yesterday is that the filesystem had gone read-only and required a reboot. After the reboot it seemed to run fine, but when I woke up this morning the vm was powered off, and required rebuilding the partition table in order to boot again.

This seems to be a semi-weekly occurrence for me.

github-actions · 2021-10-06T10:19:35Z

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

meichthys · 2021-12-10T04:33:58Z

I'm late to the party, but I still experience this regularly when rebooting the host machine. My temporary fix is:

## Make sure VM is disabled:
ha-manager set vm:<VMID> --state disabled
## Open GDISK to modify disk partition map
gdisk /dev/zvol/rpool/vm-<VMID>-disk-<DISK#>
## Once GDISK opens, then just use the `W` command to re-write the partiion map
## Re-enable (start) VM to verify the VM boots using the disk
ha-manager set vm:<VMID> --state enabled

agners · 2021-12-10T10:05:59Z

Did this happen with a recent OS version?

meichthys · 2021-12-10T12:37:30Z

Did this happen with a recent OS version?

Yes:

For me it happened somewhat regularly after the host machine is rebooted, but last night I noticed it happened even without a host machine reboot.

I did notice this:

agners · 2021-12-10T19:43:48Z

100% for /dev/root is normal since we use a read-only squashfs as root file system.

github-actions · 2022-03-10T20:03:05Z

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

RubenKelevra · 2022-05-08T22:48:46Z

@agners wrote:

This is essentially what #1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards.

Yeah. The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync. When the machine is turned off, the cache won't get flushed but instead just discarded.

I've seen this a lot of times in KVM, not sure if that's a Linux kernel bug or somewhere in the emulation layer of the disk itself.

Last time I've seen this is like 3 years ago. I just keep the machines running for several minutes before rebooting – which fixed this for me.

agners · 2022-05-09T09:47:39Z

The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync.

"gets cached" by whom?

If its hardware, then it's broken hardware. The OS needs to be able to rely on flush stuff to the underlying non-volatile storage, otherwise the whole stack of cards fall apart (journaling file systems won't be able to implement consistency guarantees, databases ACID breaks).

If its the VM's virtual disk driver, than that VM disk driver is buggy or reckless. Granted, you might want that option so you can trade performance for reliability if you really don't need any relyability (e.g. for testing). But it shouldn't be the default, and it should not be configured for Home Assistant OS :)

KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default...

RubenKelevra · 2022-05-09T13:54:15Z

"gets cached" by whom?

So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive.

Linux also writes something out like

sd 0:0:0:0: [sda] No Caching mode page found
sd 0:0:0:0: [sda] Assuming drive cache: write through

On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.

KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default...

Yeah not by intention but because of a bug somewhere. Making sure that writes are atomic without data journaling / copy-on-write is kinda hard.

I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away.

RubenKelevra · 2022-05-09T14:02:59Z

Btw ext4 got a mount option to fix some application issues: auto_da_alloc. But I don't think this will cover block based replacements.

              Many broken applications don't use fsync() when replacing existing files via patterns such as

              fd = open("foo.new")/write(fd,...)/close(fd)/ rename("foo.new", "foo")

              or worse yet

              fd = open("foo", O_TRUNC)/write(fd,...)/close(fd).

              If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks
              of  the  new  file  are  forced to disk before the rename() operation is committed.  This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation
              blocks are forced to disk.

agners · 2022-05-11T08:35:01Z

So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive.

Yeah, that would explain it, but it would be a big fat bug IMHO. I mean, just throwing away caches when the VM gets destroyed seems a major oversight. I doubt that this is what is going on.

On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.

This issue is about virtual machines though. Also, SD cards are exposed as mmcblk. I don't think that the kernel makes such assumptions for those type of devices.

I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away.

Keep in mind that the boot folder is FAT. Also, it is mounted sync now, so writes should go out immediately today.

With OS 8.x we switch to GRUB2 boot loader and to the latest Linux kernel, let's see if reports appear with that combination.

Sesshoumaru-sama · 2022-07-26T20:39:26Z

I tried to update from Hass OS 8.2 to 8.4 today (Proxmox VM).
System did not boot after that, landing into the EFI shell. Had to restore a previously made snapshot and now its up again. This issue is really worrysome and persistent.

meichthys · 2022-07-26T22:28:08Z

I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: #1125 (comment)

Sesshoumaru-sama · 2022-07-26T23:00:31Z

I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: #1125 (comment)

I have no folder /dev/zvol/rpool/
Is it just the path to the vm-disks -- I have them on LVS .. so this?
/dev/pve/vm-100-disk-0 (mapped to -> ../dm-10

Do I need to do it for the data disk or also for the EFI disk (disk-1)?

meichthys · 2022-07-27T11:02:29Z

I've only ever done it on the darts disk, but my disk was zfs. Be sure to take a backup of the cm before trying it on your lvm/s disk.

Sesshoumaru-sama · 2022-07-27T22:13:17Z

Odd that nobody else with LVS had this issue and could give a hint.
I will try to restore the VM on another proxmox instance and see what happens - really frustrating to have so low level issues that stuff does not boot...

GJT · 2022-08-03T11:16:37Z

Just had the issue again after a long time without problems after upgrading from 8.2 to 8.4.
Proxmox/ZFS

agners · 2022-08-04T12:18:05Z

We really don't do anything special with that partition other than writing some files to it right before rebooting. Rebooting should properly unmount the disk, which should cause all buffers to be properly flushed. Can you check if the file system checks were all good before the upgrade, e.g. using the following command in the console/HAOS ssh shell:

journalctl -u "systemd-fsck@dev-disk-by\x2dlabel-hassos\x2dboot.service"
journalctl -u mnt-boot.mount

GJT · 2022-08-04T15:45:55Z

Unfortunately both only contain entries after the upgrade

Going to check next time before an upgrade.

sylarevan · 2023-04-17T19:32:52Z

Hi there. I must report this bug seems still present. After a host (proxmox 7.3-6) reboot, my Home Assistant VM was not able to boot anymore. I get the message:
BdsDxe: failed to load Boot0001 "UEFI QEMU HARDDISK QM00005 " from PciRoot (0x0) /Pci (0x7,0x0) /Sata (0x 0,0xFFFF,0x0) : Not Found

The solution, as mentionned there was to check the disk table with:
gdisk /dev/pve/vm-101-disk-1
and then simply
w

After that, the VM was able to boot again.

agners · 2023-04-18T06:21:20Z

After a host (proxmox 7.3-6) reboot

Was that a graceful reboot or a power cut?

If the former, can you reproduce this with each reboot?

sylarevan · 2023-04-18T10:38:49Z

This was a host graceful reboot. I have not rebooted since (I'm a bit afraid to not being to properly recover the VM this time), but I will test. FYI this is the first time I have this problem in about 3 years, and many many HA updates.

agners added the board/ova Open Virtual Appliance (Virtual Machine) label Dec 23, 2020

mj-sakellaropoulos mentioned this issue Dec 23, 2020

Disk crash after latest os upgrade? whiskerz007/proxmox_hassos_install#96

Open

iopenguin mentioned this issue Jan 25, 2021

Updating to 5.10 on VirtualBox results in empty EFI directory #1180

Closed

7 tasks

agners mentioned this issue Mar 16, 2021

5.12 OVA will not boot properly #1270

Closed

7 tasks

github-actions bot added the stale label Oct 6, 2021

github-actions bot closed this as completed Oct 13, 2021

agners reopened this Dec 10, 2021

github-actions bot removed the stale label Dec 10, 2021

github-actions bot added the stale label Mar 10, 2022

github-actions bot closed this as completed Mar 17, 2022

agners mentioned this issue Sep 16, 2022

HA Linux OS broken after 8.5 -> 9.0 update #2136

Closed

Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

Comments

mj-sakellaropoulos commented Dec 23, 2020

mj-sakellaropoulos commented Dec 23, 2020 • edited

agners commented Dec 23, 2020

agners commented Dec 23, 2020

mj-sakellaropoulos commented Dec 23, 2020

mj-sakellaropoulos commented Dec 23, 2020 • edited

mj-sakellaropoulos commented Dec 23, 2020 • edited

HassOS EFI Recovery Guide

agners commented Dec 24, 2020

mj-sakellaropoulos commented Dec 24, 2020

tumd commented Jan 5, 2021

markkamp commented Jan 8, 2021

lexathon commented Jan 18, 2021

lexathon commented Jan 29, 2021

RubenKelevra commented Feb 8, 2021 • edited

ahknight commented Feb 12, 2021

GJT commented Feb 22, 2021

ahknight commented Feb 22, 2021

agners commented Feb 25, 2021 • edited

RubenKelevra commented Feb 25, 2021

ahknight commented Mar 5, 2021 • edited

GJT commented Mar 21, 2021

hcooper commented Mar 28, 2021

stale bot commented Jun 11, 2021

Eeems commented Jul 6, 2021

github-actions bot commented Oct 6, 2021

meichthys commented Dec 10, 2021 • edited by agners

agners commented Dec 10, 2021

meichthys commented Dec 10, 2021 • edited

agners commented Dec 10, 2021

github-actions bot commented Mar 10, 2022

RubenKelevra commented May 8, 2022

agners commented May 9, 2022

RubenKelevra commented May 9, 2022 • edited

RubenKelevra commented May 9, 2022

agners commented May 11, 2022

Sesshoumaru-sama commented Jul 26, 2022

meichthys commented Jul 26, 2022

Sesshoumaru-sama commented Jul 26, 2022 • edited

meichthys commented Jul 27, 2022

Sesshoumaru-sama commented Jul 27, 2022

GJT commented Aug 3, 2022

agners commented Aug 4, 2022 • edited

GJT commented Aug 4, 2022

sylarevan commented Apr 17, 2023

agners commented Apr 18, 2023

sylarevan commented Apr 18, 2023

mj-sakellaropoulos commented Dec 23, 2020 •

edited

mj-sakellaropoulos commented Dec 23, 2020 •

edited

mj-sakellaropoulos commented Dec 23, 2020 •

edited

RubenKelevra commented Feb 8, 2021 •

edited

agners commented Feb 25, 2021 •

edited

ahknight commented Mar 5, 2021 •

edited

meichthys commented Dec 10, 2021 •

edited by agners

meichthys commented Dec 10, 2021 •

edited

RubenKelevra commented May 9, 2022 •

edited

Sesshoumaru-sama commented Jul 26, 2022 •

edited

agners commented Aug 4, 2022 •

edited