Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

Closed
mj-sakellaropoulos opened this issue Dec 23, 2020 · 48 comments
Closed

Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

mj-sakellaropoulos opened this issue Dec 23, 2020 · 48 comments
Labels
board/ova Open Virtual Appliance (Virtual Machine) stale

Comments

@mj-sakellaropoulos
Copy link

Just updated to OS 5.9 via UI, vm no longer boots.
VM is on Proxmox 6.2, OVA, UEFI OVMF

Upon investigation in ubuntu, garbage data was found in the EFI folder :
image

Will dd the boot partition from release page and report back, suspect update process is broken somehow ?

( as first reported here: whiskerz007/proxmox_hassos_install#96 )

@mj-sakellaropoulos
Copy link
Author

mj-sakellaropoulos commented Dec 23, 2020

After repairing the boot partition, the EFI file system seemed intact but I am stuck on the barebox bootloader and 100% CPU usage :
image

Update: Booting manually via GRUB command line as specified in system1 reveals the system is completely broken, the update never completed (os-release still 5.8). docker, homeassistant, networkmanager and other services do not start.

@agners
Copy link
Member

agners commented Dec 23, 2020

There are two partition (A/B update system), you might have booted the old 5.8 release.

Did you by chance had to reset/force poweroff the VM? Can you reproduce the issue? You are not the only report along those lines, see #1092. I use libvirt (which uses kvm underneath) and did a bunch of updates using OVA, I wasn't able to reproduce this issue.

@agners agners added the board/ova Open Virtual Appliance (Virtual Machine) label Dec 23, 2020
@agners
Copy link
Member

agners commented Dec 23, 2020

Which version did you upgrade from?

@mj-sakellaropoulos
Copy link
Author

5.8 to 5.9 via UI

I booted system0 and system1 via GRUB, let me know if there are other procedures to follow for booting specific versions.

The vm was not forced off by me, it did the update, corrupted EFI and rebooted. When i looked at VNC, it was saying cannot find boot entry.

@mj-sakellaropoulos
Copy link
Author

mj-sakellaropoulos commented Dec 23, 2020

Just to clarify from my perspective there are the following multiple failures:

If there are any log files I can provide let me know.

I will try to repro this issue to extract some more data.

I should also mention that the initial EFI corruption has broken OVMF detection on proxmox 6.2, the disk had to be migrated to a new VM to be detected even with repaired EFI.

I have updated proxmox to latest (6.3)
I have installed 5.8 via importing qcow2 into proxmox, barebox bootloader is still broken

@mj-sakellaropoulos
Copy link
Author

mj-sakellaropoulos commented Dec 23, 2020

MAJOR UPDATE :

  • Bootloader hang is caused by having IDE DVD-ROM attached for debugging, remove IDE and barebox works properly
  • System update worked, and system is running fully functional as long as booted by barebox (just need to manually start docker with systemctl start docker)

The ONLY issue was the EFI corruption although the cause remains unknown
Some hints: the directory listing of corrupted EFI contain strings like "Attempt 7" which are found in NvVars and are also part of barebox boot process (?)

HassOS EFI Recovery Guide

If your EFI is corrupted (you get the message cannot find QEMU HARDDISK etc..) this procedure may help:

  • Attach an ubuntu live CD iso to the VM and boot to the desktop, open terminal
  • do sudo fdisk -l /dev/sda and make sure hassos-boot is /dev/sda1 and is the same size as /dev/nbd0p1 later on
  • Follow these steps to recover EFI partition :
wget https://github.com/home-assistant/operating-system/releases/download/5.9/hassos_ova-5.9.qcow2.xz
sudo su
apt install qemu-utils
modprobe nbd max_part=10
xz -d -v ./hassos_ova-5.9.qcow2.xz
qemu-nbd --connect=/dev/nbd0 ./hassos_ova-5.9.qcow2

# Dangerous step is next, double check partition size and BACKUP your disk FIRST !
fdisk -l
dd if=/dev/nbd0p1 of=/dev/sda1
  • Now mount the repaired /dev/sda1 to verify contents :
mkdir /mnt/hass-boot
mount /dev/sda1 /mnt/hass-boot
ls -al /mnt/hass-boot/EFI
  • if everything looks good, unmount and shutdown
umount /mnt/hass-boot
qemu-nbd --disconnect /dev/nbd0
shutdown now
  • IMPORTANT: Remove the ISO and IDE DVD from the VM before rebooting
  • The VM should boot normally, you may need to systemctl start docker to get the hassio CLI working

@agners
Copy link
Member

agners commented Dec 24, 2020

Thanks for posting the update and instructions how to recover the boot partition!

I changed to use the sync mount option when mounting the boot partition, hoping things get written out immediately after update (see #1101). Although, since you did not force off/force reboot, there must have been something else causing the corruption. Maybe it is some sort of kernel bug. It it is that, then I hope the latest Linux kernel stable update (part of 5.9) fixes it. But if you have hints/ideas what could have caused the corruption in first place (or if you have a process to reproduce) I would be very interested to hear.

@mj-sakellaropoulos
Copy link
Author

I am unable to reproduce the issue, the only thing I could suggest is to implement a sanity check after writing to the boot partition, look for the bootx64.efi file, and check if it fails (?)

I did a clean install of 5.8 and updated via CLI to 5.9, it worked normally on latest proxmox. Very strange. Could be some incompatibility between specific old versions of proxmox and barebox maybe?

@tumd
Copy link

tumd commented Jan 5, 2021

Same issue here running on a libvirt VM.
I seem to have been able to restore the boot partition with help from @mj-sakellaropoulos informative post.

@markkamp
Copy link

markkamp commented Jan 8, 2021

Same thing happened to me running HassOS as VM on a Unraid server (6.8.3). The fix from @mj-sakellaropoulos worked as a charm. (Thank you!) So my hass-boot partition was also corrupted.

When reverting to a backup image, I did ended up reproducing the problem. This was when I updated OS 3.12 to 5.9. Al seemed fine when updating. But after power cycling the VM, nothing. Wouldn't boot any more. So could be as @mj-sakellaropoulos suggested compatibility issues between older versions?

@lexathon
Copy link

I had the same upgrading from 4.17 running on an esxi vm. I didn't bother recovering the EFI and simply rolled back my hard drive image to the backup.

agners added a commit to agners/operating-system that referenced this issue Jan 28, 2021
There are incident reports on the internet where poeple report that
fsck.(v)fat actually leads to problems rather file system fixes. Around
the time when Home Assistant OS added fsck.fat for the boot partition,
reports of empty boot partitions or file with weired filenames started
to appear. This could be caused by fsck.fat.

Disable fsck on the boot partition.
@lexathon
Copy link

Same again on 5.10 (as iopenguin mentioned already). Interestingly this time the machine booted fine after update and was stable but after a power cut it failed to come back online. I guess the EFI wasn't needed for a soft reboot after the update.
I used mj-sakellaropoulos workaround (using 5.10) to recover the EFI on this occasion as I'd made some changes I wanted to keep - thanks for that by the way.

agners added a commit that referenced this issue Jan 29, 2021
There are incident reports on the internet where poeple report that
fsck.(v)fat actually leads to problems rather file system fixes. Around
the time when Home Assistant OS added fsck.fat for the boot partition,
reports of empty boot partitions or file with weired filenames started
to appear. This could be caused by fsck.fat.

Disable fsck on the boot partition.
agners added a commit that referenced this issue Jan 30, 2021
There are incident reports on the internet where poeple report that
fsck.(v)fat actually leads to problems rather file system fixes. Around
the time when Home Assistant OS added fsck.fat for the boot partition,
reports of empty boot partitions or file with weired filenames started
to appear. This could be caused by fsck.fat.

Disable fsck on the boot partition.
@RubenKelevra
Copy link

RubenKelevra commented Feb 8, 2021

I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down.

I've seen this before with libvirt and fixed it in my setups by running a sync as root on the machines after anything as been updated.

And wait a while with sleep after an update has been completed.

Else the system partition might look like this:

20210208_201647
20210208_201930

@ahknight
Copy link

I'm running a VM on Proxmox and this presents as clearing the MBR but leaving GPT intact. From a Proxmox shell I can use fdisk on the zvol that hosts the VM and it will rebuild the MBR, which lets the VM boot. But I have to do this after every HA OS upgrade.

@GJT
Copy link

GJT commented Feb 22, 2021

Having the same issue under Proxmox. Happens every other week.
@ahknight What command do you run exactly to fix it?

@ahknight
Copy link

$ gdisk /dev/zd##

Then just write out the MBR again and try again.

@agners
Copy link
Member

agners commented Feb 25, 2021

@RubenKelevra

I don't think that's an fsck issue, but a failure to properly sync the data from the write cache of the VM to the storage while the VM shuts down.

That is what I thought as well, but we also see it on Intel NUCs. Also when rebooting a proper sync should be done on reboot anyways, and at least some people claimed they did a proper reboot but still experienced the issue...

I've seen this before with libvirt and fixed it in my setups by running a sync as root on the machines after anything as been updated.

This is essentially what #1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards.

@ahknight just to clarify: the images uses UEFI to boot, there is no "MBR". MBR is a DOS partition table/BIOS concept. In UEFI, there is just a FAT partition called EFI System Partition (ESP), which has to have the right files in the right place. The UEFI BIOS then picks up the boot loader from there. No "magic" master boot record (MBR) needed. I guess you refer to the ESP here.

@GJT to fix a qcow2 image, you can follow the instructions in #1125 (comment).

@RubenKelevra
Copy link

@agners interesting.

We might experience two separate issues here:

The consumer grade SSDs do have a write cache which is not protected by a battery backup.

If the shutdown process is (basically) too fast we might write this to the write cache and cut the power to the device before the SSD had time to flush it on the permanent storage.

There are some Intel SSDs which have a power backup built-in to avoid this.

They call this "enhanced power-loss data protection".

It's probably pretty racy in most setups, so we might have this issue everywhere but it only shows symptoms by a very small chance.

We could debug this by writing a file unsynced when the shutdown is initiated. If it's gone when we startup, we know something fishy is going on.

If it's still there we delete it.

Anyway, I think we could mitigate this issue if we use a hook on shutdown when the FSes are unmounted. If we just add a - say 5 second - sleep afterwards even the slowest SSD should have plenty of time to write everything from the write cache to the disk.

adeepn pushed a commit to jethome-ru/homeassistant-operating-system that referenced this issue Mar 1, 2021
…ome-assistant#1190)

There are incident reports on the internet where poeple report that
fsck.(v)fat actually leads to problems rather file system fixes. Around
the time when Home Assistant OS added fsck.fat for the boot partition,
reports of empty boot partitions or file with weired filenames started
to appear. This could be caused by fsck.fat.

Disable fsck on the boot partition.
@ahknight
Copy link

ahknight commented Mar 5, 2021

@agners I know how the startup process is supposed to work. However, I'm explaining what I did. Proxmox got stuck in an EFI boot prompt loop until I SSHed into the PX host, ran gdisk on the zvol, read the error message that said the MBR was missing, wrote out the MBR that it recovered from the GPT tables, and then started the VM again. Suddenly it worked.

We can argue should forever, but that did do it. Repeatedly.

@GJT
Copy link

GJT commented Mar 21, 2021

For me this issue occurs every 2-4 weeks, the system becomes unresponsive out of the blue and i'am greeted with the corrupt efi on a reset.
I usually roll back to my working snapshot (OS 5.12) which I can reboot as much as I like. But after some time it gets corrupted again without any updates or changes to the System.

It even occurs on different proxmox cluster nodes that use different storage systems.

@hcooper
Copy link

hcooper commented Mar 28, 2021

HassOS EFI Recovery Guide

Thanks @mj-sakellaropoulos your recovery instructions worked well, and I managed to recovery a botched upgrade.

@stale
Copy link

stale bot commented Jun 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@Eeems
Copy link

Eeems commented Jul 6, 2021

I've just experienced this, and I last upgraded from 6.0 to 6.1. I'm not convinced that an upgrade is causing this for me, as I successfully rebooted the vm multiple times after the upgrade with no issues. What I did notice yesterday is that the filesystem had gone read-only and required a reboot. After the reboot it seemed to run fine, but when I woke up this morning the vm was powered off, and required rebuilding the partition table in order to boot again.

This seems to be a semi-weekly occurrence for me.

@github-actions
Copy link

github-actions bot commented Oct 6, 2021

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@meichthys
Copy link

meichthys commented Dec 10, 2021

I'm late to the party, but I still experience this regularly when rebooting the host machine. My temporary fix is:

## Make sure VM is disabled:
ha-manager set vm:<VMID> --state disabled
## Open GDISK to modify disk partition map
gdisk /dev/zvol/rpool/vm-<VMID>-disk-<DISK#>
## Once GDISK opens, then just use the `W` command to re-write the partiion map
## Re-enable (start) VM to verify the VM boots using the disk
ha-manager set vm:<VMID> --state enabled

@agners
Copy link
Member

agners commented Dec 10, 2021

Did this happen with a recent OS version?

@agners agners reopened this Dec 10, 2021
@github-actions github-actions bot removed the stale label Dec 10, 2021
@meichthys
Copy link

meichthys commented Dec 10, 2021

Did this happen with a recent OS version?

Yes:
image

For me it happened somewhat regularly after the host machine is rebooted, but last night I noticed it happened even without a host machine reboot.

I did notice this:
image

@agners
Copy link
Member

agners commented Dec 10, 2021

100% for /dev/root is normal since we use a read-only squashfs as root file system.

@github-actions
Copy link

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

@RubenKelevra
Copy link

@agners wrote:

This is essentially what #1101 does, by mounting the hole partition sync. That went into 5.9, but the issue still appeared afterwards.

Yeah. The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync. When the machine is turned off, the cache won't get flushed but instead just discarded.

I've seen this a lot of times in KVM, not sure if that's a Linux kernel bug or somewhere in the emulation layer of the disk itself.

Last time I've seen this is like 3 years ago. I just keep the machines running for several minutes before rebooting – which fixed this for me.

@agners
Copy link
Member

agners commented May 9, 2022

The issue is outside of the VM. The write request gets cached for "performance reasons" including the fsync.

"gets cached" by whom?

If its hardware, then it's broken hardware. The OS needs to be able to rely on flush stuff to the underlying non-volatile storage, otherwise the whole stack of cards fall apart (journaling file systems won't be able to implement consistency guarantees, databases ACID breaks).

If its the VM's virtual disk driver, than that VM disk driver is buggy or reckless. Granted, you might want that option so you can trade performance for reliability if you really don't need any relyability (e.g. for testing). But it shouldn't be the default, and it should not be configured for Home Assistant OS :)

KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default...

@RubenKelevra
Copy link

RubenKelevra commented May 9, 2022

"gets cached" by whom?

So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive.

Linux also writes something out like

sd 0:0:0:0: [sda] No Caching mode page found
sd 0:0:0:0: [sda] Assuming drive cache: write through

On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.

KVM/Qemu has quite some tunables in that domain. SUSE seems to have a nice write-up about the options. I highly doubt though that "non-safe" options are used in Proxmox by default...

Yeah not by intention but because of a bug somewhere. Making sure that writes are atomic without data journaling / copy-on-write is kinda hard.

I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away.

@RubenKelevra
Copy link

Btw ext4 got a mount option to fix some application issues: auto_da_alloc. But I don't think this will cover block based replacements.

              Many broken applications don't use fsync() when replacing existing files via patterns such as

              fd = open("foo.new")/write(fd,...)/close(fd)/ rename("foo.new", "foo")

              or worse yet

              fd = open("foo", O_TRUNC)/write(fd,...)/close(fd).

              If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks
              of  the  new  file  are  forced to disk before the rename() operation is committed.  This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation
              blocks are forced to disk.

@agners
Copy link
Member

agners commented May 11, 2022

So my assumption is that the virtual harddrive does write-back caching for performance reasons and do not fully flush the cache before destroying the virtual harddrive.

Yeah, that would explain it, but it would be a big fat bug IMHO. I mean, just throwing away caches when the VM gets destroyed seems a major oversight. I doubt that this is what is going on.

On certain hardware. Which is just not true for SD Cards, which do some kind of write back caching.

This issue is about virtual machines though. Also, SD cards are exposed as mmcblk. I don't think that the kernel makes such assumptions for those type of devices.

I stopped using ext4 on lvm for this reason and switched to zfs and the issue went away.

Keep in mind that the boot folder is FAT. Also, it is mounted sync now, so writes should go out immediately today.

With OS 8.x we switch to GRUB2 boot loader and to the latest Linux kernel, let's see if reports appear with that combination.

@Sesshoumaru-sama
Copy link

I tried to update from Hass OS 8.2 to 8.4 today (Proxmox VM).
System did not boot after that, landing into the EFI shell. Had to restore a previously made snapshot and now its up again. This issue is really worrysome and persistent.

@meichthys
Copy link

I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: #1125 (comment)

@Sesshoumaru-sama
Copy link

Sesshoumaru-sama commented Jul 26, 2022

I haven't noticed this issue recently, but the following has always worked when falling the the Efi shell: #1125 (comment)

I have no folder /dev/zvol/rpool/
Is it just the path to the vm-disks -- I have them on LVS .. so this?
/dev/pve/vm-100-disk-0 (mapped to -> ../dm-10

Do I need to do it for the data disk or also for the EFI disk (disk-1)?

@meichthys
Copy link

I've only ever done it on the darts disk, but my disk was zfs. Be sure to take a backup of the cm before trying it on your lvm/s disk.

@Sesshoumaru-sama
Copy link

Odd that nobody else with LVS had this issue and could give a hint.
I will try to restore the VM on another proxmox instance and see what happens - really frustrating to have so low level issues that stuff does not boot...

@GJT
Copy link

GJT commented Aug 3, 2022

Just had the issue again after a long time without problems after upgrading from 8.2 to 8.4.
Proxmox/ZFS

@agners
Copy link
Member

agners commented Aug 4, 2022

We really don't do anything special with that partition other than writing some files to it right before rebooting. Rebooting should properly unmount the disk, which should cause all buffers to be properly flushed. Can you check if the file system checks were all good before the upgrade, e.g. using the following command in the console/HAOS ssh shell:

journalctl -u "systemd-fsck@dev-disk-by\x2dlabel-hassos\x2dboot.service"
journalctl -u mnt-boot.mount

@GJT
Copy link

GJT commented Aug 4, 2022

Unfortunately both only contain entries after the upgrade

QsSFmIa

Going to check next time before an upgrade.

@sylarevan
Copy link

Hi there. I must report this bug seems still present. After a host (proxmox 7.3-6) reboot, my Home Assistant VM was not able to boot anymore. I get the message:
BdsDxe: failed to load Boot0001 "UEFI QEMU HARDDISK QM00005 " from PciRoot (0x0) /Pci (0x7,0x0) /Sata (0x 0,0xFFFF,0x0) : Not Found

The solution, as mentionned there was to check the disk table with:
gdisk /dev/pve/vm-101-disk-1
and then simply
w

After that, the VM was able to boot again.

@agners
Copy link
Member

agners commented Apr 18, 2023

After a host (proxmox 7.3-6) reboot

Was that a graceful reboot or a power cut?

If the former, can you reproduce this with each reboot?

@sylarevan
Copy link

This was a host graceful reboot. I have not rebooted since (I'm a bit afraid to not being to properly recover the VM this time), but I will test. FYI this is the first time I have this problem in about 3 years, and many many HA updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
board/ova Open Virtual Appliance (Virtual Machine) stale
Projects
None yet
Development

No branches or pull requests