Operating System 5.8 update breaking VHDX image #1092

tonyjobson · 2020-12-15T14:56:02Z

Hardware Environment

OVA (Open Virtualization Appliance, on Intel NUC or any other hardware, please add the Hypervisor you are using)
Windows HyperV running on Windows Server X86.

Home Assistant OS release:

[ ] Updated from version = 2.12 and updated ever since.
Additional information (if accessible):

Supervisor logs:

System Health

version: 2020.12.0
installation_type: Home Assistant OS
dev: false
hassio: true
docker: true
virtualenv: false
python_version: 3.8.6
os_name: Linux
os_version: 5.4.77
arch: x86_64
timezone: Europe/London

logged_in: true
subscription_expiration: December 22, 2020, 12:00 AM
relayer_connected: true
remote_enabled: false
remote_connected: false
alexa_enabled: true
google_enabled: true
can_reach_cert_server: ok
can_reach_cloud_auth: ok
can_reach_cloud: ok

host_os: HassOS 4.17
update_channel: stable
supervisor_version: 2020.12.6
docker_version: 19.03.12
disk_total: 8.2 GB
disk_used: 3.8 GB
healthy: true
supported: true
board: ova
supervisor_api: ok
version_api: ok
installed_addons: Samba share (9.3.0), Node-RED (7.2.8), Mosquitto broker (5.1), File editor (5.2.0), Check Home Assistant configuration (3.6.0)

dashboards: 1
mode: auto-gen
resources: 0

If I apply the System Update to Operating System 5.8 from 4.17 the update successes. The instance reboots and everything looks good.

If I then reboot again the VM wont boot any longer as the virtual disk is no longer bootable and it hangs looking for a DHCP PXE boot.

I have replicated this twice (I take a snap shot before I update anything so was able to recover and replicate the problem.)
I have been able to do reboot 4 times in a row having updated Home Assistant but no the host OS so I'm fairly sure it's host issue.

Is the problem reproducible? Yes.
Has this been working before (is this a regression?) Yes
Has there been attempt to rule out harware issues? It's a VM.

agners · 2020-12-15T18:19:15Z

Hm, I did test upgrade from release 4 to 5.8 using KVM, and haven't noticed this.

Is this also reproducible if you import/install a new 5.8 installation as a new machine?

tonyjobson · 2020-12-15T19:01:55Z

I have not tried yet but i could do over the weekend.
If I import a fresh image of 5.8 then the OS upgrade wont run as it will be up to date and I think it's the OS upgrade which is breaking the image.

Also the key here is that the upgrade worked and the first reboot went through fine. It was only a 2nd reboot after OS upgrade which failed.
(very odd - I agree. perhaps this second reboot slipped through the testing is why I mention it)

(I only caught the problem because I was rebooting in a vain effort to get something unrelated working and happened to have done the upgrade only an hour or so before so still had a snap shot to recover from.)

One more point which occurred to me after I logged the case.
I have performed a VHDX expansion on the disk image when I ran low on free space. Not sure if this is a supported operation. It seemed to be given the HA install recognized the full expanded space straight away.

PerWeimann · 2020-12-16T14:15:26Z

I see the exact same behavior as tonyjabson.

When updating from OS 4.17 to 5.8, the WHDX image fails to load the OS – Have tested this twice (on Core 0.118.4 and Core 2020.12.0) and can reproduce.

version: 0.118.4
installation_type: Home Assistant OS
dev: false
hassio: true
docker: true
virtualenv: false
python_version: 3.8.6
os_name: Linux
os_version: 5.4.77
arch: x86_64
timezone: Europe/Copenhagen

Home Assistant Cloud

logged_in: false
can_reach_cert_server: ok
can_reach_cloud_auth: ok
can_reach_cloud: ok

Hass.io

host_os: HassOS 4.17
update_channel: stable
supervisor_version: 2020.12.6
docker_version: 19.03.12
disk_total: 5.2 GB
disk_used: 2.8 GB
healthy: true
supported: true
board: ova
supervisor_api: ok
version_api: ok
installed_addons: Samba share (9.3.0), AppDaemon 4 (0.3.2), Node-RED (7.2.11), File editor (5.2.0), Terminal & SSH (8.10.0)

dkebler · 2020-12-16T16:41:44Z

I had similar no boot issue on three HASSOS vm's I run when upgrading. They would not boot from saved state but you could reload the last snapshot (after the upgrade) and they would boot. Then a further restart (from saved state) and the issue would happen again. It dumps you to the efi shell and when you look in the EFI directory there is nothing there, no .efi file to boot. Needing a solution I made a HA snaphost and then started from scratch with fresh release copy of 5.8 and restored my HA snaphost and now all is good. Still it's a bummer that the upgrade failed. Maybe on major upgrades the upgrade button in the UI should warn and suggest maybe doing what I ultimately had to do.

This post's answer could explain why one can't boot a second time. https://askubuntu.com/questions/454557/virtualbox-virtual-machines-wont-boot-after-cloning.

Also I found this video which explains how to boot when it can't find the efi to boot. https://www.youtube.com/watch?v=YCegkcVheJA
I followed his instructions but the FS0 partition although it has an EFI directory there is nothing in there so I couldn't run the efi like he does....stuck.

so apparently the virtual nram settings are getting borked per that post and the efi partition doesn't seem to have the .efi booting file (per efi shell) and/or the UUIDs where changed during the upgrade process.

Related after some further investigating whenever I change the UUID of the virtual disk or clone it which does the same thing I get a similar issue. This was pointed out many places about poorly designed VMs, if the image has a "hard coded" UUID then changing that will cause problems.

It seems going forward that HASSOS should support changing the UUID of the filesystem partition. Normally if I did this on hardware I would burn the new filesystem image, change the UUID then go into fstab and change the UUID accordingly. In the case of a VM I have no idea how to mount the partitions of a virtual drive without booting. I suppose you'd have to get another support VM running and mount your virtual drive there and then maybe you could edit what needs to be edited.

agners · 2020-12-16T21:41:06Z

@dkebler the update doesn't change the UUID. In fact our image uses b3dd0952-733c-4c88-8cba-cab9b8b4377f since the beginning. Your screenshot shows that the correct boot partition (UUID), and it is considered the main file system (FS0), which it would boot off.

However, the EFI directory definitely should not be empty. My best guess is that something went wrong during writing the update. I'd love to reproduce this, as you say it happened in three independent instances, it sounds like it should be reproducible. I just tried upgrading from 4.11 and 4.17 to 5.8, in both situation things worked. Also tried with taking a snapshot. Which version have you been using before?

agners · 2020-12-16T21:47:15Z

@PerWeimann are you using HyperV then too?

So far the only way I was able to corrupt my EFI partition was to force turn off the machine right after update (which, obviously, is not a good idea). I probably can improve the update process to shorten the time a forceful power off will corrupt EFI.

Did you do a proper reboot (no system reset) as well as a proper power off?

When we write the update to the boot partiton, there is nothing which makes sure that data is written to disk. This leaves a rather large window (probably around 30s) where a machine reset/poweroff can lead to a corrupted boot partition. Use the sync mount option to minimize the corruption window. Note that sync is not ideal for flash drives normally. But since we write very little and typically only on OS update to the boot partition, this shouldn't be a problem.

heretocopycode · 2020-12-17T01:40:58Z

It happened to me On Virtualbox. I could not get it boot after power off.. and with the fresh install .. it’s booting very slow..

When we write the update to the boot partiton, there is nothing which makes sure that data is written to disk. This leaves a rather large window (probably around 30s) where a machine reset/poweroff can lead to a corrupted boot partition. Use the sync mount option to minimize the corruption window. Note that sync is not ideal for flash drives normally. But since we write very little and typically only on OS update to the boot partition, this shouldn't be a problem.

dkebler · 2020-12-17T19:26:54Z

I use vbox and after starting with a vdhx I had converted to a vdi (although the op had same issue without doing that). I was at 4.17 before I did the upgrade. I had no issue earlier when I did 4.x upgrades this only happened jumping to 5.x. I usually do an an apci shutdown so it shuts down normally. I did not restart the vm during the upgrade and it came back fine showing the upgrade to HassOS 5.8. It is on the next start of the vm when the issue arises.

Per other part of my comment. Since this project has complete control over the partition labels and /dev names then I suggest you mount the root file system and other partitions based either on the /dev or a label you give them. In this way it's not dependent on a unique UUID and can be cloned and still boot. As you said it's supposedly not the issue (but maybe it is??). Either way as a virtual drive it's easy to change the UUID whereas in a metal install a write on parts of the filesystem partition isn't going to change the UUID, and if you are going to move it will be to another metal where UUID would not conflict where as it's possible to have alternate VMs on the same metal and in the case where you don't change the UUID you can't even load the VM copy cause it complains about having the same UUID. Thus you must clone it. Then when you try to boot the clone it fails for the reasons I've already stated.

agners · 2020-12-18T15:38:43Z

Per other part of my comment. Since this project has complete control over the partition labels and /dev names then I suggest you mount the root file system and other partitions based either on the /dev or a label you give them. In this way it's not dependent on a unique UUID and can be cloned and still boot.

We do mount by file system label, so from a OS perspective it doesn't matter.

Not sure what the firmware (UEFI BIOS) is doing, it might be that it tries to do something with GUID partition UUID's (like remember what partition UUID you booted from the last time or similar). I am just saying, we ship with the same UUID always since we ship as a an image. The image also didn't change the UUID... But in theory, it shouldn't even matter since we use FS label.

When we write the update to the boot partiton, there is nothing which makes sure that data is written to disk. This leaves a rather large window (probably around 30s) where a machine reset/poweroff can lead to a corrupted boot partition. Use the sync mount option to minimize the corruption window. Note that sync is not ideal for flash drives normally. But since we write very little and typically only on OS update to the boot partition, this shouldn't be a problem.

nhorvath · 2020-12-24T06:14:24Z

This update broke mine too. I didn't want to deal with the EFI stuff and I have daily snapshots backed up to google drive so I just downloaded a new image and restored it. Had to resync some integrations but was still probably faster than fighting the corrupted EFI directory. Next time I'll snapshot in virtualbox before doing OS upgrade (my last one was from February and it couldn't restore the snapshot).

adamoutler · 2020-12-24T12:52:59Z

I got promising results by changing eeprom settings. HA at least got through more of it's boot sequence. I'm not sure about the problem now, but I think I need to adjust SystemD timeout.
Is there a way to change timeout on Hassos?

scyto · 2021-01-10T04:08:13Z

I have had no issues updating my VHDX on hyper-v.

PerWeimann · 2021-01-19T10:38:37Z

@agners - Sorry for late reply!

Yes, running Hyper-V on Server 2016. The virtual machine is running Hyper-V configuration 8.0.

I'm still able to reproduce when upgrading from OS 4.17 to OS 5.10.
Running, core-2021.1.4 and supervisor-2021.01.5

agners · 2021-02-08T10:16:16Z

@PerWeimann is this reproducible with 5.11? It is currently in the beta channel, but you can use ha os update --version 5.11

stale · 2021-05-12T09:03:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2022-01-04T11:03:08Z

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

agners added the board/ova Open Virtual Appliance (Virtual Machine) label Dec 15, 2020

Mister-Espria mentioned this issue Dec 22, 2020

Disk crash after latest os upgrade? whiskerz007/proxmox_hassos_install#96

Open

agners mentioned this issue Dec 23, 2020

Upgrade to OS 5.9 on OVA caused corrupted EFI folder #1125

Closed

stale bot added the wontfix label May 12, 2021

agners removed the wontfix label Oct 6, 2021

github-actions bot added the stale label Jan 4, 2022

github-actions bot closed this as completed Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operating System 5.8 update breaking VHDX image #1092

Operating System 5.8 update breaking VHDX image #1092

tonyjobson commented Dec 15, 2020 •

edited

agners commented Dec 15, 2020

tonyjobson commented Dec 15, 2020

PerWeimann commented Dec 16, 2020 •

edited

dkebler commented Dec 16, 2020 •

edited

agners commented Dec 16, 2020

agners commented Dec 16, 2020

heretocopycode commented Dec 17, 2020

dkebler commented Dec 17, 2020 •

edited

agners commented Dec 18, 2020 •

edited

nhorvath commented Dec 24, 2020

adamoutler commented Dec 24, 2020

scyto commented Jan 10, 2021

PerWeimann commented Jan 19, 2021

agners commented Feb 8, 2021

stale bot commented May 12, 2021

github-actions bot commented Jan 4, 2022

Operating System 5.8 update breaking VHDX image #1092

Operating System 5.8 update breaking VHDX image #1092

Comments

tonyjobson commented Dec 15, 2020 • edited

agners commented Dec 15, 2020

tonyjobson commented Dec 15, 2020

PerWeimann commented Dec 16, 2020 • edited

dkebler commented Dec 16, 2020 • edited

agners commented Dec 16, 2020

agners commented Dec 16, 2020

heretocopycode commented Dec 17, 2020

dkebler commented Dec 17, 2020 • edited

agners commented Dec 18, 2020 • edited

nhorvath commented Dec 24, 2020

adamoutler commented Dec 24, 2020

scyto commented Jan 10, 2021

PerWeimann commented Jan 19, 2021

agners commented Feb 8, 2021

stale bot commented May 12, 2021

github-actions bot commented Jan 4, 2022

tonyjobson commented Dec 15, 2020 •

edited

PerWeimann commented Dec 16, 2020 •

edited

dkebler commented Dec 16, 2020 •

edited

dkebler commented Dec 17, 2020 •

edited

agners commented Dec 18, 2020 •

edited