Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operating System 5.8 update breaking VHDX image #1092

Closed
1 task
tonyjobson opened this issue Dec 15, 2020 · 16 comments
Closed
1 task

Operating System 5.8 update breaking VHDX image #1092

tonyjobson opened this issue Dec 15, 2020 · 16 comments
Labels
board/ova Open Virtual Appliance (Virtual Machine) stale

Comments

@tonyjobson
Copy link

tonyjobson commented Dec 15, 2020

Hardware Environment

  • OVA (Open Virtualization Appliance, on Intel NUC or any other hardware, please add the Hypervisor you are using)
    Windows HyperV running on Windows Server X86.

Home Assistant OS release:

  • [ ] Updated from version = 2.12 and updated ever since.
  • Additional information (if accessible):

Supervisor logs:

System Health

version: 2020.12.0
installation_type: Home Assistant OS
dev: false
hassio: true
docker: true
virtualenv: false
python_version: 3.8.6
os_name: Linux
os_version: 5.4.77
arch: x86_64
timezone: Europe/London

logged_in: true
subscription_expiration: December 22, 2020, 12:00 AM
relayer_connected: true
remote_enabled: false
remote_connected: false
alexa_enabled: true
google_enabled: true
can_reach_cert_server: ok
can_reach_cloud_auth: ok
can_reach_cloud: ok

host_os: HassOS 4.17
update_channel: stable
supervisor_version: 2020.12.6
docker_version: 19.03.12
disk_total: 8.2 GB
disk_used: 3.8 GB
healthy: true
supported: true
board: ova
supervisor_api: ok
version_api: ok
installed_addons: Samba share (9.3.0), Node-RED (7.2.8), Mosquitto broker (5.1), File editor (5.2.0), Check Home Assistant configuration (3.6.0)

dashboards: 1
mode: auto-gen
resources: 0

If I apply the System Update to Operating System 5.8 from 4.17 the update successes. The instance reboots and everything looks good.

If I then reboot again the VM wont boot any longer as the virtual disk is no longer bootable and it hangs looking for a DHCP PXE boot.

I have replicated this twice (I take a snap shot before I update anything so was able to recover and replicate the problem.)
I have been able to do reboot 4 times in a row having updated Home Assistant but no the host OS so I'm fairly sure it's host issue.

  • Is the problem reproducible? Yes.
  • Has this been working before (is this a regression?) Yes
  • Has there been attempt to rule out harware issues? It's a VM.
@agners agners added the board/ova Open Virtual Appliance (Virtual Machine) label Dec 15, 2020
@agners
Copy link
Member

agners commented Dec 15, 2020

Hm, I did test upgrade from release 4 to 5.8 using KVM, and haven't noticed this.

Is this also reproducible if you import/install a new 5.8 installation as a new machine?

@tonyjobson
Copy link
Author

I have not tried yet but i could do over the weekend.
If I import a fresh image of 5.8 then the OS upgrade wont run as it will be up to date and I think it's the OS upgrade which is breaking the image.

Also the key here is that the upgrade worked and the first reboot went through fine. It was only a 2nd reboot after OS upgrade which failed.
(very odd - I agree. perhaps this second reboot slipped through the testing is why I mention it)

(I only caught the problem because I was rebooting in a vain effort to get something unrelated working and happened to have done the upgrade only an hour or so before so still had a snap shot to recover from.)

One more point which occurred to me after I logged the case.
I have performed a VHDX expansion on the disk image when I ran low on free space. Not sure if this is a supported operation. It seemed to be given the HA install recognized the full expanded space straight away.

@PerWeimann
Copy link

PerWeimann commented Dec 16, 2020

I see the exact same behavior as tonyjabson.

When updating from OS 4.17 to 5.8, the WHDX image fails to load the OS – Have tested this twice (on Core 0.118.4 and Core 2020.12.0) and can reproduce.

version: 0.118.4
installation_type: Home Assistant OS
dev: false
hassio: true
docker: true
virtualenv: false
python_version: 3.8.6
os_name: Linux
os_version: 5.4.77
arch: x86_64
timezone: Europe/Copenhagen

Home Assistant Cloud

logged_in: false
can_reach_cert_server: ok
can_reach_cloud_auth: ok
can_reach_cloud: ok

Hass.io

host_os: HassOS 4.17
update_channel: stable
supervisor_version: 2020.12.6
docker_version: 19.03.12
disk_total: 5.2 GB
disk_used: 2.8 GB
healthy: true
supported: true
board: ova
supervisor_api: ok
version_api: ok
installed_addons: Samba share (9.3.0), AppDaemon 4 (0.3.2), Node-RED (7.2.11), File editor (5.2.0), Terminal & SSH (8.10.0)

@dkebler
Copy link

dkebler commented Dec 16, 2020

I had similar no boot issue on three HASSOS vm's I run when upgrading. They would not boot from saved state but you could reload the last snapshot (after the upgrade) and they would boot. Then a further restart (from saved state) and the issue would happen again. It dumps you to the efi shell and when you look in the EFI directory there is nothing there, no .efi file to boot. Needing a solution I made a HA snaphost and then started from scratch with fresh release copy of 5.8 and restored my HA snaphost and now all is good. Still it's a bummer that the upgrade failed. Maybe on major upgrades the upgrade button in the UI should warn and suggest maybe doing what I ultimately had to do.

This post's answer could explain why one can't boot a second time. https://askubuntu.com/questions/454557/virtualbox-virtual-machines-wont-boot-after-cloning.

Also I found this video which explains how to boot when it can't find the efi to boot. https://www.youtube.com/watch?v=YCegkcVheJA
I followed his instructions but the FS0 partition although it has an EFI directory there is nothing in there so I couldn't run the efi like he does....stuck.

image

so apparently the virtual nram settings are getting borked per that post and the efi partition doesn't seem to have the .efi booting file (per efi shell) and/or the UUIDs where changed during the upgrade process.

Related after some further investigating whenever I change the UUID of the virtual disk or clone it which does the same thing I get a similar issue. This was pointed out many places about poorly designed VMs, if the image has a "hard coded" UUID then changing that will cause problems.

It seems going forward that HASSOS should support changing the UUID of the filesystem partition. Normally if I did this on hardware I would burn the new filesystem image, change the UUID then go into fstab and change the UUID accordingly. In the case of a VM I have no idea how to mount the partitions of a virtual drive without booting. I suppose you'd have to get another support VM running and mount your virtual drive there and then maybe you could edit what needs to be edited.

@agners
Copy link
Member

agners commented Dec 16, 2020

@dkebler the update doesn't change the UUID. In fact our image uses b3dd0952-733c-4c88-8cba-cab9b8b4377f since the beginning. Your screenshot shows that the correct boot partition (UUID), and it is considered the main file system (FS0), which it would boot off.

However, the EFI directory definitely should not be empty. My best guess is that something went wrong during writing the update. I'd love to reproduce this, as you say it happened in three independent instances, it sounds like it should be reproducible. I just tried upgrading from 4.11 and 4.17 to 5.8, in both situation things worked. Also tried with taking a snapshot. Which version have you been using before?

@agners
Copy link
Member

agners commented Dec 16, 2020

@PerWeimann are you using HyperV then too?

So far the only way I was able to corrupt my EFI partition was to force turn off the machine right after update (which, obviously, is not a good idea). I probably can improve the update process to shorten the time a forceful power off will corrupt EFI.

Did you do a proper reboot (no system reset) as well as a proper power off?

agners added a commit to agners/operating-system that referenced this issue Dec 16, 2020
When we write the update to the boot partiton, there is nothing which
makes sure that data is written to disk. This leaves a rather large
window (probably around 30s) where a machine reset/poweroff can lead
to a corrupted boot partition. Use the sync mount option to minimize the
corruption window.

Note that sync is not ideal for flash drives normally. But since we
write very little and typically only on OS update to the boot partition,
this shouldn't be a problem.
@heretocopycode
Copy link

It happened to me On Virtualbox. I could not get it boot after power off.. and with the fresh install .. it’s booting very slow..

pvizeli pushed a commit that referenced this issue Dec 17, 2020
When we write the update to the boot partiton, there is nothing which
makes sure that data is written to disk. This leaves a rather large
window (probably around 30s) where a machine reset/poweroff can lead
to a corrupted boot partition. Use the sync mount option to minimize the
corruption window.

Note that sync is not ideal for flash drives normally. But since we
write very little and typically only on OS update to the boot partition,
this shouldn't be a problem.
@dkebler
Copy link

dkebler commented Dec 17, 2020

I use vbox and after starting with a vdhx I had converted to a vdi (although the op had same issue without doing that). I was at 4.17 before I did the upgrade. I had no issue earlier when I did 4.x upgrades this only happened jumping to 5.x. I usually do an an apci shutdown so it shuts down normally. I did not restart the vm during the upgrade and it came back fine showing the upgrade to HassOS 5.8. It is on the next start of the vm when the issue arises.

Per other part of my comment. Since this project has complete control over the partition labels and /dev names then I suggest you mount the root file system and other partitions based either on the /dev or a label you give them. In this way it's not dependent on a unique UUID and can be cloned and still boot. As you said it's supposedly not the issue (but maybe it is??). Either way as a virtual drive it's easy to change the UUID whereas in a metal install a write on parts of the filesystem partition isn't going to change the UUID, and if you are going to move it will be to another metal where UUID would not conflict where as it's possible to have alternate VMs on the same metal and in the case where you don't change the UUID you can't even load the VM copy cause it complains about having the same UUID. Thus you must clone it. Then when you try to boot the clone it fails for the reasons I've already stated.

@agners
Copy link
Member

agners commented Dec 18, 2020

Per other part of my comment. Since this project has complete control over the partition labels and /dev names then I suggest you mount the root file system and other partitions based either on the /dev or a label you give them. In this way it's not dependent on a unique UUID and can be cloned and still boot.

We do mount by file system label, so from a OS perspective it doesn't matter.

Not sure what the firmware (UEFI BIOS) is doing, it might be that it tries to do something with GUID partition UUID's (like remember what partition UUID you booted from the last time or similar). I am just saying, we ship with the same UUID always since we ship as a an image. The image also didn't change the UUID... But in theory, it shouldn't even matter since we use FS label.

agners added a commit to agners/operating-system that referenced this issue Dec 18, 2020
When we write the update to the boot partiton, there is nothing which
makes sure that data is written to disk. This leaves a rather large
window (probably around 30s) where a machine reset/poweroff can lead
to a corrupted boot partition. Use the sync mount option to minimize the
corruption window.

Note that sync is not ideal for flash drives normally. But since we
write very little and typically only on OS update to the boot partition,
this shouldn't be a problem.
adeepn pushed a commit to adeepn/home-assistant-operating-system that referenced this issue Dec 19, 2020
When we write the update to the boot partiton, there is nothing which
makes sure that data is written to disk. This leaves a rather large
window (probably around 30s) where a machine reset/poweroff can lead
to a corrupted boot partition. Use the sync mount option to minimize the
corruption window.

Note that sync is not ideal for flash drives normally. But since we
write very little and typically only on OS update to the boot partition,
this shouldn't be a problem.
@nhorvath
Copy link

This update broke mine too. I didn't want to deal with the EFI stuff and I have daily snapshots backed up to google drive so I just downloaded a new image and restored it. Had to resync some integrations but was still probably faster than fighting the corrupted EFI directory. Next time I'll snapshot in virtualbox before doing OS upgrade (my last one was from February and it couldn't restore the snapshot).

@adamoutler
Copy link

I got promising results by changing eeprom settings. HA at least got through more of it's boot sequence. I'm not sure about the problem now, but I think I need to adjust SystemD timeout.
Is there a way to change timeout on Hassos?

@scyto
Copy link

scyto commented Jan 10, 2021

I have had no issues updating my VHDX on hyper-v.

@PerWeimann
Copy link

@agners - Sorry for late reply!

Yes, running Hyper-V on Server 2016. The virtual machine is running Hyper-V configuration 8.0.

I'm still able to reproduce when upgrading from OS 4.17 to OS 5.10.
Running, core-2021.1.4 and supervisor-2021.01.5

@agners
Copy link
Member

agners commented Feb 8, 2021

@PerWeimann is this reproducible with 5.11? It is currently in the beta channel, but you can use ha os update --version 5.11

@stale
Copy link

stale bot commented May 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 12, 2021
@agners agners removed the wontfix label Oct 6, 2021
@github-actions
Copy link

github-actions bot commented Jan 4, 2022

There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates.
Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍
This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
board/ova Open Virtual Appliance (Virtual Machine) stale
Projects
None yet
Development

No branches or pull requests

8 participants