Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] v1.2.0 Interactive ISO Fails to Install On Some Bare-Metal Devices #4510

Closed
irishgordo opened this issue Sep 8, 2023 · 20 comments
Closed
Assignees
Labels
area/installer backport-needed/1.1.3 kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/often Reproducible 10% to 99% of the time require/doc Improvements or additions to documentation severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@irishgordo
Copy link

irishgordo commented Sep 8, 2023

Describe the bug
Interactive ISO Install Fails to install on some bare-metal devices.

Devices reported so far:

  • AMD Ryzen 9 7940HS "nuc-like" machine
  • AMD Ryzen 5900 w/ 64GB
  • Intel 770T i7 CPU (i7 vPro 7th Gen) ThinkCentre M910q w/ 32GB

Two paths seem to take place:

Path "A":

  • Once GRUB boot entry fires up, dmesg/journalctl logs kick off but get hung right-after/at squashfs: version 4.0 (2009/01/31) Phillip Lougher

Path "B":

  • hit "e" at GRUB boot menu to edit boot menu item, remove console=ttyS0, hit cntrl+x to boot
  • boot continues
  • boot will hit 50min limit on a start job is running for /dev/mapper/live-rw (50min)
  • will result in:
timed out waiting for device /dev/mapper/live-rw
dependency failed for /sysroot
dependency failed fo cOS system initramfs setup before switch root
dependency failed for initrd default target
dependency failed for migrate config to new version

Related to:

[ 3002.862138] localhost systemd[1]: systemd-ask-password-console.path: Deactivated successfully.
[ 3002.876793] localhost systemd[1]: Stopped Dispatch Password Requests to Console Directory Watch.
[ 3002.877445] localhost systemd[1]: Stopped target Basic System.
[ 3002.878015] localhost systemd[1]: Stopped target System Initialization.
[ 3002.878491] localhost systemd[1]: dracut-pre-mount.service: Deactivated successfully.
[ 3002.878601] localhost systemd[1]: Stopped dracut pre-mount hook.
[ 3002.879091] localhost systemd[1]: dracut-initqueue.service: Deactivated successfully.
[ 3002.879145] localhost systemd[1]: Stopped dracut initqueue hook.
[ 3002.879628] localhost systemd[1]: dracut-pre-trigger.service: Deactivated successfully.
[ 3002.879678] localhost systemd[1]: Stopped dracut pre-trigger hook.
[ 3002.880124] localhost systemd[1]: dracut-pre-udev.service: Deactivated successfully.
[ 3002.880160] localhost systemd[1]: dracut-pre-udev.service: Unit process 706 (rpcbind) remains running after unit stopped.
[ 3002.880176] localhost systemd[1]: dracut-pre-udev.service: Unit process 710 (rpc.statd) remains running after unit stopped.
[ 3002.880190] localhost systemd[1]: dracut-pre-udev.service: Unit process 715 (rpc.idmapd) remains running after unit stopped.
[ 3002.880254] localhost systemd[1]: Stopped dracut pre-udev hook.
[ 3002.880742] localhost systemd[1]: dracut-cmdline.service: Deactivated successfully.
[ 3002.880805] localhost systemd[1]: Stopped dracut cmdline hook.
[ 3002.881866] localhost systemd[1]: Started Emergency Shell.
[ 3002.882341] localhost systemd[1]: Reached target Emergency Mode.
[ 3002.882790] localhost systemd[1]: Reached target Initrd Root File System.
[ 3002.883747] localhost systemd[1]: Starting cOS system early rootfs setup...
[ 3002.910231] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Starting elemental version 0.3.1
[ 3002.910231] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] reading configuration form '/etc/elemental'
[ 3002.910455] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Running stage: rootfs.before
[ 3002.910455] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Done executing stage 'rootfs.before'
[ 3002.910455] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Running stage: rootfs
[ 3002.910455] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Done executing stage 'rootfs'
[ 3002.910455] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Running stage: rootfs.after
[ 3002.910455] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Done executing stage 'rootfs.after'
[ 3002.910520] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Running stage: rootfs.before
[ 3002.910584] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Done executing stage 'rootfs.before'
[ 3002.910584] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Running stage: rootfs
[ 3002.910662] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Done executing stage 'rootfs'
[ 3002.910662] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Running stage: rootfs.after
[ 3002.910730] localhost elemental[1113]: �[36mINFO�[0m[2023-09-08T20:07:42Z] Done executing stage 'rootfs.after'
[ 3002.911691] localhost systemd[1]: Finished cOS system early rootfs setup.
[ 3002.912704] localhost systemd[1]: Starting cOS system immutable rootfs mounts...
[ 3002.916317] localhost systemctl[1122]: Failed to stop oem.mount: Unit oem.mount not loaded.
[ 3002.923074] localhost systemd[1]: Finished cOS system immutable rootfs mounts.
[ 3002.923497] localhost systemd[1]: Reached target Initrd File Systems.
[ 3002.923872] localhost systemd[1]: Startup finished in 23.595s (firmware) + 3min 16.756s (loader) + 2.664s (kernel) + 0 (initrd) + 50min 259ms (userspace) = 53min 43.276s.

Resulting in:

Generating "/run/initramfs/rdosreport.txt"

Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsoreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.

Press Enter for maintenance
(or press control-d to continue)

To Reproduce
Pre-Reqs:

  • have a machine that is close to those series of devices
  • AMD 5900 & Intel i7 vPro 7th Gen machines-> were reproduced/tested in UEFI boot mode, not legacy BIOS
    Steps to reproduce the behavior:
  1. Have a bootable USB stick with the interactive ISO flashed to it
  2. Attempt to boot
    ( Either exercise Path B or allow Path A to run the course)

Expected behavior
The installer to not hit:

[ 3002.911691] localhost systemd[1]: Finished cOS system early rootfs setup.
[ 3002.912704] localhost systemd[1]: Starting cOS system immutable rootfs mounts...
[ 3002.916317] localhost systemctl[1122]: Failed to stop oem.mount: Unit oem.mount not loaded.
[ 3002.923074] localhost systemd[1]: Finished cOS system immutable rootfs mounts.

And other moments.
And allow the user to proceed to the entry point of the first page of the interactive iso install.

Environment
NOTE:
This is not reproducible with v1.2.0-rc5.
But is reproducible with v1.2.0-rc6.

  • Harvester ISO version: v1.2.0 & v1.2.0-rc6
  • Underlying Infrastructure: Bare-metal only, unsuccessful in reproducing on HP Proliant Server
    Also unsuccessful in reproducing in qemu/kvm

Additional context

Attaching some logs:

dmesg-logs.log
etc-initrd-release.log
etc-os-release.log
journalctllogs.log
rdsosreport.txt

And additional pictures:

PXL_20230908_184608227
PXL_20230908_203503129

Update

  • this can be virtualized in a hybrid fashion by, having a USB, flashing the USB with rc6 or v1.2.0 - then passing through the USB to the VM, then setting up the boot order, to leverage the USB

Other Update

  • this is also reproducible with BIOS not juse UEFI
@irishgordo irishgordo added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/1 Function broken (a critical incident with very high impact) reproduce/often Reproducible 10% to 99% of the time labels Sep 8, 2023
@slackspace-io
Copy link

slackspace-io commented Sep 9, 2023

Same issue;

Dell Optiplex 3080
i5-12500

I've reset bios, upgraded bios, disabled SATA/M2, every usb port. Tried the few bios options I've seen for any issue, all same exact behaviour as this.

I compared a v1.1.2 iso to the v1.2 iso, and I noticed bootx64.efi is not executable in v1.2.0 but was in v1.1.2 iso. I have no idea if this can be a problem, but my uneducated efforts to try and compare differences I noticed this. As well a change form using kernel.xz and rootfs.xz to initrd. But I really have no idea if any of this matters.

@anixon604
Copy link

I have the exact same on intel NUC12 i5-1240p.

@mirceanton
Copy link

Another related (duplicate) issue: #4472

@slackspace-io
Copy link

I think I found the cause!!
It seems we changed from kernel.xz to just kernel -- I have no idea or what that means :)

I noticed the search at the top of grub.cfg tho is still looking for kernel.xz!!
search --no-floppy --file --set=root /boot/kernel**.xz**
It should be (I think?)
search --no-floppy --file --set=root /boot/kernel
At the same time of discovering this, I tried booting specifying all paths manually. (hd0,msdos1) for kernel and initrd, as well specifying the location for COS_LIVE.

I thought this made sense, and immediately wrote the above. But i've been retesting, the only change I really had to do was root=live:/dev/sda1 . Which makes me wonder if the search line actually is not the problem at all :) I would of thought both kernel and initrd of failed to be loaded given the search was looking for kernel.zx?

My ultimate 'work around' however was just specifying the partition for the rd.live
$linux ($root)/boot/kernel cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable net.ifnames=1
to
$linux ($root)/boot/kernel cdroot root=live:**/dev/sdX1** rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable net.ifnames=1

Warning I am changing from label based to hard path, I know in my case the USB is /dev/sda. But if the system had a sata it could of been /dev/sdb, etc. Do not do this unless you know as you are using a less reliable method to load the image.

@Vicente-Cheng
Copy link
Contributor

This is not reproducible with v1.2.0-rc5. But is reproducible with v1.2.0-rc6.

Thanks, @slackspace-io, for the take care of it.
As @irishgordo reported, it is workable on v1.2.0-rc5 and failed after rc6 and formal release.

I checked the rc5 and later versions. It looks like they all use the /boot/kernel.
But you might find an issue here, we will also take a look at it.

After you change this, you can boot into the installer as usual.
You will meet the same issue here if you do not change anything. Did I understand right?

Thanks!

@slackspace-io
Copy link

This is not reproducible with v1.2.0-rc5. But is reproducible with v1.2.0-rc6.

Thanks, @slackspace-io, for the take care of it. As @irishgordo reported, it is workable on v1.2.0-rc5 and failed after rc6 and formal release.

I checked the rc5 and later versions. It looks like they all use the /boot/kernel. But you might find an issue here, we will also take a look at it.

After you change this, you can boot into the installer as usual. You will meet the same issue here if you do not change anything. Did I understand right?

Thanks!

Yes, the only change that was required was the root=live portion. It was not needed to set hd0,msdos1 on the ($root) portions. However setting these did not break anything.

When I press 'e' to edit the config, I do not gain access to the 'search ' line. So was not able to change the search line itself and test.

Would the incorrect 'search' line, cause the label based root=live: to not be found ?

I was able to successfully install both a nuc gen 8 ,and optiplex 3080 by setting root=live:/dev/sda1 instead of the labeled based identification.

@bk201
Copy link
Member

bk201 commented Sep 11, 2023

@slackspace-io Thanks. We can reproduce it. Your workaround works quite well! Specifying the real partition name or UUID path works, just not sure why it breaks with the label.

@Vicente-Cheng
Copy link
Contributor

Update the current status.

We found there are two COS_LIVE label partitions of USB sticks.
When bootup, some checking script would hang because of the wrong partition.

And the timeout default is 3000 seconds, so we must wait 50 minutes.
You can refer to this for the timeout setup https://github.com/haraldh/dracut/blob/master/modules.d/90dmsquash-live/dmsquash-generator.sh#L75-L80.

So that might be the root cause of this situation. Also, I tried with the original ISO (which means no repack), and it works well.

NOTE: We repack the ISO because we need to support legacy BIOS bootup

BTW, we also found some ISO editors (I tried rufus) will resolve this problem because they change the original partition layout.

@jeff-radick-suse
Copy link

Some things confuse me about this.

  1. When I look at my USB stick with the 1.2.0 image on it, I only see a single COS_LIVE partition. I'm examining it plugged into my laptop just to look at it, not from within the system that's trying to run it to do the install.
  2. If there is a partitioning/layout problem, shouldn't this be deterministic and repeatable on any attempt to boot from this ISO image on any hardware or in a VM?

The compressed vs uncompressed kernel ought not to matter. If that was the problem then the boot would fail at the grub boot prompt and you wouldn't get to the point where the kernel is running and trying to do stuff.

There's a lot I don't yet know about how this ISO is constructed so I'm sure there's something important I'm missing here.

@jeff-radick-suse
Copy link

I was mistaken about the partition labeling. If I use lsblk -O" then it shows more info than things like parted -lorfdisk -l` do, and when I do that, it shows the EFI partition for the rc6 image as having the COS_LIVE label. That is strange.

I ran builds from my desktop for rc5 and the release, and have been comparing the build output; it appears that the actions at the end to create the ISO image look substantially different, though I do not yet understand these differences, or where they come from. A lot about the build process is not evident to me yet.

I am however now fairly convinced that some late change in the way the ISO is created is the cause of this problem.

@irishgordo
Copy link
Author

@Vicente-Cheng I updated the initial description as this seems to be both reproducible with UEFI & BIOS

@guangbochen guangbochen added require/doc Improvements or additions to documentation priority/0 Must be fixed in this release labels Sep 13, 2023
@bk201 bk201 added this to the v1.2.0 milestone Sep 14, 2023
@harvesterhci-io-github-bot
Copy link

harvesterhci-io-github-bot commented Sep 14, 2023

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?
    The HEP PR is at:

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

    Perform installation with the following methods:

    • Boot from the ISO.
    • Flash the ISO to a USB stick and boot from it. Note: need to use balenaEtcher or dd, don't use Rufus. Rufus doesn't have the issue.
    • iPXE method.
  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at:

    • Does the PR include the explanation for the fix or the feature?

    • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
      The PR for the YAML change is at:
      The PR for the chart change is at:

  • If labeled: area/ui Has the UI issue filed or ready to be merged?
    The UI issue/PR is at:

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?
    The documentation/KB PR is at:

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@harvesterhci-io-github-bot

Automation e2e test issue: harvester/tests#937

@irishgordo
Copy link
Author

Validated the workaround provided through the docs pr looks good on qemu/kvm - both BIOS & UEFI with non-virtualized USB-Host-Device.
Screenshot from 2023-09-14 09-42-26
Screenshot from 2023-09-14 09-41-28
Screenshot from 2023-09-14 09-37-56
Screenshot from 2023-09-14 09-37-06

I will follow up on this, once I am back at my apartment and can validate on bare-metal 😄 that the docs pr provided work around also works directly on UEFI & BIOS bare-metal (nothing virtualized)

@irishgordo
Copy link
Author

Validated that this also looks good as a workaround on bare-metal as well 😄 :

  • 5900, 64GB, custom tower w/ UEFI
  • ThinkCentre, w/ BIOS

@irishgordo
Copy link
Author

@Vicente-Cheng based on the documentation update working for both solutions in kvm/qemu & on bare-metal (consumer-grade) in what was mentioned above, I feel comfortable closing this out. Thanks again for the doc update on the workaround 😄 !

cc: @bk201

@Roguito
Copy link

Roguito commented Sep 19, 2023

So where are we at? I just flashed 1.2 from the current releases with this bug. Is 1.2.0-patch1 available?

@irishgordo
Copy link
Author

Hi @Roguito , the docs have been updated with:
harvester/docs#435
To add content to the USB Installation section, with a link out to where a user can download the patched ISO 😄
Screenshot from 2023-09-19 14-09-31

@Roguito
Copy link

Roguito commented Sep 19, 2023

Hi @Roguito , the docs have been updated with: harvester/docs#435 To add content to the USB Installation section, with a link out to where a user can download the patched ISO 😄 Screenshot from 2023-09-19 14-09-31

That's what I get for focusing on github over the documentation. Thank you so much!

@harvesterhci-io-github-bot

added backport-needed/1.1.3 issue: #4549.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/installer backport-needed/1.1.3 kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/often Reproducible 10% to 99% of the time require/doc Improvements or additions to documentation severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

10 participants