Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testiso failing in 4.11 using RHEL 8.5; race mounting rootfs #666

Closed
cgwalters opened this issue Nov 10, 2021 · 13 comments
Closed

testiso failing in 4.11 using RHEL 8.5; race mounting rootfs #666

cgwalters opened this issue Nov 10, 2021 · 13 comments

Comments

@cgwalters
Copy link
Member

cgwalters commented Nov 10, 2021

Our Live ISO relies on loopback mounting a squashfs.

It's failing in 4.10; looks like this may be a kernel change:

[    8.185550] systemd[1]: Mounting /sysroot...
[    8.200864] loop1: detected capacity change from 0 to 972305408
[    8.208027] loop_set_status: loop1 () has still dirty pages (nrpages=2)
[    8.212371] mount[758]: mount: /sysroot: failed to setup loop device for /run/media/iso/images/pxeboot/rootfs.img.
@cgwalters
Copy link
Member Author

One possibility here is that mksquashfs (or more generally, something in the tooling that generates disk images) changed in Fedora in that breaks being mounted by RHEL8. If the problem was rooted in coreos-assembler(Fedora) that would explain why it's appearing on multiple branches. This update was pushed recently.

@miabbott
Copy link
Member

We are seeing this also on internal 4.9 CI jobs with the same error, but the coreos-assembler image that was used to build the image maps to coreos/coreos-assembler@9730a0b from Oct 6, so it doesn't seem likely that a change in cosa content is the culprit.

@miabbott
Copy link
Member

We had one successful build + test of 4.10 using the RHEL 8.4 EUS bits (the repo location we had previously been using switched to RHEL 8.5). I suspect there was something in the 8.5 bits that was causing the error, but the pkgdiff for the CI jobs isn't working, so we don't have a good snapshot of what changed.

@miabbott
Copy link
Member

miabbott commented Nov 10, 2021

We've had additional success building RHCOS 4.10 + 4.9 using RHEL 8.4 EUS sources.

Some additional investigation around RHCOS 4.10 + RHEL 8.5 kernel would be a good place to start.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2022
@miabbott
Copy link
Member

miabbott commented Feb 9, 2022

This issue seems to have resolved itself, though we don't really understand why.

Please re-open if this happens again.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2022

@miabbott: Closing this issue.

In response to this:

This issue seems to have resolved itself, though we don't really understand why.

Please re-open if this happens again.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bgilbert
Copy link
Contributor

This showed up again in #718.

The kernel message has still dirty pages comes from torvalds/linux@5db470e, and is printed right before returning -EAGAIN from LOOP_SET_STATUS, LOOP_SET_STATUS64, or LOOP_SET_BLOCK_SIZE. That commit is included in RHEL ≥ 8.1.

The util-linux message failed to setup loop device for is only generated from one place, loopcxt_setup_device(), which didn't handle EAGAIN from LOOP_SET_STATUS64 until util-linux/util-linux@eab90ef. That commit is not included in any RHEL 8 util-linux.

The util-linux commit mentions that the failure only occurs with a non-zero file offset. That's not a common setting, so it's unsurprising that no one noticed. But CoreOS uses a non-zero offset to mount the rootfs squashfs directly out of the cpio archive. (Added in RHCOS 4.6 in coreos/fedora-coreos-config@18a2c51.)

A reproducer based on the util-linux commit message triggers pretty quickly for me:

truncate -s 100M disk
mkdir point
losetup -o 239 /dev/loop3 disk
mkfs.ext4 /dev/loop3
losetup -d /dev/loop3
while mount -o loop,offset=239 disk point && umount point; do :; done

We should ask for util-linux/util-linux@eab90ef to be backported to 8.4+.

@cgwalters
Copy link
Member Author

Awesome work on that investigation 🕵️ !

We should ask for util-linux/util-linux@eab90ef to be backported to 8.4+.

Filed https://bugzilla.redhat.com/show_bug.cgi?id=2058176

miabbott added a commit to miabbott/os that referenced this issue Feb 24, 2022
All of the ISO-based `testiso` scenarios are failing due to openshift#666
so we are going to disable them until we get `util-linux` patched.

See also: https://bugzilla.redhat.com/show_bug.cgi?id=2058176
@miabbott miabbott changed the title testiso failing in 4.10 testiso failing in 4.11 using RHEL 8.5; race mounting rootfs Feb 24, 2022
@miabbott
Copy link
Member

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2022
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2022
@miabbott
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@openshift-ci openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 26, 2022
@cgwalters cgwalters removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 1, 2022
@cgwalters
Copy link
Member Author

https://bugzilla.redhat.com/show_bug.cgi?id=2058176 is shipped in 8.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants