Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvlist_find_value() rewrite causes regression in grub2-probe on non-ZFS systems #1

Open
ryao opened this issue Apr 24, 2013 · 6 comments

Comments

@ryao
Copy link

ryao commented Apr 24, 2013

The nvlist_find_value() rewrite caused a regression in grub2-probe on non-ZFS systems. It seems that systems that happen to have a correct magic number in the right place will be handled as if they had proper nvlists, even when the data is nonsense. This can cause an infinite loop. Here is the Gentoo bug:

https://bugs.gentoo.org/show_bug.cgi?id=462740

@maxximino
Copy link
Owner

I think that I've identified a possibile cause. Fix proposed in commit 83aabc1
The current code moves to the next nvpair by moving the pointer by the size of the current nvpair.
If we are reading from non-nvlist-data (or bad formed nvlist), the current nvpair might be zero.
So the next nvpair is also the current nvpair, and this repeats for an unlimited number of times.
Reading from non-nvlist-data can happen when looking for ZFS labels on device that are not ZFS vdevs, like in Gentoo bug #462740.

@LazyGentooGuy
Copy link

I'm a Sabayon user that encountered this problem as well. I confirm that sys-boot/grub-2.00-r2 led to a grub2-probe hang when running grub2-mkconfig. Reverting to sys-boot/grub-2.00-r1 and upgrading to sys-boot/grub-2.00-r3 both resolved this issue for me. Here's the Sabayon bug for reference: https://bugs.sabayon.org/show_bug.cgi?id=4178

@maxximino
Copy link
Owner

@LazyGentooGuy can you please test if sys-boot/grub-2.00-r2 with the commit 83aabc1 fixes the problem? I'm unable to reproduce it. Also, any information useful to reproduce the problem is more than welcome. Which filesystem and/or md/dm-raid are you using? Can you provide a debug log of grub2-probe from a working execution?
You should be able to test (also in Sabayon) with:

mkdir -p /etc/portage/patches/sys-boot/grub-2.00-r2/
curl https://github.com/maxximino/grub2/commit/83aabc190e95860437fdb848461b16202b8cbb79.patch >/etc/portage/patches/sys-boot/grub-2.00-r2/grub-2.00-zfs.patch
emerge -1v =sys-boot/grub-2.00-r2

@maxximino
Copy link
Owner

@csiden, as you are the original author of the code, what do you think about the proposed fix? I think that an unlucky (or specially crafted) partition could start an infinite loop also in Grub1 in IllumOS.
If encode_size ( https://hg.openindiana.org/upstream/illumos/illumos-gate/rev/2889e2596bd6#l10.363 ) is zero, nvlist_next_nvpair would return the same pointer as in it's input, so the cycle at
https://hg.openindiana.org/upstream/illumos/illumos-gate/rev/2889e2596bd6#l10.508 would never end.
As I haven't been able to reproduce the problem, this is my best guess about the cause of this problem.

@csiden
Copy link

csiden commented May 6, 2013

@maxximino It's been a while since I looked at this code, but going through it with @ahrens it looks like we should be verifying that this is a valid ZFS file system before starting to process the nvlist, not just trying to handle the nvlist better. The proposed fix only seems to address the case where the invalid value is <= 0, but an invalid positive number could lead us to offset into invalid memory.

It seems like the correct fix would be to have check_pool_label() validate that vdev->vl_vdev_phys.vp_zbt is correct (it has type zio_eck_t defined in zio.h which consists of a magic number, ZEC_MAGIC, and a 256-bit checksum of the label). It looks like the grub implementation of zfs_mount() just completely ignores this value. Verifying that the magic number is correct in check_pool_label() before calling nvlist_unpack() would probably be enough to fix this problem.

@LazyGentooGuy
Copy link

@maxximino I have tested the patch as requested and it did seem to resolve the issue. grub2-probe did not hang on mdadm as it did previously. I reverted back to the current Sabayon grub-2.00-r2 and the failures appeared again

I have made some changes to my config over the past couple of weeks in order to get my system booting. Originally, I had ext3 /boot on /dev/md0 as 0.90 mdadm RAID-1 over /dev/sd{a,b}1 with /dev/sd{c,d}1 as spares. I keep ext4 / on /dev/md1 as 0.90 mdadm RAID-10 over /dev/sd{a..d}3. I first noticed the problem when I tried to 'equo upgrade' all packages and checked after an hour it hadn't moved - that's when I noticed it was stuck on grub2-probe. I killed that process and the rest of the update packages installed.

On next boot, however, I was getting "Block device /dev/md1 is not a valid root device". I was able to boot on an older 3.6 kernel to troubleshoot. Turns out aside from the grub2-probe bug, I also had to add domdadm to my kernel parameters when previously that wasn't needed. The interesting thing though is that once that was added, bootup took longer to construct the arrays. And it was building /dev/md127 and /dev/md126 instead of /dev/md0, with a /dev/sda1 and /dev/sdc1 in one and /dev/sdb1 and /dev/sdd1 in the other. Since they all had same uuid though, i found out that rebooting during troubleshooting would give me different /boot partitions, so some tests of grub.cfg would disappear across reboots - very frustrating until I figured out what was happening. I also noticed with new grub2-mkconfig the devices were being referenced by /dev/md1 instead of root=UUID=xxxxx...

So I eventually did away with RAID-1 with two spares, and instead did just RAID-1 with /dev/sda1 and /dev/sdb1, zeroing out the superblock on /dev/sd{c,d}1. While I was at it I also backed up /boot, formatted to ext4 and replaced files.

I think the problem I had with mdadm was due to removal of udev and ability to reference devices by UUID

So that's my current config and the reasons for getting there. As mentioned above I still get grub2-probe error when using current Sabayon grub-2.00-r2 package, but using the grub-2.00-zfs.patch seems to have resolved it for me.

Hope this helps but let me know if you need any more info...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants