-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resizing of volume groups causes mounting issues on Rhel-8 #691
Comments
Hi @Chris-Long02, sorry for the delayed response, I was out on vacation for a couple weeks. Did you figure out the issue and get things working? If so, would you mind posting the solution, in case someone else runs into it later? |
I unfortunately never found a fix. I closed it as the latest release doesn't have the issue. |
Hmm, it's been fairly quiet between the April and May releases, nothing comes to mind as far as any changes in this project or its dependencies. The May release should just have patch updates, compared to April, but should otherwise be the same... |
@lorengordon I take back what I said before. I've run into the issue again on the spel-minimal-rhel-8-hvm-2024.05.1.x86_64-gp3 release. With no configurations other than what was posted in the original message, there are still mounting issues. |
This morning, when I reformatted issue-opening information to make the code more-easily excerptible, I did launch an EC2 from the stated AMI using the provided code. I wasn't able to replicate the issue (applied the script's geometry changes and rebooted right back to a running state). That said, the code as provided didn't actually function without modification. Specifically, the Note: I was using a Nitro-based instance-type (specifically, a
With respect to the third point, if you use a Nitro-based instance-type, the AMIs baked-in SSM packages/services make it so that you can access the rescue-mode prompt from the EC2 webUI if you have an appropriate instance-role attached to the EC2 and the UI-user has an appropriate access-policy attached to their user |
@ferricoxide I realize now that I left out somewhat important info, sorry for that. I am using a t3.large and launching with 50 GBs of storage. I tried running from a t3.medium and the script ran fine still. Before the script is run, the output of lsblk is: After the script is ran the output of lsblk is: Output as the script runs is: The issue is then normally tripped if I make an AMI of the instance that ran the script and launch from the new AMI, it tends to still be fine after a reboot. The really tricky part is that the issue doesn't occur every time. I have to launch a small batch of maybe 5 instances and maybe 1 or 2 will succeed with the rest failing. |
Ok. Without knowing your child-AMI process and output from that failing process. If you can get us some diagnostic data, we might be able to help out with your specific consistency-problem or, if there's something truly problematic with the AMI(s), fix the automation used to produce them. Something to consider over using our AMIs as a starting point is using this automation to originate your own, more-suitable-to-you AMIs. The automation has enough configurational-flexibility in it to do so (I'm super lazy, so I parameterized a lot of the plumbing to allow for things like geometry-customization and installation of custom RPM manifexts). |
This is the log file from a failed instance: |
Ok, it's not being super helpful with those logged-errors, eh?
If we want to see more, probably going to need to login and see what it's annoyed about (using the prescribed diagnostic command). This may require setting a password on the root account (doable via user-data payload, especially using a |
Interesting, it's saying the mount doesn't exist. Mounting /var/log/audit... |
Ok... but it successfully mounted |
Right, there should've been. I found another post about this, going to try implementing it. |
Good luck. Let us know if you're able to isolate anything and if there's anything you suspect could be added into the images to help with the issue. That linked-post makes it sound like what you're seeing could be another of those "this is the downside of 'stable-release' distros like Red Hat Enterprise Linux"? I mean, at least with EL8, they have been doing more-frequent rebasing of tools rather than only patching packages that the X.0 release shipped with. |
Haven't had any success yet, but did find out more info. /usr/bin/mount is exiting with status 32 for /boot. Investigating the cause of that. |
I maybe be having the same/similar issue. I am attempting to use One thing I noticed though is that if I keep rebooting, sometimes it boots correctly and I can get back in. If I reboot again, it goes back to failed boots. I've been digging through logs, I see the mount fail with that code when a device is busy while the system is shutting down:
After comparing successful boots with unsuccessful ones, my guess is that a systemd reload is breaking the mounting process. Here are two examples of failed one where a
The boots without that reloading statement in the mounting process succeed. |
@mrabe142 Are your instances failing from |
I use the AMI directly (no modifications/derivatives). |
Have you found a fix by any chance? I haven't been able to do anything successful. |
I did not. I know some systemd stuff can execute in parallel during boot so maybe there is just an unfortunate ordering to things in this setup where the reload is being triggered at a bad time by something else. I did not get a chance to play with adding other attributes to the fstab entries to see if they would make any difference either. |
What follows is mostly "grabbing at straws" in nature (since I haven't encountered the issue, myself, and it seems like the issues you've encountered haven't been 100% reapeatable)… Unfortunately, with the interaction between At any rate… Because a lot of people would die of hives if That said, I know that at one point in the hardening-content cycles for RHEL (at this point, can't specifically remember if it was EL7, EL8 or EL9 but I think it was in the earliest iterations for EL7), the remediation-content tried to convert I mention the hardening-content because most of those we're aware of that use the spel-generated images pair that usage with The other thing I would ask, particularly given @mrabe142's mention of:
"Are either/both of you expanding the root disk/VG"? Basically, wondering if doing so might be introducing disjoint volume-composition that might be slowing down the As a final mention: the Red Hat AMIs (and Azure VM-templates) are tagged with Red Hat "pay as you go" entitlements (part of why the EC2s launched from RHEL AMIs have a higher hourly charges than those launched from the CentOS/CentOS Stream or Oracle Linux AMIs). The primary purpose of these entitlements is providing access to the official dnf repositories maintained by Red Hat. That said, those entitlements also entitle an AWS (or Azure) account-holder to limited OS support (via the CSP's case-management system). In the past, I've been able to open support tickets with AWS using those tagged entitlements. That may be a pursuit-avenue available to you to help you identify the underlying problem. |
To summarize my observations:
|
I've experienced this issue with many different configurations/modifications. With the most minimal being from expanding the root disk/VG and no hardening. I have yet to have the issue occur for me on a boot directly from the SPEL ami though. I tried to use a mount wrapper to stop race conditions, but it didn't work. Although it is very possible that there is an issue with the mount wrapper. |
Just for clarification: you were successfully using the RHEL 8 AMIs prior to the Also: do you have userData payload (or other provisioning-automation) that can be borrowed to try to replicate the problem? Otherwise, I'm kind of blind. |
That's correct The Steps to expand volumes from the original post is the only user data I've been using and still having the issue occur. Using a C5 instance type might provide you more luck in recreating the problem, as it's been a lot more consistent in having the issue from my recent testing. |
Bugger… Ok, did the February one work for you (
Note that after the |
Guess I should have verified the AWS region: the above is us-east-1. As noted previously, the 3.1 releases were marked "deprecated", so, when I do a search from my normal (GC) dev account, I get:
At any rate, if your issues are in GC, please check whether they still manifest with the 03.2 AMI. |
I remember having an issue with a previous AMI, but I think it was the deprecated one. I have made images based off the 03.2 release without issue, but I'll try on the February one. |
Haven't had any failed launches with |
Ok, so looking back at your userData payload: DISK=$(lsblk -l | sort | awk '{ if ($6 == "disk") { print $1 }}' | tail -1)
DISKPART=$(lsblk -l | sort | awk '{ if ($6 == "part") { print $1 }}' | tail -1)
PARTNUM=$(echo "$DISKPART" | grep -oP "[0-9]+$")
sudo growpart /dev/$DISK $PARTNUM
sudo pvresize /dev/$DISKPART
VOL=$(lsblk -l | sort | awk '{ if ($7 == "/") { print $1 }}' | tail -1)
GROUP=$(echo $VOL | grep -oP "^[^-]*")
sudo lvextend -r -L 4G /dev/$GROUP/homeVol
sudo lvextend -r -L 8G /dev/$GROUP/varVol
sudo lvextend -r -L 6G /dev/$GROUP/logVol
sudo lvextend -r -L 10G /dev/$GROUP/auditVol
sudo lvextend -r -l +100%FREE /dev/$GROUP/rootVol My first recommendation would be to change from using a simple, DISK=$(lsblk -l | sort | awk '{ if ($6 == "disk") { print $1 }}' | tail -1)
DISKPART=$(lsblk -l | sort | awk '{ if ($6 == "part") { print $1 }}' | tail -1)
PARTNUM=$(echo "$DISKPART" | grep -oP "[0-9]+$")
sudo growpart /dev/$DISK $PARTNUM Using a multipart mixed-MIME payload, this stanza's logic could be replaced with a growpart:
mode: auto
devices: [
'/dev/nvme0n1',
'/dev/nvme0n1p3',
'/dev/nvme0n1p4',
] The snippet I post above is just an "I might be using any Nitro or non-Nitro instance type to host a spel-derived OS that may or may not have UEFI-boot support" safety-list. At any rate, having moved the EBS geometry-change logic to a DISKPART=$(
lsblk -l | \
sort | \
awk '{ if ( $6 == "part" ) { print $1 }}' | \
tail -1
)
growpart /dev/$DISK $PARTNUM
pvresize /dev/$DISKPART
VOLPATH=$( lsblk --noheadings -l | sed -n '/ \/$/p' | cut -d " " -f 1 )
VOLGROUP="${VOLPATH//-*/}"
lvextend -r -L 4G "/dev/$VOLGROUP/homeVol"
lvextend -r -L 8G "/dev/$VOLGROUP/varVol"
lvextend -r -L 6G "/dev/$VOLGROUP/logVol"
lvextend -r -L 10G "/dev/$VOLGROUP/auditVol"
lvextend -r -l +100%FREE "/dev/$VOLGROUP/rootVol" Ultimately, I took what you originally posted and created a multipart mixed-MIME userData payload of:
And then, as a quick test, I used a |
20-for-20: all good… |
Batch of 25: all good… Note that, what I'm doing to create these batches is: mapfile -t INSTANCES < <(
aws ec2 run-instances \
--image-id ami-0455bbb8b742553ba \
--instance-type c5.large \
--subnet-id <PRIVATE> \
--security-group-id <PRIVATE> \
--iam-instance-profile 'Name=<PRIVATE>' \
--key-name <PRIVATE> \
--block-device-mappings 'DeviceName=/dev/sda1,Ebs={DeleteOnTermination=true,VolumeType=gp3,VolumeSize=50,Encrypted=false}' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=Testing: GitHub Issue \
- spel #691}]' \
--user-data file:///tmp/userData.spel_691 \
--count 25 \
--query 'Instances[].InstanceId' \
--output text | \
tr '\t' '\n'
) Note 1: I'm wrapping the launch in
Note 2: The account I'm testing in is limited for address space within the AZ I'm using, so "25" is about the most I can launch in one go (I'd probably soon be running up on EC2 type-limits, any way, if IP space wasn't killing me) |
Ok, looks like I can launch at least 30 into an empty subnet… At any rate, at this point, I've batch-launched over 100 ( |
I used the payload you provided above and I'm still getting the same result. Launching from the original AMI has still not provided me any issues, but once I create the image of the modified |
Unfortunately, I can't really provide guidance on what deltas between our AMI and your derived-AMI might be responsible for the issue. If your AMIs are still RHUI-enabled/entitled, you should be able to open a support request with AWS (who can rope in Red Hat's OS-support group for you). If you're not sure your AMIs are RHUI-enabled/entitled, you can do: aws ec2 describe-images \
--image-id <YOUR_IMAGE_ID> \
--query 'sort_by(Images, &CreationDate)[].[UsageOperation]' Entitled AMIs will produce output like: [
[
"RunInstances:0010"
]
] AMIs that lack entitlement will produce output like: [
[
"RunInstances"
]
] Similarly, from within a running EC2, you can execute: curl -s http://169.254.169.254/latest/dynamic/instance-identity/document/ | \
python3 -c 'import json,sys ; print( json.load(sys.stdin)["billingProducts"] )' If your EC2 has entitlement, you'll get back a result like: ['bp-6fa54006'] Otherwise, that command will return: None |
Okay, thank you for all the help! I really appreciate it. |
Just wanted to update on what I have observed so far after a bit of testing. To clarify, I have no issues launching the AMI. I am only running into issues after I have done the initial SSH to the instance and reboot. I spun up seven different instances:
The first one started to have boot volume mounting issues after the second reboot, no modifications/updates In summary, 3, 5, and 7 do not exhibit reboot mounting issues (as of yet). 2 and 7 used a different version of the AMI. I have not tried to resize any volumes yet. On comparing, I do see some differences in the devices. 1 and 4 have I will continue to test after doing some resizing and hardening to see if that changes anything. |
Bleah: not much in the way of consistency in what you're describing. It's going to make reproducing significantly more difficult to do. I can re-do my tests and run random reboots against them to see if I can provoke similar behaviors, but it's unlikely to be until next week that I can set aside any time to do so. As to:
Device-node references are a function of the instance-type you selected:
Probably neither here nor there, but it's generally recommended to not use pre-Nitro instance-types if you can avoid it:
In fairness, first and second bullet are heavily supposition-based. |
So, before I started working any other tasks/projects, today, I fired up 30 EC2s in the $ mapfile -t INSTANCES < <(
aws ec2 run-instances \
--image-id ami-0455bbb8b742553ba \
--instance-type t3.medium \
--subnet-id <PRIVATE_INFO> \
--security-group-id <PRIVATE_INFO> \
--iam-instance-profile 'Name=<PRIVATE_INFO>' \
--key-name <PRIVATE_INFO> \
--block-device-mappings 'DeviceName=/dev/sda1,Ebs={DeleteOnTermination=true,VolumeType=gp3,VolumeSize=50,Encrypted=false}' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=Testing: GitHub Issue - spel #691}]' \
--user-data file:///tmp/userData.spel_691 \
--count 30 \
--query 'Instances[].InstanceId' \
--output text | \
tr '\t' '\n'
) Note that the I then set up a simple, "do forever" loop to iterate over those 30 instances, rebooting them every two or so minutes: while true
do
for INSTANCE in "${INSTANCES[@]}"
do
timeout 5 ssh "maintuser@${INSTANCE}" "sudo systemctl reboot"
echo
done
sleep 120
done And let that run for about an hour. Over the course of that hour-long iteration:
In short, I can't reproduce the issue with the configuration information relayed to me to date. I either need more in the way of "how do I reproduce your prolem" input, or I'm at the end of what I can think to do as troubleshooting steps. |
Just a heads up: if I don't see any further comments from either @Chris-Long02 or @mrabe142 before Wednesday next (26 June 2025), I'm going to close this issue. If either of you is able to provide me more information on reproducing your respective problems, I can either continue trying to resolve this ticket or, if that info comes in after closing, I'll re-open. |
Just wanted to provide an update. I am not current blocked on this so there is no urgent need for me, you can close it unless you want to investigate further. I did another round of testing. I tried 5 different VMs each with these configurations:
I used the following 5 AMIs with the above configuration:
This is what I found for each configuration:
Since I am able to use configurations 1 and 3, I am able to continue with what I need to do. All VMs are using the same CPU/Mem/EBS configurations and all were updated to the latest packages so it is not clear to me what is causing the issues with the newer AMIs. |
@mrabe142 sed:
I can investigate further: are all of the above Nitro instance-types? Also, do you have specific, consistent automation that you're running that can be invoked to try to provoke the issues? |
I think the t3 instance types are nitro capable but I don't think I selected anything about nitro when I launched them, I just launch on-demand instances from the I am not running any automation at this point to do the initial setup, when the instance comes up, I SSH to it. The first thing I try is a For the
For the other AMIs that have 4 partitions:
I try rebooting again after applying those. If they reboot a couple times without boot errors, they are usually stable at that point for all the rest of the configuration I apply to them. |
Continuing conversation with @mrabe142 in (new) issue #695: May be a separate issue and probably don't need to spam @Chris-Long02 with stuff that may not be relevant. Will re-link if anything turns up in new ticket. |
Expected behavior
Resize volume groups using lvextend, create an AMI, launch and log into AMI with the resized volume groups.
Actual behavior
After the volume groups are resized and then an image is made, any of the instances launched from the new image fail to mount. Most commonly /var/log/audit.
Steps to expand volumes
Context/Specifications
OS/VERSION: Rhel-8
AMI: spel-minimal-rhel-8-hvm-2024.04.1.x86_64-gp3
Any help would be greatly appreciated.
The text was updated successfully, but these errors were encountered: