New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mount kubelet and container runtime rootdir on LSSD #93305
Conversation
cluster/gce/gci/configure-helper.sh
Outdated
if [ -e "${ssd}" ]; then | ||
# This workaround to find if the NVMe device is a disk is required because | ||
# the existing Google images does not expose NVMe devices in /dev/disk/by-id | ||
if [[ `udevadm info --query=property --name=${ssd} | grep DEVTYPE | sed "s/DEVTYPE=//"` == "disk" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle the case where PD could be nvme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was copied over from ensure-local-ssds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing logic doesn't handle PD as nvme and assumes all nvme devices are local SSDs. @mattcary do you want to restrict this new ephemeral option to only scsi PD boot disk for now, and update this logic to handle nvme PD boot disk later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want nvme support. Is there any doc on how to detect nvme PDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using ID_MODEL=nvme_card
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to lsblk
cluster/gce/gci/configure-helper.sh
Outdated
md_device="/dev/md/0" | ||
echo "y" | mdadm --create "${md_device}" --level=0 --raid-devices=${#devices[@]} ${devices[@]} | ||
fi | ||
local ephemeral_mountpoint="/mnt/disks/kube-ephemeral-ssd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mount this somewhere other than /mnt/disks? This is currently used as the default discovery directory for local PVs, so if we ever want to support both at the same time, this will conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of /mnt is read-only. I see this for /mnt/disks
:
tmpfs /mnt/disks tmpfs rw,relatime,size=256k,mode=755 0 0
One possibility would be /mnt/stateful_partition/ephemeral_storage
, but then we are putting it inside the boot disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
becefd1
to
17ca5f1
Compare
17ca5f1
to
e632ad7
Compare
/lgtm |
/lgtm |
/retest |
Friendly ping @cheftako Giving the freeze, we are not looking for merging it now, but just to get early feedback |
e632ad7
to
a7cc28b
Compare
cluster/gce/util.sh
Outdated
@@ -1215,6 +1215,7 @@ CONTAINER_RUNTIME_TEST_HANDLER: $(yaml-quote ${CONTAINER_RUNTIME_TEST_HANDLER:-} | |||
UBUNTU_INSTALL_CONTAINERD_VERSION: $(yaml-quote ${UBUNTU_INSTALL_CONTAINERD_VERSION:-}) | |||
UBUNTU_INSTALL_RUNC_VERSION: $(yaml-quote ${UBUNTU_INSTALL_RUNC_VERSION:-}) | |||
NODE_LOCAL_SSDS_EXT: $(yaml-quote ${NODE_LOCAL_SSDS_EXT:-}) | |||
NODE_LOCAL_SSDS_EPHEMERAL: $(yaml-quote ${NODE_LOCAL_SSDS_EPHEMERAL:-}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :D
/lgtm |
/retest |
# Move the container runtime's directory to the new location to preserve | ||
# preloaded images. | ||
if [ ! -d "${ephemeral_mountpoint}/${container_runtime}" ]; then | ||
mv "/var/lib/${container_runtime}" "${ephemeral_mountpoint}/${container_runtime}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We unmounted "/var/lib/${container_runtime}" above. Is there any data here to move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are rw remounts. I have yet to verify in Ubuntu, though. I'll get back to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't follow. Do you mean that the cases where there are preloaded images, /var/lib/docker isn't mounted so the unmount is a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rechecked. The OS images that mount are the containerd ones. And it's just a RW remount in the same disk. We need to unmount prior to moving the folder.
This is independent of having preloaded images or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not following. Why do we need to unmount? If we're concerned about someone writing to it during the move, the umount isn't enough because it will silently fail if the dir is already in use (ie there's a race).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to unmount to be able to move the data. Otherwise we would get a "resource busy" error. Nothing is writing to it, as the container runtime is stopped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally I understand, bind-mounted dirs make that resource busy error. thx
f7aa584
to
0ea9665
Compare
There are new shell check requirements. PTAL at last 2 commits |
seen_arrays=(/dev/md/*) | ||
device=${seen_arrays[0]} | ||
echo "Setting RAID array with local SSDs on device ${device}" | ||
if [ ! -e "$device" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can we assume an existing raid array has to be our local SSDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer: we have to, because it's too complicated to figure out what an existing raid is from when the node is restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no other mechanism that creates RAID arrays in the startup script.
cluster/gce/gci/configure-helper.sh
Outdated
# mount container runtime root dir on SSD | ||
local container_runtime="${CONTAINER_RUNTIME:-docker}" | ||
systemctl stop "$container_runtime" | ||
# Some images mount the container runtime root dir. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more precise to say "remount" here? That gives a better hint as to what's going on IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
0ea9665
to
add0519
Compare
/lgtm fwiw |
add0519
to
072a544
Compare
Rebased and squashed |
/retest |
Still |
When environment variable NODE_LOCAL_SSD_EPHEMERAL=true, create a RAID 0 array on all attached SSDs to mount: - kubelet root dir - container runtime root dir - pod logs dir Those directories account for all ephemeral storage. An array is not created when there is only one SSD. Change-Id: I22137f1d83fc19e9ef58a556d7461da43e4ab9bd Signed-off-by: Aldo Culquicondor <acondor@google.com>
072a544
to
2ae4eeb
Compare
ping @cheftako /retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, cheftako The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When environment variable NODE_LOCAL_SSD_EPHEMERAL=true, create a RAID 0 array on all attached SSDs to mount: - kubelet root dir - container runtime root dir - pod logs dir Those directories account for all ephemeral storage. An array is not created when there is only one SSD. OSS: kubernetes#93305 Signed-off-by: Aldo Culquicondor <acondor@google.com> Change-Id: Ib15524d6e6fab7a5fadda7bc1a64765f1364327f
What type of PR is this?
/kind feature
What this PR does / why we need it:
When environment variable NODE_LOCAL_SSD_EPHEMERAL=true,
create a RAID 0 array on all attached Local SSDs on NVMe interfaces to mount:
Those directories account for all ephemeral storage.
Does this PR introduce a user-facing change?: