Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mount kubelet and container runtime rootdir on LSSD #93305

Merged
merged 1 commit into from Sep 22, 2020

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Jul 21, 2020

What type of PR is this?

/kind feature

What this PR does / why we need it:

When environment variable NODE_LOCAL_SSD_EPHEMERAL=true,
create a RAID 0 array on all attached Local SSDs on NVMe interfaces to mount:

  • kubelet root dir
  • container runtime root dir
  • pod logs dir

Those directories account for all ephemeral storage.

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/provider/gcp Issues or PRs related to gcp provider labels Jul 21, 2020
@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 21, 2020
@alculquicondor
Copy link
Member Author

/assign @mattcary @msau42

if [ -e "${ssd}" ]; then
# This workaround to find if the NVMe device is a disk is required because
# the existing Google images does not expose NVMe devices in /dev/disk/by-id
if [[ `udevadm info --query=property --name=${ssd} | grep DEVTYPE | sed "s/DEVTYPE=//"` == "disk" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle the case where PD could be nvme?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was copied over from ensure-local-ssds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing logic doesn't handle PD as nvme and assumes all nvme devices are local SSDs. @mattcary do you want to restrict this new ephemeral option to only scsi PD boot disk for now, and update this logic to handle nvme PD boot disk later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want nvme support. Is there any doc on how to detect nvme PDs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using ID_MODEL=nvme_card

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to lsblk

cluster/gce/gci/configure-helper.sh Outdated Show resolved Hide resolved
cluster/gce/gci/configure-helper.sh Show resolved Hide resolved
md_device="/dev/md/0"
echo "y" | mdadm --create "${md_device}" --level=0 --raid-devices=${#devices[@]} ${devices[@]}
fi
local ephemeral_mountpoint="/mnt/disks/kube-ephemeral-ssd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mount this somewhere other than /mnt/disks? This is currently used as the default discovery directory for local PVs, so if we ever want to support both at the same time, this will conflict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of /mnt is read-only. I see this for /mnt/disks:

tmpfs /mnt/disks tmpfs rw,relatime,size=256k,mode=755 0 0

One possibility would be /mnt/stateful_partition/ephemeral_storage, but then we are putting it inside the boot disk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cluster/gce/gci/configure-helper.sh Outdated Show resolved Hide resolved
cluster/gce/gci/configure-helper.sh Outdated Show resolved Hide resolved
cluster/gce/gci/configure-helper.sh Show resolved Hide resolved
cluster/gce/gci/configure-helper.sh Show resolved Hide resolved
@alculquicondor alculquicondor force-pushed the lssd-ephemeral branch 2 times, most recently from becefd1 to 17ca5f1 Compare July 23, 2020 17:44
@msau42
Copy link
Member

msau42 commented Jul 24, 2020

/lgtm
/assign @cheftako

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 24, 2020
@mattcary
Copy link
Contributor

/lgtm
as well

@alculquicondor
Copy link
Member Author

/retest

@alculquicondor
Copy link
Member Author

Friendly ping @cheftako

Giving the freeze, we are not looking for merging it now, but just to get early feedback

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2020
@@ -1215,6 +1215,7 @@ CONTAINER_RUNTIME_TEST_HANDLER: $(yaml-quote ${CONTAINER_RUNTIME_TEST_HANDLER:-}
UBUNTU_INSTALL_CONTAINERD_VERSION: $(yaml-quote ${UBUNTU_INSTALL_CONTAINERD_VERSION:-})
UBUNTU_INSTALL_RUNC_VERSION: $(yaml-quote ${UBUNTU_INSTALL_RUNC_VERSION:-})
NODE_LOCAL_SSDS_EXT: $(yaml-quote ${NODE_LOCAL_SSDS_EXT:-})
NODE_LOCAL_SSDS_EPHEMERAL: $(yaml-quote ${NODE_LOCAL_SSDS_EPHEMERAL:-})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :D

@cheftako
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 1, 2020
@alculquicondor
Copy link
Member Author

/retest

# Move the container runtime's directory to the new location to preserve
# preloaded images.
if [ ! -d "${ephemeral_mountpoint}/${container_runtime}" ]; then
mv "/var/lib/${container_runtime}" "${ephemeral_mountpoint}/${container_runtime}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We unmounted "/var/lib/${container_runtime}" above. Is there any data here to move?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are rw remounts. I have yet to verify in Ubuntu, though. I'll get back to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't follow. Do you mean that the cases where there are preloaded images, /var/lib/docker isn't mounted so the unmount is a no-op?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked. The OS images that mount are the containerd ones. And it's just a RW remount in the same disk. We need to unmount prior to moving the folder.

This is independent of having preloaded images or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not following. Why do we need to unmount? If we're concerned about someone writing to it during the move, the umount isn't enough because it will silently fail if the dir is already in use (ie there's a race).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to unmount to be able to move the data. Otherwise we would get a "resource busy" error. Nothing is writing to it, as the container runtime is stopped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally I understand, bind-mounted dirs make that resource busy error. thx

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2020
@alculquicondor
Copy link
Member Author

There are new shell check requirements. PTAL at last 2 commits

seen_arrays=(/dev/md/*)
device=${seen_arrays[0]}
echo "Setting RAID array with local SSDs on device ${device}"
if [ ! -e "$device" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can we assume an existing raid array has to be our local SSDs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer: we have to, because it's too complicated to figure out what an existing raid is from when the node is restarted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no other mechanism that creates RAID arrays in the startup script.

# mount container runtime root dir on SSD
local container_runtime="${CONTAINER_RUNTIME:-docker}"
systemctl stop "$container_runtime"
# Some images mount the container runtime root dir.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more precise to say "remount" here? That gives a better hint as to what's going on IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mattcary
Copy link
Contributor

mattcary commented Sep 2, 2020

/lgtm

fwiw

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 2, 2020
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 8, 2020
@alculquicondor
Copy link
Member Author

Rebased and squashed

@cheftako

@alculquicondor
Copy link
Member Author

/retest

@mattcary
Copy link
Contributor

mattcary commented Sep 8, 2020

Still
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 8, 2020
When environment variable NODE_LOCAL_SSD_EPHEMERAL=true,
create a RAID 0 array on all attached SSDs to mount:

- kubelet root dir
- container runtime root dir
- pod logs dir

Those directories account for all ephemeral storage.
An array is not created when there is only one SSD.

Change-Id: I22137f1d83fc19e9ef58a556d7461da43e4ab9bd
Signed-off-by: Aldo Culquicondor <acondor@google.com>
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 14, 2020
@alculquicondor
Copy link
Member Author

ping @cheftako

/retest

@cheftako
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 16, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, cheftako

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 16, 2020
@k8s-ci-robot k8s-ci-robot merged commit 6b39cdf into kubernetes:master Sep 22, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Sep 22, 2020
yuga711 pushed a commit to yuga711/kubernetes that referenced this pull request Dec 3, 2020
When environment variable NODE_LOCAL_SSD_EPHEMERAL=true,
create a RAID 0 array on all attached SSDs to mount:

- kubelet root dir
- container runtime root dir
- pod logs dir

Those directories account for all ephemeral storage.
An array is not created when there is only one SSD.

OSS: kubernetes#93305

Signed-off-by: Aldo Culquicondor <acondor@google.com>
Change-Id: Ib15524d6e6fab7a5fadda7bc1a64765f1364327f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants