Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerized subpath #63143

Merged
merged 9 commits into from Jun 1, 2018

Conversation

jsafrane
Copy link
Member

@jsafrane jsafrane commented Apr 25, 2018

What this PR does / why we need it:
Containerized kubelet needs a different implementation of PrepareSafeSubpath than kubelet running directly on the host.

On the host we safely open the subpath and then bind-mount /proc/<pidof kubelet>/fd/<descriptor of opened subpath>.

With kubelet running in a container, /proc/xxx/fd/yy on the host contains path that works only inside the container, i.e. /rootfs/path/to/subpath and thus any bind-mount on the host fails.

Solution:

  • safely open the subpath and gets its device ID and inode number
  • blindly bind-mount the subpath to /var/lib/kubelet/pods/<uid>/volume-subpaths/<name of container>/<id of mount>. This is potentially unsafe, because user can change the subpath source to a link to a bad place (say /run/docker.sock) just before the bind-mount.
  • get device ID and inode number of the destination. Typical users can't modify this file, as it lies on /var/lib/kubelet on the host.
  • compare these device IDs and inode numbers.

Which issue(s) this PR fixes
Fixes #61456

Special notes for your reviewer:

The PR contains some refactoring of doBindSubPath to extract the common code. New doNsEnterBindSubPath is added for the nsenter related parts.

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2018
@jsafrane
Copy link
Member Author

/test pull-kubernetes-cross
Calling explicitly to check Windows mounter

@jsafrane
Copy link
Member Author

/sig storage
@msau42 PTAL
@cofyc, you've played with nsenter recently, would you mind taking a look?

BTW, we need real unit test for whole nsenter mounter.

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Apr 25, 2018
@msau42
Copy link
Member

msau42 commented Apr 25, 2018

/assign

@@ -112,6 +112,11 @@ type Interface interface {
// It returns the same path as it gets unless kubelet runs in a container.
// Then it returns "/rootfs/" + <path>
KubeletPath(path string) string
// EvalSymlinks returns the path name after the evaluation of any symbolic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returns the path name on the host

@@ -175,7 +174,8 @@ func makeMounts(pod *v1.Pod, podDir string, container *v1.Container, hostName, h
return nil, cleanupAction, fmt.Errorf("unable to provision SubPath `%s`: %v", mount.SubPath, err)
}

fileinfo, err := os.Lstat(hostPath)
kubeletHostPath := mounter.KubeletPath(hostPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exposing KubeletPath as an interface, would it be better to expose Lstat as an interface, and have EvalSymlinks return container path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the code here is wrong. It resolves any symlinks in the hostPath in kubelet, which is not possible.

And that's Lstat is not easy - we need to resolve the symlinks on the host and then do os.Lstat("/rootfs/" + resolved_path) in kubelet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by some refactoring.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 25, 2018
@jsafrane
Copy link
Member Author

Pushed 2nd version.

kubelet_pods.go now passes all paths as paths on the host to mounter and mounter translates them by itself when it's needed.

I changed semantics of SafeMakeDir
from SafeMakeDir(*fullPath* string, base string, perm os.FileMode)
to SafeMakeDir(*subdirectoryInBase* string, base string, perm os.FileMode)
to simplify its implementation a bit.

@andyzhangx, I touched the Windows part a bit, please review.

@jsafrane jsafrane force-pushed the containerized-subpath branch 4 times, most recently from e683d06 to 7026920 Compare April 27, 2018 12:52
@jsafrane
Copy link
Member Author

@kubernetes/sig-storage-pr-reviews, any review is highly appreciated, we don't want more regressions in subpath!

@@ -33,7 +33,7 @@ import (

const (
// hostProcMountsPath is the default mount path for rootfs
hostProcMountsPath = "/rootfs/proc/1/mounts"
HostProcMountsPath = "/rootfs/proc/1/mounts"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to be public if we introduce Nsenter.HostPath() as in https://github.com/kubernetes/kubernetes/pull/62903/files#diff-f6ffeca7a3b942208d644eaa4ec99642R129

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see it being used in this file.

func (mounter *Mounter) SafeMakeDir(pathname string, base string, perm os.FileMode) error {
return doSafeMakeDir(pathname, base, perm)
func (mounter *Mounter) SafeMakeDir(subdir string, base string, perm os.FileMode) error {
fullPath := filepath.Join(base, subdir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does windows not need to eval symlinks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed

}
defer syscall.Close(fd)

glog.Infof("JSAF before prepare")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up

return false, err
}
kubeletPath := mounter.getKubeletPath(evaluatedPath)
_, err = os.Lstat(kubeletPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was wrong with mounter.Lstat()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked mounter.Lstat() to return NotExist when the target does not exist (it returned some indistinguishable error before).

@@ -800,7 +800,7 @@ function start_kubelet {
all_kubelet_flags+=(--containerized)

docker run --rm --name kubelet \
--volume=/:/rootfs:ro,rslave \
--volume=/:/rootfs:rw,rslave \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean existing deployments will break if they upgrade to this fix and don't change this setting?

Previously, safemkdir still worked for non-hostpath types because it uses /var/lib/kubelet/.... instead of /rootfs/var/lib/kubelet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, it will break.

I could pass podDir into SafeMakeDir() as a new parameter. NsenterMounter.SafeMakeDir could then check if the directory to create is subdir of podDir and add /rootfs only when it's not.

Any better ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new argument to NewNsenterMounter so the mounter knows what's location of /var/lib/kubelet and can do a shortcut in SafeMakeDir. That was the easiest way I could find out. PTAL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I removed changes in hack/local-up-cluster.sh

}
return strconv.Atoi(string(matches[1]))
kubeletBase := mounter.getKubeletPath(evaluatedBase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the paths need to be cleaned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to doSafeMakeDir, just to be sure.


// Check it's not already bind-mounted
// prepareSubpathTarget creates target for bind-mount of subpath. It returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment this is also used by nsenter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added

defer syscall.Close(fd)

glog.Infof("JSAF before prepare")
alreadyMounted, bindPathTarget, err := prepareSubpathTarget(mounter, subpath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be done first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it does not really matter if we do safeOpenSubPath or prepareSubpathTarget first, I wanted the defer that cleans up volumes to affect as little code as possible.

}
defer syscall.Close(fd)

alreadyMounted, bindPathTarget, err := prepareSubpathTarget(mounter, subpath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this first?

// path: /mnt/volume/non/existing/directory. /mnt/volume exists and
// non/existing/directory does not exist. It resolves symlinks in /mnt/volume
// to say /mnt/foo and returns /mnt/foo/non/existing/directory.
func (mounter *NsenterMounter) evalHostSymlinks(path string, mustExist bool) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had added a similar function in https://github.com/kubernetes/kubernetes/pull/62903/files#diff-f6ffeca7a3b942208d644eaa4ec99642R127 as you and @msau42 suggested. How about merging #62903 first, then use Nsenter.HostPath to get the path name on the host after evaluating symlinks on the host (and adding mustExist parameter)?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2018
@dims
Copy link
Member

dims commented Apr 30, 2018

@jsafrane if you rebase to master, we can try the new e2e job

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2018
@liggitt
Copy link
Member

liggitt commented May 29, 2018

The current e2e (non containerized) has hostpath test cases but not with an existing path with the kubelet image

if we support the containerized kubelet, it seems like we should exercise it in CI, especially given the complexity around how it has to be handled

// that's under user control. User must not be able to use symlinks to
// escape the volume to create directories somewhere else.
SafeMakeDir(subdir string, base string, perm os.FileMode) error
// Will operate in the host mount namespace if kubelet is running in a container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that a seemingly generic interface in a generic lib (pkg/util) has any comprehension of "kubelet" at all. Can we decouple these layered ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet is the only user of this API and it makes the whole picture more visible. It would be harder to go through all the layers and see how they fit together if they contain just generic comments.

We could move the big picture to a separate file / documentation, however, these tend to rot.

I filled #64603 for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SafeFormatAndMount is a very delicate and tricky operation, and kops uses it (and this package). Another vote from me for keeping the layers well-factored here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ongoing work to refactor the mount library to separate out k8s specific operations (like subpath processing), and general mounting/formatting utilities. #68513

@thockin
Copy link
Member

thockin commented May 31, 2018

I'm very grumpy that pkg/util contains the word "kubelet" at all. How can we fix this? I will approve the PR since it fixes a bug and doesn't make things worse, but that's a genericity problem.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, msau42, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@jsafrane
Copy link
Member Author

jsafrane commented Jun 1, 2018

/kind bug
/priority important-soon

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 1, 2018
@jsafrane
Copy link
Member Author

jsafrane commented Jun 1, 2018

@saad-ali @childsb, please approve for 1.11

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 1, 2018

@jsafrane: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e fdb50b589f2aa73c02ec33c4423b49d1ff7c2497 link /test pull-kubernetes-local-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@dims
Copy link
Member

dims commented Jun 1, 2018

/status approved-for-milestone

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@brendandburns @dchen1107 @jsafrane @lavalamp @msau42 @smarterclayton @thockin

Pull Request Labels
  • sig/storage: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 63348, 63839, 63143, 64447, 64567). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit d2495b8 into kubernetes:master Jun 1, 2018
@cofyc cofyc mentioned this pull request Jun 2, 2018
@munnerz
Copy link
Member

munnerz commented Jul 2, 2018

@msau42 I'm running kubelet under rkt using the kubelet wrapper, with kube 1.11.0.

To fix this, there is one new requirement in containerized kubelet deployment that /rootfs needs to be mounted rw

My systemd unit does not currently have rootfs mounted into the kubelet container, so I have attempted to add it in an attempt to take advantage of this patch. When doing so, kubelet then refuses to start with:

failed to run Kubelet: could not detect clock speed from output: ""

I'm wondering if I'm missing something obvious here? I am able to mount standard host path volumes fine, but when using the prometheus operator (which enforces a subPath be used), it will not work.

Is there any more detailed info on how I should configure this? I'd imagine (hope!) that tectonic doesn't suffer from this same problem, however I'm unsure where to dig into to find this out, and I'd prefer not to have to spin up an entire tectonic cluster just to find a couple of arguments! 😄

@jsafrane
Copy link
Member Author

jsafrane commented Jul 2, 2018

@munnerz, list of directories that must be in the container with kubelet is here:

all_kubelet_volumes=(
--volume=/:/rootfs:ro,rslave \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:rslave \
--volume=/var/lib/kubelet/:/var/lib/kubelet:rslave \
--volume=/dev:/dev \
--volume=/run/xtables.lock:/run/xtables.lock:rw \

Please report if there is anything missing.

@lichuqiang
Copy link
Contributor

@jsafrane
I ran into a problem when running the unit test in "nsenter_mount_test.go",
and got error like this:

--- FAIL: TestNsenterExistsFile (0.01s)
    nsenter_mount_test.go:343: Test "simple non-accessible file": expected error, got none
    nsenter_mount_test.go:347: Test "simple non-accessible file": expected return value false, got true
E0719 09:26:23.220671   67952 mount_linux.go:495] format of disk "/dev/foo" failed: type:("ext4") target:("/tmp/mount681850900") options:(["defaults"])error:(formatting failed)
E0719 09:26:23.221027   67952 mount_linux.go:529] Could not determine if disk "/dev/foo" is formatted (exit 4)
FAIL
FAIL    k8s.io/kubernetes/pkg/util/mount    0.077s

I switched to a non root user, and it got passed.
Is it required to manually switch to other users to run the case?

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this pull request Nov 5, 2018
This is part of upstream PRs kubernetes#62903 and kubernetes#63143.

Origin-commit: dbee4f01db7eba2aff89e8cdde8b06de4bc159f0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hostPath volume with subPath volume mount does not work with containerized kubelets