Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node-Level UserNamespace implementation #64005

Open
wants to merge 5 commits into
base: master
from

Conversation

@vikaschoudhary16
Copy link
Member

vikaschoudhary16 commented May 18, 2018

What this PR does / why we need it:
Implements kubernetes/community#2067

Updated Proposal: kubernetes/community#2595

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
e2e tests will be added in a separate PR

Add node-level user-namespace support through a new feature gate `NodeUserNamespace`. This feature gate supersedes the existing feature gate `ExperimentalHostUserNamespaceDefaulting`. `ExperimentalHostUserNamespaceDefaulting`  feature gate is deprecated, not supported, and will be removed next release.

DEMO: https://www.youtube.com/watch?v=2sEmceRNa6Y

/cc @mrunalp @adelton @derekwaynecarr @sjenning @pweil- @euank @Random-Liu @yujuhong @dchen1107 @vishh @kubernetes/sig-node-pr-reviews
/sig node

@vikaschoudhary16 vikaschoudhary16 changed the title Node-Level UserNamespace implementation [WIP]Node-Level UserNamespace implementation May 18, 2018

@vikaschoudhary16 vikaschoudhary16 changed the title [WIP]Node-Level UserNamespace implementation Node-Level UserNamespace implementation May 18, 2018

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented May 18, 2018

/assign @derekwaynecarr

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:userns-new branch from 29d78f2 to 0e5cc96 May 18, 2018

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented May 18, 2018

/test pull-kubernetes-e2e-kops-aws

@@ -1370,3 +1370,29 @@ func TestSetDefaultHostPathVolumeSource(t *testing.T) {
t.Errorf("Expected v1.HostPathVolumeSource default type %v, got %v", expectedType, defaultType)
}
}

func TestSetDefaultPodSpecHostUserNamespace(t *testing.T) {
in := v1.PodSpec{}

This comment has been minimized.

@wgliang

wgliang May 18, 2018

Member

Maybe we can use table driven tests here!

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented May 18, 2018

/test pull-kubernetes-local-e2e-containerized

1 similar comment
@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented May 21, 2018

/test pull-kubernetes-local-e2e-containerized

@@ -91,6 +91,9 @@ service RuntimeService {

// Status returns the status of the runtime.
rpc Status(StatusRequest) returns (StatusResponse) {}

// GetRuntimeConfigInfo returns the configuration details of the runtime.
rpc GetRuntimeConfigInfo(GetRuntimeConfigInfoRequest) returns (GetRuntimeConfigInfoResponse) {}

This comment has been minimized.

@feiskyer

feiskyer May 21, 2018

Member

Could this be part of Status()?

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

@feiskyer do you have a suggestion on how you would map the data into Status()?

right now, i view Status as information useful to drive debug, but not change behavior.

This comment has been minimized.

@yujuhong

yujuhong May 24, 2018

Member

Status right now is used primarily to check the health of the runtime, and is called periodically. Using a separate call is cleaner and may avoid hitting performance problems.


message LinuxUserNamespaceConfig {
// is_enabled, if true indicates that user-namespaces are supported and enabled in the container runtime
bool is_enabled = 1;

This comment has been minimized.

@feiskyer

feiskyer May 21, 2018

Member

I don't think this field is required. An empty list of mappings also means user-namespace is not enabled

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

agree w/ @feiskyer that this is not required.

@derekwaynecarr
Copy link
Member

derekwaynecarr left a comment

see the comments throughout.

i think we should not poll runtime config info.
the pr also needs more test cases in identified areas.

@@ -91,6 +91,9 @@ service RuntimeService {

// Status returns the status of the runtime.
rpc Status(StatusRequest) returns (StatusResponse) {}

// GetRuntimeConfigInfo returns the configuration details of the runtime.
rpc GetRuntimeConfigInfo(GetRuntimeConfigInfoRequest) returns (GetRuntimeConfigInfoResponse) {}

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

@feiskyer do you have a suggestion on how you would map the data into Status()?

right now, i view Status as information useful to drive debug, but not change behavior.


message LinuxUserNamespaceConfig {
// is_enabled, if true indicates that user-namespaces are supported and enabled in the container runtime
bool is_enabled = 1;

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

agree w/ @feiskyer that this is not required.

@@ -100,6 +100,7 @@ type RuntimeService interface {
UpdateRuntimeConfig(runtimeConfig *runtimeapi.RuntimeConfig) error
// Status returns the status of the runtime.
Status() (*runtimeapi.RuntimeStatus, error)
GetRuntimeConfigInfo() (*runtimeapi.GetRuntimeConfig, error)

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

nit: add the Godoc

@@ -122,6 +122,7 @@ type Runtime interface {
// This method just proxies a new runtimeConfig with the updated
// CIDR value down to the runtime shim.
UpdatePodCIDR(podCIDR string) error
GetRuntimeConfigInfo() (*RuntimeConfigInfo, error)

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

nit: add godoc

@@ -491,6 +492,34 @@ type RuntimeStatus struct {
Conditions []RuntimeCondition
}

type RuntimeConfigInfo struct {

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

add public structs should have doc. it pains me when the existing file is so well documented that we would not keep it going with new stuff ;-)

This comment has been minimized.

@liggitt

liggitt Jul 19, 2018

Member

fields still need documentation

@@ -2077,6 +2105,28 @@ func (kl *Kubelet) updateRuntimeUp() {
glog.Errorf("Container runtime not ready: %v", runtimeReady)
return
}

ci, err := kl.containerRuntime.GetRuntimeConfigInfo()

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

yeah, this basically has us calling docker info every 5s if we don't cache the behavior in the shim.

i can see here why @feiskyer is suggesting putting the call in Status as a result so we can avoid making two grpc operations every 5s. given this is the intended usage, i see a compelling argument for merging the information.

/cc @yujuhong @dchen1107 input appreciated.

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

honestly, i dont think this information can change over the life of a kubelet.

calling the operation once, and caching the information should be sufficient imho. operators that dynamically change the config will have had to drain the node already and most likely restart the kubelet.

so if we call this operation once, and cache the info, it doesnt need to merge into status.

func (kl *Kubelet) getRemappedIDs() (error, int, int) {
ci, err := kl.containerRuntime.GetRuntimeConfigInfo()
if err != nil {
return fmt.Errorf("Container runtime info get failed: %v", err), -1, -1

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

to lower here

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

if we cached the remappedIDs value, we could avoid recomputing it all the time. it feels like this should be cached.

@@ -474,11 +473,17 @@ func (kl *Kubelet) GenerateRunContainerOptions(pod *v1.Pod, container *v1.Contai
glog.Errorf("Error on creating %q: %v", p, err)
} else {
opts.PodContainerDir = p
if err := kl.chownDirForRemappedIDs(p); err != nil {
glog.Errorf("Error while chowning %s", p)

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

shouldnt this return an error rather than log an error?

func hasHostNamespace(pod *v1.Pod) bool {
if pod.Spec.SecurityContext == nil {
return false
//if pod.Spec.SecurityContext == nil {

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

why remove this? shouldnt this come after the HostUserNamespace check?

// namespacesForPod returns the runtimeapi.NamespaceOption for a given pod.
// An empty or nil pod can be used to get the namespace defaults for v1.Pod.
func namespacesForPod(pod *v1.Pod) *runtimeapi.NamespaceOption {
return &runtimeapi.NamespaceOption{
Ipc: ipcNamespaceForPod(pod),
Network: networkNamespaceForPod(pod),
Pid: pidNamespaceForPod(pod),
User: userNamespaceForPod(pod),

This comment has been minimized.

@derekwaynecarr

derekwaynecarr May 21, 2018

Member

this should have an update in TestNamespacesForPod

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:userns-new branch 2 times, most recently from 8016c8d to 6efef2f Dec 19, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 19, 2018

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: derekwaynecarr, vikaschoudhary16
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: lavalamp

If they are not already assigned, you can assign the PR to them by writing /assign @lavalamp in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:userns-new branch from 6efef2f to a4f7eea Dec 19, 2018

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Dec 21, 2018

/test pull-kubernetes-local-e2e-containerized

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:userns-new branch 2 times, most recently from 745cb29 to 6396a57 Dec 24, 2018

vikaschoudhary16 added some commits May 10, 2018

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:userns-new branch from 6396a57 to d2dd92b Dec 25, 2018

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Dec 26, 2018

/test pull-kubernetes-local-e2e-containerized

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 26, 2018

@vikaschoudhary16: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-cross c6b8d2f link /test pull-kubernetes-cross
pull-kubernetes-local-e2e-containerized d2dd92b link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Random-Liu

This comment has been minimized.

Copy link
Member

Random-Liu commented Jan 8, 2019

If I understand the proposal properly, we plan to support per-namespace and per-pod user namespace mapping in the future. And I think the uid/gid mapping would be managed by kubelet in the future? Kubelet will manage the namespace/pod <-> uid/gid mapping, and pass the corresponding uid/gid mapping when creating a pod sandbox.

If that is the future direction, can we make the CRI change to be aligned with that?

  1. Add uid/gid mapping into PodSandboxConfig;
  2. Remove the uid/gid mapping discovery from container runtime, because it is going to come from kubelet in the future.

I think we can achieve what we want for alpha stage with the API change above + some kubelet configuration.

  1. Add a kubelet flag for the uid/gid mapping.
    a) If the container runtime supports per-pod/per-namespace uid/gid mapping, it just works;
    b) If the container runtime doesn't support it (e.g. docker), system admin just needs to make sure the kubelet flag is consistent with the corresponding runtime config. This does introduce a bit management overhead, but I think it seems managable for an alpha feature.
  2. Do not introduce NODE_WIDE_REMAPPED, just use POD. For the alpha stage, we just need to make sure kubelet specifies POD and passes the same node level uid/gid mapping to all pods which don't use host user namespace.
    a) If the container runtime supports per-pod/per-namespace uid/gid mapping, this just works;
    b) If the container runtime doesn't support it, it can check whether the uid/gid mapping matches its node level mapping, and returns error if it doesn't.

In this way, we don't need to change CRI again when we switch to per-pod or per-namespace user namespace. We just need to change the kubelet implementation and pass in different uid/gid mapping.

And please note that containerd already supports per-snapshot uid/gid mappings. If I understand correctly, we can already implement 1.a) and 2.a) there.

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jan 9, 2019

@Random-Liu Thanks a lot for the suggestions. Agree with most of the changes that you suggested. Just kubelet flags part i want to discuss a bit more. .
Introducing kubelet flags, as you said, introduces management overhead and increased chance of configuration error. On the positive side, Kubelet flags seems to adding no additional value over learning runtime mappings through CRI. Both approaches provides id mappings that kubelet internally has to divide and allocate. Even if runtime supports per-pod namespace mapping, is it incorrect or too much to assume that there will be an option at runtime to configure default mappings?

@Random-Liu

This comment has been minimized.

Copy link
Member

Random-Liu commented Jan 10, 2019

@vikaschoudhary16 Oops, it seems that some part of your comments got copied twice, which makes it a bit hard to understand. :P

What I mean is that:

  1. For the alpha stage (node level remap), we have the kubelet node level remap flag. System admin needs to make sure it matches the docker remap config; kubelet just passes the whole node remap to each pod, which is equivalent to what is suggested in this PR. This avoids the necessity for kubelet to learn the mapping from the container runtime at the alpha stage.
  2. For future per-pod/per-namespace remap, we still need a flag to tell kubelet which uid/gid range it can use to divide and allocate to different namespaces/pods, it could be the same flag from 1) or not. However, the container runtime won't have its own remap any more, it just receives remap in PodSandboxConfig and does it for the pod. Kubelet doesn't need to learn the mapping from the container runtime.
@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jan 10, 2019

@Random-Liu sorry about that :)
From your reply i can imagine that you got it right :D.

Kubelet doesn't need to learn the mapping from the container runtime.

Initially we started with passing ranges through kubelet flags only, , but then we got feedback that learning configuration might be a better idea. I see two following benefits if kubelet learns mapping:

  1. Helps not further increasing kubelet configuration knobs
  2. Reduces chances of failure due to configuration mismatch between kubelet and runtime. Runtime like docker, crio supports default mapping ranges.

Why do we want to have kubelet flags over dynamic learning?
Just want to make sure that we have discussed enough before reverting back. Thoughts?

/cc @mrunalp

@Random-Liu

This comment has been minimized.

Copy link
Member

Random-Liu commented Jan 12, 2019

Just want to make sure that we have discussed enough before reverting back. Thoughts?

Don't revert yet, this is just my opinion. Let's continue discussing.

Helps not further increasing kubelet configuration knobs.

Actually kubelet configuration is much easier to manage than a runtime configuration today. We have dynamic kubelet config management, with which you can manage the config in Kubernetes.

In the future, the overall uid/gid range to use has to be configured somewhere, either kubelet or container runtime. Given that 1) kubelet is going to manage and divide the mapping; 2) kubelet config is easier to manage, I see no point of configuring it in the container runtime, and let kubelet discover it.

Reduces chances of failure due to configuration mismatch between kubelet and runtime. Runtime like docker, crio supports default mapping ranges.

We'll only have to config this in 2 places for the alpha stage, which seems fine to me. For beta, we'll only need to configure it in kubelet.

@prbinu

This comment has been minimized.

Copy link

prbinu commented Feb 13, 2019

@vikaschoudhary16 @Random-Liu
It would be really helpful if we can get this feature into v1.14 release.

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Feb 14, 2019

@prbinu I am working to get approvals on the updated proposal, kubernetes/community#2595.

@liggitt liggitt added this to the v1.14 milestone Feb 15, 2019

@nathanjsweet
Copy link

nathanjsweet left a comment

I really like this overall approach, but I worry about the actual remapping logic a bit. There are definitely efforts already within the various runtimes to solve the problem of actually shifting the uids/gids. One benefit of allowing them to this, especially if something like shiftfs comes along is the keeping track of all of this is stateless.

What do I mean? I see that you have commented out the data directory stuff. Which makes sense, as if there is ever a node failure before cleanup there is no way to track what the original uid(s)/gid(s) in the data directory were. If something like shiftfs were ever to get merged this would no longer be a problem.

I think this general approach is great, but I think the code should be written to future proof the possibility that the runtimes handle the "chowning problem".

Sorry if I'm spouting something that you're already aware of. This feature is particularly important to me and I want to make sure that I can use it in a way that ensures that nodes can still be treated as disposable objects (or not thought of at all).

Thanks for doing this work!

@@ -1364,6 +1442,10 @@ func (kl *Kubelet) initializeModules() error {

// initializeRuntimeDependentModules will initialize internal modules that require the container runtime to be up.
func (kl *Kubelet) initializeRuntimeDependentModules() {
if err := kl.chownDirForRemappedIDs(kl.getPodsDir()); err != nil {

This comment has been minimized.

@nathanjsweet

nathanjsweet Feb 19, 2019

It is very conceivable that there are situations when this might not need to happen. For example if shiftfs ever gets merged then this approach is disadvantageous.

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Feb 19, 2019

@krmayankk

This comment has been minimized.

Copy link
Contributor

krmayankk commented Mar 3, 2019

@vikaschoudhary16 are you working on this ?

@nikopen

This comment has been minimized.

Copy link
Member

nikopen commented Mar 4, 2019

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.14 milestone Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.