New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rkt: Support per-pod stage1-image #23944

Closed
aaronlevy opened this Issue Apr 6, 2016 · 13 comments

Comments

Projects
None yet
5 participants
@aaronlevy
Copy link
Member

aaronlevy commented Apr 6, 2016

It would be useful to be able to set the rkt stage1-image on a per-pod basis (overriding the global --rkt-stage1-image flag on the kubelet).

A example use-case would be using the rkt-fly stage1 when running a self-hosted kubelet, while using the global stage1 for all other pods.

One option might be to use annotations on the pod which the rkt runtime could parse for stage1-image. This would be somewhat lightweight in that no api-changes would need to be made. Eventually this could be replaced if/when runtime-specific pod configuration was available.

A concern with using annotations is that the stage1 could essentially be running the pod with escalated privileges. It might mean that at the kubelet side, pods with a stage1 annotation would need to be rejected if they did not also contain the privileged=true flag (however, the annotations wouldn't be enforceable from the api-server --allow-privileged side).

@yifan-gu @philips

@yifan-gu

This comment has been minimized.

Copy link
Member

yifan-gu commented Apr 6, 2016

@bgrant0607 This raises the question about how can we add some runtime specific options to the pod spec effectively?

@saad-ali saad-ali added the sig/node label Apr 7, 2016

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Apr 7, 2016

@aaronlevy Please explain more about what this means and why we'd want it.

cc @kubernetes/sig-node

@philips

This comment has been minimized.

Copy link
Contributor

philips commented Apr 7, 2016

@bgrant0607 Two use cases:

  • Have rkt use kvm isolation instead of namespace/cgroups
  • Have rkt use less isolation, essentially just a chroot, for running really privileged stuff like the kubelet

@aaronlevy aaronlevy closed this Apr 7, 2016

@aaronlevy aaronlevy reopened this Apr 7, 2016

@philips

This comment has been minimized.

Copy link
Contributor

philips commented Apr 15, 2016

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented Apr 19, 2016

To add a bit more detail to this request:

From a selfish "I want this" perspective:

This comes from the need to be able to run a kubelet pod using specific rkt runtime options (in this case, using a different stage1-image). This would currently be necessary for running a self-hosted kubelet with rkt as the container runtime.

For example:

Starting from a self-hosted kubelet proposal: #23343

  1. Bootstrap kubelet (on host) starts with --container-runtime=rkt
  2. Bootstrap kubelet starts getting pod definitions from api-server and running them using default stage1-image (namespace/cgroup isolation).
  3. Bootstrap kubelet receives pod definition for a "self-hosted" kubelet -- which runs using a different --stage1-image (rkt-fly) due to different isolation needs (part of host namespaces).

From a more general perspective:

There might be some other options that would allow a self-hosted kubelet with rkt as the container runtime, potentially without needing to specify per-pod runtime options.

Possibly setting mount propagation in the spec (#20698), and creating a rkt stage1 that does not have systemd as pid1 in the running container (#23692) so hostPID works as expected.

However, it seems like container runtime configuration (potentially at the pod level) is going to be somewhat necessary no matter what. Some of the options in the pod spec are already runtime specific, and may not map well to various container runtimes. For example, what does "hostPID" mean when the runtime could be VM based (e.g. hyper or kvm isolation)? Or, how/why is some runtime configuration blessed, but potentially not others?

So I guess back to the originally posted (and selfish) issue: being able to specify a runtime configurable (stage1-image in this case) as part of the pod would allow for situations where we may need to change the isolation mechanics of a pod (in a concrete use-case: for self-hosting the kubelet).

@philips

This comment has been minimized.

Copy link
Contributor

philips commented Apr 20, 2016

@aaronlevy What is an example of runtime configuration?

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented Apr 20, 2016

I think the pod/container securityContexts are a good example of pod-based runtime configuration.

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Apr 20, 2016

Ref #17064

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Apr 20, 2016

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Apr 20, 2016

@yifan-gu @philips For now, experimental runtime-specific options could be passed via annotations on the pod. Something like rkt.alpha.kubernetes.io/image-stage: stage1. It would be non-portable, though, and might be subsumed by future first-class API features. #17064 is a good place to discuss that issue.

@yifan-gu

This comment has been minimized.

Copy link
Member

yifan-gu commented Apr 21, 2016

@yifan-gu @philips For now, experimental runtime-specific options could be passed via annotations on the pod. Something like rkt.alpha.kubernetes.io/image-stage: stage1. It would be non-portable, though, and might be subsumed by future first-class API features. #17064 is a good place to discuss that issue.

I am happy to implementing something like this.

@yifan-gu yifan-gu added this to the rktnetes-v1.0 milestone Apr 21, 2016

@yifan-gu

This comment has been minimized.

Copy link
Member

yifan-gu commented Apr 21, 2016

@bgrant0607 But the problem for using annotation for stage1 image in rkt, as mentioned in #23944 (comment), is that this could give users privilege without noticing the API server, (users can run arbitrary stage1, unless we verify this in kubelet/rkt) and API server has no way to check and prevent it..

And if we let the kubelet fail those pods with illegal stage1 image annotation, then we probable see a crash loop that keeps retrying... Though that might not be too bad for now?

@aaronlevy

This comment has been minimized.

Copy link
Member

aaronlevy commented Apr 21, 2016

@yifan-gu it seems like eventually this would be a good candidate for an admission controller that can validate runtime options (ex. the admissions section of securityContext)

If the crashloop / events are clear that the reason a pod is failing to run is that a privileged=true context is required, that would seem reasonable enough. Another option might be to have something like an --rkt-approved-stage1-images flag on the kubelet which would restrict the set a user could specify. But if it is assumed that --privileged essentially means root already, not sure this is much extra protection.

k8s-merge-robot added a commit that referenced this issue May 25, 2016

Merge pull request #25177 from euank/rkt-alternate-stage1
Automatic merge from submit-queue

rkt: Support alternate stage1's via annotation

This provides a basic implementation for setting a stage1 on a per-pod
basis via an annotation.

This provides a basic implementation for setting a stage1 on a per-pod
basis via an annotation. See discussion here for how this approach was arrived at: #23944 (comment)

It's possible this feature should be gated behind additional knobs, such
as a kubelet flag to filter allowed stage1s, or a check akin to what
priviliged gets in the apiserver.
Currently, it checks `AllowPrivileged`, as a means to let people disable
this feature, though overloading it as stage1 and privileged isn't
ideal.

Fixes #23944

Testing done (note, unfortunately done with some additional ./cluster changes merged in):

```
$ cat examples/stage1-fly/fly-me-to-the-moon.yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    name: exit
  name: exit-fast
  annotations: {"rkt.alpha.kubernetes.io/stage1-name-override": "coreos.com/rkt/stage1-fly:1.3.0"}
spec:
  restartPolicy: Never
  containers:
    - name: exit
      image: busybox
      command: ["sh", "-c", "ps aux"]
$ kubectl create -f examples/stage1-fly
$ ssh core@minion systemctl status -l --no-pager k8s_2f169b2e-c32a-49e9-a5fb-29ae1f6b4783.service
...
failed
...
May 04 23:33:03 minion rkt[2525]: stage0: error writing /etc/rkt-resolv.conf: open /var/lib/rkt/pods/run/2f169b2e-c32a-49e9-a5fb-29ae1f6b4783/stage1/rootfs/etc/rkt-resolv.conf: no such file or directory
...
# Restart kubelet with allow-privileged=false
$ kubectl create -f examples/stage1-fly
$ kubectl describe exit-fast
...
  1m		19s		5	{kubelet euank-e2e-test-minion-dv3u}	spec.containers{exit}	Warning		Failed		Failed to create rkt container with error: cannot make "exit-fast_default(17050ce9-1252-11e6-a52a-42010af00002)": running a custom stage1 requires a privileged security context
....
```

Note as well that the "success" here is rkt spitting out an [error message](rkt/rkt#2141) which indicates that the right stage1 was being used at least.

cc @yifan-gu @aaronlevy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment