Add design-proposal for discussion. #3

yifan-gu · 2016-08-23T23:45:15Z

cc @tmrts @euank @alban @iaguis @lucab @s-urbaniak

Try to address #2

yifan-gu · 2016-08-23T23:47:30Z

After sent out the PR, I feel like it has too much wording?
Anyway, feel free to comment :)

euank · 2016-08-24T00:00:38Z

design-proposal.md

+3. The third approach is somehow a combination of the first two.
+It has the benefits of the both worlds, rktlet lives outside of Kubelet, so it could lead to higher iteration speed, smaller Kubelet size, and better modularity.
+Also on the other side, the 'rkt API service' is compiled with rkt, and communicate with the rktlet through gRPC, so we still can use a lot of the existing rkt facilities.
+The potential con is that instead of one gPRC service, now we introduce two, which may or may not cause some overhead.


This also increases blast radius. I'm 👎 on 3

It'd be nice to know how much overhead there is, just for the reference. I think @feiskyer did some latency measurement before?

Y, there is a measurement of logging api on json-based http api, but gRPC should have better performance.

lucab · 2016-08-24T15:34:20Z

As a general sidenote, this CRI and the client/server proposals seem to go in the exact opposite direction of the goals we initially set: one (or more) new long-running high-privileged services are being introduced here, becoming a SPOF for the whole node. I'm sure this is not the first time this concern is being raised, but here it becomes quite apparent.

yifan-gu · 2016-08-24T18:22:44Z

As a general sidenote, this CRI and the client/server proposals seem to go in the exact opposite direction of the goals we initially set: one (or more) new long-running high-privileged services are being introduced here, becoming a SPOF for the whole node. I'm sure this is not the first time this concern is being raised, but here it becomes quite apparent.

@lucab A non privileged service + setuid + exec'ing the rkt binary to solve this? And for listing/introspection tasks, we can use library, or something like today's rkt api service?

euank · 2016-08-24T19:27:19Z

@lucab this should be thought of as an extension of kubelet and potentially could be vendored in to kubelet to reduce the SPOF concern; that being said, there's also benefit in decoupling and defining clear API bounderies, which is a trend seen among other Kubernetes components.

The kubelet will be a highly privileged spof no matter what; that's baked into its design. This is an extension of that, and we can potentially work to drop privileges/re-exec in order to reduce the privileges here, but when we frame this as more a component of kubelet than a component of rkt, it is less clearly against other goals explicitly and more a continuation of the state of things.

Additionally, the SPoF should be limited in that the lifecycle of rkltet should be completely decoupled from that of pods by ensuring they're parented elsewhere, so at least updates and restarts should have minimal impact.

On to my opinion on what the notable pros / cons of each given approach are. There are two axis to consider: 1) whether it's a separate binary or vendored, and 2) how it interacts with rkt (library vs exec)

Separate process (vs vendored)

Pros

No need to re-vendor (see cadvisor for an example of the minor pain/friction that results in)
Can have an independent/decoupled release cycle in the future
Can update independently without need to restart kubelet
Users can deploy fewer total bits of binaries if they don't want rkt
Easier to split up the decision of containerizing kubelet vs containerizing rktlet (e.g. to not containerize rktlet)
"plays nicer" with the kubelet/CRI ecosystem (oci-runtime, hyper-runtime both intend to integrate as services)
Node team encourages this level of integration (cc @yujuhong / @dchen1107 to double check)

Cons

Users need to deploy more different parts and likely more total bits if they do want rkt
Need to provide a means of distribution + management; can't be a kubelet static pod unless we have a bootstrapping in-process runtime
Need to figure out an auth/security story we're happy with, whether that be a unix socket or localhost address or proper auth
Increased blast radius; kubelet + rktlet
Increased superprivileged radius; kubelet + rktlet
Increased chance for messy error handling and latency (but it's over loopback, latency should be tiny concern)
Increased CPU/mem due to encode/decode grpc vs function passing, and the typical logging/server/etc overhead paid down twice

library-like vendoring (vs process exec)

Pros

No need to deal with the rkt-bin version (be it too old, new, or what have you)
Some amount of the work already done via api-service
Not exec-ing binaries seems a little "cleaner", less overhead for some parts
No need for rkt to provide a machine-readable output or to parse the tabular output
Slightly easier deployment story
Encourages rkt to be refactored in a way where it can be more generally useful as libs

Cons

The CLI is the only really official way to interact with the rkt datadir
Still need a way to re-parent off /init, can't be fully library-based (maybe re-exec self nullifies this concern?)
Harder to separate privileges (no or fewer exec boundaries)
Possible to have different datadir versions for rktlet-library vs rkt-binary
More work total probably
Couples release cycle for users, they can't just update rkt to fix bugs
Constant re-vendoring of rkt could be painful in terms of churn

Conclusion

It's possible that I missed details above, and it's also up for interpretation which axis matter.

I have a preference for integrating as a service in no small part because the node team, iiuc, encourages that. The improved boundary / separation of responsibilities is nice, though we really get that either way.

Library vs exec is a hard one, but I feel like being able to avoid version-skew issues makes a library or distributing a specific version of the rkt binary along with the rktklet quite attractive.

I worry that some of the values of rkt's design aren't aligned with being used as a library by a long-lived daemon which exclusively manages an owned data-directory, but I can also individually see attractive qualities of each of those choices.

The extreme opposite, of conforming to CRI via this code, but re-vendoring into kubelet and then exec-ing rkt for everything is still afaik a viable option to discuss, @lucab, and seems in-line with the current state of things (other than introducing the app-level concept and a related layer of abstraction).

s-urbaniak · 2016-08-26T11:35:50Z

Thanks @euank for the write-up. Since most are leaning towards having a separate process (variant 1), replacing the current rkt api-service, I'd like to throw in another idea for variant 1.

Instead of deleting/pruning the rkt api-service from the rkt code base, and migrating it here as rktlet we could rewrite the existing rkt api-service to a new subcommand called rkt kubelet-service, using the gRPC protobufs as described https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/runtime-client-server.md (which as far as I understand would be used also here in rktlet).

We would also be removed from the "burden" of providing a rkt-library as things would continue to function mechanically as with the rkt api-service but only with different protobufs.

Existing distribution models would continue working, CoreOS simply continues shipping rkt, and users having rkt installed on their machines can start the "new" rkt kubelet-service as they did with rkt api-service.

lucab · 2016-08-29T17:16:38Z

Thanks for the analysis @euank, I agree that those are the two main decisional points. It took me a bit to digest this all and also went digging around existing k8s proposals to get a better understanding.

As a casual/naive observer, these points raised my original concerns:

each node must have two services deployed
both services must be working in order to run pods
both services are running as privileged
k8s only knows about a single node manager, not two separate services

But I see they are coming from the server/client proposal and intended to be designed exactly like this.
So, if this is the design we have to follow, my pick would be for @yifan-gu "design proposal 1", possibly library-based. The way to get there would be 1) to improve rkt library-interface to be a first-class citizen and 2) to grow api-service to its full potential (perhaps standalone in this repo).

If we can play outside of that proposal than my wishlist would be:

a single privileged service
getting rid of shell-outs
splitting this out of rkt main binary

yifan-gu · 2016-08-29T19:28:50Z

OOB discussed with @euank . It sounds like setuid bits for rkt might cause some security concern we are not certain about.

Another idea is creating the unix socket that is only accessible by some groups (say kubernetes group), and having kubelet run with that group ID.

Other than that, are there any ideas that we can make kubelet to contact with a separate process/service that has root privilege (in order to fetch images, launch containers), while keeping kubelet non root?

cc @philips @gtank @mjg59

tmrts · 2016-08-31T14:57:29Z

@euank @yifan-gu maybe we should consider distributing multiple binaries with k8s releases.
One for the default runtime (same as before) plus binaries that contain machinery to start other supported runtimes. It would eliminate a good amount of pain points plus configuration for us.

Since including an additional binary is no different than supporting a runtime and creating a repo for it under the organization

euank · 2016-08-31T15:58:09Z

I think the security concerns, while important to consider, aren't the primary focus of this issue. Let's just proceed now with the assumption that any communication can be a root-only unix socket and if we need to break it down further we can.

Based on @lucab's comments and some further OOB discussion (and an attempt to raise this discussion at sig-node), it seems likely to me that the most productive course of action would be to implement as a client/server two-processes approach.

This does kick distribution issues down the road and I have no doubt there will be some problems to deal with there, but it's definitely the quickest way to start getting something functional out there (especially given the other runtime integrations working in a similar way).

I'll assert that converting from a standalone service to a vendored library will also be possible if we implement with that in mind, while the reverse would be quite difficult, so this also isn't a bad path for ending up in either state.

Based on the above, I think that the right course of action is to implement it as a client/server runtime for the moment.

The exec vs library discussion is a more difficult one to pick a horse in. One of the valuable things to consider is what the rkt project thinks is the right approach here because, well, they're going to end up determining how well supported (or not).
The exec approach is probably more expedient for development and I personally don't see it as an unreasonable option.

I'll lean towards an initial implementation of client/server + exec, with the knowledge that coming back and merging in rkt code to end up with a lib approach is possible.

I'll also propose that our exec-ing of rkt should be able to work when rktlet is containerized and should encourage usage of a dedicated rkt binary and data directory due to version-skew reasons, as well as simplicity of recognizing and handling broken pods.

lucab · 2016-08-31T18:30:36Z

Let's just proceed now with the assumption that any communication can be a root-only unix socket and if we need to break it down further we can.

This is reasonable and not far away for what was already floating around (a root-like group ownership). It keeps things simple, and can be evolved later.

Based on the above, I think that the right course of action is to implement it as a client/server runtime for the moment.

Ack. Do you prefer to start from the current api-service codebase or build from scratch?

The exec approach is probably more expedient for development and I personally don't see it as an unreasonable option.

Let's start this way, as I think it will encounter less blockers on the rkt side. In parallel we'll try to polish a bit the interface/modules exported by rkt and see if it is feasible to have one binary less to distribute.

euank · 2016-08-31T19:05:34Z

@lucab I think it's better to start from the work @tmrts did, and disregard the api-service.

If we use exec, the api-service has no value for us (since it was an initial version of the 'use rkt as a library' thing really and I'd argue it's less supported than the rkt cli equivalent operations), but it also doesn't save too much in the use rkt as a library thing because it uses private methods and post-library-able refactor, I expect the logic it had to implement to be off the mark.

yifan-gu · 2016-09-06T20:43:15Z

closing in favor of #4

thockin · 2016-09-07T00:54:37Z

This is a long thread, but let me just say this. This is a decision that can be revisited if need be. I lean towards building it in, but developing out of core-tree. I think this gives the best testing and least version-skew-hell. But that can always be revisited and the rktlet (OMG better name needed) moved out.

yifan-gu assigned euank and tmrts Aug 23, 2016

yifan-gu force-pushed the design_proposal branch from 5d6c3ca to 9b4f02c Compare August 23, 2016 23:48

Add design-proposal for discussion.

dc29ed7

yifan-gu force-pushed the design_proposal branch from 9b4f02c to dc29ed7 Compare August 23, 2016 23:49

euank reviewed Aug 24, 2016
View reviewed changes

sarahnovotny unassigned tmrts Sep 1, 2016

euank mentioned this pull request Sep 6, 2016

docs: Initial design document #4

Merged

yifan-gu closed this Sep 6, 2016

yifan-gu deleted the design_proposal branch September 6, 2016 20:43

euank removed their assignment Nov 18, 2016

iaguis added this to the v0.1.0 milestone Nov 15, 2017

Uh oh!

Add design-proposal for discussion. #3

Add design-proposal for discussion. #3

Conversation

yifan-gu commented Aug 23, 2016

Uh oh!

yifan-gu commented Aug 23, 2016

Uh oh!

euank Aug 24, 2016

Choose a reason for hiding this comment

Uh oh!

yujuhong Aug 24, 2016

Choose a reason for hiding this comment

Uh oh!

feiskyer Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucab commented Aug 24, 2016

Uh oh!

yifan-gu commented Aug 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

euank commented Aug 24, 2016

Separate process (vs vendored)

Pros

Cons

library-like vendoring (vs process exec)

Pros

Cons

Conclusion

Uh oh!

s-urbaniak commented Aug 26, 2016

Uh oh!

lucab commented Aug 29, 2016

Uh oh!

yifan-gu commented Aug 29, 2016

Uh oh!

tmrts commented Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

euank commented Aug 31, 2016

Uh oh!

lucab commented Aug 31, 2016

Uh oh!

euank commented Aug 31, 2016

Uh oh!

yifan-gu commented Sep 6, 2016

Uh oh!

thockin commented Sep 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

feiskyer Aug 24, 2016 •

edited

Loading

yifan-gu commented Aug 24, 2016 •

edited

Loading

tmrts commented Aug 31, 2016 •

edited

Loading