Skip to content
This repository was archived by the owner on Dec 19, 2019. It is now read-only.

Conversation

@yifan-gu
Copy link
Contributor

@yifan-gu
Copy link
Contributor Author

After sent out the PR, I feel like it has too much wording?
Anyway, feel free to comment :)

3. The third approach is somehow a combination of the first two.
It has the benefits of the both worlds, rktlet lives outside of Kubelet, so it could lead to higher iteration speed, smaller Kubelet size, and better modularity.
Also on the other side, the 'rkt API service' is compiled with rkt, and communicate with the rktlet through gRPC, so we still can use a lot of the existing rkt facilities.
The potential con is that instead of one gPRC service, now we introduce two, which may or may not cause some overhead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also increases blast radius. I'm 👎 on 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to know how much overhead there is, just for the reference. I think @feiskyer did some latency measurement before?

Copy link

@feiskyer feiskyer Aug 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y, there is a measurement of logging api on json-based http api, but gRPC should have better performance.

@lucab
Copy link
Contributor

lucab commented Aug 24, 2016

As a general sidenote, this CRI and the client/server proposals seem to go in the exact opposite direction of the goals we initially set: one (or more) new long-running high-privileged services are being introduced here, becoming a SPOF for the whole node. I'm sure this is not the first time this concern is being raised, but here it becomes quite apparent.

@yifan-gu
Copy link
Contributor Author

yifan-gu commented Aug 24, 2016

As a general sidenote, this CRI and the client/server proposals seem to go in the exact opposite direction of the goals we initially set: one (or more) new long-running high-privileged services are being introduced here, becoming a SPOF for the whole node. I'm sure this is not the first time this concern is being raised, but here it becomes quite apparent.

@lucab A non privileged service + setuid + exec'ing the rkt binary to solve this? And for listing/introspection tasks, we can use library, or something like today's rkt api service?

@euank
Copy link
Contributor

euank commented Aug 24, 2016

@lucab this should be thought of as an extension of kubelet and potentially could be vendored in to kubelet to reduce the SPOF concern; that being said, there's also benefit in decoupling and defining clear API bounderies, which is a trend seen among other Kubernetes components.

The kubelet will be a highly privileged spof no matter what; that's baked into its design. This is an extension of that, and we can potentially work to drop privileges/re-exec in order to reduce the privileges here, but when we frame this as more a component of kubelet than a component of rkt, it is less clearly against other goals explicitly and more a continuation of the state of things.

Additionally, the SPoF should be limited in that the lifecycle of rkltet should be completely decoupled from that of pods by ensuring they're parented elsewhere, so at least updates and restarts should have minimal impact.


On to my opinion on what the notable pros / cons of each given approach are. There are two axis to consider: 1) whether it's a separate binary or vendored, and 2) how it interacts with rkt (library vs exec)

Separate process (vs vendored)

Pros

  • No need to re-vendor (see cadvisor for an example of the minor pain/friction that results in)
  • Can have an independent/decoupled release cycle in the future
  • Can update independently without need to restart kubelet
  • Users can deploy fewer total bits of binaries if they don't want rkt
  • Easier to split up the decision of containerizing kubelet vs containerizing rktlet (e.g. to not containerize rktlet)
  • "plays nicer" with the kubelet/CRI ecosystem (oci-runtime, hyper-runtime both intend to integrate as services)
  • Node team encourages this level of integration (cc @yujuhong / @dchen1107 to double check)

Cons

  • Users need to deploy more different parts and likely more total bits if they do want rkt
  • Need to provide a means of distribution + management; can't be a kubelet static pod unless we have a bootstrapping in-process runtime
  • Need to figure out an auth/security story we're happy with, whether that be a unix socket or localhost address or proper auth
  • Increased blast radius; kubelet + rktlet
  • Increased superprivileged radius; kubelet + rktlet
  • Increased chance for messy error handling and latency (but it's over loopback, latency should be tiny concern)
  • Increased CPU/mem due to encode/decode grpc vs function passing, and the typical logging/server/etc overhead paid down twice

library-like vendoring (vs process exec)

Pros

  • No need to deal with the rkt-bin version (be it too old, new, or what have you)
  • Some amount of the work already done via api-service
  • Not exec-ing binaries seems a little "cleaner", less overhead for some parts
  • No need for rkt to provide a machine-readable output or to parse the tabular output
  • Slightly easier deployment story
  • Encourages rkt to be refactored in a way where it can be more generally useful as libs

Cons

  • The CLI is the only really official way to interact with the rkt datadir
  • Still need a way to re-parent off /init, can't be fully library-based (maybe re-exec self nullifies this concern?)
  • Harder to separate privileges (no or fewer exec boundaries)
  • Possible to have different datadir versions for rktlet-library vs rkt-binary
  • More work total probably
  • Couples release cycle for users, they can't just update rkt to fix bugs
  • Constant re-vendoring of rkt could be painful in terms of churn

Conclusion

It's possible that I missed details above, and it's also up for interpretation which axis matter.

I have a preference for integrating as a service in no small part because the node team, iiuc, encourages that. The improved boundary / separation of responsibilities is nice, though we really get that either way.

Library vs exec is a hard one, but I feel like being able to avoid version-skew issues makes a library or distributing a specific version of the rkt binary along with the rktklet quite attractive.

I worry that some of the values of rkt's design aren't aligned with being used as a library by a long-lived daemon which exclusively manages an owned data-directory, but I can also individually see attractive qualities of each of those choices.

The extreme opposite, of conforming to CRI via this code, but re-vendoring into kubelet and then exec-ing rkt for everything is still afaik a viable option to discuss, @lucab, and seems in-line with the current state of things (other than introducing the app-level concept and a related layer of abstraction).

@s-urbaniak
Copy link

Thanks @euank for the write-up. Since most are leaning towards having a separate process (variant 1), replacing the current rkt api-service, I'd like to throw in another idea for variant 1.

Instead of deleting/pruning the rkt api-service from the rkt code base, and migrating it here as rktlet we could rewrite the existing rkt api-service to a new subcommand called rkt kubelet-service, using the gRPC protobufs as described https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/runtime-client-server.md (which as far as I understand would be used also here in rktlet).

We would also be removed from the "burden" of providing a rkt-library as things would continue to function mechanically as with the rkt api-service but only with different protobufs.

Existing distribution models would continue working, CoreOS simply continues shipping rkt, and users having rkt installed on their machines can start the "new" rkt kubelet-service as they did with rkt api-service.

@lucab
Copy link
Contributor

lucab commented Aug 29, 2016

Thanks for the analysis @euank, I agree that those are the two main decisional points. It took me a bit to digest this all and also went digging around existing k8s proposals to get a better understanding.

As a casual/naive observer, these points raised my original concerns:

  • each node must have two services deployed
  • both services must be working in order to run pods
  • both services are running as privileged
  • k8s only knows about a single node manager, not two separate services

But I see they are coming from the server/client proposal and intended to be designed exactly like this.
So, if this is the design we have to follow, my pick would be for @yifan-gu "design proposal 1", possibly library-based. The way to get there would be 1) to improve rkt library-interface to be a first-class citizen and 2) to grow api-service to its full potential (perhaps standalone in this repo).

If we can play outside of that proposal than my wishlist would be:

  • a single privileged service
  • getting rid of shell-outs
  • splitting this out of rkt main binary

@yifan-gu
Copy link
Contributor Author

OOB discussed with @euank . It sounds like setuid bits for rkt might cause some security concern we are not certain about.

Another idea is creating the unix socket that is only accessible by some groups (say kubernetes group), and having kubelet run with that group ID.

Other than that, are there any ideas that we can make kubelet to contact with a separate process/service that has root privilege (in order to fetch images, launch containers), while keeping kubelet non root?

cc @philips @gtank @mjg59

@tmrts
Copy link

tmrts commented Aug 31, 2016

@euank @yifan-gu maybe we should consider distributing multiple binaries with k8s releases.
One for the default runtime (same as before) plus binaries that contain machinery to start other supported runtimes. It would eliminate a good amount of pain points plus configuration for us.

Since including an additional binary is no different than supporting a runtime and creating a repo for it under the organization

@euank
Copy link
Contributor

euank commented Aug 31, 2016

I think the security concerns, while important to consider, aren't the primary focus of this issue. Let's just proceed now with the assumption that any communication can be a root-only unix socket and if we need to break it down further we can.

Based on @lucab's comments and some further OOB discussion (and an attempt to raise this discussion at sig-node), it seems likely to me that the most productive course of action would be to implement as a client/server two-processes approach.

This does kick distribution issues down the road and I have no doubt there will be some problems to deal with there, but it's definitely the quickest way to start getting something functional out there (especially given the other runtime integrations working in a similar way).

I'll assert that converting from a standalone service to a vendored library will also be possible if we implement with that in mind, while the reverse would be quite difficult, so this also isn't a bad path for ending up in either state.

Based on the above, I think that the right course of action is to implement it as a client/server runtime for the moment.

The exec vs library discussion is a more difficult one to pick a horse in. One of the valuable things to consider is what the rkt project thinks is the right approach here because, well, they're going to end up determining how well supported (or not).
The exec approach is probably more expedient for development and I personally don't see it as an unreasonable option.

I'll lean towards an initial implementation of client/server + exec, with the knowledge that coming back and merging in rkt code to end up with a lib approach is possible.

I'll also propose that our exec-ing of rkt should be able to work when rktlet is containerized and should encourage usage of a dedicated rkt binary and data directory due to version-skew reasons, as well as simplicity of recognizing and handling broken pods.

@lucab
Copy link
Contributor

lucab commented Aug 31, 2016

Let's just proceed now with the assumption that any communication can be a root-only unix socket and if we need to break it down further we can.

This is reasonable and not far away for what was already floating around (a root-like group ownership). It keeps things simple, and can be evolved later.

Based on the above, I think that the right course of action is to implement it as a client/server runtime for the moment.

Ack. Do you prefer to start from the current api-service codebase or build from scratch?

The exec approach is probably more expedient for development and I personally don't see it as an unreasonable option.

Let's start this way, as I think it will encounter less blockers on the rkt side. In parallel we'll try to polish a bit the interface/modules exported by rkt and see if it is feasible to have one binary less to distribute.

@euank
Copy link
Contributor

euank commented Aug 31, 2016

@lucab I think it's better to start from the work @tmrts did, and disregard the api-service.

If we use exec, the api-service has no value for us (since it was an initial version of the 'use rkt as a library' thing really and I'd argue it's less supported than the rkt cli equivalent operations), but it also doesn't save too much in the use rkt as a library thing because it uses private methods and post-library-able refactor, I expect the logic it had to implement to be off the mark.

@yifan-gu
Copy link
Contributor Author

yifan-gu commented Sep 6, 2016

closing in favor of #4

@yifan-gu yifan-gu closed this Sep 6, 2016
@yifan-gu yifan-gu deleted the design_proposal branch September 6, 2016 20:43
@thockin
Copy link

thockin commented Sep 7, 2016

This is a long thread, but let me just say this. This is a decision that can be revisited if need be. I lean towards building it in, but developing out of core-tree. I think this gives the best testing and least version-skew-hell. But that can always be revisited and the rktlet (OMG better name needed) moved out.

@euank euank removed their assignment Nov 18, 2016
@iaguis iaguis added this to the v0.1.0 milestone Nov 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants