Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Places for hooks #3585

Closed
erictune opened this issue Jan 16, 2015 · 55 comments
Closed

Places for hooks #3585

erictune opened this issue Jan 16, 2015 · 55 comments
Labels
area/api Indicates an issue on api area. area/extensibility kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@erictune
Copy link
Member

There are lots of reasons why the "system" wants to look at an object, like a pod, and modify and/or act on it. There are also a number of places where we can put in "hooks" for these actions. We'll end up with a better system in the long run, I think, if we put some thought into what hooks to use in what situations.

Places for hooks

  1. kubectl, other clients
  2. proxy in front of apiserver
  3. apiserver
  4. post-apiserver

Within apiserver, there are both hardcoded actions on objects, and extensible ones (such as what @derekwaynecarr has implemented with "admission control", #3472 and #3319).

The last one is a somewhat new concept, so it deserves a bit of explanation. After a pod is POST'ed to the apiserver, and persisted to etcd, some other component (a different process than apiserver), which is watching for new objects, will see it, and act on it and/or modify it. It should not start running on a Kubelet until all the things that need to act on it have done so. @smarterclayton called this a "finalizer". The scheduler can be viewed as a finalizer that takes pods that have everything except a HostIP set, and it sets the HostIP.

Pros/cons of each hook location

  • kubectl, other clients
    • hooks won't be mandatory. users could just use a different client.
    • updating all clients to change behavior is very difficult.
    • can interact with users local directory, templates, etc...
  • proxy
    • can be made manadatory,
    • may be a pain to setup?
  • apiserver
    • can prevent object from ever being persisted to storage
    • currently only place where atomic read-then-write is possible.
    • adding an action requires modifying apiserver binary,
    • adding an action requires checking in changes to kubernetes project or maintaining a branch.
  • post-apiserver (finalizer)
    • easy for users to customize by maintaining their own component, in separate repository if necessary
    • can't "reject" an object. hard for user to understand if the POST succeeds but the pod never runs.

Use cases for hooked actions

  1. Set default resource limits on pods
  2. Reject pods with resource limits above or below sane levels. Admission control plugins: LimitRanger and ResourceQuota #3057
  3. Limit aggregate amount of resources used by a tenant, or by objects matching some selector. Admission control plugins: LimitRanger and ResourceQuota #3057
  4. Reject Pods with resource limits which are difficult shapes for the system to schedule (e.g. lots of ram and very little CPU or vice versa.)
  5. Prevent creation of too many objects of any kind, either because it is obviously user error, or because it will hurt the system. Admission control plugins: LimitRanger and ResourceQuota #3057
  6. Schedule pods to nodes (the existing scheduler)
  7. setup network routes for pods (@pmorie working on something like this for OpenShift, I think)
  8. custom allocator for IP addresses, vlans, etc (Proposal: deouple networking for segmentation and other use cases #3350)
  9. something that distributes secrets to nodes via some side channel for the pods to use.
  10. pod limit auto-adjuster

Which hooks to use for what

Suggested guidelines for what hooks to use for what types of actions

  • Prefer in apiserver if need to prevent some object from ever "executing"
    • Prefer to say no as early as possible for debugging.
    • e.g. resource quotas in apiserver
  • Put it in the apiserver if the act of persisting of the object could be harmful
    • protect apiserver storage space
    • e.g. object quotas in apiserver
  • Prefer finalizer (outside apiserver) otherwise.
    • No need to recompile apiserver or commit to github.
    • Separation of responsibilities

Next steps

  • debate the above proposal

On finalizers from #3586

Read #3585 too.

We should have a general framework for a pod or other object to be POST'ed in an incomplete state, and persisted to etcd, and then subsequently to be handled by a series of "finalizers" that fill in missing fields. Once all are filled in, the object can be picked up by a kubelet and run.

Use cases

Use cases for filling in fields in pods after they are stored.

  • Set default resource limits on pods
  • Schedule pods to nodes (the existing scheduler sets the HostIP field)
  • custom allocator for PodIP addresses (and setup vlans, etc; Proposal: deouple networking for segmentation and other use cases #3350)
  • pod limit auto-adjuster. sets unspecified cpu and memory limits
  • pod template. a permanently underspecified pod could be a template for a replication controller

Availability

Availability of the cluster limited by the finalizers, so they need to be replicated. Fortunately, because they typically act on one object at a time, and can use resource versioning, it should be easy to parallelize them.

Bootstrapping

There has to be some way to get pods onto minions without waiting for finalizers to act on them, when turning up a cluster, or upgrading a finalizer. The scheduler is a special case of this.

Some options:

  • allow privileged user to write Pods to apiserver with all fields finalized, including HostIP, so that kubelets pick them up immediately
  • talk directly to a particular kubelet an make it start a pod. We should make kubelets accept pods to run in "api.Pod" format.
  • do rolling updates of finalizers wherever possible, so that there is always at least one good one around to help out.

State and sequence

  • what phase/condition/reason pods have are in as they work their way through finalizers
  • sequencing and composing multiple finalizers; how finalizer knows when it is its turn to act.
@erictune
Copy link
Member Author

We discussed this today in the meeting @derekwaynecarr @smarterclayton

We discussed this in our office yesterday @bgrant0607

@thockin
Copy link
Member

thockin commented Jan 16, 2015

Does default-value expansions count as one of these? It wants to (but
doesn't yet) operate on versioned API structs before conversion to internal
API. Or maybe that's just so fundamental it is consudered part of the API
versioning scaffolding.

Why must apiserver hooks be built in? Could we not support exec hooks that
were installed a posteriori?
On Jan 16, 2015 3:34 PM, "Eric Tune" notifications@github.com wrote:

There are lots of reasons why the "system" wants to look at an object,
like a pod, and modify and/or act on it. There are also a number of places
where we can put in "hooks" for these actions. We'll end up with a better
system in the long run, I think, if we put some thought into what hooks to
use in what situations.
Places for hooks

  1. kubectl, other clients
  2. proxy in front of apiserver
  3. apiserver
  4. post-apiserver

Within apiserver, there are both hardcoded actions on objects, and
extensible ones (such as what @derekwaynecarr
https://github.com/derekwaynecarr has implemented with "admission
control", #3472
#3472 and #3319
#3319).

The last one is a somewhat new concept, so it deserves a bit of
explanation. After a pod is POST'ed to the apiserver, and persisted to
etcd, some other component (a different process than apiserver), which is
watching for new objects, will see it, and act on it and/or modify it. It
should not start running on a Kubelet until all the things that need to act
on it have done so. @smarterclayton https://github.com/smarterclayton
called this a "finalizer". The scheduler can be viewed as a finalizer that
takes pods that have everything except a HostIP set, and it sets the
HostIP.
Pros/cons of each hook location

  • kubectl, other clients
    • hooks won't be mandatory. users could just use a different client.
      • updating all clients to change behavior is very difficult.
    • can interact with users local directory, templates, etc...
      • proxy
    • can be made manadatory,
    • may be a pain to setup?
      • apiserver
    • can prevent object from ever being persisted to storage
    • currently only place where atomic read-then-write is possible.
    • adding an action requires modifying apiserver binary,
    • adding an action requires checking in changes to kubernetes
      project or maintaining a branch.
      • post-apiserver (finalizer)
    • easy for users to customize by maintaining their own component,
      in separate repository if necessary
    • can't "reject" an object. hard for user to understand if the POST
      succeeds but the pod never runs.

Use cases for hooked actions

  1. Set default resource limits on pods
  2. Reject pods with resource limits above or below sane levels. Admission control plugins: LimitRanger and ResourceQuota #3057
    Admission control plugins: LimitRanger and ResourceQuota #3057
  3. Limit aggregate amount of resources used by a tenant, or by objects
    matching some selector. Admission control plugins: LimitRanger and ResourceQuota #3057
    Admission control plugins: LimitRanger and ResourceQuota #3057
  4. Reject Pods with resource limits which are difficult shapes for the
    system to schedule (e.g. lots of ram and very little CPU or vice versa.)
  5. Prevent creation of too many objects of any kind, either because it
    is obviously user error, or because it will hurt the system. Admission control plugins: LimitRanger and ResourceQuota #3057
    Admission control plugins: LimitRanger and ResourceQuota #3057
  6. Schedule pods to nodes (the existing scheduler)
  7. setup network routes for pods (@pmorie https://github.com/pmorie
    working on something like this for OpenShift, I think)
  8. custom allocator for IP addresses, vlans, etc (Proposal: deouple networking for segmentation and other use cases #3350
    Proposal: deouple networking for segmentation and other use cases #3350)
  9. something that distributes secrets to nodes via some side channel
    for the pods to use.
  10. pod limit auto-adjuster

Which hooks to use for what

Suggested guidelines for what hooks to use for what types of actions

  • Prefer in apiserver if need to prevent some object from ever
    "executing"
    • Prefer to say no as early as possible for debugging.
    • e.g. resource quotas in apiserver
      • Put it in the apiserver if the act of persisting of the object
        could be harmful
    • protect apiserver storage space
    • e.g. object quotas in apiserver
      • Prefer finalizer (outside apiserver) otherwise.
    • No need to recompile apiserver or commit to github.
      • Separation of responsibilities

Next steps

  • debate the above proposal

Reply to this email directly or view it on GitHub
#3585.

@erictune
Copy link
Member Author

@thockin
can you phrase those questions in the form of answers?

@derekwaynecarr
Copy link
Member

LGTM, maybe this can become a design doc for future reference?

@thockin
Copy link
Member

thockin commented Jan 16, 2015

I am on cell, so brief. I thought about defaulting, but I think it is too
fundamental. That said, of there was another use for a hook-point between
receiving a versioned struct and converting to internal, this might use the
same infra.

I think exec hooks are reasonable, with constraints that exec is much
slower than a function call. That may be totally acceptable in many cases.

I think ~all extension points should have an exec side-load.
On Jan 16, 2015 3:48 PM, "Eric Tune" notifications@github.com wrote:

@thockin https://github.com/thockin
can you phrase those questions in the form of answers?

Reply to this email directly or view it on GitHub
#3585 (comment)
.

@davidopp
Copy link
Member

I basically like this proposal, just a few comments/questions

  • I didn't catch whether the admission controller is a finalizer or not. If so, what prevents scheduler from scheduling before admission controller has acted? (IIRC it's not a finalizer, but wanted to make sure.)
  • I'm not sure there are any good examples of the first item in your bullet list (Prefer in apiserver if need to prevent some object from ever "executing"). It seems that everything can be structured as a finalizer (modulo your second bullet, where you want to put a quota on object creation), as long as you have a way to prevent an object from "executing" until it has been finalized. I would suggest instead that this bullet point should be about wanting to be able to synchronously return success/failure to the user, which IIUC is possible if you put the logic in the api server request handling path but not if you put it in a finalizer (please correct me if I'm wrong).
  • Related to the previous point, there can be some subtlety about sequencing finalizers. Assume all finalizers operate by taking a PodSpec and filling in some field that was blank. How does a finalizer know the PodSpec is ready for it to process, especially when the set of finalizers that you will have in the system isn't known at the time you write the finalizer? In some cases a finalizer is completely orthogonal to all other possible finalizers, which is fine. But sometimes it is not. For example, how does the scheduler know when a pod is ready to schedule? You could say it waits until there is enough information for it to schedule (e.g. resources have been filled in), but there may be unrelated fields that need to be filled in by some finalizer that is installed in some systems but not in others. Related to this is how you want the system to function if one or more finalizer processes are down -- in some cases you want to block executing a Pod (or block execution of the next finalizer in some sequence), but in other cases the finalizer is optional and you want the system to gracefully degrade (e.g. time out and continue finalizing and then eventually scheduling).

@derekwaynecarr
Copy link
Member

@davidopp - Admission control is not a finalizer for the reason you stated - it provides immediate success/failure response to the end-user.

@erictune
Copy link
Member Author

@davidopp
Regarding the third question, I'm guessing you have some experience in this area and maybe want to suggest solutions, perhaps in issue #3586 which is about the finer points of finalizers.

@bgrant0607
Copy link
Member

Another option: webhooks

@bgrant0607
Copy link
Member

See also: #1502 (comment)

@davidopp
Copy link
Member

Another option: webhooks

Meaning, the API server makes some call out to an HTTP endpoint? This could work but I believe it requires putting the logic about handoff, graceful degradation, etc. inside the API server rather than the finalizer, which may not be good for modularity. The API server would need to know the identity of all of the finalizers, though of course that could be configurable (though you'd need some way to have it re-read the config when it changes, versus the finalizer approach Eric described allows the API server to not need to know anything about the finalizers.)

@bgrant0607 bgrant0607 added the kind/design Categorizes issue or PR as related to design. label Jan 17, 2015
@bgrant0607
Copy link
Member

We have a lot of experience with the "finalizer" approach. It permits loosely coupled, choreographed interaction between components. Main cons are understandability when things go wrong, as @erictune mentioned, operational complexity, auth. complexity (who's allowed to munge what), lack of centralized control (if you want/need that), latency, and bootstrapping complexity.

@bgrant0607
Copy link
Member

BTW, I don't mean to sound down on finalizers. In balance, they've worked pretty well.

@bgrant0607
Copy link
Member

I think "Put it in the apiserver if the act of persisting of the object could be harmful" is the only strong justification to not use the finalizer approach (assuming we use it for anything).

We already support async operations. We could not declare creation operations done until finalization. We've had requests for this from people working on deployment tools, similar to the use cases in #1899. Certainly higher-level deployment, scheduling, and workflow systems will want to be able to respond to conditions such as pods not scheduling.

Also, I'd like a Suspend field in every object, so we could create inert objects. One use would be pod templates (#170), but it could also be used to await finalization. Determining when an object has been finalized might require a centralized decision, but I really want to avoid building in complex dependency logic, since experience has taught me that there's no end to the level of sophistication users will need. A small generalization, a dependency count, might be fairly useful, though. The object would be enabled when the count reached 0. (This has the side benefit of enabling objects by default.)

@bgrant0607
Copy link
Member

API plugins are related: #991

@davidopp
Copy link
Member

I think probably the finalizer and call-out approaches end up with the same set of issues. In fact, if you structure the call-outs as API server plug-ins, then there's probably not much difference between the two approaches at all. It's just a question of whether the logic for how these pieces of logic interact (e.g. for sequencing) is running in one process (the api server) or is distributed across multiple processes (finalizers). In Google's cluster management system we use a mixture of the two approaches and I'm not sure I'd say one is obviously better than the other.

@smarterclayton
Copy link
Contributor

Sounds like we need a good proposal for finalizer (concrete, lays out how the scheduler knows when to run, hashes out what a small cluster solution would look like), and then we should consider doing that post 1.0?

Also, authorization is another operation that could be done via call-out early in admission control for user actions (vs as a proxy). We talked about remoting the authz check for some use cases without requiring a proxy, and at that point it looks a lot more like a call out. Call out actions do not necessarily require auth, since they are a higher power.

And after creation, authorization is the only thing that has control over what mutations are made post creation, so authorization checks have to be per field (and determine the difference between mutating a field, vs setting it). Ie my post creation watcher might set volumes but not networking, but if I shouldn't be able to edit networking the authz check has to discriminate between "field changed" and "field unchanged". I don't see an easy way to do that unless you have the existing object loaded, which means a proxy won't have the necessary info available.

On Jan 17, 2015, at 1:54 AM, davidopp notifications@github.com wrote:

I think probably the finalizer and call-out approaches end up with the same set of issues. In fact, if you structure the call-outs as API server plug-ins, then there's probably not much difference between the two approaches at all. It's just a question of whether the logic for how these pieces of logic interact (e.g. for sequencing) is running in one process (the api server) or is distributed across multiple processes (finalizers). In Google's cluster management system we use a mixture of the two approaches and I'm not sure I'd say one is obviously better than the other.


Reply to this email directly or view it on GitHub.

@erictune
Copy link
Member Author

@brendandburns suggested using a "finalizer" for allocating IP addresses in issue #3435

@erictune
Copy link
Member Author

Finalizers need to be replicated for availability. When one fails, or is overloaded, the others need to pick up the work it was doing without leaving anything forgotten.

In some cases, there may be multiple finalizers that could work on an object. We could impose a specific ordering. But if one is not necessary, we could allow them freedom to work in an arbitrary order. But they need to not waste time doing work, only to find that the object has changed underneath them.

One pattern that would address both problems is if finalizers could take out "leases" on objects, for say a few seconds, while they work on them. Apiserver would enforce leases.

Needs more thought.

@jainvipin
Copy link

Thanks @erictune for putting this proposal up for review

+1 for the use cases discussed here, especially admissions control and networking related (custom ip allocation, setting up vlans/bridged-networks, etc.)

For my clarification, on implementation of these hooks, if they are expected to be:

  • Implemented via a well extensions/plugins (i.e. foreign code that runs as a part of the aforementioned APIs), or a separate binary listening to various events
  • Synchronicity: Synchronous hooks can ensure a well defined ordered flow of the state transitions, where as asynchronous may be better performing but without a guarantee. For example if certain network policy must be instantiated before a Pod is brought up.

@erictune
Copy link
Member Author

@jainvipin

As @smarterclayton said, we still need to have a more concrete proposal for that, and work on that should wait until after we ship 1.0.

But, if you want to hack something up before then as a proof of concept, with the understanding that we might change our minds, that would be cool. I suggest that you do the following:

  • make it a separate binary
  • Be synchronous. (details follow)
  • watch for new pods be created and have their HostIP assigned., e.g. using the Reflector and Store classes from pkg/client/cache, the same way that pkg/kubelet/config/api.go does.
  • change the apiserver to not set PodIP since your tool will handle it.
  • when you see a pod which you haven't handled before, create whatever network artifacts it needs, and then call the apiserver to update the pod; set PodIP to what you allocated, and record any metadata you need to remember in the pod's Annotations.
  • teach kubelet to not start the pod until it has a HostIP and a PodIP.

@alex-mohr
Copy link
Contributor

FWIW from my experience: it's worth erring on the side of centralized control and only making things loosely coupled when you actually have something that's totally async and orthogonal to the rest of the system.

Building these systems is complex, and making them loosely coupled adds to that complexity. The points raised by a few people above are worth reiterating: it's a PITA to figure out what's happening when things don't work as you expect. I suspect that mostly comes from the number of possible execution orders -- it's not A then B then C, it's some arbitrary order -- so convincing yourself that the code works as intended is hard. You have to make sure not only that an execution order is correct, but that all possible ones are.

@jainvipin
Copy link

@alex-mohr - your concern is well founded. Trying to keep it too generic/futuristic is neither needed, nor worth it. That said, I'd not confuse that with taking complexity out of the core function - I believe simplifying the core to not get muddled with unwanted details and keep it stable from forever changing requirements is a very good goal. Let's take an example: having following two or three hooks for networking, would suffice the current implementation and requirements for networking:

  • Create Network
  • Attach/Detach a Pod to a Network
  • Externally exposed endpoints/ports

While keeping the default simplicity of network model as is, it can allow somewhat advanced/different use cases (#3550) to be implemented, in the plugins, with the help of hooks such that the core is not affected. To me, this keeps the design stable and simple for longer run.

I hope you'd agree that the use cases for networking that I brought are worth addressing in order to increase the adoption of Kubernetes.

@derekwaynecarr
Copy link
Member

@bgrant0607 - I have not had time to think about how synchronous delegation would work yet, and ma not until we look at 1.2

@bgrant0607
Copy link
Member

@derekwaynecarr We're definitely not going to look at this until 1.2. Our current leaning is to do as much as possible asynchronously, though.

@derekwaynecarr
Copy link
Member

@bgrant0607 - I was asked about a use case today to see if it would be
possible for a third party to reject pods whose docker image declared
volumes but whose pod spec had no associated volume declaration. This
seemed like a possible use case for initializers that needed to perform
some image analysis prior to allowing a pod to become active in the system.
Doing that synchronously today in the current admission controller space
was heavy, but it seems like an area where asynchronous initializers would
provide future value. Recording here just to denote the type of use case
that could be supported. In that scenario, the general proposal outlined
in an earlier comment would have worked.

That said I continue to still get asked for sequential set of url hooks
that can be registered as call outs to do something as part of normal
admission control.

I think a good goal for 1.2 is to make the initial resources autosizing
work to be asynchronous as a proving ground for the pattern.

On Thursday, October 15, 2015, Brian Grant notifications@github.com wrote:

@derekwaynecarr https://github.com/derekwaynecarr We're definitely not
going to look at this until 1.2. Our current leaning is to do as much as
possible asynchronously, though.


Reply to this email directly or view it on GitHub
#3585 (comment)
.

@fabiand
Copy link
Contributor

fabiand commented Dec 12, 2016

On the dependencies/ordering side I wonder if the semantics of systemd's unit files would also make sense in this context. Unit files can express pure dependencies, which don't enforce an ordering, and 'liberal' ordering which only enforces ordering if the dependency is present. Combining both leads to hard dependencies with ordering.
IMHO this is a nice framework which is pretty flexible.

However the ownership issue, to prevent that multiple finalizers/initializers/hooks fight over the same resource or field path.
Thus I like @bgrant0607 's idea to have a per-field-path ownership. The question is how and where this registration process could happen.

My main interest in this topic are primarily admission controllers.

@derekmahar
Copy link
Contributor

I think the systemd dependency model in unit files is both simple and effective, in particular the useful distinction between strict Requires and the more liberal Wants.

@smarterclayton
Copy link
Contributor

This is effectively part of 1.7, future work there will improve it. Closing as part of 1.7 (see kubernetes/community#132)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/extensibility kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests