Initial Daemon controller proposal #10210

rjnagal · 2015-06-23T05:59:54Z

Transcribing Daemon controller proposal from @AnanyaKumar

rjnagal · 2015-06-23T06:01:03Z

@bgrant0607 @davidopp @AnanyaKumar @vmarmol @vishh @erictune

Saving the daemon controller discussion as a proposal here.

k8s-bot · 2015-06-23T06:17:55Z

GCE e2e build/test failed for commit 4395a996c248282f2d7ebc7cf42e5ae1f1df4dca.

k8s-bot · 2015-06-23T06:26:17Z

GCE e2e build/test passed for commit 36a97c071860a90caa4f30ef497f447788309074.

AnanyaKumar · 2015-06-23T06:57:03Z

Thanks a lot @rjnagal !

mbforbes · 2015-06-24T19:19:42Z

@rjnagal can you pick a reviewer, please?

mbforbes · 2015-06-24T19:26:43Z

@bgrant0607 for the interim

bgrant0607 · 2015-06-25T01:17:06Z

Ref #1518

bgrant0607 · 2015-06-25T05:05:12Z

docs/proposals/daemon-controller.md

+Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The daemon controller can be used to run a data collection service (for example fluentd) and send the data to a service like ElasticSearch for analysis.
+
+### Cluster-Level Applications
+Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A daemon controller is a convenient way to implement such a datastore. At Google, D is an example of such an application.


I don't think we've published D, but the GFS chunkserver was described in the GFS paper:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf

bgrant0607 · 2015-06-25T05:23:34Z

docs/proposals/daemon-controller.md

+
+1. Include the daemon in the machine image
+2. Use config files to launch daemons
+3. Use static pod manifests to launch daemon pods when the node initializes


Maybe some of the text from #10093 would be useful in describing this alternative.
cc @jsafrane

bgrant0607 · 2015-06-25T05:30:54Z

docs/proposals/daemon-controller.md

+- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node.
+- The daemon manager will create a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname)
+#### Kubelet
+- Does not need to be modified, but health checking for the daemon pods and revive the pods if they are killed (we will set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always.


Validation of restartPolicy belongs under Apiserver.

Yup, I realized when I implemented the Apiserver part of the daemon controller :) Thanks!

bgrant0607 · 2015-06-27T01:14:45Z

cc @bprashanth @gmarek

ravigadde · 2015-06-27T01:35:12Z

@bgrant0607

Thanks. My only leftover question is on whether showing aggregated status makes sense for higher level objects like Replication Controller and Daemon Controller. Yes, pod status can be looked at for all the pods controlled by one of these controllers. If the Controller's status encodes a union of all the Pods statuses, it would be helpful imo to the end user.

bgrant0607 · 2015-06-27T02:46:13Z

@ravigadde Aggregated status: yes. I agree it would be helpful.

eparis · 2015-06-27T15:12:15Z

I know I'm talking way out of turn here... And I do NOT want to derail real good forward progress work. This is a real proposal for something I really want.

But I really like the start of scheduler thoughts that @timothysc had. It was some relatively simple (to conceptualize) ideas making changes to the normal replication controller. His proposal can likely solve this too.

Then again, his proposal is just starting to work through concepts and this one is getting to the coding point, so I'd rather take a simple thing like this now and get a more flexible solution over there to solve things where the 'daemon controller' isn't simple...

His proposal had at least 2/3 key concepts required. (Tim, what's that PR?)

Better label selection. !, ||, Exists, (). Right now we only have && and ==. Might be nice here too. The ability to say "all nodes with 3com nics but not nodes with the 3c509b nic" Yes I can create another set of labels on the nodes. So maybe this isn't a P0 to solve his problem.
SPREAD_BY(label) and GROUP_BY(label_selector) scheduling concepts. So after the selector has decided the set of nodes in question the _BY hints would cause the scheduler to either co-locate pods where the label value was the same, or make sure no 2 pods went onto the same node where the label was the same. (The exact syntax of these is quite vague in my mind)
A new rc 'replicas' option. Some form of $NUM_NODES(label_selector) so the system dynamically picks the number of replicas, much like you propose here.

In Tim's thoughts your proposal would look like an rc with

replicas = NUM_NODES(EXISTS("kubernetes.io/hostname"))
SPREAD_BY("kubernetes.io/hostname")

To run a pod adding iSCSI offload to all of the 3com nics (but not the trusty old 3c509b) you'd have something:

replics = NUM_NODES(nic_man == 3com && nic != 3c509b)
SPREAD_BY("kubernetes.io/hostname")

I think these are the exact problems you are trying to solve here? SPREAD_BY("kubernetes.io/hostname") seems to be the same thing in his thoughts.

But lets not derail here. I want this functionality badly. Let's just not try to overengineer here, something simple and to the point. Which we can re-use when we have more flexible scheduling.

smarterclayton · 2015-06-27T17:19:10Z

One reason to prefer a separate object (even though we can represent a wide
variety of scenarios with a more powerful policy) is that it makes it
easier for users and tools to reason about the use cases for choosing one
or the other. The use cases for daemon controllers almost all overlap with
some characteristic of the underlying host - file system, device, network,
or location. Those use cases require elevated privilege, and also in some
cases undermine the assumption that topology and placement are details that
users do not need. On the contrary, replication controllers represent a
stronger need for a particular number of an item (for performance, or
reliability). A daemon controller is satisfied by covering a set - a
replication controller that is oversubscribed to cluster resources may
indicate the cluster should scale up.

It's not that we couldn't represent the spectrum of choice between the
two positions (I need to guarantee that all nodes that look like this can
run this process, to I need a certain number of these processes that may
need these nodes) with a single object - but that we perhaps shouldn't
because those distinctions are important for users and tools to understand
and choose between.

On Jun 27, 2015, at 11:12 AM, Eric Paris notifications@github.com wrote:

I know I'm talking way out of turn here... And I do NOT want to derail real
good forward progress work. This is a real proposal for something I really
want.

But I really like the start of scheduler thoughts that @timothysc
https://github.com/timothysc had. It was some relatively simple (to
conceptualize) ideas making changes to the normal replication controller.
His proposal can likely solve this too.

Then again, his proposal is just starting to work through concepts and this
one is getting to the coding point, so I'd rather take a simple thing like
this now and get a more flexible solution over there to solve things where
the 'daemon controller' isn't simple...

His proposal had at least 2/3 key concepts required. (Tim, what's that PR?)

Better label selection. !, ||, Exists, (). Right now we only have && and
==. Might be nice here too. The ability to say "all nodes with 3com nics
but not nodes with the 3c509b nic" Yes I can create another set of labels
on the nodes. So maybe this isn't a P0 to solve his problem.
2.

SPREAD_BY(label) and GROUP_BY(label_selector) scheduling concepts. So
after the selector has decided the set of nodes in question the _BY hints
would cause the scheduler to either co-locate pods where the label value
was the same, or make sure no 2 pods went onto the same node where the
label was the same. (The exact syntax of these is quite vague in my mind)

A new rc 'replicas' option. Some form of $NUM_NODES(label_selector) so
the system dynamically picks the number of replicas, much like you propose
here.

In Tim's thoughts your proposal would look like an rc with

replicas = NUM_NODES(EXISTS("kubernetes.io/hostname"))
SPREAD_BY("kubernetes.io/hostname")

To run a pod adding iSCSI offload to all of the 3com nics (but not the
trusty old 3c509b) you'd have something:

replics = NUM_NODES(nic_man == 3com && nic != 3c509b)
SPREAD_BY("kubernetes.io/hostname")

I think these are the exact problems you are trying to solve here?
SPREAD_BY("kubernetes.io/hostname") seems to be the same thing in his
thoughts.

But lets not derail here. I want this functionality badly. Let's just not
try to overengineer here, something simple and to the point. Which we can
re-use when we have more flexible scheduling.

—
Reply to this email directly or view it on GitHub
#10210 (comment)
.

davidopp · 2015-06-27T23:13:20Z

(My responses below are the same whether we implement this as a separate object or as an RC. But I think the arguments for putting the DC functionality inside the node controller are compelling, and justify implementing this as a separate object.)

@eparis With respect to your point (1), it seems that if we expand the label selector language over time, the feature as described in the proposal will eventually work the way you want. And I think we surely will want to make the label selector language more sophisticated in time (I'm sure I've seen an issue for this...)

Your point (2) is a different beast than (1), along two axes: it is a "soft constraint" rather than a hard one (try to do this rather than you must do this) and it describes a relationship among multiple pods rather than something that can be evaluated in isolation. The first part of that is easy, the second part is not as easy. It seems like these SPREAD_BY and GROUP_BY rules have to be defined at a cluster level, not a controller level. Otherwise what happens if you have one controller that says to spread pods by label X and another controller that says to group pods by label X?

I agree on point (3). At some point in the past, I thought this proposal allowed you to specify a % of machines (that match the selector) to deploy onto. Maybe that's not required for the first version but I think it's a useful feature.

eparis · 2015-06-28T00:20:26Z

@davidopp (2) I was not thinking of something 'cluster' level. Definitely controller level. I could say 'create on 15 replicas of my DB and spread by the "row=" label' very very similar to this daemon controller.

Although as I think more a group_by concept (although not the one I was intending to outline) does seem that it could be more interesting on a cluster level instead of controller level, as you say. I was thinking GROUP_BY(label=rack) so all of the pods would land on a single rack for low network latency. But I can totally see how I many want to group my db pods and my web server pods against each other. But that isn't what I was originally thinking....

davidopp · 2015-06-28T01:27:32Z

I didn't say I was opposed to preemption, only to numerical priorities. Daemons don't make much
sense, IMO, unless the placement can be guaranteed, so they should be able to preempt anything
else. We might want to perform such preemptions gracefully (e.g., in a rolling fashion), however.

There are three cases
(1) new nodes
(2) add a new daemon pod, or increase resource limit of existing daemon pod
(3) daemon pod was deleted (perhaps accidentally) and you want it back

Integrating with node controller solves (1). For (2) and (3) we could say this is a rare event and you have to drain the machine, remove it, and re-add it, which makes it a case of (1).

I'm not saying we shouldn't have preemptions, I'm just saying I don't think they're strictly required for daemon controller even if we want guaranteed scheduling.

My view on priorities (since it came up and I can never resist offering my opinion on things) is that it should be a layered concept. The lowest level should be based on numerical values but the knob we offer users could be higher-level (like different qualitative categories or whatever). The "everything is part of the API" concept makes the distinction between internal and external a bit muddled, though.

bprashanth · 2015-06-28T18:39:03Z

docs/proposals/daemon-controller.md

+- API object for Daemon Controller will be created in api/v1/types.go and api/v1/register.go
+#### DaemonManager
+- Creates new daemon controllers when requested. Launches the corresponding daemon pod on all nodes with labels matching the new daemon controller’s selector.
+- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node.


hmm, I feel like you want another point here. How does the DCManager notice daemon death and what pod states does it ignore? What does api.PodSuccess even mean for a daemon? Are we just going to rely on the kubelet restarting the daemon forever?

+1 to @bprashanth point. You wrote that DC won't monitor health of the Daemon somewhere above, but it'd be good to mention that NC won't be evicting Daemons in the case of Nodes death. We'll also need to add some garbage collection for such forever hanging Daemons.

gmarek · 2015-06-29T17:47:55Z

I really wanted to read through this proposal, but literally didn't have time today:/ After first read I had mostly two questions:

how we handle multiple schedulers? (IUUC design moved from writing separate controller to extending NodeController, so it's probably no longer an issue)
how we handle machine drains?

If it's addressed somewhere, please shout at me. I'll make rereading this proposal my P0 for tomorrow.

bgrant0607 · 2015-06-29T17:59:19Z

@gmarek Multiple schedulers or multiple controllers? What issue are you concerned about?

Machine drains: Again, what specific issue(s) are you concerned about? We don't support machine drains yet.

davidopp · 2015-06-29T18:16:42Z

He asked me about this last night so I think I know what he is asking.

(1) Multiple schedulers: Say a daemon pod is accidentally deleted from a machine (not crashes and restarts locally -- accidentally deleted in etcd). Now the node controller/scheduler will try to put the daemon pod back at the same time as the regular scheduler may try to schedule a regular pod on the machine. Do we have a mechanism to avoid race conditions between two schedulers (e.g. two schedulers each see the machine as free, then send binding for pod that takes up the whole machine -- will it be accepted?)? We have resourceVersion but that is on the binding -- do we need something like in Omega where we "touch" the machine when we write a binding so that subsequent bindings that don't take this binding into account will be rejected? (I believe this would require a multi-object etcd transaction.)

(2) Machine drains: Without preemption, the only way to increase the resources on a daemon pod or add a new daemon pod (in a guaranteed-to-work way) is the drain the machine. I think he is just observing that we don't have machine drain yet. But you can just remove the machine which will evict everything (and then re-add it), so we do have machine drain, just not a graceful one.

gmarek · 2015-06-29T22:12:55Z

Yes, thanks David.

is exactly as David wrote. I meant DaemonController of course.
I just recall some discussion on horizontal node autoscaling, where someone assumed that nodes that are possible to remove are completely empty. I have no idea what's the status of it, but I'm sure something like that was discussed. The question is how/if we want to support eviction of daemons from nodes, as I don't recall seeing it in the doc (but I might have missed it). Of course if we don't support graceful drains, then it's OK to ignore this.

bgrant0607 · 2015-06-30T22:56:34Z

@rjnagal Please just ensure the proposal accurately reflects what we're doing, and then we can merge the proposal. It can cite this PR and the relevant issues (#1518, #3058) for people to read the full discussion.

I'll respond to the comments.

timothysc · 2015-07-08T21:21:14Z

Sorry for the lag, I'm still catching up from vacation.

@eparis I don't believe the two ideas are mutually exclusive, and it's entirely possible they could be built atop.

However, I do believe this is proposal jumps the shark (so to speak) before preemption and priority schemas are created.

+1 to @davidopp's layering prio.

k8s-bot · 2015-07-31T17:42:54Z

GCE e2e build/test failed for commit e495dad.

gmarek · 2015-08-24T12:09:19Z

@bgrant0607 - what are the plans for this PR? Will we merge it or drop it?

AnanyaKumar · 2015-08-24T14:58:48Z

@gmarek Let's close this, I'll send a PR to add a slightly updated doc to docs/design soon.

gmarek · 2015-08-24T15:01:16Z

Thanks, closing.

timothysc · 2015-08-31T13:21:39Z

@AnanyaKumar where is this at now?

AnanyaKumar · 2015-09-01T03:54:47Z

@timothysc it's been moved to #13368!

googlebot added the cla: yes label Jun 23, 2015

rjnagal force-pushed the ux branch from 4395a99 to 36a97c0 Compare June 23, 2015 06:08

j3ffml added the kind/design Categorizes issue or PR as related to design. label Jun 23, 2015

mbforbes assigned bgrant0607 Jun 24, 2015

bgrant0607 mentioned this pull request Jun 25, 2015

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Closed

bgrant0607 reviewed Jun 25, 2015
View reviewed changes

bgrant0607 mentioned this pull request Jun 25, 2015

Add documentation about static pods. #10093

Merged

bgrant0607 reviewed Jun 25, 2015
View reviewed changes

Initial Daemon controller proposal.

e495dad

rjnagal force-pushed the ux branch from 36a97c0 to e495dad Compare June 25, 2015 05:30

bgrant0607 reviewed Jun 25, 2015
View reviewed changes

bprashanth reviewed Jun 28, 2015
View reviewed changes

bgrant0607 added this to the v1.0 milestone Jun 30, 2015

bgrant0607 modified the milestones: v1.0-post, v1.0 Jul 6, 2015

bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

bgrant0607 added the area/api Indicates an issue on api area. label Jul 30, 2015

AnanyaKumar mentioned this pull request Aug 7, 2015

Daemon Registry and Client #12383

Closed

gmarek closed this Aug 24, 2015

bgrant0607 mentioned this pull request Aug 28, 2015

Validate no replicationController overlap. #2210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Daemon controller proposal #10210

Initial Daemon controller proposal #10210

rjnagal commented Jun 23, 2015

rjnagal commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

AnanyaKumar commented Jun 23, 2015

mbforbes commented Jun 24, 2015

mbforbes commented Jun 24, 2015

bgrant0607 commented Jun 25, 2015

bgrant0607 Jun 25, 2015

bgrant0607 Jun 25, 2015

bgrant0607 Jun 25, 2015

AnanyaKumar Jun 25, 2015

bgrant0607 commented Jun 27, 2015

ravigadde commented Jun 27, 2015

bgrant0607 commented Jun 27, 2015

eparis commented Jun 27, 2015

smarterclayton commented Jun 27, 2015

davidopp commented Jun 27, 2015

eparis commented Jun 28, 2015

davidopp commented Jun 28, 2015

bprashanth Jun 28, 2015

gmarek Jul 3, 2015

gmarek commented Jun 29, 2015

bgrant0607 commented Jun 29, 2015

davidopp commented Jun 29, 2015

gmarek commented Jun 29, 2015

bgrant0607 commented Jun 30, 2015

timothysc commented Jul 8, 2015

k8s-bot commented Jul 31, 2015

gmarek commented Aug 24, 2015

AnanyaKumar commented Aug 24, 2015

gmarek commented Aug 24, 2015

timothysc commented Aug 31, 2015

AnanyaKumar commented Sep 1, 2015

Initial Daemon controller proposal #10210

Initial Daemon controller proposal #10210

Conversation

rjnagal commented Jun 23, 2015

rjnagal commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

k8s-bot commented Jun 23, 2015

AnanyaKumar commented Jun 23, 2015

mbforbes commented Jun 24, 2015

mbforbes commented Jun 24, 2015

bgrant0607 commented Jun 25, 2015

bgrant0607 Jun 25, 2015

Choose a reason for hiding this comment

bgrant0607 Jun 25, 2015

Choose a reason for hiding this comment

bgrant0607 Jun 25, 2015

Choose a reason for hiding this comment

AnanyaKumar Jun 25, 2015

Choose a reason for hiding this comment

bgrant0607 commented Jun 27, 2015

ravigadde commented Jun 27, 2015

bgrant0607 commented Jun 27, 2015

eparis commented Jun 27, 2015

smarterclayton commented Jun 27, 2015

davidopp commented Jun 27, 2015

eparis commented Jun 28, 2015

davidopp commented Jun 28, 2015

bprashanth Jun 28, 2015

Choose a reason for hiding this comment

gmarek Jul 3, 2015

Choose a reason for hiding this comment

gmarek commented Jun 29, 2015

bgrant0607 commented Jun 29, 2015

davidopp commented Jun 29, 2015

gmarek commented Jun 29, 2015

bgrant0607 commented Jun 30, 2015

timothysc commented Jul 8, 2015

k8s-bot commented Jul 31, 2015

gmarek commented Aug 24, 2015

AnanyaKumar commented Aug 24, 2015

gmarek commented Aug 24, 2015

timothysc commented Aug 31, 2015

AnanyaKumar commented Sep 1, 2015