New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Clustering: Design proposal. #8859

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
@aluzzardi
Contributor

aluzzardi commented Oct 30, 2014

Authors: @aluzzardi and @vieux.

ScreenShot

The goal and scope of this change is to allow docker to manage a cluster
of docker hosts.

The target audience for clustering is modeled after Docker itself:
developer and devops first then enterprise later. Enterprise requires
additional features such as authentication, ACLs, auditing and tooling
which will come at a later time.

The system is designed to handle workloads in the same way as Docker: it
can run both long-running and one-off tasks. Batch processing can be
built on top of the API using the one-off primitive.

The architecture is based on an evented model where the master queries
the registered slaves within the system and checks the current state of
the slaves against the requested state by the user. It reconciles any
differences and updates the cluster state with the required changes.

Signed-off-by: Andrea Luzzardi aluzzardi@gmail.com
Signed-off-by: Victor Vieux vieux@docker.com

Docker Clustering: Design proposal.
Authors: @aluzzardi and @vieux.

The goal and scope of this change is to allow docker to manage a cluster
of docker hosts.

It is designed to scale to ~100 machines in a master/slave architecture.
When you have more than this amount of machines are have different
requirements and are probably implementing your own infrastructure based
software to deal with your specific challenges.

The target audience for clustering is modeled after Docker itself:
developer and devops first then enterprise later. Enterprise requires
additional features such as authentication, ACLs, auditing and tooling
which will come at a later time.

The system is designed to handle workloads in the same way as Docker: it
can run both long-running and one-off tasks. Batch processing can be
built on top of the API using the one-off primitive.

The architecture is based on an evented model where the master queries
the registered slaves within the system and checks the current state of
the slaves against the requested state by the user.  It reconciles any
differences and updates the cluster state with the required changes.

Signed-off-by: Andrea Luzzardi <aluzzardi@gmail.com>
Signed-off-by: Victor Vieux <vieux@docker.com>
@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Oct 30, 2014

Contributor

Probably here should be described some technical details. Like how master <-> slave relationship is organized, how master knows about new slaves etc.

Contributor

LK4D4 commented Oct 30, 2014

Probably here should be described some technical details. Like how master <-> slave relationship is organized, how master knows about new slaves etc.

$ docker -d --master --discovery https://discovery.hub.docker.com/u/demo-user/cluster-1
Eventually, clustering will provide a built-in leader election algorithm making the `--master` flag obsolete.

This comment has been minimized.

@jessfraz

jessfraz Oct 30, 2014

Contributor

Would this use a gossip protocol to assign a new master, probably out of scope for the initial proposal just wondering

@jessfraz

jessfraz Oct 30, 2014

Contributor

Would this use a gossip protocol to assign a new master, probably out of scope for the initial proposal just wondering

This comment has been minimized.

@gaberger

gaberger Oct 30, 2014

I assume the leader election would be through an implementation of Raft as mentioned earlier today? I thin a gossip protocol like Serf might be useful for failure detection and state distribution to route important messages around failed nodes or network partitions.

-g

On Oct 30, 2014, at 3:14 PM, Jessie Frazelle notifications@github.com wrote:

In docs/sources/userguide/cluster.md:

+In the future, other discovery mechanism will be provided that will not depend on the Hub.
+
+## Master
+
+In order to ensure consistency within the cluster one of your Docker Engines
+will need to be promoted to master within the cluster. If you are using
+the Hub's discovery service you will be able to promote any of your registered
+nodes to a master from the web interface. You can also statically assign one
+of the Docker Engines with the --master flag.
+

+Eventually, clustering will provide a built-in leader election algorithm making the --master flag obsolete.
Would this use a gossip protocol to assign a new master, probably out of scope for the initial proposal just wondering


Reply to this email directly or view it on GitHub https://github.com/docker/docker/pull/8859/files#r19629151.

@gaberger

gaberger Oct 30, 2014

I assume the leader election would be through an implementation of Raft as mentioned earlier today? I thin a gossip protocol like Serf might be useful for failure detection and state distribution to route important messages around failed nodes or network partitions.

-g

On Oct 30, 2014, at 3:14 PM, Jessie Frazelle notifications@github.com wrote:

In docs/sources/userguide/cluster.md:

+In the future, other discovery mechanism will be provided that will not depend on the Hub.
+
+## Master
+
+In order to ensure consistency within the cluster one of your Docker Engines
+will need to be promoted to master within the cluster. If you are using
+the Hub's discovery service you will be able to promote any of your registered
+nodes to a master from the web interface. You can also statically assign one
+of the Docker Engines with the --master flag.
+

+Eventually, clustering will provide a built-in leader election algorithm making the --master flag obsolete.
Would this use a gossip protocol to assign a new master, probably out of scope for the initial proposal just wondering


Reply to this email directly or view it on GitHub https://github.com/docker/docker/pull/8859/files#r19629151.

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

Please don't require a gossip protocol. Given that we have a discovery service, there is also no need. Once all nodes connected there, it knows about all nodes. I would try to avoid gossip protocols if possible because it's complex to reason about. A single master (or single discovery endpoint) is much simpler.

@discordianfish

discordianfish Oct 31, 2014

Contributor

Please don't require a gossip protocol. Given that we have a discovery service, there is also no need. Once all nodes connected there, it knows about all nodes. I would try to avoid gossip protocols if possible because it's complex to reason about. A single master (or single discovery endpoint) is much simpler.

This comment has been minimized.

@BhargavaRamM

BhargavaRamM Apr 3, 2015

What are the available options for allowing fault tolerance in cluster ?(If master goes down.. one of the nodes has to be picked and given privileges). I want to use Docker swarm orchestration technology and docker swarm doesn't provide any fault tolerance.

@BhargavaRamM

BhargavaRamM Apr 3, 2015

What are the available options for allowing fault tolerance in cluster ?(If master goes down.. one of the nodes has to be picked and given privileges). I want to use Docker swarm orchestration technology and docker swarm doesn't provide any fault tolerance.

Constraints are key/value pairs associated to particular nodes. You can see them as *node tags*.
When creating a container, the user can select a subset of nodes that should be considered
for scheduling by specifying one or more sets of matching key/value pairs.

This comment has been minimized.

@ynachiket

ynachiket Oct 30, 2014

What are your thoughts on providing some sort of hierarchy model for constraints? On a very high level there could be hard constraints and soft constraints (good to have but non blocking).

@ynachiket

ynachiket Oct 30, 2014

What are your thoughts on providing some sort of hierarchy model for constraints? On a very high level there could be hard constraints and soft constraints (good to have but non blocking).

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

Sounds like a good idea. For example you want to schedule your cache close to your application but if there aren't resources available on the same host/rack you still want to schedule it somewhere.

@discordianfish

discordianfish Oct 31, 2014

Contributor

Sounds like a good idea. For example you want to schedule your cache close to your application but if there aren't resources available on the same host/rack you still want to schedule it somewhere.

f8b693db9cd6 redis:2.8 "redis-server" About a minute running 192.168.0.43:49178->6379/tcp node-2 stoic_albattani
Upon rebalancing, the scheduler will look at the shape of your container (resource requirements, contraints...) and search for a node available.
If there is no such node, the container will remain in `pending` state until all conditions are met.

This comment has been minimized.

@ynachiket

ynachiket Oct 30, 2014

The soft limits approach I mentioned in https://github.com/docker/docker/pull/8859/files#r19632723 would also allow to ease this condition a bit. A failover could be handled with less hurdles if we allow soft limits (non blocking).

@ynachiket

ynachiket Oct 30, 2014

The soft limits approach I mentioned in https://github.com/docker/docker/pull/8859/files#r19632723 would also allow to ease this condition a bit. A failover could be handled with less hurdles if we allow soft limits (non blocking).

For instance, let's start `node-1` with the `storage=ssd` tag:
```
$ docker -d --discovery https://discovery.hub.docker.com/u/demo-user/cluster-1 --constraint storage=ssd

This comment has been minimized.

@wallnerryan

wallnerryan Oct 30, 2014

Contributor

Do you see that possibly specific node values could be useful here? I saw the mention of kernel version and operating system in default facts, can these be provided per node, specific to that node like factor facts (https://docs.puppetlabs.com/facter/1.6/core_facts.html) can be? E.g it would be really great to be able to write custom constraints specific to the system rather than key key=value at node startup.

@wallnerryan

wallnerryan Oct 30, 2014

Contributor

Do you see that possibly specific node values could be useful here? I saw the mention of kernel version and operating system in default facts, can these be provided per node, specific to that node like factor facts (https://docs.puppetlabs.com/facter/1.6/core_facts.html) can be? E.g it would be really great to be able to write custom constraints specific to the system rather than key key=value at node startup.

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NODE NAMES
f8b693db9cd6 redis:2.8 "redis-server" Up About a minute running 192.168.0.42:49178->6379/tcp node-1 redis
The commands you would use for single host will work in clustering mode.

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Wondering; is there a way to uniquely address a container in a cluster? Are container-ids and -names guaranteed to be unique in a cluster, or is there another way to address a container?

@thaJeztah

thaJeztah Oct 30, 2014

Member

Wondering; is there a way to uniquely address a container in a cluster? Are container-ids and -names guaranteed to be unique in a cluster, or is there another way to address a container?

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Just saw a part of the demo and noticed that container names are prefixed with the node name, so I that answers my question?

@thaJeztah

thaJeztah Oct 30, 2014

Member

Just saw a part of the demo and noticed that container names are prefixed with the node name, so I that answers my question?

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

If I run docker ps locally on one of the machines in my cluster, will I see the same name for the container? Will it include the "node-n" prefix?

@brendandburns

brendandburns Oct 31, 2014

If I run docker ps locally on one of the machines in my cluster, will I see the same name for the container? Will it include the "node-n" prefix?

```
$ docker run -d -P -m 10g redis
2014/10/29 00:33:20 Error response from daemon: no resources availalble to schedule container

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

I know it's just a proposal, but this error message is extremely vague; which resource(s) are not available? Memory? CPUs? Port(s)?

Also typo; s/availalble/available/

@thaJeztah

thaJeztah Oct 30, 2014

Member

I know it's just a proposal, but this error message is extremely vague; which resource(s) are not available? Memory? CPUs? Port(s)?

Also typo; s/availalble/available/

* Tagging nodes based on their physical location (`region=us-east`, to force containers to run on a given location).
* Logical cluster partioning (`environment=production`, to split a cluster into sub-clusters with different properties).
To tag a node with a specific set of key/value pairs, one must pass a list of `--constraint` options at node startup time.

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Would be useful to manage these key/value pairs on a running node, not only at start. These values are not tied to the actual hardware/specs of the node and are just custom properties that can be assigned. For example, if I want to "promote" a node to be my "production" node, I should be able to do so.

Since "discovery" is already handled through hub.docker.com, would it make sense to be able to manage these properties via the website? Ie offer a control panel to manage key/value pairs (besides being able to manage them through the API)

@thaJeztah

thaJeztah Oct 30, 2014

Member

Would be useful to manage these key/value pairs on a running node, not only at start. These values are not tied to the actual hardware/specs of the node and are just custom properties that can be assigned. For example, if I want to "promote" a node to be my "production" node, I should be able to do so.

Since "discovery" is already handled through hub.docker.com, would it make sense to be able to manage these properties via the website? Ie offer a control panel to manage key/value pairs (besides being able to manage them through the API)

$ docker -d --discovery https://discovery.hub.docker.com/u/demo-user/cluster-1 --constraint storage=disk
```
Once the nodes are registered with the cluster, the master pulls their respective tags and will take them into account when scheduling new containers.

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Is it necessary that the master collects these? Should this be part of the global "discovery" mechanism? e.g. An etcd like storage?

@thaJeztah

thaJeztah Oct 30, 2014

Member

Is it necessary that the master collects these? Should this be part of the global "discovery" mechanism? e.g. An etcd like storage?

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

No, the discovery mechanism is only for discovering docker hosts. Considerations around container and where they are run is job of the master. Whether the master replicates the state to other docker hosts for failover etc is still up for discussion but likely necessary.

@discordianfish

discordianfish Oct 31, 2014

Contributor

No, the discovery mechanism is only for discovering docker hosts. Considerations around container and where they are run is job of the master. Whether the master replicates the state to other docker hosts for failover etc is still up for discussion but likely necessary.

able to reschedule your containers to another machine. Lets shutdown our **node-1**
to see how Docker handles an entire node failure.
$ ssh node-1 && poweroff

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

What happens if I ssh master && poweroff? Unless I overlooked, there's no failover for the master, is this a single point of failure?

@thaJeztah

thaJeztah Oct 30, 2014

Member

What happens if I ssh master && poweroff? Unless I overlooked, there's no failover for the master, is this a single point of failure?

This comment has been minimized.

@tpires

tpires Oct 31, 2014

How would this work with docker run --restart policy?

If I set docker run --restart=on-failure:3 , the node by itself will restart (max: 3 times) if the container exits with a non-zero exit code. So master rebalancing will only work after the 3rd time? Or will only work if I set it to default --restart=no ?

@tpires

tpires Oct 31, 2014

How would this work with docker run --restart policy?

If I set docker run --restart=on-failure:3 , the node by itself will restart (max: 3 times) if the container exits with a non-zero exit code. So master rebalancing will only work after the 3rd time? Or will only work if I set it to default --restart=no ?

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

It would require --restart=no. But agreed, that should be part of the cluster docs (unless we just disable this if you run docker in clustered mode)

@discordianfish

discordianfish Oct 31, 2014

Contributor

It would require --restart=no. But agreed, that should be part of the cluster docs (unless we just disable this if you run docker in clustered mode)

Constraints are key/value pairs associated to particular nodes. You can see them as *node tags*.
When creating a container, the user can select a subset of nodes that should be considered

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Any thoughts on being able to specify conditions for an image and allow specifying them in a Dockerfile (@cpuguy83 has a proposal for specifying "dependencies" in Dockerfiles, I don't have the issue number at hand, but possibly this is useful)

One problem is that the constraints in this proposal are user defined; for interoperability, pre-defined constraints should exist, so that I can pull an image of the hub, configure my nodes (using the pre-defined constraints) and have the cluster automatically distribute my containers based on those constraints.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Any thoughts on being able to specify conditions for an image and allow specifying them in a Dockerfile (@cpuguy83 has a proposal for specifying "dependencies" in Dockerfiles, I don't have the issue number at hand, but possibly this is useful)

One problem is that the constraints in this proposal are user defined; for interoperability, pre-defined constraints should exist, so that I can pull an image of the hub, configure my nodes (using the pre-defined constraints) and have the cluster automatically distribute my containers based on those constraints.

To deploy a container to the cluster you can use your existing Docker CLI to issue a
run command against the master node.
$ export DOCKER_HOST=tcp://master-address:1234

This comment has been minimized.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Having to specify an environment-variable to select the cluster is a bit clunky. What if I need to manage multiple clusters? Perhaps a command to select/switch cluster from the docker client itself?

Additionally, if discovery is already offered via hub.docker.com, would it be possible to access my cluster by name after being logged in? hub.docker.com could then return the IP-address of the master of the cluster.

@thaJeztah

thaJeztah Oct 30, 2014

Member

Having to specify an environment-variable to select the cluster is a bit clunky. What if I need to manage multiple clusters? Perhaps a command to select/switch cluster from the docker client itself?

Additionally, if discovery is already offered via hub.docker.com, would it be possible to access my cluster by name after being logged in? hub.docker.com could then return the IP-address of the master of the cluster.

This comment has been minimized.

@bfirsh

bfirsh Oct 31, 2014

Contributor

Host management will let you select/switch clusters. In the same way you can point at a single Docker host, you can point at a Docker master.

As hinted above, there will likely be a UI via the Docker Hub, in which case you could indeed select a host/cluster by name. This could also be exposed through the Docker CLI.

@bfirsh

bfirsh Oct 31, 2014

Contributor

Host management will let you select/switch clusters. In the same way you can point at a single Docker host, you can point at a Docker master.

As hinted above, there will likely be a UI via the Docker Hub, in which case you could indeed select a host/cluster by name. This could also be exposed through the Docker CLI.

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

As far as I know, the idea is to have all docker hosts proxy/redirect requests to the current master. In this case I would just set up a DNS RR name with all docker hosts and configure that on your clients. This way each request selects a random docker host which will redirect to the master anyway.

@discordianfish

discordianfish Oct 31, 2014

Contributor

As far as I know, the idea is to have all docker hosts proxy/redirect requests to the current master. In this case I would just set up a DNS RR name with all docker hosts and configure that on your clients. This way each request selects a random docker host which will redirect to the master anyway.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Oct 30, 2014

Member

Interesting proposal. I must admit I have no experience with cluster management, but decided to give my thoughts anyway (hope I haven't made a complete fool of myself 😄)

Some additional thoughts;

how does this play together with the container groups proposal (#8637)? Will it be able to distribute groups across a cluster?

My second thought is harder to answer; is this something that should be implemented by Docker itself? The Docker Eco-system / community already has invested a lot in building solutions like Flocker and CoreOS Fleet (among others). Although no one solution will "fit all", I wonder if the docker community will appreciate it if docker itself is going to compete with them.

Just my thoughts, wdyt?

Member

thaJeztah commented Oct 30, 2014

Interesting proposal. I must admit I have no experience with cluster management, but decided to give my thoughts anyway (hope I haven't made a complete fool of myself 😄)

Some additional thoughts;

how does this play together with the container groups proposal (#8637)? Will it be able to distribute groups across a cluster?

My second thought is harder to answer; is this something that should be implemented by Docker itself? The Docker Eco-system / community already has invested a lot in building solutions like Flocker and CoreOS Fleet (among others). Although no one solution will "fit all", I wonder if the docker community will appreciate it if docker itself is going to compete with them.

Just my thoughts, wdyt?

@mkb

This comment has been minimized.

Show comment
Hide comment
@mkb

mkb Oct 30, 2014

Does this need to be part of the main Docker project? Orchestration is a whole other problem beyond container management. I'd rather see these implemented in a parallel project so Docker remains simple and Docker users can choose between Docker orchestration and the other available options.

mkb commented Oct 30, 2014

Does this need to be part of the main Docker project? Orchestration is a whole other problem beyond container management. I'd rather see these implemented in a parallel project so Docker remains simple and Docker users can choose between Docker orchestration and the other available options.

@jessfraz

This comment has been minimized.

Show comment
Hide comment
@jessfraz

jessfraz Oct 30, 2014

Contributor

@mkb do you have reasons why not? I think this is so seamlessly integrated into the current docker cli commands/functionality, that I can't think of a reason why it shouldn't, but I would be curious to know.

Contributor

jessfraz commented Oct 30, 2014

@mkb do you have reasons why not? I think this is so seamlessly integrated into the current docker cli commands/functionality, that I can't think of a reason why it shouldn't, but I would be curious to know.

@mkb

This comment has been minimized.

Show comment
Hide comment
@mkb

mkb Oct 30, 2014

@jfrazelle It comes down to style, really. I usually favor the Unix approach of small, composable tools that do one thing well. The less a tool does, the easier it is for new users (or new maintainers) to ramp up.

From purely personal standpoint, I love the idea of having the option of using Docker's cluster management but wouldn't want to be obligated to use those tools over some other.

To be clear, I'm not saying any of this is a bad idea at all. I'm just suggesting it might be better in a separate project.

mkb commented Oct 30, 2014

@jfrazelle It comes down to style, really. I usually favor the Unix approach of small, composable tools that do one thing well. The less a tool does, the easier it is for new users (or new maintainers) to ramp up.

From purely personal standpoint, I love the idea of having the option of using Docker's cluster management but wouldn't want to be obligated to use those tools over some other.

To be clear, I'm not saying any of this is a bad idea at all. I'm just suggesting it might be better in a separate project.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Oct 30, 2014

Contributor

@mkb did you see the demo?

Contributor

cpuguy83 commented Oct 30, 2014

@mkb did you see the demo?

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Oct 30, 2014

Member

@jfrazelle I think it depends on how you look at Docker; is Docker the "end product" or a "framework" to build an end-product on? Currently, it seems to be both; it's usable out of the box to get working with containers (end product), but acts as a "framework/library" backing other end-products (as mentioned in my previous comment).

Having cluster management out of the box certainly is convenient, the question is; will that make other solutions obsolete? Will cluster management offered by docker itself be just "bare bones" and will other solutions still be required for more advanced needs?

And (to summarise my previous comment); is the goal of Docker to build an Eco-system "around it", or to be the Eco-system (for better words).

I don't know the answer to that, and don't know if these are "mutually exclusive".

No bad feelings, just thinking out loud here.

Member

thaJeztah commented Oct 30, 2014

@jfrazelle I think it depends on how you look at Docker; is Docker the "end product" or a "framework" to build an end-product on? Currently, it seems to be both; it's usable out of the box to get working with containers (end product), but acts as a "framework/library" backing other end-products (as mentioned in my previous comment).

Having cluster management out of the box certainly is convenient, the question is; will that make other solutions obsolete? Will cluster management offered by docker itself be just "bare bones" and will other solutions still be required for more advanced needs?

And (to summarise my previous comment); is the goal of Docker to build an Eco-system "around it", or to be the Eco-system (for better words).

I don't know the answer to that, and don't know if these are "mutually exclusive".

No bad feelings, just thinking out loud here.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Oct 30, 2014

Member

@cpuguy83 where? Is it viewable online?

Member

thaJeztah commented Oct 30, 2014

@cpuguy83 where? Is it viewable online?

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Oct 30, 2014

Collaborator

@thaJeztah the demo will be posted soon on the proposal.

Collaborator

vieux commented Oct 30, 2014

@thaJeztah the demo will be posted soon on the proposal.

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah Oct 30, 2014

Member

@vieux thanks! I'll watch this ticket, quite interested to see it.

Member

thaJeztah commented Oct 30, 2014

@vieux thanks! I'll watch this ticket, quite interested to see it.

@asbjornenge

This comment has been minimized.

Show comment
Hide comment
@asbjornenge

asbjornenge Oct 30, 2014

Contributor

@vieux @aluzzardi really top notch stuff!

You mentioned in your demo that the cluster will diff the current running state with the new state and, I assume, create some sort of execution plan. Would it be out of scope for this system to handle transitioning between more complex states? So that I can describe (or dump?) some desired configuration and easily return to it?

@thaJeztah you can watch it here https://docker.com/community/globalhackday about 1 hour in.

Contributor

asbjornenge commented Oct 30, 2014

@vieux @aluzzardi really top notch stuff!

You mentioned in your demo that the cluster will diff the current running state with the new state and, I assume, create some sort of execution plan. Would it be out of scope for this system to handle transitioning between more complex states? So that I can describe (or dump?) some desired configuration and easily return to it?

@thaJeztah you can watch it here https://docker.com/community/globalhackday about 1 hour in.

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Oct 31, 2014

Collaborator

Hi everyone, thank you for your feedback,

I added the video from the Docker Global Hack Day in the proposal.

Collaborator

vieux commented Oct 31, 2014

Hi everyone, thank you for your feedback,

I added the video from the Docker Global Hack Day in the proposal.

@lukemarsden

This comment has been minimized.

Show comment
Hide comment
@lukemarsden

lukemarsden Oct 31, 2014

Contributor

Awesome work @vieux & @aluzzardi. I'm interested to see the code for the demo, is that available on GitHub somewhere? In particular, did you modify the Docker API at all to support multi-host?

Contributor

lukemarsden commented Oct 31, 2014

Awesome work @vieux & @aluzzardi. I'm interested to see the code for the demo, is that available on GitHub somewhere? In particular, did you modify the Docker API at all to support multi-host?

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Oct 31, 2014

how can you discuss this stuff in a pull request, this is madness

SamSaffron commented Oct 31, 2014

how can you discuss this stuff in a pull request, this is madness

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Oct 31, 2014

The big problem with clustering is that getting something basic up is trivial, getting something correct up is fiendishly complex required leader election, raft like protocols, dealing with partitions and so on.

I feel this is biting off way more than what the building blocks of docker should provide, if this is a direction Docker wants to take then the first step should be adding the building blocks into the infrastructure (eg. bundling etcd or something), but I am not convinced this is Dockers role.

SamSaffron commented Oct 31, 2014

The big problem with clustering is that getting something basic up is trivial, getting something correct up is fiendishly complex required leader election, raft like protocols, dealing with partitions and so on.

I feel this is biting off way more than what the building blocks of docker should provide, if this is a direction Docker wants to take then the first step should be adding the building blocks into the infrastructure (eg. bundling etcd or something), but I am not convinced this is Dockers role.

@andreaturli

This comment has been minimized.

Show comment
Hide comment
@andreaturli

andreaturli Oct 31, 2014

Contributor

what kind of inter-containers connectivity your solution offers? If the scheduler decides to place my container1 in host1 and container2 in host2, are they part of the same subnet?

Contributor

andreaturli commented Oct 31, 2014

what kind of inter-containers connectivity your solution offers? If the scheduler decides to place my container1 in host1 and container2 in host2, are they part of the same subnet?

To add a new or existing Docker
Engine to your newly created cluster use the provided URL from the Hub
with the `--discovery` flag when you start Docker in daemon mode.

This comment has been minimized.

@bfirsh

bfirsh Oct 31, 2014

Contributor

It's not immediately clear that --discovery has anything to do with clustering. How about --cluster? It then makes it really clear that you're enabling clustering mode. The URL seems to be the unique identifier for a cluster, so this makes sense semantically.

@bfirsh

bfirsh Oct 31, 2014

Contributor

It's not immediately clear that --discovery has anything to do with clustering. How about --cluster? It then makes it really clear that you're enabling clustering mode. The URL seems to be the unique identifier for a cluster, so this makes sense semantically.

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

+1 for --cluster

@discordianfish

discordianfish Oct 31, 2014

Contributor

+1 for --cluster

@bfirsh

This comment has been minimized.

Show comment
Hide comment
@bfirsh

bfirsh Oct 31, 2014

Contributor

How are images managed across multiple nodes? E.g.:

  • What does docker images output? Does it say what nodes the images are on, or is the list of images global to the cluster?
  • If I do docker build -t myimage . can I then do docker run myimage without having to think about what node the build ran on?
  • How can I make sure that the redis running on one node is the same as the redis running on another node?

... and stuff like that.

Contributor

bfirsh commented Oct 31, 2014

How are images managed across multiple nodes? E.g.:

  • What does docker images output? Does it say what nodes the images are on, or is the list of images global to the cluster?
  • If I do docker build -t myimage . can I then do docker run myimage without having to think about what node the build ran on?
  • How can I make sure that the redis running on one node is the same as the redis running on another node?

... and stuff like that.

## Constraints
Constraints are key/value pairs associated to particular nodes. You can see them as *node tags*.

This comment has been minimized.

@bfirsh

bfirsh Oct 31, 2014

Contributor

I'm don't think "constraint" is the right word for this. The "constraint" is a rule applied to something based on the key/value pairs, not the key/value pairs themselves.

A better word could be, as you suggest "node tags", and perhaps the option is --node-tags or just --tags. Or perhaps they're "node labels" because they're a bit like container labels, not sure.

The "constraint" could then be created against a "tag".

@bfirsh

bfirsh Oct 31, 2014

Contributor

I'm don't think "constraint" is the right word for this. The "constraint" is a rule applied to something based on the key/value pairs, not the key/value pairs themselves.

A better word could be, as you suggest "node tags", and perhaps the option is --node-tags or just --tags. Or perhaps they're "node labels" because they're a bit like container labels, not sure.

The "constraint" could then be created against a "tag".

This comment has been minimized.

@discordianfish

discordianfish Oct 31, 2014

Contributor

Agreed, that's indeed confusing. The constraint refers to key/value pairs (also not tag, tags are imo single words) but the k/v pairs aren't constraints.
I would prefer labels or attributes.

@discordianfish

discordianfish Oct 31, 2014

Contributor

Agreed, that's indeed confusing. The constraint refers to key/value pairs (also not tag, tags are imo single words) but the k/v pairs aren't constraints.
I would prefer labels or attributes.

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

+1 for node labels, esp. since they're already in use in Kubernetes, and standardizing label support inside of Docker would be a big win.

@brendandburns

brendandburns Oct 31, 2014

+1 for node labels, esp. since they're already in use in Kubernetes, and standardizing label support inside of Docker would be a big win.

This comment has been minimized.

@timothysc

timothysc Nov 3, 2014

  • 1 re: standardizing around labels.
@timothysc

timothysc Nov 3, 2014

  • 1 re: standardizing around labels.

This comment has been minimized.

@ravigadde

ravigadde Nov 7, 2014

Labels with properties, an example property being anti-affinity.

@ravigadde

ravigadde Nov 7, 2014

Labels with properties, an example property being anti-affinity.

### Standard Constraints
Additionally, a standard set of constraints can be used when scheduling containers without specifying them when starting the node.
Those tags are sourced from `docker info` and currently include:

This comment has been minimized.

@bfirsh

bfirsh Oct 31, 2014

Contributor

Perhaps standard constraints should be namespaced? I can see it being a problem adding new standard constraints in future because they might collide with user-defined constraints.

@bfirsh

bfirsh Oct 31, 2014

Contributor

Perhaps standard constraints should be namespaced? I can see it being a problem adding new standard constraints in future because they might collide with user-defined constraints.

@discordianfish

This comment has been minimized.

Show comment
Hide comment
@discordianfish

discordianfish Oct 31, 2014

Contributor

Cool! So I would love to join forces with kubernetes on this. Even if both projects will stay independent, there are a lot of design that is already shared and similar discussions around things like constraints etc.

Contributor

discordianfish commented Oct 31, 2014

Cool! So I would love to join forces with kubernetes on this. Even if both projects will stay independent, there are a lot of design that is already shared and similar discussions around things like constraints etc.

**cluster-1**. After creating a cluster you should be able to see and manage
the nodes that are currently registerd. After creating a cluster you
should receive a URL that looks similar to
`https://discovery.hub.docker.com/u/demo-user/cluster-1` with your Hub

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

Is there a public API for this endpoint? What if I want to set up a rally point inside my private network, or use different auth for managing my cluster rather than my docker images?

Ah I see the "In the future" note below, could you expand to include plans like spec-ing the API, etc?

@brendandburns

brendandburns Oct 31, 2014

Is there a public API for this endpoint? What if I want to set up a rally point inside my private network, or use different auth for managing my cluster rather than my docker images?

Ah I see the "In the future" note below, could you expand to include plans like spec-ing the API, etc?

This comment has been minimized.

@timothysc

timothysc Nov 1, 2014

I think most folks would just prefer to specify a master, vs. proxy through hub.

@timothysc

timothysc Nov 1, 2014

I think most folks would just prefer to specify a master, vs. proxy through hub.

nodes to a master from the web interface. You can also statically assign one
of the Docker Engines with the `--master` flag.
$ docker -d --master --discovery https://discovery.hub.docker.com/u/demo-user/cluster-1

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

How do I discover who the master is from my client? Do I need to hard code the IP address of the master as my $DOCKER_HOST

(ah, I see equivalent questions below)

@brendandburns

brendandburns Oct 31, 2014

How do I discover who the master is from my client? Do I need to hard code the IP address of the master as my $DOCKER_HOST

(ah, I see equivalent questions below)

that information.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NODE NAMES

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

How does the docker client know to display 'NODE' in the clustering case, and not in the local client case?

@brendandburns

brendandburns Oct 31, 2014

How does the docker client know to display 'NODE' in the clustering case, and not in the local client case?

The commands you would use for single host will work in clustering mode.
```
$ docker port redis

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

should this be:

docker port node-1/redis

It seems like only allowing a single item named 'redis' in your cluster is probably going to lead to namespace conflicts.

@brendandburns

brendandburns Oct 31, 2014

should this be:

docker port node-1/redis

It seems like only allowing a single item named 'redis' in your cluster is probably going to lead to namespace conflicts.

f8b693db9cd6 redis:2.8 "redis-server" Up About a minute running 192.168.0.42:49178->6379/tcp node-1 prickly_engelbart
```
The default scheduler uses bin packing to avoid resource fragmentation. If we ask for **1GB** of ram again, the container will be placed on the same node:

This comment has been minimized.

@brendandburns

brendandburns Oct 31, 2014

This will help with external fragmentation, but how will you cope with internal fragmentation (e.g. after I stop a container, it's going to leave a hole that may or may not fit future containers)

@brendandburns

brendandburns Oct 31, 2014

This will help with external fragmentation, but how will you cope with internal fragmentation (e.g. after I stop a container, it's going to leave a hole that may or may not fit future containers)

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Oct 31, 2014

@SamSaffron @discordianfish +1 to building modular things and re-using concepts where we can. I have a separate proposal here: #8781 to add the idea of a Pod (co-scheduled containers) into Docker proper.

When you start scheduling containers onto multiple hosts, you need an atomic element that must schedule onto a single machine. I don't think that that element is a single container. It is a group of symbiotic containers, or a hierarchy of containers. (of course there is a degenerate group that consists of a single container)

An example of this is a web server and a side-car job that syncs code from github, or an application composed of two containers talking via shared memory. Neither container can operate properly without the other being co-located on the same machine. I worry that the clustering approach proposed in this PR prevents this kind of atomic co-scheduling. I would much rather see Pods get baked into Docker and then the community can build many different schedulers on top of that primitive.

My other concern relative to this proposal, is that the current Docker API is imperative. I believe that a scheduling system should be declarative. I don't really want to imperatively say: "run this" to my cluster, I want to declaratively say: "this thing is running". In this way, the operational responsibility for keeping my application running is a part of the scheduling system, rather than part of a layer I have to build on top of the scheduling system.

brendandburns commented Oct 31, 2014

@SamSaffron @discordianfish +1 to building modular things and re-using concepts where we can. I have a separate proposal here: #8781 to add the idea of a Pod (co-scheduled containers) into Docker proper.

When you start scheduling containers onto multiple hosts, you need an atomic element that must schedule onto a single machine. I don't think that that element is a single container. It is a group of symbiotic containers, or a hierarchy of containers. (of course there is a degenerate group that consists of a single container)

An example of this is a web server and a side-car job that syncs code from github, or an application composed of two containers talking via shared memory. Neither container can operate properly without the other being co-located on the same machine. I worry that the clustering approach proposed in this PR prevents this kind of atomic co-scheduling. I would much rather see Pods get baked into Docker and then the community can build many different schedulers on top of that primitive.

My other concern relative to this proposal, is that the current Docker API is imperative. I believe that a scheduling system should be declarative. I don't really want to imperatively say: "run this" to my cluster, I want to declaratively say: "this thing is running". In this way, the operational responsibility for keeping my application running is a part of the scheduling system, rather than part of a layer I have to build on top of the scheduling system.

@titanous

This comment has been minimized.

Show comment
Hide comment
@titanous

titanous Nov 8, 2014

Contributor

@discordianfish

Basically, the discovery system will be more similar to etcd's cluster discovery than to a actual registry your docker hosts constantly have to talk to.

How will leader election work?

Contributor

titanous commented Nov 8, 2014

@discordianfish

Basically, the discovery system will be more similar to etcd's cluster discovery than to a actual registry your docker hosts constantly have to talk to.

How will leader election work?

@nathanleclaire

This comment has been minimized.

Show comment
Hide comment
@nathanleclaire

nathanleclaire Nov 8, 2014

Contributor

I imagine this will be akin to the open source registry where Docker provides a hosted service, but also provides the basic functionality via open source.

+1 for this, it should be accounted for right from the start. Hub as a hard dep for deployments or infrastructure will send some otherwise interested parties running away screaming.

Contributor

nathanleclaire commented Nov 8, 2014

I imagine this will be akin to the open source registry where Docker provides a hosted service, but also provides the basic functionality via open source.

+1 for this, it should be accounted for right from the start. Hub as a hard dep for deployments or infrastructure will send some otherwise interested parties running away screaming.

@inthecloud247

This comment has been minimized.

Show comment
Hide comment
@inthecloud247

inthecloud247 Nov 8, 2014

+1 agreed.

Also nice to be able to turn it off completely so no worries at all about unexpected outgoing requests going out to dockerhub.

inthecloud247 commented Nov 8, 2014

+1 agreed.

Also nice to be able to turn it off completely so no worries at all about unexpected outgoing requests going out to dockerhub.

@ibuildthecloud

This comment has been minimized.

Show comment
Hide comment
@ibuildthecloud

ibuildthecloud Nov 8, 2014

Contributor

@titanous can you offer more useful feedback than basically saying that Docker shouldn't do this. Docker as an open source project and a company is bound to eventually include features that expand beyond the scope of one host. How did you envision that happening? That's not a rhetorical question. I'm really quite interested if you have an alternate proposal of how Docker could do something like this.

Also, by cross server links, I mean container links that connect two containers across two servers. As I'm sure your well aware of, Docker links can only work with containers on the same server. This greatly limits their use and many have asked for the ability to link across servers. I personally would like this.

Contributor

ibuildthecloud commented Nov 8, 2014

@titanous can you offer more useful feedback than basically saying that Docker shouldn't do this. Docker as an open source project and a company is bound to eventually include features that expand beyond the scope of one host. How did you envision that happening? That's not a rhetorical question. I'm really quite interested if you have an alternate proposal of how Docker could do something like this.

Also, by cross server links, I mean container links that connect two containers across two servers. As I'm sure your well aware of, Docker links can only work with containers on the same server. This greatly limits their use and many have asked for the ability to link across servers. I personally would like this.

@ibuildthecloud

This comment has been minimized.

Show comment
Hide comment
@ibuildthecloud

ibuildthecloud Nov 8, 2014

Contributor

Nested Proposal: Multi-Server Primatives

Motivation

There are some people who are opposed to this proposal and I understand their concerns. I believe the basic concern is that Docker is unnecessarily expanding it's scope and taking on very difficult problems. By going down this route Docker runs the risk of hastily introducing a model that cannot be well executed and alienates other projects in the ecosystem that might provide a better solution.

While I understand the concerns, I also see there are some basic gaps in the existing Docker model. Docker is a single server technology today. For Docker to grow it must have some multi server concepts. Some could argue that the multi server concepts can be added solely by 3rd party wrappers and additions to Docker, but I would argue that that model runs the risk of creating a very confusing experience for the user and the user experience of Docker is paramount to its success.

Docker walks this very fine line where it needs to provide just enough functionality to allow the ecosystem to move forward. If Docker does too much and takes too much control it alienates other projects. If Docker doesn't do enough, the ecosystem fragments eventually hurting the user.

Objective: Multi Server User Experience

The main objective is to provide a simple multi-server user experience. This topic was covered in DockerCon 2014. Ever since the launch of Docker there has been a flurry of activity in the orchestration space. They are a lot of Docker orchestration tools. This is great and I don't think anybody wants to discourage this type of innovation. We do want to make the space a little less confusing for the user though. The basic premise is that today a user does docker run and it runs a container on the server local to the daemon. How can we change that such that a user can do docker run and it runs on one of many servers.

Not Clustering

The term clustering brings with it a lot of implied baggage. Again, I think the approach should be to provide a multi server experience, not necessarily clustering. Instead of building a full clustering solution, lets include the primitives into Docker to provide a multi server experience and then those primitives can be used to build a full clustering solution.

Plugins

It is absolutely critical, in my opinion, to the future success of Docker that we have a true pluggable design. That work is being designed in #8968. Assume all functionality that I will describe will be pluggable. Also, not just pluggable from a coding perspective, but also from a packaging perspective. Plugins can be artifacts packaged and delivered outside of the main Docker binary.

Primitives

The following is what I believe are the primitives lacking in Docker today. From an implementation perspective I do think it may make sense to spin off another library, libhost, to encapsulate some of this functionality. For the purpose of this proposal I will refer to libhost as the library that provides the interface and plugin points for these primitives and Docker is the daemon calling libhost.

Host Registration

Upon starting the Docker daemon, Docker should delegate to libhost to invoke a host registration plugin. The purpose of this plugin is to notify external entities of the presence of this server. The Docker daemon should be configured with a host registration URL. The URL is intended to be opaque to Docker and should be interpreted by the implementation of the host registration plugin.

Implementations of this plugin could do things such as register the host in a metadata server such as Apache ZooKeeper, Etcd, or Consul. It could also start an embedded implementation of Raft or Paxos. Since Docker wants to provide a simple functional implementation that covers most use cases, Docker will provide a default implementation. The default implementation of the host registration will call into Docker Hub. The approach is very simple and avoids having to run a consensus algorithm such as paxos or raft.

Host Discovery

In order to provide multi-server capabilities, you fundamentally need to know what servers exist. libhost will expose a plugin that Docker calles to list the available hosts. An opaque URL should be used to identify the group of hosts. Neither Docker or the user needs to understand the construction of the URL but only that that URL identifies their groups of hosts. The URL should be interpreted by the implementation of this plugin.

Implementations of this plugin could do things such as query a meta data server or get a list of cluster members from Raft or Paxos. The default implemenation of this plugin will query Docker Hub. Again, this is done for simplicity's sake. The approach and implementation is quite simple. Obviously some will disagree with the notion of a centralized service provide by Docker, Inc, but that is exactly why this is pluggable. Other implementations can be provided.

Host Connection Delegation

Once a host is determined, if Docker wants to issue commands to that host, it must need a connection to the daemon. libhost will expose a plugin that will give the caller a URL to the Docker daemon. Again, the URL should be opaque to Docker, all that it can assume is that if it opens a connection to that URL it will be connected to the Docker API.

Implementations of this plugin should provide means of doing a secure connection to the Docker host. Because the URL is opaque to Docker the plugin can embed authentication information in the URL such that the connection is properly authenticated.

Host API aggregation

Once you have a list of hosts at your command you would like the Docker CLI commands to return information for multiple hosts, not just one. For example, docker ps would list containers on all servers and docker host 42 ps would do ps for just host 42.

The implementation of the API aggregation will probably not be pluggable, but instead uses the pluggable primitives in libhost to accomplish it. If the Docker daemon is started in a "host aggregation" mode it will return multi host results. For example, docker ps would do essentially the following.

  1. Call libhost to get a list of hosts
  2. For each host call libhost to obtain the URL connection to the host
  3. Issue a docker ps to the host URL
  4. Aggregate the results and return to the client

Host Event Aggregation

This really is an implication of Host API aggregation, but should be specifically called out. docker events today gives you a stream of events for that individual server. If you are running against the Docker daemon in "host aggregation" mode it will return all events for all hosts. This is accomplished by simply issuing docker events to each host using constructs already available in libhost.

Conclusion

When I read the Clustering proposal, I interpret it as wanting to add these basic primitives. There may be more that would be useful, but I think this is a good starting point. The key point is that all of these things are pluggable and the interface to them is intended to be high level which should allow a high degree of flexibility in the implementation of the plugin.

Contributor

ibuildthecloud commented Nov 8, 2014

Nested Proposal: Multi-Server Primatives

Motivation

There are some people who are opposed to this proposal and I understand their concerns. I believe the basic concern is that Docker is unnecessarily expanding it's scope and taking on very difficult problems. By going down this route Docker runs the risk of hastily introducing a model that cannot be well executed and alienates other projects in the ecosystem that might provide a better solution.

While I understand the concerns, I also see there are some basic gaps in the existing Docker model. Docker is a single server technology today. For Docker to grow it must have some multi server concepts. Some could argue that the multi server concepts can be added solely by 3rd party wrappers and additions to Docker, but I would argue that that model runs the risk of creating a very confusing experience for the user and the user experience of Docker is paramount to its success.

Docker walks this very fine line where it needs to provide just enough functionality to allow the ecosystem to move forward. If Docker does too much and takes too much control it alienates other projects. If Docker doesn't do enough, the ecosystem fragments eventually hurting the user.

Objective: Multi Server User Experience

The main objective is to provide a simple multi-server user experience. This topic was covered in DockerCon 2014. Ever since the launch of Docker there has been a flurry of activity in the orchestration space. They are a lot of Docker orchestration tools. This is great and I don't think anybody wants to discourage this type of innovation. We do want to make the space a little less confusing for the user though. The basic premise is that today a user does docker run and it runs a container on the server local to the daemon. How can we change that such that a user can do docker run and it runs on one of many servers.

Not Clustering

The term clustering brings with it a lot of implied baggage. Again, I think the approach should be to provide a multi server experience, not necessarily clustering. Instead of building a full clustering solution, lets include the primitives into Docker to provide a multi server experience and then those primitives can be used to build a full clustering solution.

Plugins

It is absolutely critical, in my opinion, to the future success of Docker that we have a true pluggable design. That work is being designed in #8968. Assume all functionality that I will describe will be pluggable. Also, not just pluggable from a coding perspective, but also from a packaging perspective. Plugins can be artifacts packaged and delivered outside of the main Docker binary.

Primitives

The following is what I believe are the primitives lacking in Docker today. From an implementation perspective I do think it may make sense to spin off another library, libhost, to encapsulate some of this functionality. For the purpose of this proposal I will refer to libhost as the library that provides the interface and plugin points for these primitives and Docker is the daemon calling libhost.

Host Registration

Upon starting the Docker daemon, Docker should delegate to libhost to invoke a host registration plugin. The purpose of this plugin is to notify external entities of the presence of this server. The Docker daemon should be configured with a host registration URL. The URL is intended to be opaque to Docker and should be interpreted by the implementation of the host registration plugin.

Implementations of this plugin could do things such as register the host in a metadata server such as Apache ZooKeeper, Etcd, or Consul. It could also start an embedded implementation of Raft or Paxos. Since Docker wants to provide a simple functional implementation that covers most use cases, Docker will provide a default implementation. The default implementation of the host registration will call into Docker Hub. The approach is very simple and avoids having to run a consensus algorithm such as paxos or raft.

Host Discovery

In order to provide multi-server capabilities, you fundamentally need to know what servers exist. libhost will expose a plugin that Docker calles to list the available hosts. An opaque URL should be used to identify the group of hosts. Neither Docker or the user needs to understand the construction of the URL but only that that URL identifies their groups of hosts. The URL should be interpreted by the implementation of this plugin.

Implementations of this plugin could do things such as query a meta data server or get a list of cluster members from Raft or Paxos. The default implemenation of this plugin will query Docker Hub. Again, this is done for simplicity's sake. The approach and implementation is quite simple. Obviously some will disagree with the notion of a centralized service provide by Docker, Inc, but that is exactly why this is pluggable. Other implementations can be provided.

Host Connection Delegation

Once a host is determined, if Docker wants to issue commands to that host, it must need a connection to the daemon. libhost will expose a plugin that will give the caller a URL to the Docker daemon. Again, the URL should be opaque to Docker, all that it can assume is that if it opens a connection to that URL it will be connected to the Docker API.

Implementations of this plugin should provide means of doing a secure connection to the Docker host. Because the URL is opaque to Docker the plugin can embed authentication information in the URL such that the connection is properly authenticated.

Host API aggregation

Once you have a list of hosts at your command you would like the Docker CLI commands to return information for multiple hosts, not just one. For example, docker ps would list containers on all servers and docker host 42 ps would do ps for just host 42.

The implementation of the API aggregation will probably not be pluggable, but instead uses the pluggable primitives in libhost to accomplish it. If the Docker daemon is started in a "host aggregation" mode it will return multi host results. For example, docker ps would do essentially the following.

  1. Call libhost to get a list of hosts
  2. For each host call libhost to obtain the URL connection to the host
  3. Issue a docker ps to the host URL
  4. Aggregate the results and return to the client

Host Event Aggregation

This really is an implication of Host API aggregation, but should be specifically called out. docker events today gives you a stream of events for that individual server. If you are running against the Docker daemon in "host aggregation" mode it will return all events for all hosts. This is accomplished by simply issuing docker events to each host using constructs already available in libhost.

Conclusion

When I read the Clustering proposal, I interpret it as wanting to add these basic primitives. There may be more that would be useful, but I think this is a good starting point. The key point is that all of these things are pluggable and the interface to them is intended to be high level which should allow a high degree of flexibility in the implementation of the plugin.

## Discovery
Before we can start deploying our container's to a Docker cluster we need to

This comment has been minimized.

@SvenDowideit

SvenDowideit Nov 10, 2014

Contributor

containers :) not possessive

@SvenDowideit

SvenDowideit Nov 10, 2014

Contributor

containers :) not possessive

@SvenDowideit

This comment has been minimized.

Show comment
Hide comment
@SvenDowideit

SvenDowideit Nov 10, 2014

Contributor

nice - please remember to re-jig to 80 char lines some time tho

Contributor

SvenDowideit commented Nov 10, 2014

nice - please remember to re-jig to 80 char lines some time tho

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Nov 11, 2014

I was just thinking about this some more, how are you going to handle data containers?

Suppose I have a container Foo that wants to do volumes from from data container A and data container B.

How can I make sure that data container A and data container B land on the same machine? Am I required to manually use machine constraints to make sure that they both land in the same place?

Suppose I do (hypothetical syntax):

docker run --constraint=machine1 data-container-a
docker run --constraint=machine1 data-container-b
docker run --constraint=machine1 app-container-foo

How can I make sure that my app container will successfully schedule onto the same place? What if I have two app containers foo and bar, how can I make sure that they will correctly schedule onto a machine with sufficient resources?

I think this is yet another strong argument for the fact that you need to schedule pods of containers, not individual containers.

brendandburns commented Nov 11, 2014

I was just thinking about this some more, how are you going to handle data containers?

Suppose I have a container Foo that wants to do volumes from from data container A and data container B.

How can I make sure that data container A and data container B land on the same machine? Am I required to manually use machine constraints to make sure that they both land in the same place?

Suppose I do (hypothetical syntax):

docker run --constraint=machine1 data-container-a
docker run --constraint=machine1 data-container-b
docker run --constraint=machine1 app-container-foo

How can I make sure that my app container will successfully schedule onto the same place? What if I have two app containers foo and bar, how can I make sure that they will correctly schedule onto a machine with sufficient resources?

I think this is yet another strong argument for the fact that you need to schedule pods of containers, not individual containers.

@discordianfish

This comment has been minimized.

Show comment
Hide comment
@discordianfish

discordianfish Nov 11, 2014

Contributor

@brendandburns AFAIK the idea is that a volumes-from will be a hard scheduling constraint: Your container will be only scheduled to where that data volume container exists. But right, this should be part of the proposal.

Contributor

discordianfish commented Nov 11, 2014

@brendandburns AFAIK the idea is that a volumes-from will be a hard scheduling constraint: Your container will be only scheduled to where that data volume container exists. But right, this should be part of the proposal.

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Nov 11, 2014

Sure, I understand its a hard constraint, but if you don't make it an atomic unit, then you will easily get into situations where your container + volume container isn't schedulable.

Even worse, you can essentially deadlock your cluster where the dependencies are such that it can never be schedulable.

brendandburns commented Nov 11, 2014

Sure, I understand its a hard constraint, but if you don't make it an atomic unit, then you will easily get into situations where your container + volume container isn't schedulable.

Even worse, you can essentially deadlock your cluster where the dependencies are such that it can never be schedulable.

@discordianfish

This comment has been minimized.

Show comment
Hide comment
@discordianfish

discordianfish Nov 11, 2014

Contributor

@brendandburns Got it, right. I agree that we need some atomic scheduling unit for such things and, given that we advise people to run a container per service, this can't be containers. So whether they are called pods but I'm all up for that.

Contributor

discordianfish commented Nov 11, 2014

@brendandburns Got it, right. I agree that we need some atomic scheduling unit for such things and, given that we advise people to run a container per service, this can't be containers. So whether they are called pods but I'm all up for that.

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 12, 2014

Collaborator

@brendandburns we discussed the topic of "declarative vs imperative" in today's live review. (aka "Docker tuesday" :).

Here is my conclusion:

  • Yes, any sound clustering solution must 1) distinguish "wanted state" from "actual state", 2) expose both to the user, 3) allow users to change wanted state, and 4) implement a mechanism for resolving differences.
  • This proposal is no exception, and clearly implements all 4 points above. As for how well this is designed and implemented, I trust the instincts of @aluzzardi and @vieux since they have the combined operational experience of Google SRE, Dotcloud platform engineering, and Microsoft Bing. All 3 operated at large scale (although different axis of scale for each), and each with different set of requirements. We are talking about people who know how the real world works, so let's give them the benefit of the doubt on implementation.
  • How to implement point 3 ("allow users to change wanted state") is a discussion of UI and design philosophy. It has nothing to do with "right" or "wrong" way to do clustering. One way is to edit a yaml file describing the entire wanted state, upload it in bulk, and let the tool infer what changed (this is how Kubernetes and Dotcloud do it). Another way is to specify discrete changes to the wanted state - "add this", "remove that", "change this value", etc. (this is how Docker does it, and this proposal preserves that behavior). Again: this is a UI decision, and does not impact the underlying architecture: in both cases, "the operational responsibility for keeping my application running is a part of the scheduling system, rather than part of a layer I have to build on top of the scheduling system". So your requirement is met.

TLDR: we are in violent agreement on underlying architecture and separation of concerns. What is left is a matter of bikeshedding^W UI preference.

Collaborator

shykes commented Nov 12, 2014

@brendandburns we discussed the topic of "declarative vs imperative" in today's live review. (aka "Docker tuesday" :).

Here is my conclusion:

  • Yes, any sound clustering solution must 1) distinguish "wanted state" from "actual state", 2) expose both to the user, 3) allow users to change wanted state, and 4) implement a mechanism for resolving differences.
  • This proposal is no exception, and clearly implements all 4 points above. As for how well this is designed and implemented, I trust the instincts of @aluzzardi and @vieux since they have the combined operational experience of Google SRE, Dotcloud platform engineering, and Microsoft Bing. All 3 operated at large scale (although different axis of scale for each), and each with different set of requirements. We are talking about people who know how the real world works, so let's give them the benefit of the doubt on implementation.
  • How to implement point 3 ("allow users to change wanted state") is a discussion of UI and design philosophy. It has nothing to do with "right" or "wrong" way to do clustering. One way is to edit a yaml file describing the entire wanted state, upload it in bulk, and let the tool infer what changed (this is how Kubernetes and Dotcloud do it). Another way is to specify discrete changes to the wanted state - "add this", "remove that", "change this value", etc. (this is how Docker does it, and this proposal preserves that behavior). Again: this is a UI decision, and does not impact the underlying architecture: in both cases, "the operational responsibility for keeping my application running is a part of the scheduling system, rather than part of a layer I have to build on top of the scheduling system". So your requirement is met.

TLDR: we are in violent agreement on underlying architecture and separation of concerns. What is left is a matter of bikeshedding^W UI preference.

@jbeda

This comment has been minimized.

Show comment
Hide comment
@jbeda

jbeda Nov 12, 2014

Contributor

@shykes Thanks for writing this up.

Totally agree on your 1-4 points above. Based on experiences and mistakes in the past we explicitly call out desired vs. actual state in the API. I would say that if you API does handle desired vs. actual state you are building a declarative API.

Pure RESTful vs. custom verbs

However, I do think that the way that you modify the desired state is more than just a UI difference. There are pros and cons to each method. When thinking about this in REST terms, the question is if you support modifications via replacing the resource or via custom verbs. In the GCE API, I went with the custom verb route and I'm coming to think that it was a mistake due the sheer number of verbs.

It is important, I think, to differentiate the UI to the user vs. how this is modeled in the API. Just because you model the API as more RESTful with "full update" over custom verbs doesn't mean that you can't have affordances in your UX for incremental updates. The full update model doesn't require or imply YAML/JSON files. You can have a config file based system calling an imperative API (and doing reconciliation client side) and a imperative experience on top of a declarative system.

Error Handling

One thing that is worth calling out (and does relate to this proposal) is how you handle errors. In a truely declarative system failure isn't as clear cut as it might be on a single node. Specifically, if I try an schedule a container and there isn't enough capacity in the cluster, what is the appropriate action -- you could either fail immediately or you could accept the request and fail to converge desired state and actual state. The proposal above suggests that this would error immediately. But what if you can add capacity to the cluster? Shouldn't the desire to run the container be recorded immediately?

This situation is equivalent in many ways to having a container running and then having the cluster shrink due to hardware failure (or admin action). In that case you may now no longer have enough capacity to run all of the work that is desired. Some containers will go into a 'pending' state. Why is it possible to go pending due to cluster shrinkage but not when the cluster isn't large enough?

I think the desire to be able to return success or failure for an operation drives the need, in many ways, to have custom verbs as the update method. If this is really desired, one solution that we've considered is that the user would submit a new version of the resource and the API would return a set of operations that can succeed or fail (with some time bound, etc). That way the user can slam a new version in and the system breaks it down to the things that can succeed or fail.

Serial configuration vs. Settling

One thing that comes out of a multi-resource declarative system is that order of object doesn't matter nearly as much. This is something that I think we got wrong in places in the GCE API. If you do this right, you can reference an object before it is ready -- or even created. You can specify resources out of order and let the system settle as things become configured. This doesn't matter as much on a single node as most operations there are relatively time bounded. But with larger and larger systems (some of which may include a ticket to the network ops team) latencies increase and it is easier to use if you don't have to run an outside workflow/state machine to achieve a result. I can give more examples here if you like.

Hopefully this gives you some food for thought. We are building Kubernetes with this stuff in mind.

Contributor

jbeda commented Nov 12, 2014

@shykes Thanks for writing this up.

Totally agree on your 1-4 points above. Based on experiences and mistakes in the past we explicitly call out desired vs. actual state in the API. I would say that if you API does handle desired vs. actual state you are building a declarative API.

Pure RESTful vs. custom verbs

However, I do think that the way that you modify the desired state is more than just a UI difference. There are pros and cons to each method. When thinking about this in REST terms, the question is if you support modifications via replacing the resource or via custom verbs. In the GCE API, I went with the custom verb route and I'm coming to think that it was a mistake due the sheer number of verbs.

It is important, I think, to differentiate the UI to the user vs. how this is modeled in the API. Just because you model the API as more RESTful with "full update" over custom verbs doesn't mean that you can't have affordances in your UX for incremental updates. The full update model doesn't require or imply YAML/JSON files. You can have a config file based system calling an imperative API (and doing reconciliation client side) and a imperative experience on top of a declarative system.

Error Handling

One thing that is worth calling out (and does relate to this proposal) is how you handle errors. In a truely declarative system failure isn't as clear cut as it might be on a single node. Specifically, if I try an schedule a container and there isn't enough capacity in the cluster, what is the appropriate action -- you could either fail immediately or you could accept the request and fail to converge desired state and actual state. The proposal above suggests that this would error immediately. But what if you can add capacity to the cluster? Shouldn't the desire to run the container be recorded immediately?

This situation is equivalent in many ways to having a container running and then having the cluster shrink due to hardware failure (or admin action). In that case you may now no longer have enough capacity to run all of the work that is desired. Some containers will go into a 'pending' state. Why is it possible to go pending due to cluster shrinkage but not when the cluster isn't large enough?

I think the desire to be able to return success or failure for an operation drives the need, in many ways, to have custom verbs as the update method. If this is really desired, one solution that we've considered is that the user would submit a new version of the resource and the API would return a set of operations that can succeed or fail (with some time bound, etc). That way the user can slam a new version in and the system breaks it down to the things that can succeed or fail.

Serial configuration vs. Settling

One thing that comes out of a multi-resource declarative system is that order of object doesn't matter nearly as much. This is something that I think we got wrong in places in the GCE API. If you do this right, you can reference an object before it is ready -- or even created. You can specify resources out of order and let the system settle as things become configured. This doesn't matter as much on a single node as most operations there are relatively time bounded. But with larger and larger systems (some of which may include a ticket to the network ops team) latencies increase and it is easier to use if you don't have to run an outside workflow/state machine to achieve a result. I can give more examples here if you like.

Hopefully this gives you some food for thought. We are building Kubernetes with this stuff in mind.

@inthecloud247

This comment has been minimized.

Show comment
Hide comment
@inthecloud247

inthecloud247 Nov 12, 2014

Thought it would be useful to add info about current relevant and really interesting work with container groups and stack composition by @aanand @crosbymichael and @nathanleclaire . Saw this on the docker-dev list:

"Hi all. I've been working on two new features: container grouping (docker groups)
and stack composition (docker up). Together, they will eventually form a complete
replacement for Fig in Docker itself."

https://github.com/aanand/docker/compare/composition
https://gist.github.com/aanand/9e7ac7185ffd64c1a91a

https://groups.google.com/forum/#!topic/docker-dev/th3yKNKbCWM

I think it's incredibly exciting to have so many different options for orchestration and service discovery. Before Docker, how much innovation and excitement was there in either field? And how many options have popped up out of the woodwork over the last few months. It's amazing to be able to choose among so many solid and creative projects. DNS-based service discovery ftw! And omg (!!!) even now some new alerting/logging systems after decades of stagnation.

But I think it's simply too early to start choosing the winners here and integrating these features into core. IMO, until the plugin system is ready, these features either need to be implemented as separate utilities or wrappers, and even then should tread carefully to not upset the developer ecosystem that has sprung up around Docker. There was a certain chilling affect I noticed after Fig was re-released as the 'official' docker way to run multiple containers on a host. Fig is cool, but some of the other competing systems were also great.

inthecloud247 commented Nov 12, 2014

Thought it would be useful to add info about current relevant and really interesting work with container groups and stack composition by @aanand @crosbymichael and @nathanleclaire . Saw this on the docker-dev list:

"Hi all. I've been working on two new features: container grouping (docker groups)
and stack composition (docker up). Together, they will eventually form a complete
replacement for Fig in Docker itself."

https://github.com/aanand/docker/compare/composition
https://gist.github.com/aanand/9e7ac7185ffd64c1a91a

https://groups.google.com/forum/#!topic/docker-dev/th3yKNKbCWM

I think it's incredibly exciting to have so many different options for orchestration and service discovery. Before Docker, how much innovation and excitement was there in either field? And how many options have popped up out of the woodwork over the last few months. It's amazing to be able to choose among so many solid and creative projects. DNS-based service discovery ftw! And omg (!!!) even now some new alerting/logging systems after decades of stagnation.

But I think it's simply too early to start choosing the winners here and integrating these features into core. IMO, until the plugin system is ready, these features either need to be implemented as separate utilities or wrappers, and even then should tread carefully to not upset the developer ecosystem that has sprung up around Docker. There was a certain chilling affect I noticed after Fig was re-released as the 'official' docker way to run multiple containers on a host. Fig is cool, but some of the other competing systems were also great.

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Nov 12, 2014

@jbeda I chatted with @vieux and @aluzzardi yesterday, and I wrote the following notes. I'll let them add color, or greater clarity, where they think it's needed.


  • Clustering will be done via a plugin whose API is TBD but due out later...
  • The Docker CLI is meant to be the primary interface to support Docker clustering, and the plugin will extend the CLI via options (TBD). Ideally we'll rally around both the interface and options to provide consistency across different backends. e.g. --constraints=TBD-fubar
  • Kubernetes, Mesos, or possibly Marathon, could be added as a back end via the plugin API.
  • Any concepts, such as 1st classing PODs is a docker-core constraint. But imho it should be a added otherwise I believe clustering will become a mess.

I would hope to see as much common ground and re-usability "where possible", such that we can leverage linus's law, "given enough eyeballs, all bugs are shallow". For example:

Do we need to have two separate master worker implementations?
Do we need to have two label/tag/attribute systems?
Do we need to have two separate monitoring schemes?
Don't get me started on constraints (╯°□°)╯︵ ┻━┻
The list goes on...

The models atop could support different use cases, whether they are imperative or declarative depends on the use cases.

I trust @vieux and @aluzzardi, as we travel in similar circles, and I look forward to working together to make sure we can support multiple models with a common core.

timothysc commented Nov 12, 2014

@jbeda I chatted with @vieux and @aluzzardi yesterday, and I wrote the following notes. I'll let them add color, or greater clarity, where they think it's needed.


  • Clustering will be done via a plugin whose API is TBD but due out later...
  • The Docker CLI is meant to be the primary interface to support Docker clustering, and the plugin will extend the CLI via options (TBD). Ideally we'll rally around both the interface and options to provide consistency across different backends. e.g. --constraints=TBD-fubar
  • Kubernetes, Mesos, or possibly Marathon, could be added as a back end via the plugin API.
  • Any concepts, such as 1st classing PODs is a docker-core constraint. But imho it should be a added otherwise I believe clustering will become a mess.

I would hope to see as much common ground and re-usability "where possible", such that we can leverage linus's law, "given enough eyeballs, all bugs are shallow". For example:

Do we need to have two separate master worker implementations?
Do we need to have two label/tag/attribute systems?
Do we need to have two separate monitoring schemes?
Don't get me started on constraints (╯°□°)╯︵ ┻━┻
The list goes on...

The models atop could support different use cases, whether they are imperative or declarative depends on the use cases.

I trust @vieux and @aluzzardi, as we travel in similar circles, and I look forward to working together to make sure we can support multiple models with a common core.

@kelseyhightower

This comment has been minimized.

Show comment
Hide comment
@kelseyhightower

kelseyhightower Nov 12, 2014

I'm loving this open discussion around Docker clustering. After reading the proposal and watching the design review on YouTube, I really think etcd can offer a lot of value in the following ways:

Cluster Bootstrapping

The etcd discovery protocol can be used for cluster bootstrapping. Based on the desired user experience I think the discovery protocol is a perfect fit. Docker, Inc can host their own public discovery service dedicated to Docker clustering. Since the discovery service has an HTTP interface, auth can easily be layered on top at anytime.

Master election and Cluster state management

This is something we have been doing for a long time and heavily used in etcd, fleet, and flannel. Ideally this proposal can leverage the same stuff and focus on the Docker UI.

Open to collaboration

We are happy to discuss the usage/bundling of etcd or help reusing our raft implementation. We moved away from goraft, but we have made our new raft implementation available as a standalone package.

import "github.com/coreos/etcd/raft"

See the docs for more details.

kelseyhightower commented Nov 12, 2014

I'm loving this open discussion around Docker clustering. After reading the proposal and watching the design review on YouTube, I really think etcd can offer a lot of value in the following ways:

Cluster Bootstrapping

The etcd discovery protocol can be used for cluster bootstrapping. Based on the desired user experience I think the discovery protocol is a perfect fit. Docker, Inc can host their own public discovery service dedicated to Docker clustering. Since the discovery service has an HTTP interface, auth can easily be layered on top at anytime.

Master election and Cluster state management

This is something we have been doing for a long time and heavily used in etcd, fleet, and flannel. Ideally this proposal can leverage the same stuff and focus on the Docker UI.

Open to collaboration

We are happy to discuss the usage/bundling of etcd or help reusing our raft implementation. We moved away from goraft, but we have made our new raft implementation available as a standalone package.

import "github.com/coreos/etcd/raft"

See the docs for more details.

@LK4D4

This comment has been minimized.

Show comment
Hide comment
@LK4D4

LK4D4 Nov 12, 2014

Contributor

Using raft in docker would be nice I think. Also this is interesting thing to do.

Contributor

LK4D4 commented Nov 12, 2014

Using raft in docker would be nice I think. Also this is interesting thing to do.

@brendandburns

This comment has been minimized.

Show comment
Hide comment
@brendandburns

brendandburns Nov 13, 2014

@shykes I wanted to clarify my concerns with this proposal into a concise comment, since as you say, the declarative vs. imperative thing is somewhat of a distraction (although an important one) from the larger concerns around the proposal.

Concern 1: Atomicity

As it is currently stated, the atomic unit for clustering is a single Docker container. This is insufficient, as many real world applications consist of multiple groups of containers that must be co-scheduled to operate correctly. The easiest example of this from the Docker world is a data container and the app container. In the current approach it is easy to image examples where an attempt to serially schedule the data container, and then the app container in two different API calls will result in a failure to schedule.

For example, consider a data container that takes up 1G of disk, and an app container that takes up 0.5G of disk. I have two machines:

  • M1: 1.25G of disk free
  • M2: 10G of disk free

Given the current proposal, you would issue two separate commands. First
docker run -d my/data-container

Let's say that this schedules the data container onto M1

Now you issue a second command:
docker run -d --volumes-from data-container my/app-container

This container is unschedulable, since it is required to schedule onto the same machine as the data container (M1) but there is insufficient space to schedule there.

The correct answer of course is to schedule both containers onto M2, but unless you treat the two containers as an atomic unit, you will never be able to achieve that.

Concern 2: A single API that mixes concepts between node and cluster

The existing Docker API is focused on a single machine (aka a node). In trying to keep the same api between node and cluster, you are going to introduce extra fields and concepts into the API (e.g. the "NODE" field in the UX suggested in the proposal, or the scheduling constraints) that are irrelevant or unused in either the clustering side of the API or the node side of the API. You are effectively mashing two APIs together, because there is some degree of overlap. A better approach is to extract the type that is common (basically the container [or hopefully pod] definition) and introduce two different APIs, one that is focused on the node, and one that is focused on the cluster.

Concern 3: A lack of support for Labels or tags

If we take containers as the lowest primitive, people are going to want to build higher level primitives on top of those individual containers. Examples of this include replica sets, or load balancing groups.

In Kubernetes we have used labels and label selectors (queries) extensively to concisely express sets of containers/pods that make up a replica set or a load balancing group.

If you don't have something like a label for your containers people are either forced to

  • encode labels into the name of the container (super hacky)
  • maintain a separate parallel set of data structures for each container (ugly, hard to sync)
  • keep explicit lists of containers by name (hard to self-heal via introspecting the API)

Concern 4: An insufficient health model

The extent to which Docker currently monitors the health of a container is by ensuring that the process is running, but there are lots of applications failures that are due to server deadlock that are only visible if you do application level (for example HTTP) health checking. If these kind of health checks aren't part of the low level API, then users are going to again, be forced to create their own set of api objects to drive these kind of concepts down to the node.

Concern 5: An inability to update existing containers

The Docker API as it currently stands lacks the ability to update any aspect of the container (#6323). This is a large part about what I mean when I say that it is an imperative API, if I want to (for example) adjust the Memory that is available to the process, I have to kill the existing container, and restart a new container with the higher memory limit. This extends to other meta-information, like labels (if they existed) where it is extremely useful to be able to update a container without killing and restarting it. While dynamic updates are useful on the node, they become increasingly important in a scheduling system as you want to be able to dynamically adjust container limits to match actual usage, so that you can pack more containers onto any individual machine.

Those are my highest level concerns about the existing proposal. There are a few other smaller scale things...

I think that adopting Pods for scheduling (#8781) and adding labels would go a long way towards making this proposal more usable, but I would second @jbeda 's request for a clean delineation in the API objects between the desired state and the current state as well, and I truly believe that it's better to extract out the common objects and build a separate clustering API, rather than trying to smash it into the existing Docker API.

brendandburns commented Nov 13, 2014

@shykes I wanted to clarify my concerns with this proposal into a concise comment, since as you say, the declarative vs. imperative thing is somewhat of a distraction (although an important one) from the larger concerns around the proposal.

Concern 1: Atomicity

As it is currently stated, the atomic unit for clustering is a single Docker container. This is insufficient, as many real world applications consist of multiple groups of containers that must be co-scheduled to operate correctly. The easiest example of this from the Docker world is a data container and the app container. In the current approach it is easy to image examples where an attempt to serially schedule the data container, and then the app container in two different API calls will result in a failure to schedule.

For example, consider a data container that takes up 1G of disk, and an app container that takes up 0.5G of disk. I have two machines:

  • M1: 1.25G of disk free
  • M2: 10G of disk free

Given the current proposal, you would issue two separate commands. First
docker run -d my/data-container

Let's say that this schedules the data container onto M1

Now you issue a second command:
docker run -d --volumes-from data-container my/app-container

This container is unschedulable, since it is required to schedule onto the same machine as the data container (M1) but there is insufficient space to schedule there.

The correct answer of course is to schedule both containers onto M2, but unless you treat the two containers as an atomic unit, you will never be able to achieve that.

Concern 2: A single API that mixes concepts between node and cluster

The existing Docker API is focused on a single machine (aka a node). In trying to keep the same api between node and cluster, you are going to introduce extra fields and concepts into the API (e.g. the "NODE" field in the UX suggested in the proposal, or the scheduling constraints) that are irrelevant or unused in either the clustering side of the API or the node side of the API. You are effectively mashing two APIs together, because there is some degree of overlap. A better approach is to extract the type that is common (basically the container [or hopefully pod] definition) and introduce two different APIs, one that is focused on the node, and one that is focused on the cluster.

Concern 3: A lack of support for Labels or tags

If we take containers as the lowest primitive, people are going to want to build higher level primitives on top of those individual containers. Examples of this include replica sets, or load balancing groups.

In Kubernetes we have used labels and label selectors (queries) extensively to concisely express sets of containers/pods that make up a replica set or a load balancing group.

If you don't have something like a label for your containers people are either forced to

  • encode labels into the name of the container (super hacky)
  • maintain a separate parallel set of data structures for each container (ugly, hard to sync)
  • keep explicit lists of containers by name (hard to self-heal via introspecting the API)

Concern 4: An insufficient health model

The extent to which Docker currently monitors the health of a container is by ensuring that the process is running, but there are lots of applications failures that are due to server deadlock that are only visible if you do application level (for example HTTP) health checking. If these kind of health checks aren't part of the low level API, then users are going to again, be forced to create their own set of api objects to drive these kind of concepts down to the node.

Concern 5: An inability to update existing containers

The Docker API as it currently stands lacks the ability to update any aspect of the container (#6323). This is a large part about what I mean when I say that it is an imperative API, if I want to (for example) adjust the Memory that is available to the process, I have to kill the existing container, and restart a new container with the higher memory limit. This extends to other meta-information, like labels (if they existed) where it is extremely useful to be able to update a container without killing and restarting it. While dynamic updates are useful on the node, they become increasingly important in a scheduling system as you want to be able to dynamically adjust container limits to match actual usage, so that you can pack more containers onto any individual machine.

Those are my highest level concerns about the existing proposal. There are a few other smaller scale things...

I think that adopting Pods for scheduling (#8781) and adding labels would go a long way towards making this proposal more usable, but I would second @jbeda 's request for a clean delineation in the API objects between the desired state and the current state as well, and I truly believe that it's better to extract out the common objects and build a separate clustering API, rather than trying to smash it into the existing Docker API.

@bfirsh

This comment has been minimized.

Show comment
Hide comment
@bfirsh

bfirsh Nov 13, 2014

Contributor

In case anyone missed it, here's the design review mentioned in a few comments above: https://www.youtube.com/watch?v=4etqZ4ghZus

Contributor

bfirsh commented Nov 13, 2014

In case anyone missed it, here's the design review mentioned in a few comments above: https://www.youtube.com/watch?v=4etqZ4ghZus

@johngossman

This comment has been minimized.

Show comment
Hide comment
@johngossman

johngossman Nov 14, 2014

Contributor

@brendandburns +1 to your comment about atomicity. And as you all know, these scheduling constraints get really hard...think putting up hundreds or thousands of containers (vms) in some sort of HA configuration with all the affinity and disaffinity rules. You need to know the final state of the dependency graph in order to be able to solve the constraint problem. Or at the minimum, you need to know when the end is coming because you can't resolve the system every time it changes, so you need transactions (and the simplest transaction is to provide the whole model in one gulp). You can still have the "incremental" edits to the model, which is what I think people mean when they say imperative, but you also need a "bulk" edit mode.

The above is all theoretical for a simple scheduler that can't solve complex constraints anyway...but I believe part of the proposal is to allow plugin schedulers, placement algorithsm etc. Some of these will not work without transactions.

Contributor

johngossman commented Nov 14, 2014

@brendandburns +1 to your comment about atomicity. And as you all know, these scheduling constraints get really hard...think putting up hundreds or thousands of containers (vms) in some sort of HA configuration with all the affinity and disaffinity rules. You need to know the final state of the dependency graph in order to be able to solve the constraint problem. Or at the minimum, you need to know when the end is coming because you can't resolve the system every time it changes, so you need transactions (and the simplest transaction is to provide the whole model in one gulp). You can still have the "incremental" edits to the model, which is what I think people mean when they say imperative, but you also need a "bulk" edit mode.

The above is all theoretical for a simple scheduler that can't solve complex constraints anyway...but I believe part of the proposal is to allow plugin schedulers, placement algorithsm etc. Some of these will not work without transactions.

@johngossman

This comment has been minimized.

Show comment
Hide comment
@johngossman

johngossman Nov 14, 2014

Contributor

@kelseyhightower As much as I like etcd, I hope discovery, leader election and state management are pluggable in Docker. As much as I like raft, I hope the team doesn't try to reimplement these features starting at that level. Batteries and all that...

Though I should add, as much I like plugins, I like getting something running and then refactoring it.

Contributor

johngossman commented Nov 14, 2014

@kelseyhightower As much as I like etcd, I hope discovery, leader election and state management are pluggable in Docker. As much as I like raft, I hope the team doesn't try to reimplement these features starting at that level. Batteries and all that...

Though I should add, as much I like plugins, I like getting something running and then refactoring it.

@glynd

This comment has been minimized.

Show comment
Hide comment
@glynd

glynd Nov 17, 2014

The main point I'd liked to add is that having a master/standby solution assumed is not necessarily the best approach - for scale, geographic reach or reliability. I can understand the desire to do this in an initial v1 version, but it would be a good plan for any APIs or similar to be designed to cater for multiple coordinators being configured.

In a similar vein, it would be useful if a Docker node could be asked "please do this if your state is currently as I think it is" - ie conditional launches/shutdowns. This would also aid towards a distributed approach.

glynd commented Nov 17, 2014

The main point I'd liked to add is that having a master/standby solution assumed is not necessarily the best approach - for scale, geographic reach or reliability. I can understand the desire to do this in an initial v1 version, but it would be a good plan for any APIs or similar to be designed to cater for multiple coordinators being configured.

In a similar vein, it would be useful if a Docker node could be asked "please do this if your state is currently as I think it is" - ie conditional launches/shutdowns. This would also aid towards a distributed approach.

@inthecloud247

This comment has been minimized.

Show comment
Hide comment
@inthecloud247

inthecloud247 Nov 17, 2014

@bfirsh thanks for the youtube link. Didn't know about the recorded design review sessions... interesting.

inthecloud247 commented Nov 17, 2014

@bfirsh thanks for the youtube link. Didn't know about the recorded design review sessions... interesting.

@inthecloud247

This comment has been minimized.

Show comment
Hide comment
@inthecloud247

inthecloud247 Nov 17, 2014

If there's a built-in link to docker-hub, I hope it's possible to specify a new default url using a command-line option / config file variable / environment variable. Inevitably, Enterprise developers using Docker will start storing cluster metadata on dockerhub, and it'd be important for devop/IT orgs to manage enterprise-wide docker usage through management of default configuration values. It'd be simple to use boxen/chef/ansible/salt to push out and manage safe default values.

I know there was discussion about allowing federation/mirroring of the docker index to help increase availability. Are there similar plans for ensuring availability of this centralized docker cluster service? One simple solution would be to add this this functionality to the docker registry project https://github.com/docker/docker-registry to allow self-hosting.

inthecloud247 commented Nov 17, 2014

If there's a built-in link to docker-hub, I hope it's possible to specify a new default url using a command-line option / config file variable / environment variable. Inevitably, Enterprise developers using Docker will start storing cluster metadata on dockerhub, and it'd be important for devop/IT orgs to manage enterprise-wide docker usage through management of default configuration values. It'd be simple to use boxen/chef/ansible/salt to push out and manage safe default values.

I know there was discussion about allowing federation/mirroring of the docker index to help increase availability. Are there similar plans for ensuring availability of this centralized docker cluster service? One simple solution would be to add this this functionality to the docker registry project https://github.com/docker/docker-registry to allow self-hosting.

@discordianfish

This comment has been minimized.

Show comment
Hide comment
@discordianfish

discordianfish Nov 17, 2014

Contributor

@glynd I don't think this should be scope of Docker clustering: As soon as your cluster is so large you need to scale the master/coordinator, you should create multiple clusters per (availability) zone with individual masters. This seems to be in practice the more robust approach. At that scale, you would probably built something multi-cluster deployment on top of Docker.

Contributor

discordianfish commented Nov 17, 2014

@glynd I don't think this should be scope of Docker clustering: As soon as your cluster is so large you need to scale the master/coordinator, you should create multiple clusters per (availability) zone with individual masters. This seems to be in practice the more robust approach. At that scale, you would probably built something multi-cluster deployment on top of Docker.

@glynd

This comment has been minimized.

Show comment
Hide comment
@glynd

glynd Nov 18, 2014

@discordianfish If you go down the approach of a cluster per AZ you then have to have something to manage the different AZs. What happens if an AZ (or DC) goes down and you need to increase your capacity at other sites. If you don't have a global view on such things you can't act.

I'm saying the APIs and command lines should allow for this from the start, and take it into account in how the tooling is configured / used. Not that it is actually implemented like that from the start.

Then if someone does build a cluster approach they can easily layer it on, and use the same docker tooling and APIs - as well as any other 3rd party tools which have been written with the docker APIs in mind.

glynd commented Nov 18, 2014

@discordianfish If you go down the approach of a cluster per AZ you then have to have something to manage the different AZs. What happens if an AZ (or DC) goes down and you need to increase your capacity at other sites. If you don't have a global view on such things you can't act.

I'm saying the APIs and command lines should allow for this from the start, and take it into account in how the tooling is configured / used. Not that it is actually implemented like that from the start.

Then if someone does build a cluster approach they can easily layer it on, and use the same docker tooling and APIs - as well as any other 3rd party tools which have been written with the docker APIs in mind.

@discordianfish

This comment has been minimized.

Show comment
Hide comment
@discordianfish

discordianfish Nov 18, 2014

Contributor

@glynd Right, you need some management on top of your clusters. I argue this is so site specific (and complex), that it's not reasonable to address this by Docker itself. At some point we might revisit that, but I think accounting for those use cases right now just drives up complexity for a sane first implementation.

Contributor

discordianfish commented Nov 18, 2014

@glynd Right, you need some management on top of your clusters. I argue this is so site specific (and complex), that it's not reasonable to address this by Docker itself. At some point we might revisit that, but I think accounting for those use cases right now just drives up complexity for a sane first implementation.

@glynd

This comment has been minimized.

Show comment
Hide comment
@glynd

glynd Nov 18, 2014

@discordianfish Does it? On the API / tooling side it can be as simple as allowing more than one server to be configured instead of just taking a single hostname. That and how your server responds under partial success (i.e. the comments around Atomicity above when stretched to multiple servers.)

glynd commented Nov 18, 2014

@discordianfish Does it? On the API / tooling side it can be as simple as allowing more than one server to be configured instead of just taking a single hostname. That and how your server responds under partial success (i.e. the comments around Atomicity above when stretched to multiple servers.)

@sthulb

This comment has been minimized.

Show comment
Hide comment
@sthulb

sthulb Nov 26, 2014

@aluzzardi @vieux, how would one run non-deamonised containers through the cluster? I noticed that all the examples show daemonised containers. Would a non-daemonised container proxy stdout to the master or which ever host started the container?

sthulb commented Nov 26, 2014

@aluzzardi @vieux, how would one run non-deamonised containers through the cluster? I noticed that all the examples show daemonised containers. Would a non-daemonised container proxy stdout to the master or which ever host started the container?

@dbason

This comment has been minimized.

Show comment
Hide comment
@dbason

dbason Dec 1, 2014

+1 to @titanous ' issues. To me this seems to break separation of concerns. I don't have an issue with Docker developing a cross server solution, I do however have a problem with this being integrated into the containerization engine. I like docker because it is a simple building block and I feel this takes away from that. If we want cross host options it should be a choice what we use and how we implement that.

dbason commented Dec 1, 2014

+1 to @titanous ' issues. To me this seems to break separation of concerns. I don't have an issue with Docker developing a cross server solution, I do however have a problem with this being integrated into the containerization engine. I like docker because it is a simple building block and I feel this takes away from that. If we want cross host options it should be a choice what we use and how we implement that.

@dbason

This comment has been minimized.

Show comment
Hide comment
@dbason

dbason Dec 2, 2014

To expand on this here's my use case:
I'm currently using docker with one of the schedulers out there in community. I don't want to use Docker Hub for host registration/discovery inside of docker. I don't want to plug something else in because this is already implemented in the scheduler I'm running over the top. Will I be able to turn all of that off and just run docker in a standalone fashion as it is now?

dbason commented Dec 2, 2014

To expand on this here's my use case:
I'm currently using docker with one of the schedulers out there in community. I don't want to use Docker Hub for host registration/discovery inside of docker. I don't want to plug something else in because this is already implemented in the scheduler I'm running over the top. Will I be able to turn all of that off and just run docker in a standalone fashion as it is now?

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Dec 2, 2014

Collaborator

Yes.

On Tue, Dec 2, 2014 at 10:49 PM, dbason notifications@github.com wrote:

To expand on this here's my use case:
I'm currently using docker with one of the schedulers out there in
community. I don't want to use Docker Hub for host registration/discovery
inside of docker. I don't want to plug something else in because this is
already implemented in the scheduler I'm running over the top. Will I be
able to turn all of that off and just run docker in a standalone fashion as
it is now?


Reply to this email directly or view it on GitHub
#8859 (comment).

Collaborator

shykes commented Dec 2, 2014

Yes.

On Tue, Dec 2, 2014 at 10:49 PM, dbason notifications@github.com wrote:

To expand on this here's my use case:
I'm currently using docker with one of the schedulers out there in
community. I don't want to use Docker Hub for host registration/discovery
inside of docker. I don't want to plug something else in because this is
already implemented in the scheduler I'm running over the top. Will I be
able to turn all of that off and just run docker in a standalone fashion as
it is now?


Reply to this email directly or view it on GitHub
#8859 (comment).

@boonkerz

This comment has been minimized.

Show comment
Hide comment
@boonkerz

boonkerz Jan 4, 2015

When container are rebalanced as example an typical webapp
docker run -d --name mysql mysql
docker run -d --name elasticsearch elasticsearch
docker run -d --name web --link mysql:mysql --link elasticsearch:elasticsearch webserver
and the elasticsearch goes down
do the cluster also reconfigure the link on the webserver container?
and when the mysql is started on host 1 and the webserver on host2
do the cluster connect the right hosts together?

boonkerz commented Jan 4, 2015

When container are rebalanced as example an typical webapp
docker run -d --name mysql mysql
docker run -d --name elasticsearch elasticsearch
docker run -d --name web --link mysql:mysql --link elasticsearch:elasticsearch webserver
and the elasticsearch goes down
do the cluster also reconfigure the link on the webserver container?
and when the mysql is started on host 1 and the webserver on host2
do the cluster connect the right hosts together?

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 13, 2015

Collaborator

@boonkerz right now swarm don't support links.
There is a huge effort to improving the networking model directly into the docker engine.
With this improvements, the links will work between 2 hosts.
We are waiting for this to use it.

Collaborator

vieux commented Jan 13, 2015

@boonkerz right now swarm don't support links.
There is a huge effort to improving the networking model directly into the docker engine.
With this improvements, the links will work between 2 hosts.
We are waiting for this to use it.

@vieux

This comment has been minimized.

Show comment
Hide comment
@vieux

vieux Jan 13, 2015

Collaborator

Hi everyone,

As you probably figured out, this design proposal was an early version of Swarm

Please redirect all your concerns to the Swarm issue tracker

Thanks!

Collaborator

vieux commented Jan 13, 2015

Hi everyone,

As you probably figured out, this design proposal was an early version of Swarm

Please redirect all your concerns to the Swarm issue tracker

Thanks!

@vieux vieux closed this Jan 13, 2015

@aluzzardi aluzzardi deleted the aluzzardi:clustering-proposal branch May 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment