New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Native Docker Multi-Host Networking #8951

Closed
nerdalert opened this Issue Nov 4, 2014 · 145 comments

Comments

Projects
None yet
@nerdalert
Contributor

nerdalert commented Nov 4, 2014

Native Docker Multi-Host Networking

TL;DR Practical SDN for Docker

Authors: @dave-tucker, @mavenugo and @nerdalert.

Background

Application virtualization will have a significant impact on the future of data center networks. Compute virtualization has driven the edge of the network into the server and more specifically the virtual switch. The compute workload efficiencies derived from Docker containers will dramatically increase the density of network requirements in the server. Scaling this density will require reliable network fundamentals, while also ensuring the developer has as much or little interaction with the network as is desired.

A tightly coupled and native integration to Docker will ensure there is a base functionality that capable of integrating into the vast majority of data center network architectures today and help reduce the barriers to Docker adoption for the user. Just as important for the diverse user base, is making Docker networking dead simple for the to integrate, provision and troubleshoot.

The first step is a Native Docker Networking solution today that can handle Multi-Host environment which scales to production requirements and that works well with the existing network deployments / operations.

Problem Statement

Though there are a few existing multi-host networking solutions, they are currently designed more as over-the-top solutions on top of Docker that either:

  1. Address a specific use case
  2. Address a specific orchestration system deployment
  3. Do not scale to the production requirements
  4. Do not work well with existing production network and operations.

The core of this proposal is to bring multi-host networking as a native part of Docker that handles most of the use-cases, scales and works well with the existing production network and operations. With this provided as a native Docker solution, every orchestration system can enjoy the benefits alike.

There are three ways to approach multi-host networking in docker:

  1. NAT-based : Just hide the containers behind the docker host IP address. Job Done.
  2. IP-based Each container should have it’s own unique IP address
  3. Hybrid. A mix of the above

NAT-based

The first option (NAT-based) works by hiding the the containers behind a Docker Host IP address. The TCP port exposed by a given Docker container is mapped to an unique port on the Host machine.

Since the mapped host port has to be unique, containers using well-known port numbers are therefore forced to use ephemeral ports. This adds complexity in network operations, network visibility, troubleshooting and deployment.

For example, the configuration of a front-end load-balancer for a DNS service hosted in a Docker cluster.

Service Address:

  • 1.2.3.4:53

Servers:

  • 10.1.10.1:65321
  • 10.36.45.2:64123
  • 10.44.3.1:54219

If you have firewalls or IDS/IPS devices behind the load-balancer, these also need to know that the DNS service is being hosted on these devices and port numbers.

IP-based

The second option (IP-based) works by assigning unique IP-Addresses to the containers and thus avoiding the need to do Port-mapping, and solving issues with downstream load-balancers and firewalls by using well-known ports in pre-determined subnets.
However, this exposes different sets of issues.

  • _Reachability_: Which containers are on which host?*
  • GCE uses a /24 per host for this reason, but solutions outside of GCE will require an overlay network like Flannel
  • Even a GCE style architecture will make firewall management difficult
  • Flexible Addressing / IP Address Management (IPAM)*
    • Who assigns IP addresses to containers
      • Static? A flag in docker run?
      • DHCP/IPAM? A proper DHCP server or IPAM solution?
      • Docker? A local DHCP solution using Docker?
      • Orchestration System? via docker run or another API?
  • Deployability and migration concerns
    • Some clouds do not play well with routers (like EC2)

Proposal

We are proposing a Native Multi-Host networking solution to Docker that handles various production-grade deployment scenarios and use cases.

The power of Docker is its simplicity, yet it scales to the demands of hyper-scale deployments. The same cannot be said today for the native networking solution in Docker. This proposal aims to bridge that gap. The intent is to implement a production-ready reliable multi-host networking solutions that is native to Docker while remaining laser focused on the user friendly needs of the developers environment that is at the heart of the Docker transformation.

The new edge of the network is the vSwitch. The virtual port density that application virtualization will drive is an even larger multiplier then the explosion of virtual ports created by OS virtualization. This will create port density far beyond anything to date. In order to scale, the network cannot be seen as merely the existing physical spine/leaf 2-tier physical network architecture but also incorporate the virtual edge. Having Docker natively incorporate clear scalable architectures will avoid the all too common problem of the network blocking innovation.

Solution Components

1. Programmable vSwitch

To implement this solution we require a programmable vSwitch.
This will allow us to configure the necessary bridges, ports and tunnels to support a wide range of networking use cases.

Our initial focus will be to develop an API to implement the primitives required of the vSwitch for multi-host networking with a focus on delivering an implementation for Open vSwitch first.

This link, WHY-OVS covers the rational for choosing OVS and why it is important to the Docker ecosystem and virtual networking as a whole. Open vSwitch has a mature Kernel Data-Plane (upstream since 3.7) with a rich set of features that addresses the requirements of mult-host. In addition to the data-plane performance and functionality, Open vSwitch also has an integrated management-plane called OVSDB that abstracts the Switch as a Database for the applications to make use of.

With this proposal the native implementation in Docker will:

  • Provide an API for implementing Multi-Host Networking
  • Provide an implementation for an Open vSwitch datapath
  • Implement native control plane to address the scenarios mentioned in this proposal.

2. Network Integration

The various scenarios that we will deal with in this proposal range between existing Port-Mapping solution to VXLAN based Overlays to Native underlay network-integration. There are real deployment scenarios for each of these use-cases / scenarios.

Facilitate the common application HA scenario of a service needing a 1:1 NAT mapping between the container’s back-end ip-address and a front-end IP address from a routable address pool. Alternatively, the containers can also be reachable globally depending on the users IP addressing strategy.

3. Flexible Addressing / IP Address Management (IPAM)

In a multi-host environment, IP Addressing Strategy becomes crucial. Some of the Use-cases, as we will see, will also require reasonable IPAM in place. This discussion will also lead to the production-grade scale requirements of Layer2 vs Layer3 networks.

4. Host Discovery
Though it is obvious, it is important to mention the Host Discovery requirements that is inherent for any Multi-host solution. We believe that such Host/Service Discovery mechanism is a generic requirement and is not specific to the Multi-Host networking needs and as such we are backing the Docker Clustering proposal for this purpose.

5. Multi-Tenancy
Another important consideration is to provide the architectural white-space for Multi-Tenancy solutions that may either be introduced in Docker Natively or by external orchestration systems.

Single Host Network Deployment Scenarios

  • Parity with existing Docker Single-Host solution

This is the native Single-Host Docker Networking model as of today. This is the most basic scenario that the solution that we are proposing must address seamlessly. This scenario brings in the basic Open vSwitch integration into Docker which we can build on top of for the Multi-Host scenarios that follows.

Figure - 1

Figure - 1

  • Addition of Flexible Addressing

This scenario adds a Flexible Addressing scheme to the basic single-host use-case where we can provide IP addressing from one of many different sources

Figure - 2

Figure - 2

Multi Host Network Deployment Scenarios

This following scenarios enables backend Docker containers to communicate with one another across multiple hosts. This fulfills the need for high availability applications to survive beyond a single node failure.

  • Overlay Tunnels (VXLAN, GRE, Geneve, etc.)

For environments which need to abstract the physical network, overlay networks need to create a virtual datapath using supported tunneling encapsulations (VXLAN, GRE, etc). It is just as important for these networks to be as reliable and consistent as the underlying network. Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future.

The overlay datapath is provisioned between tunnel endpoints residing in the Docker host which gives the appearance of all hosts within a given provider segment being directly connected to one another as depicted in the following Diagram 3.

Figure - 3

Figure -  3

As a new container comes online, the prefix is updated in the routing protocol announcing its location via a tunnel endpoint. As the other Docker hosts receive the updates the forwarding is installed into OVS for which tunnel endpoint the host resides. When the host is deprovisioned, the similar process occurs and tunnel endpoint Docker hosts remove the forwarding entry for the deprovisioned container.
Underlay Network Integration

  • Underlay Network integration

The backend can also simply be bridged into a networks broadcast domain and rely on upstream networking to provide reachability. Traditional L2 bridging has significant scaling issues but it is still very common in many data centers with flat VLAN architectures to facilitate live workload migrations of their VMs.

This model is fairly critical for DC architectures that require a tight coupling of network and compute as opposed to a ships in the night design of overlays abstracting the physical network.

The underlay network integration can be designed with some specific network architecture in mind and hence we see models like Google Compute where every host is assigned a dedicated Subnet & each pod gets an ip-address from that subnet.

Figure - 4 - Dedicated one Static Subnet per Host*

Figure -  4

The entire backend container space can be advertised into the underlying network for IP reachability. IPv6 is becoming attractive for many in this scenario due to v4 constraints.

By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture.

Alternatively, Underlay integration can also provide Flexible addressing combined with /32 host-updates to the network in order to provide the subnet flexibility.

Figure - 5

Figure -  5

Summary

Implementing the above solution provides a flexible, scalable, multi-host networking as a native part of Docker. This implementation adds a strong networking foundation that is intent on providing an evolvable network architecture for the future.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Nov 4, 2014

Contributor

This sounds good. What I am not seeing is the API and performance. How does one go about setting this up? How much does it hurt performance?

One of the things we are trying to do in GCE is drive container network perf -> native. veth is awful from a perf perspective. We're working on networking (what you call underlay) without veth and a vbridge at all.

Contributor

thockin commented Nov 4, 2014

This sounds good. What I am not seeing is the API and performance. How does one go about setting this up? How much does it hurt performance?

One of the things we are trying to do in GCE is drive container network perf -> native. veth is awful from a perf perspective. We're working on networking (what you call underlay) without veth and a vbridge at all.

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 4, 2014

Collaborator

I like the idea of underlay networking in Docker. The first question is: how much can be bundled by default? Does an ovs+vxlan solution make sense as a default, in replacement of veth + regular bridge? Or should they be reserved for opt-in plugins?

@thockin do you have opinions on the best system mechanism to use?

Collaborator

shykes commented Nov 4, 2014

I like the idea of underlay networking in Docker. The first question is: how much can be bundled by default? Does an ovs+vxlan solution make sense as a default, in replacement of veth + regular bridge? Or should they be reserved for opt-in plugins?

@thockin do you have opinions on the best system mechanism to use?

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Nov 4, 2014

Contributor

What exactly do you mean by "system mechanism" ?

Contributor

thockin commented Nov 4, 2014

What exactly do you mean by "system mechanism" ?

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 4, 2014

Collaborator

vxlan vs pcap/userland encapsulation vs nat with netfilter vs veth/bridge vs macvlan... use ovs by default vs. keep it out of the core.. Things like that.

Collaborator

shykes commented Nov 4, 2014

vxlan vs pcap/userland encapsulation vs nat with netfilter vs veth/bridge vs macvlan... use ovs by default vs. keep it out of the core.. Things like that.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Nov 4, 2014

Contributor

Ah. My experience is somewhat limited.

Google has made good use of OVS internally.

veth pair performance is awful and unlikely to get better.

I have not plain with macvlan, but I understand it is ~wire speed, but a bit awkward to use.

We have a patch cooking that fills the need for macvlan-like perf without actually being VLAN (more like old-skool eth0:0 aliases).

If we're going to pick a default, I don't think OVS is the worst choice - it can't be worse perf than veth. But it's maybe more dependency heavy? Not sure.

Contributor

thockin commented Nov 4, 2014

Ah. My experience is somewhat limited.

Google has made good use of OVS internally.

veth pair performance is awful and unlikely to get better.

I have not plain with macvlan, but I understand it is ~wire speed, but a bit awkward to use.

We have a patch cooking that fills the need for macvlan-like perf without actually being VLAN (more like old-skool eth0:0 aliases).

If we're going to pick a default, I don't think OVS is the worst choice - it can't be worse perf than veth. But it's maybe more dependency heavy? Not sure.

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@thockin @shykes Thanks for the comments.
Agreed on the veth performance issues. our proposal is to use OVS ports.
The companion proposal : #8952 covers details on how we are planning to use OVS.
(Please refer to the Open vSwitch Backend section of #8952 which covers performance details of veth vs OVS port).

OVS provides the flexibility of using VXLAN for overlay deployments or native network integration for underlay deployments without sacrificing performance or scale.

I haven't done much work with macvlan to give an answer on how it stacks up to an overall solution that includes functionality, manageability, performance, scale and network operations.

We believe that Native Docker networking solution should be flexible enough to accommodate L2, L3 and Overlay network architectures.

Contributor

mavenugo commented Nov 5, 2014

@thockin @shykes Thanks for the comments.
Agreed on the veth performance issues. our proposal is to use OVS ports.
The companion proposal : #8952 covers details on how we are planning to use OVS.
(Please refer to the Open vSwitch Backend section of #8952 which covers performance details of veth vs OVS port).

OVS provides the flexibility of using VXLAN for overlay deployments or native network integration for underlay deployments without sacrificing performance or scale.

I haven't done much work with macvlan to give an answer on how it stacks up to an overall solution that includes functionality, manageability, performance, scale and network operations.

We believe that Native Docker networking solution should be flexible enough to accommodate L2, L3 and Overlay network architectures.

@jainvipin

This comment has been minimized.

Show comment
Hide comment
@jainvipin

jainvipin Nov 5, 2014

Hi Madhu, Dave and Team:

Definitely a wholesome view of the problem. Thanks for putting it out there. Few questions and comments (on both proposals [0] and [1], as they tie into each other quite a bit):

Comments and Questions on proposal on Native-Docker Multi-Host Networking:

[a] OVS Integration: The proposal is to natively instantiate ovs from docker is good.

  • Versioning and dependency between networking component and compute part of docker: Assuming that the driver APIs (proposed in [1]) will change and refine itself as we go. An obvious implication of such implementation inside Docker is that the docker version that implements those APIs would be tied to the user of the APIs (aka orchestrator) and all must be compatible and upgraded together.
  • Providing native data-path integration: If native integration of OVSDB API calls are made via docker, wouldn’t it be inefficient (extra-hop) to make these API calls via docker.
  • Datapath OF integration: OVS also provides a complete OF datapath using a controller (ODL, for example). Are you proposing that for a use case that requires OF API calls, the API calls are also made through docker (native integration)? Assuming not, if the datapath programming to the switch is done from outside docker, then why keep part of the OVS bridge manipulation inside docker (via the driver) and a part outside? It would seem that doing the network operations completely outside in an orchestration entity would be a good choice, provided a simple basic mechanism like [2] exists to allow the outside systems to attach network namespaces during container creation.
  • Provide API for implementation for Multi-Host Networking:
    Question: Can you please clarify if the APIs proposed here are eventually consumed by the driver calls defined in [1]? Assuming yes, to keep docker-interface transparent to plugin-specific content of these APIs, what is the proposed method? Say, a plugin-specific parsable-network-configuration for each of the proposed API calls in [1].
  • Provide native control plane:
    Question: Can you please elaborate the intention of this integration. Is this to allow inserting a control plane entity (aka router or routing layer, as illustrated in Figure 4 forming routing adjacency)? If so, does the entity sit inside or outside docker? The confusion comes from the bullet in section 1 “o Implement native control plane to address the scenarios mentioned in this proposal.”

[b]
+1 on the flexibility being talked about is good (single host, vs. overlays to native underlay integration). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[c]
+1 on the flexibility on IPAM (use of perhaps DHCP for certain containers vs. auto-configured for the rest, mostly useful in multi-tenant scenarios). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[e]
Multi-tenancy is an important consideration indeed; Associating a profile as in [1], specifies arbitrary parsed network configuration, seem to suffice providing a tenant context.

[f]
Regarding dns/ddns update (exposing) for the host, assuming it is done outside (orchestrator) then where part of the networking is done outside docker and part inside (rest of the native docker integration proposed here).

Comments and Questions on proposal on ‘Network Drivers’:

[g] Multiple-vNICs inside a container: Are the APIs proposed here (CreatePort) handle creation of multiple vNICs inside a container?

[h] Update to Network configuration: Say a bridge is added with a VXLAN-VNID or a VLAN, would your suggestion be to call ‘InitBridge’ or be done during PortCreate() if the VLAN/tunnel/other-parameters-needed-for-port-create does not exist.

[j] Driver API performance/scale requirements: It would be good to state an upfront design target for scale/performance.

As always, will be happy to collaborate on this with you and other developers.

Cheers,
--Vipin

[0] #8951
[1] #8952
[2] #8216

Hi Madhu, Dave and Team:

Definitely a wholesome view of the problem. Thanks for putting it out there. Few questions and comments (on both proposals [0] and [1], as they tie into each other quite a bit):

Comments and Questions on proposal on Native-Docker Multi-Host Networking:

[a] OVS Integration: The proposal is to natively instantiate ovs from docker is good.

  • Versioning and dependency between networking component and compute part of docker: Assuming that the driver APIs (proposed in [1]) will change and refine itself as we go. An obvious implication of such implementation inside Docker is that the docker version that implements those APIs would be tied to the user of the APIs (aka orchestrator) and all must be compatible and upgraded together.
  • Providing native data-path integration: If native integration of OVSDB API calls are made via docker, wouldn’t it be inefficient (extra-hop) to make these API calls via docker.
  • Datapath OF integration: OVS also provides a complete OF datapath using a controller (ODL, for example). Are you proposing that for a use case that requires OF API calls, the API calls are also made through docker (native integration)? Assuming not, if the datapath programming to the switch is done from outside docker, then why keep part of the OVS bridge manipulation inside docker (via the driver) and a part outside? It would seem that doing the network operations completely outside in an orchestration entity would be a good choice, provided a simple basic mechanism like [2] exists to allow the outside systems to attach network namespaces during container creation.
  • Provide API for implementation for Multi-Host Networking:
    Question: Can you please clarify if the APIs proposed here are eventually consumed by the driver calls defined in [1]? Assuming yes, to keep docker-interface transparent to plugin-specific content of these APIs, what is the proposed method? Say, a plugin-specific parsable-network-configuration for each of the proposed API calls in [1].
  • Provide native control plane:
    Question: Can you please elaborate the intention of this integration. Is this to allow inserting a control plane entity (aka router or routing layer, as illustrated in Figure 4 forming routing adjacency)? If so, does the entity sit inside or outside docker? The confusion comes from the bullet in section 1 “o Implement native control plane to address the scenarios mentioned in this proposal.”

[b]
+1 on the flexibility being talked about is good (single host, vs. overlays to native underlay integration). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[c]
+1 on the flexibility on IPAM (use of perhaps DHCP for certain containers vs. auto-configured for the rest, mostly useful in multi-tenant scenarios). I am wondering if there is anything specific being proposed here or something that naturally comes from the OVS integration?

[e]
Multi-tenancy is an important consideration indeed; Associating a profile as in [1], specifies arbitrary parsed network configuration, seem to suffice providing a tenant context.

[f]
Regarding dns/ddns update (exposing) for the host, assuming it is done outside (orchestrator) then where part of the networking is done outside docker and part inside (rest of the native docker integration proposed here).

Comments and Questions on proposal on ‘Network Drivers’:

[g] Multiple-vNICs inside a container: Are the APIs proposed here (CreatePort) handle creation of multiple vNICs inside a container?

[h] Update to Network configuration: Say a bridge is added with a VXLAN-VNID or a VLAN, would your suggestion be to call ‘InitBridge’ or be done during PortCreate() if the VLAN/tunnel/other-parameters-needed-for-port-create does not exist.

[j] Driver API performance/scale requirements: It would be good to state an upfront design target for scale/performance.

As always, will be happy to collaborate on this with you and other developers.

Cheers,
--Vipin

[0] #8951
[1] #8952
[2] #8216

@dave-tucker

This comment has been minimized.

Show comment
Hide comment
@dave-tucker

dave-tucker Nov 5, 2014

Contributor

@thockin on the macvlan performance, are there any published figures?
@shykes @mavenugo i've done a very rough & ready comparisons and so far OVS seems to be leading the pack in my scenario, which is iperf between two netns on the same host.
See code and environment here
screenshot 2014-11-05 02 07 08

from an underlay integration standpoint, I'd imagine that having a bridge would be much easier to manage as you could trunk all vlans to the vswitch and place the container port in the appropriate vlan.... otherwise with a load of mac addresses loose on your underlay you'd need to configure your underlay edge switches to apply a vlan based on a mac address (which won't be known in advance).

I feel like i'm missing something though so please feel free to correct me if i haven't quite grokked the macvlan use case

Contributor

dave-tucker commented Nov 5, 2014

@thockin on the macvlan performance, are there any published figures?
@shykes @mavenugo i've done a very rough & ready comparisons and so far OVS seems to be leading the pack in my scenario, which is iperf between two netns on the same host.
See code and environment here
screenshot 2014-11-05 02 07 08

from an underlay integration standpoint, I'd imagine that having a bridge would be much easier to manage as you could trunk all vlans to the vswitch and place the container port in the appropriate vlan.... otherwise with a load of mac addresses loose on your underlay you'd need to configure your underlay edge switches to apply a vlan based on a mac address (which won't be known in advance).

I feel like i'm missing something though so please feel free to correct me if i haven't quite grokked the macvlan use case

@dave-tucker

This comment has been minimized.

Show comment
Hide comment
@dave-tucker

dave-tucker Nov 5, 2014

Contributor

@jainvipin thanks for the mega feedback. I think the answer to a lot of your questions lies in these simple statements. I firmly believe that all network configuration should be done natively, as a part of Docker. I also believe that docker run shouldn't be polluted with operational semantics, especially if this impacts the ability of docker run to be used with libswarm (e.g making assumptions on the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host, then asking Docker to plumb this in to the container doesn't seem right to me. I'd much rather see orchestration systems converge on, or create a driver in this framework (or one like it) that does the necessary configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the required primitives for programming the dataplane. This could take the form of OF datapath programming in the case of OVS, but it could also be adding plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host designed to be agnostic to the backend used to deliver them.

Contributor

dave-tucker commented Nov 5, 2014

@jainvipin thanks for the mega feedback. I think the answer to a lot of your questions lies in these simple statements. I firmly believe that all network configuration should be done natively, as a part of Docker. I also believe that docker run shouldn't be polluted with operational semantics, especially if this impacts the ability of docker run to be used with libswarm (e.g making assumptions on the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host, then asking Docker to plumb this in to the container doesn't seem right to me. I'd much rather see orchestration systems converge on, or create a driver in this framework (or one like it) that does the necessary configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the required primitives for programming the dataplane. This could take the form of OF datapath programming in the case of OVS, but it could also be adding plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host designed to be agnostic to the backend used to deliver them.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Nov 5, 2014

Contributor

The caveat here is that Docker can not be everything to everyone, and the
more we try to make it do everything, the more likely it is to blow up in
our faces.

Having networking be externalized with a clean plugin interface (i.e. exec)
is powerful. Network setup isn't exactly fast-path, so popping out to an
external tool would probably be fine.

On Tue, Nov 4, 2014 at 6:44 PM, Dave Tucker notifications@github.com
wrote:

@jainvipin https://github.com/jainvipin thanks for the mega feedback. I
think the answer to a lot of your questions lies in these simple
statements. I firmly believe that all network configuration should be done
natively, as a part of Docker. I also believe that docker run shouldn't
be polluted with operational semantics, especially if this impacts the
ability of docker run to be used with libswarm (e.g making assumptions on
the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host,
then asking Docker to plumb this in to the container doesn't seem right to
me. I'd much rather see orchestration systems converge on, or create a
driver in this framework (or one like it) that does the necessary
configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the
required primitives for programming the dataplane. This could take the form
of OF datapath programming in the case of OVS, but it could also be adding
plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host
designed to be agnostic to the backend used to deliver them.

Reply to this email directly or view it on GitHub
#8951 (comment).

Contributor

thockin commented Nov 5, 2014

The caveat here is that Docker can not be everything to everyone, and the
more we try to make it do everything, the more likely it is to blow up in
our faces.

Having networking be externalized with a clean plugin interface (i.e. exec)
is powerful. Network setup isn't exactly fast-path, so popping out to an
external tool would probably be fine.

On Tue, Nov 4, 2014 at 6:44 PM, Dave Tucker notifications@github.com
wrote:

@jainvipin https://github.com/jainvipin thanks for the mega feedback. I
think the answer to a lot of your questions lies in these simple
statements. I firmly believe that all network configuration should be done
natively, as a part of Docker. I also believe that docker run shouldn't
be polluted with operational semantics, especially if this impacts the
ability of docker run to be used with libswarm (e.g making assumptions on
the environment) or adds complexity for devs using docker.

Orchestration systems populating netns and/or bridge details on the host,
then asking Docker to plumb this in to the container doesn't seem right to
me. I'd much rather see orchestration systems converge on, or create a
driver in this framework (or one like it) that does the necessary
configuration in Docker itself.

For multi-host, the Network Driver API will be extended to support the
required primitives for programming the dataplane. This could take the form
of OF datapath programming in the case of OVS, but it could also be adding
plain old ip routes in the kernel. This is really up to the driver.

To that end, all of the improvements we're suggesting here for multi-host
designed to be agnostic to the backend used to deliver them.

Reply to this email directly or view it on GitHub
#8951 (comment).

@jainvipin

This comment has been minimized.

Show comment
Hide comment
@jainvipin

jainvipin Nov 5, 2014

@dave-tucker There are trade-offs of pulling everything (management, data-plane, and control-plane) in docker. While you highlighted the advantages (and I agree with some as indicated in my comment), I was noting a few disadvantages (versioning/compatibility, inefficiency, docker performance, etc.) so we can weigh it better. This is based on my understanding of things reading the proposal (no experimentation yet).

In contrast, if we can incorporate a small change (#8216) in docker, it can perhaps give scheduler/orchestrator/controller a good way to spawn the containers while allowing them to do networking related things themselves, and not have to move all networking natively inside docker – IMHO a good balance for what the pain point is and yet not make docker very heavy.

'docker run' has about 20-25 options now, some of them further provides more options (e.g. ‘-a’, or ‘—security-opt’). I don’t think it will remain 25 in near/short term, and likely grow rapidly to make it a flat unstructured set. The growth would come from valid use-cases (networking or non-networking), but must we consider solving that problem here in this proposal?

I think libswarm can work with either of the two models, where an orchestrator has to play a role of spawning ‘swarmd’ with appropriate network glue points.

@dave-tucker There are trade-offs of pulling everything (management, data-plane, and control-plane) in docker. While you highlighted the advantages (and I agree with some as indicated in my comment), I was noting a few disadvantages (versioning/compatibility, inefficiency, docker performance, etc.) so we can weigh it better. This is based on my understanding of things reading the proposal (no experimentation yet).

In contrast, if we can incorporate a small change (#8216) in docker, it can perhaps give scheduler/orchestrator/controller a good way to spawn the containers while allowing them to do networking related things themselves, and not have to move all networking natively inside docker – IMHO a good balance for what the pain point is and yet not make docker very heavy.

'docker run' has about 20-25 options now, some of them further provides more options (e.g. ‘-a’, or ‘—security-opt’). I don’t think it will remain 25 in near/short term, and likely grow rapidly to make it a flat unstructured set. The growth would come from valid use-cases (networking or non-networking), but must we consider solving that problem here in this proposal?

I think libswarm can work with either of the two models, where an orchestrator has to play a role of spawning ‘swarmd’ with appropriate network glue points.

@nkratzke

This comment has been minimized.

Show comment
Hide comment
@nkratzke

nkratzke Nov 5, 2014

What is about weave (https://github.com/zettio/weave)? Weave provides a very convenient SDN solution for Docker from my point of view. And it provides encryption out of the box, which is a true plus. And it is the only solution with out-of-the-box encryption so far, we have found on the open source market.

Nevertheless weaves impact to network performance in HTTP based and REST-like protocols is substantial. About 30% performance loss for small message sizes (< 1000 byte) and up to 70% performance loss for big message sizes (> 200.000 bytes). Performance losses were measured for the indicators time per request, transfer rate and requests per second using apachebench against a simple ping-pong system exchanging data using a HTTP based REST-like protocol.

We are writing a paper for the next CLOSER conference to present our performance results. There are some options to optimize weave performance (e.g. not containerizing the weave router should bring 10% to 15% performance plus according to our data).

nkratzke commented Nov 5, 2014

What is about weave (https://github.com/zettio/weave)? Weave provides a very convenient SDN solution for Docker from my point of view. And it provides encryption out of the box, which is a true plus. And it is the only solution with out-of-the-box encryption so far, we have found on the open source market.

Nevertheless weaves impact to network performance in HTTP based and REST-like protocols is substantial. About 30% performance loss for small message sizes (< 1000 byte) and up to 70% performance loss for big message sizes (> 200.000 bytes). Performance losses were measured for the indicators time per request, transfer rate and requests per second using apachebench against a simple ping-pong system exchanging data using a HTTP based REST-like protocol.

We are writing a paper for the next CLOSER conference to present our performance results. There are some options to optimize weave performance (e.g. not containerizing the weave router should bring 10% to 15% performance plus according to our data).

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 5, 2014

Collaborator

@thockin absolutely we will need to couple this with a plugin architecture. See #8968 for first steps in that direction :)

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

Collaborator

shykes commented Nov 5, 2014

@thockin absolutely we will need to couple this with a plugin architecture. See #8968 for first steps in that direction :)

At the same time, Docker will always have a default. Ideally that default should be enough for 80% of use cases , with plugins as a solution for the rest. When I ask about ovs as a viable default, it's in the context of this "batteries included but removable" model.

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 5, 2014

Collaborator

Ping @erikh

Collaborator

shykes commented Nov 5, 2014

Ping @erikh

@Lukasa

This comment has been minimized.

Show comment
Hide comment
@Lukasa

Lukasa Nov 5, 2014

@dave-tucker, @mavenugo and @nerdalert (and indeed @ everyone else):

It's really exciting to see this proposal for Docker! The lack of multi-host networking has been a glaring gap in Docker's solution for a while now.

I just want to quickly propose an alternative, lighter-weight model that my colleagues and I have been working on. The OVS approach proposed here is great if it's necessary to put containers in layer 2 broadcast domains, but it's not immediately clear to me that this will be necessary for the majority of containerized workloads.

An alternative approach is pursue network virtualization at Layer 3. A good reference example is Project Calico. This approach uses BGP and ACLs to route traffic between endpoints (in this case containers). This is a much lighter-weight approach, so long as you can accept certain limitations: IP only, and no IP address overlap. Both of these feel like extremely reasonable limitations for a default Docker case.

We've prototyped Calico's approach with Docker, and it works perfectly, so the approach is simple to implement for Docker.

Docker is in a unique position to take advantage of lighter-weight approaches to virtual networking because it doesn't have the legacy weight of hypervisor approaches. It would be a shame to simply follow the path laid by hypervisors without evaluating alternative approaches.

(NB: I spotted #8952 and will comment there as well, I'd like the Calico approach to be viable for integration with Docker regardless of whether it's the default.)

Lukasa commented Nov 5, 2014

@dave-tucker, @mavenugo and @nerdalert (and indeed @ everyone else):

It's really exciting to see this proposal for Docker! The lack of multi-host networking has been a glaring gap in Docker's solution for a while now.

I just want to quickly propose an alternative, lighter-weight model that my colleagues and I have been working on. The OVS approach proposed here is great if it's necessary to put containers in layer 2 broadcast domains, but it's not immediately clear to me that this will be necessary for the majority of containerized workloads.

An alternative approach is pursue network virtualization at Layer 3. A good reference example is Project Calico. This approach uses BGP and ACLs to route traffic between endpoints (in this case containers). This is a much lighter-weight approach, so long as you can accept certain limitations: IP only, and no IP address overlap. Both of these feel like extremely reasonable limitations for a default Docker case.

We've prototyped Calico's approach with Docker, and it works perfectly, so the approach is simple to implement for Docker.

Docker is in a unique position to take advantage of lighter-weight approaches to virtual networking because it doesn't have the legacy weight of hypervisor approaches. It would be a shame to simply follow the path laid by hypervisors without evaluating alternative approaches.

(NB: I spotted #8952 and will comment there as well, I'd like the Calico approach to be viable for integration with Docker regardless of whether it's the default.)

@erikh

This comment has been minimized.

Show comment
Hide comment
@erikh

erikh Nov 5, 2014

Contributor

I have some simple opinions here but they may be misguided, so please feel free to correct my assumptions. Sorry if this seems overly simplistic but plenty of this is very new to me, so I’ll focus on how I think this should fit into docker instead. I’m not entirely sure what you wanted me to weigh in on @shykes, so I’m trying to cover everything from a design angle.

I’ll weigh in on the nitty-gritty of the architecture after some more experimentation with openvswitch (you know, when I have a clue :).

After some consideration, I think weave, or something like it, should be the default networking system in docker. While this may ruffle some feathers, we absolutely have to support the simple use case. I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together. Weave brings this capability without a lot of dependencies at the cost of performance, and it’s very possible to embed directly into docker, with some collaborative work between us and the zettio team.

That said, openvswitch should definitely be available and first-class for production use (weave does not appear at a glance to be made especially demanding workloads) and ops professionals will appreciate the necessary complexity with the bonus flexibility. The socketplane guys seem extremely skilled and knowledgable with openvswitch and we should fully leverage that, standing on the shoulders of giants.

In general, I am all for anything that gets rid of this iptables/veth mess we have now. The code is very brittle and racy, with tons of problems, and basically makes life for ops a lot harder than it needs to be even in trivial deployments. At the end of the day, if ops teams can’t scale docker because of a poor network implementation it simply won’t get adopted in a lot of institutions.

The downside to all of this is if we execute on the above, that we have two first-class network solutions, both of which have to be meticulously maintained regularly, and devs and ops may have an impedance mismatch between dev and prod. I think that’s an acceptable trade for “it just works” on the dev side, as painful as it might end up being for docker maintainers. Ops can always create a staging environment (As they should) if they need to test network capabilities between alternatives, or help devs configure openvswitch if that’s absolutely necessary.

I would like to take plugin discussion to the relevant pull requests instead of here, I think it’s distracting from the discussion. Additionally, I don’t think the people behind the work in the plugin system are not specifically focused on networking, but instead a wider goal, so the best place to have that discussion is there.

I hope this was useful. :)

-Erik=

Contributor

erikh commented Nov 5, 2014

I have some simple opinions here but they may be misguided, so please feel free to correct my assumptions. Sorry if this seems overly simplistic but plenty of this is very new to me, so I’ll focus on how I think this should fit into docker instead. I’m not entirely sure what you wanted me to weigh in on @shykes, so I’m trying to cover everything from a design angle.

I’ll weigh in on the nitty-gritty of the architecture after some more experimentation with openvswitch (you know, when I have a clue :).

After some consideration, I think weave, or something like it, should be the default networking system in docker. While this may ruffle some feathers, we absolutely have to support the simple use case. I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together. Weave brings this capability without a lot of dependencies at the cost of performance, and it’s very possible to embed directly into docker, with some collaborative work between us and the zettio team.

That said, openvswitch should definitely be available and first-class for production use (weave does not appear at a glance to be made especially demanding workloads) and ops professionals will appreciate the necessary complexity with the bonus flexibility. The socketplane guys seem extremely skilled and knowledgable with openvswitch and we should fully leverage that, standing on the shoulders of giants.

In general, I am all for anything that gets rid of this iptables/veth mess we have now. The code is very brittle and racy, with tons of problems, and basically makes life for ops a lot harder than it needs to be even in trivial deployments. At the end of the day, if ops teams can’t scale docker because of a poor network implementation it simply won’t get adopted in a lot of institutions.

The downside to all of this is if we execute on the above, that we have two first-class network solutions, both of which have to be meticulously maintained regularly, and devs and ops may have an impedance mismatch between dev and prod. I think that’s an acceptable trade for “it just works” on the dev side, as painful as it might end up being for docker maintainers. Ops can always create a staging environment (As they should) if they need to test network capabilities between alternatives, or help devs configure openvswitch if that’s absolutely necessary.

I would like to take plugin discussion to the relevant pull requests instead of here, I think it’s distracting from the discussion. Additionally, I don’t think the people behind the work in the plugin system are not specifically focused on networking, but instead a wider goal, so the best place to have that discussion is there.

I hope this was useful. :)

-Erik=

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.

From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

Contributor

mavenugo commented Nov 5, 2014

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.

From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@Lukasa Please refer to a couple of important points in this proposal that exactly addresses yours :

"Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future."

"By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture."

Please refer to #8952 which provides the details on how a driver / plugin can help in choosing appropriate networking backend. I believe that is the right place to bring the discussion on including an alternative choice of another backend that will fit best in a certain scenarios.

This proposal is to explore all the multi-host networking options and exploring the Native Docker integration of those features.

Contributor

mavenugo commented Nov 5, 2014

@Lukasa Please refer to a couple of important points in this proposal that exactly addresses yours :

"Our experience leads us towards using similar consistency protocol such as a tenant aware BGP in order to achieve the worry free environment developers and operators desire. This also presents an evolvable architecture if a tighter coupling into the native network is of value in the future."

"By extending L3 to the true edge of the network in the vSwitch it enables a proven network scale while still retaining the ability to perform disaggregated network services on the edge. Extending gateway protocols to the host will play a significant role in scaling a tight coupling to the network architecture."

Please refer to #8952 which provides the details on how a driver / plugin can help in choosing appropriate networking backend. I believe that is the right place to bring the discussion on including an alternative choice of another backend that will fit best in a certain scenarios.

This proposal is to explore all the multi-host networking options and exploring the Native Docker integration of those features.

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@erikh Thanks for weighing in. Is there anything specific in the proposal that leads you to believe that it will make life of the application developer more complex ? We wanted to provide a wholesome view of the Network operations & choices in a multi-host production deployment and hence the proposal description became network operations heavy. I just wanted to assure you that It will by no way expose any complexity to the application developers.

One of the primary goals of Docker is to provide seamless and consistent mechanism from dev to production. Any impedance mismatch between dev and production should be discouraged.

+1 to "I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together."
The discussion on OVS vs Linux Bridge + IPTables is purely a infra level discussion and shouldn't impact the application developers in any way. Also that discussion should be kept under #8952.

This proposal is to bring multi-host networking Native to Docker, Transparent to Developers and Friendly to Operations.

Contributor

mavenugo commented Nov 5, 2014

@erikh Thanks for weighing in. Is there anything specific in the proposal that leads you to believe that it will make life of the application developer more complex ? We wanted to provide a wholesome view of the Network operations & choices in a multi-host production deployment and hence the proposal description became network operations heavy. I just wanted to assure you that It will by no way expose any complexity to the application developers.

One of the primary goals of Docker is to provide seamless and consistent mechanism from dev to production. Any impedance mismatch between dev and production should be discouraged.

+1 to "I think it’s safe to say developers don’t care about openvswitch, they care that they can start postgres and rails and they just work together."
The discussion on OVS vs Linux Bridge + IPTables is purely a infra level discussion and shouldn't impact the application developers in any way. Also that discussion should be kept under #8952.

This proposal is to bring multi-host networking Native to Docker, Transparent to Developers and Friendly to Operations.

@rade

This comment has been minimized.

Show comment
Hide comment
@rade

rade Nov 5, 2014

@shykes

absolutely we will need to couple this with a plugin architecture

+1

I reckon that architecturally there are three layers here...

  1. generic docker plug-in system
  2. networking plug-in API, sitting on top of 1)
  3. specific implementation of 2), e.g. based on OVS, user-space, docker's existing bridge approach, our own (weave), etc.

Crucially, 2) must make as few assumptions as possible about what docker networking looks like, such as to not artificially constrain/exclude different approaches.

As a strawman for 2), how about wiring a ConfigureContainerNetworking(<container>) plug-in invocation into docker's container startup workflow just after the docker container process (and hence network namespace) has been created?

@dave-tucker Is this broadly compatible with your thinking on #8952?

rade commented Nov 5, 2014

@shykes

absolutely we will need to couple this with a plugin architecture

+1

I reckon that architecturally there are three layers here...

  1. generic docker plug-in system
  2. networking plug-in API, sitting on top of 1)
  3. specific implementation of 2), e.g. based on OVS, user-space, docker's existing bridge approach, our own (weave), etc.

Crucially, 2) must make as few assumptions as possible about what docker networking looks like, such as to not artificially constrain/exclude different approaches.

As a strawman for 2), how about wiring a ConfigureContainerNetworking(<container>) plug-in invocation into docker's container startup workflow just after the docker container process (and hence network namespace) has been created?

@dave-tucker Is this broadly compatible with your thinking on #8952?

@MalteJ

This comment has been minimized.

Show comment
Hide comment
@MalteJ

MalteJ Nov 5, 2014

Contributor

I would like to see a simple but secure standard network solution (e.g. preventing arp spoofing. The current default config is vulnerable to this.). It should be easy to replace by something more comprehensive. And there should be an API that you can connect to your network management solution.
I don't want to put everything into docker - sounds like a big monolithic monstrosity.
I am OK with a simple default OpenVSwitch setup.
With OVS the user will find lots of documentation and has lots of configuration possibilities - if he likes to dig in.

Contributor

MalteJ commented Nov 5, 2014

I would like to see a simple but secure standard network solution (e.g. preventing arp spoofing. The current default config is vulnerable to this.). It should be easy to replace by something more comprehensive. And there should be an API that you can connect to your network management solution.
I don't want to put everything into docker - sounds like a big monolithic monstrosity.
I am OK with a simple default OpenVSwitch setup.
With OVS the user will find lots of documentation and has lots of configuration possibilities - if he likes to dig in.

@titanous

This comment has been minimized.

Show comment
Hide comment
@titanous

titanous Nov 5, 2014

Contributor

I'd like to see this as a composable external tool that works well when wrapped up as a Docker plugin, but doesn't assume anything about the containers it is working with. There's no reason why this needs to be specific to Docker. This also will require service discovery and cluster communication to work effectively, which should be a pluggable layer.

Contributor

titanous commented Nov 5, 2014

I'd like to see this as a composable external tool that works well when wrapped up as a Docker plugin, but doesn't assume anything about the containers it is working with. There's no reason why this needs to be specific to Docker. This also will require service discovery and cluster communication to work effectively, which should be a pluggable layer.

@dave-tucker

This comment has been minimized.

Show comment
Hide comment
@dave-tucker

dave-tucker Nov 5, 2014

Contributor

@erikh "developers don't care about openvswitch" - I agree.

Our solution is designed to be totally transparent to developers such that they can deploy their rails or postgres containers safe in the knowledge that the plumbing will be taken care of.

The other point of note here is that the backend doesn't have to be Open vSwitch - it could be whatever so long as it honours the API. You could theoretically have multi-host networking using this control plane, but linux bridge, iptables and whatever in the backend.

We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+

Contributor

dave-tucker commented Nov 5, 2014

@erikh "developers don't care about openvswitch" - I agree.

Our solution is designed to be totally transparent to developers such that they can deploy their rails or postgres containers safe in the knowledge that the plumbing will be taken care of.

The other point of note here is that the backend doesn't have to be Open vSwitch - it could be whatever so long as it honours the API. You could theoretically have multi-host networking using this control plane, but linux bridge, iptables and whatever in the backend.

We prefer OVS, the only downside being that we require "openvswitch" to be installed on the host, but we've wrapped up all the userland elements in a docker container - the kernel module is available in 3.7+

@dave-tucker

This comment has been minimized.

Show comment
Hide comment
@dave-tucker

dave-tucker Nov 5, 2014

Contributor

@rade yep - philosophy is exactly the same. lets head on over to #8952 to discuss

Contributor

dave-tucker commented Nov 5, 2014

@rade yep - philosophy is exactly the same. lets head on over to #8952 to discuss

@nerdalert

This comment has been minimized.

Show comment
Hide comment
@nerdalert

nerdalert Nov 5, 2014

Contributor

Hi @MalteJ, Thanks for the feedback.
"And there should be an API that you can connect to your network management solution."

  • A loosely coupled management plane is definitely something that probably shouldn't affect the potential race conditions, performance or scale of deployments other then some policy float.
  • The basic building blocks proposed are to ensure a container can have networking provisioned with as little latency as possible which is ultimately local to the node. Once provisioned, the instance is eventually consistent with updates to its peers.
  • The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.
Contributor

nerdalert commented Nov 5, 2014

Hi @MalteJ, Thanks for the feedback.
"And there should be an API that you can connect to your network management solution."

  • A loosely coupled management plane is definitely something that probably shouldn't affect the potential race conditions, performance or scale of deployments other then some policy float.
  • The basic building blocks proposed are to ensure a container can have networking provisioned with as little latency as possible which is ultimately local to the node. Once provisioned, the instance is eventually consistent with updates to its peers.
  • The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.
@maceip

This comment has been minimized.

Show comment
Hide comment
@maceip

maceip Nov 5, 2014

Wanted to drop in and mention an alternative to VxLAN: GUE -> an in-kernel, L3 encap solution recently (soon to be?) merged into Linux: torvalds/linux@6106253

maceip commented Nov 5, 2014

Wanted to drop in and mention an alternative to VxLAN: GUE -> an in-kernel, L3 encap solution recently (soon to be?) merged into Linux: torvalds/linux@6106253

@c4milo

This comment has been minimized.

Show comment
Hide comment
@c4milo

c4milo Nov 5, 2014

@maceip agreed with you. It seems to me that an efficient and minimal approach to networking in Docker would be using VXLAN + DOVE extensions or, even better, GUE. I'm inclined to think that OVS is too much for containers but I might be just biased.

c4milo commented Nov 5, 2014

@maceip agreed with you. It seems to me that an efficient and minimal approach to networking in Docker would be using VXLAN + DOVE extensions or, even better, GUE. I'm inclined to think that OVS is too much for containers but I might be just biased.

@maceip

This comment has been minimized.

Show comment
Hide comment
@maceip

maceip Nov 5, 2014

Given my limited experience, I don't see a compelling reason to do anything in L2 (ovs/vxlan). Is there an argument explaining why people want this? Generic UDP Encapsulation (GUE) seems to provide a simple, performant solution to this network overlay problem, and scales across various environments/providers.

maceip commented Nov 5, 2014

Given my limited experience, I don't see a compelling reason to do anything in L2 (ovs/vxlan). Is there an argument explaining why people want this? Generic UDP Encapsulation (GUE) seems to provide a simple, performant solution to this network overlay problem, and scales across various environments/providers.

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 5, 2014

Collaborator

@maceip @c4milo isn't GUE super new and poorly supported in the wild? Regarding vxlan+dove, I believe OVS can be used to manage it. Do you think we would be better off hitting the kernel directly? I can see the benefits of not carrying the entire footprint of OVS if we only use a small part of it - but that should be weighed against the difficulty of writing and maintaining new code. We faced a similar tradeoff between continuing to wrap lxc, or carrying our own implementation with libcontainer. Definitely not a no-brainer either way.

Collaborator

shykes commented Nov 5, 2014

@maceip @c4milo isn't GUE super new and poorly supported in the wild? Regarding vxlan+dove, I believe OVS can be used to manage it. Do you think we would be better off hitting the kernel directly? I can see the benefits of not carrying the entire footprint of OVS if we only use a small part of it - but that should be weighed against the difficulty of writing and maintaining new code. We faced a similar tradeoff between continuing to wrap lxc, or carrying our own implementation with libcontainer. Definitely not a no-brainer either way.

@maceip

This comment has been minimized.

Show comment
Hide comment
@maceip

maceip Nov 5, 2014

@shykes correct, GUE would only be available in kernels >= 3.18. I realize this limits its applicability but wanted to make sure it was on your radar nonetheless. OVS is a nightmare; it's like they reimplemented libc...

maceip commented Nov 5, 2014

@shykes correct, GUE would only be available in kernels >= 3.18. I realize this limits its applicability but wanted to make sure it was on your radar nonetheless. OVS is a nightmare; it's like they reimplemented libc...

@c4milo

This comment has been minimized.

Show comment
Hide comment
@c4milo

c4milo Nov 5, 2014

@shykes what do you mean by poorly supported? it just landed in the mainline kernel about 1 month ago and it is being worked on by Google.

Regarding VXLAN+DOVE, it certainly can be managed by OVS and I believe work to integrate it into OVS already started as well as into OpenDaylight.

I guess the decision comes down to the sort of networking Docker wants to provide. You can get as crazy as you want with things like Opendaylight, OpenContrail, OVS and the like, or use something simpler/lighter like VXLAN+DOVE or GUE which wouldn't have a fancy control plane or monitoring but that gets the job done too.

c4milo commented Nov 5, 2014

@shykes what do you mean by poorly supported? it just landed in the mainline kernel about 1 month ago and it is being worked on by Google.

Regarding VXLAN+DOVE, it certainly can be managed by OVS and I believe work to integrate it into OVS already started as well as into OpenDaylight.

I guess the decision comes down to the sort of networking Docker wants to provide. You can get as crazy as you want with things like Opendaylight, OpenContrail, OVS and the like, or use something simpler/lighter like VXLAN+DOVE or GUE which wouldn't have a fancy control plane or monitoring but that gets the job done too.

@shykes

This comment has been minimized.

Show comment
Hide comment
@shykes

shykes Nov 5, 2014

Collaborator

By "poorly supported" I simply mean very few machines with Docker installed
currently support it.

On Wednesday, November 5, 2014, Camilo Aguilar notifications@github.com
wrote:

@shykes https://github.com/shykes what do you mean by poorly supported?
it just landed in the mainline kernel about 1 month ago and it is being
worked on by Google.

Regarding VXLAN+DOVE, it certainly can be managed by OVS and I believe
work to integrate it into OVS already started as well as into OpenDaylight.

I guess the decision comes down to the sort of networking Docker wants to
provide. You can get as crazy as you want with things like Opendaylight,
OpenContrail and the like, or use something simpler like VXLAN+DOVE or GUE
which wouldn't have a fancy control plane or monitoring but that gets the
job done too.


Reply to this email directly or view it on GitHub
#8951 (comment).

Collaborator

shykes commented Nov 5, 2014

By "poorly supported" I simply mean very few machines with Docker installed
currently support it.

On Wednesday, November 5, 2014, Camilo Aguilar notifications@github.com
wrote:

@shykes https://github.com/shykes what do you mean by poorly supported?
it just landed in the mainline kernel about 1 month ago and it is being
worked on by Google.

Regarding VXLAN+DOVE, it certainly can be managed by OVS and I believe
work to integrate it into OVS already started as well as into OpenDaylight.

I guess the decision comes down to the sort of networking Docker wants to
provide. You can get as crazy as you want with things like Opendaylight,
OpenContrail and the like, or use something simpler like VXLAN+DOVE or GUE
which wouldn't have a fancy control plane or monitoring but that gets the
job done too.


Reply to this email directly or view it on GitHub
#8951 (comment).

@rmustacc

This comment has been minimized.

Show comment
Hide comment
@rmustacc

rmustacc Nov 5, 2014

As someone else who's in the coalface of building overlay networks based
on vxlan and thinking through some of the abstractions that might make
sense, I think that this is a useful first step. As a result, it's
raised a bunch of questions for me that I'd like to discuss, if this
should instead be directed elsewhere, let me know, but it feels pretty
central to the issue of Multi-Host Networking.

I'd like to approach this issue from a slightly different perspective
and focus less on how this is implemented in terms of the data plane of
the networking stack, but rather start from the perspective of a user
and what they'd actually like to build based on what our users are
building today at Joyent.

As folks are trying to migrate their existing applications into the
world of docker containers, there are a bunch of things that they do
from a networking perspective that aren't quite captured in multi-host
network deployment scenarios. The first is representing the following
classic networking topology that involves every instance existing on two
networks, with a distinct lb vlan, web vlan, and db vlan:

                   +----------+
                   | Internet |
                   +----------+
                  /            \
         +------------+    +------------+
         |            |    |            |
         |    Load    |    |    Load    |
         |  Balancer  |    |  Balancer  |
         |            |    |            |
         +------------+    +------------+
               |                  |
               |                  |
 +--------------------------------------------------------+
( )  VLAN                                                 |
 +--------------------------------------------------------+
      |           |            |            |           |
      |           |            |            |           |
  +------+    +------+     +------+     +------+    +------+
  | Web  |    | Web  |     | Web  |     | Web  |    | Web  |
  | Head |    | Head |     | Head |     | Head |    | Head |
  +------+    +------+     +------+     +------+    +------+
      |           |            |            |           |
      |           |            |            |           |
 +--------------------------------------------------------+
( )  VLAN                                                 |
 +--------------------------------------------------------+
            |               |              |
      +----------+    +----------+    +----------+
      | Database |    | Database |    | Database |
      +----------+    +----------+    +----------+

I believe that this use case is actually highly prevalent for a lot of
applications and represents a very common deployment model. As we move
to the world of Multi-Host Networking, these actually become important
and I think it's worth us taking a critical look at that before we bake
the backend implementation, as it may foreclose us on actually being
able to enable these cases.

From our observations, there are a bunch of open questions in the world
of Multi-Host Networking:

  • How do we specify multiple interfaces to a container to allow it to be
    on multiple networks?
  • How do we assign IP addresses or leave them to the IP Address
    Management System talked about in the proposal?

One of the abstractions that other orchestration and cloud providers
have is the notion of a logical network, which consists of a some IPv4
or IPv6 subnet, a set of IPs that are usable inside of that subnet for
Virtual Machines and containers, and optional information that applies
to that network, such as gateways, additional routes, resolvers, etc.
Whether docker wants to have an abstraction like this that can be
integrated or just work in terms of the raw pieces like it does today,
seems like an open question.

From what we've done and what we've had customers ask us for, they often
want to be able to logically create those networks, but not always
manage it. Most are pretty happy with the IP address management system
assigning IPs, but some also want to be select the IP address directly.
So before we go too much further into discussion about which technology
we should use in the backend, let's spend some time thinking about how
we want to actually use this from a CLI perspective when we're in the
world of Multi-Host Networking and our overlay networks allow us to have
multiple independent virtual L2 and L3 domains on the same host.

So in conclusion, while the discussion about how all this can be
implemented and the different overlay technologies we have available is
rather useful, we need to really step back and ask ourselves, what is it
we want our users to be able to do with this functionality first.

rmustacc commented Nov 5, 2014

As someone else who's in the coalface of building overlay networks based
on vxlan and thinking through some of the abstractions that might make
sense, I think that this is a useful first step. As a result, it's
raised a bunch of questions for me that I'd like to discuss, if this
should instead be directed elsewhere, let me know, but it feels pretty
central to the issue of Multi-Host Networking.

I'd like to approach this issue from a slightly different perspective
and focus less on how this is implemented in terms of the data plane of
the networking stack, but rather start from the perspective of a user
and what they'd actually like to build based on what our users are
building today at Joyent.

As folks are trying to migrate their existing applications into the
world of docker containers, there are a bunch of things that they do
from a networking perspective that aren't quite captured in multi-host
network deployment scenarios. The first is representing the following
classic networking topology that involves every instance existing on two
networks, with a distinct lb vlan, web vlan, and db vlan:

                   +----------+
                   | Internet |
                   +----------+
                  /            \
         +------------+    +------------+
         |            |    |            |
         |    Load    |    |    Load    |
         |  Balancer  |    |  Balancer  |
         |            |    |            |
         +------------+    +------------+
               |                  |
               |                  |
 +--------------------------------------------------------+
( )  VLAN                                                 |
 +--------------------------------------------------------+
      |           |            |            |           |
      |           |            |            |           |
  +------+    +------+     +------+     +------+    +------+
  | Web  |    | Web  |     | Web  |     | Web  |    | Web  |
  | Head |    | Head |     | Head |     | Head |    | Head |
  +------+    +------+     +------+     +------+    +------+
      |           |            |            |           |
      |           |            |            |           |
 +--------------------------------------------------------+
( )  VLAN                                                 |
 +--------------------------------------------------------+
            |               |              |
      +----------+    +----------+    +----------+
      | Database |    | Database |    | Database |
      +----------+    +----------+    +----------+

I believe that this use case is actually highly prevalent for a lot of
applications and represents a very common deployment model. As we move
to the world of Multi-Host Networking, these actually become important
and I think it's worth us taking a critical look at that before we bake
the backend implementation, as it may foreclose us on actually being
able to enable these cases.

From our observations, there are a bunch of open questions in the world
of Multi-Host Networking:

  • How do we specify multiple interfaces to a container to allow it to be
    on multiple networks?
  • How do we assign IP addresses or leave them to the IP Address
    Management System talked about in the proposal?

One of the abstractions that other orchestration and cloud providers
have is the notion of a logical network, which consists of a some IPv4
or IPv6 subnet, a set of IPs that are usable inside of that subnet for
Virtual Machines and containers, and optional information that applies
to that network, such as gateways, additional routes, resolvers, etc.
Whether docker wants to have an abstraction like this that can be
integrated or just work in terms of the raw pieces like it does today,
seems like an open question.

From what we've done and what we've had customers ask us for, they often
want to be able to logically create those networks, but not always
manage it. Most are pretty happy with the IP address management system
assigning IPs, but some also want to be select the IP address directly.
So before we go too much further into discussion about which technology
we should use in the backend, let's spend some time thinking about how
we want to actually use this from a CLI perspective when we're in the
world of Multi-Host Networking and our overlay networks allow us to have
multiple independent virtual L2 and L3 domains on the same host.

So in conclusion, while the discussion about how all this can be
implemented and the different overlay technologies we have available is
rather useful, we need to really step back and ask ourselves, what is it
we want our users to be able to do with this functionality first.

@jainvipin

This comment has been minimized.

Show comment
Hide comment
@jainvipin

jainvipin Nov 5, 2014

@mavenugo I am convinced that the proposal doesn't preclude higher order orchestration to add more value, in contrast may be this proposal requires an orchestrator to do that (which I am okay with).
The point I was bringing up is if we need to bring entire data/control/management plane inside docker natively, then technical trade-offs be discussed. May be what I am concerned about are not technical/architectural issues at all, if you or someone can address the concerns. So far I am hearing from Dave, Brent and you mentioning that you 'believe' in native integration and I trust that your assessment is based on good technical merits, it is just that I want to know and be convinced about the reasons too. First three specifics to discuss can be:

  • Increasing the code-footprint of docker: Almost all of the benefits you talked about in this proposal can be done with OVS without natively integrating data/control/management plane. This assumes that the work be done outside in an orchestrator. Can you point out some that otherwise are not possible?
  • Compatibility/versioning; do we require docker version to be compatible with orchestator using the version of APIs?
  • Inefficiency due to extra hop to docker: if I have to manage an OVS via OVSDB/OF-CTL interface, then why take an extra hop via docker especially if docker doesn't need to parse/understand the netwrok configuration.

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.
From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

@mavenugo I am convinced that the proposal doesn't preclude higher order orchestration to add more value, in contrast may be this proposal requires an orchestrator to do that (which I am okay with).
The point I was bringing up is if we need to bring entire data/control/management plane inside docker natively, then technical trade-offs be discussed. May be what I am concerned about are not technical/architectural issues at all, if you or someone can address the concerns. So far I am hearing from Dave, Brent and you mentioning that you 'believe' in native integration and I trust that your assessment is based on good technical merits, it is just that I want to know and be convinced about the reasons too. First three specifics to discuss can be:

  • Increasing the code-footprint of docker: Almost all of the benefits you talked about in this proposal can be done with OVS without natively integrating data/control/management plane. This assumes that the work be done outside in an orchestrator. Can you point out some that otherwise are not possible?
  • Compatibility/versioning; do we require docker version to be compatible with orchestator using the version of APIs?
  • Inefficiency due to extra hop to docker: if I have to manage an OVS via OVSDB/OF-CTL interface, then why take an extra hop via docker especially if docker doesn't need to parse/understand the netwrok configuration.

@thockin @jainvipin @shykes I just want to bring your attention to the point that this proposal tries to bring in solid foundation for network plumbing and is in no way precludes higher order orchestrators to add more value on top. I think adding more details on the API and integration will help clarify some of these concerns.
From the past, we have some deep scars in approaches that lets non-native solutions dictate the basic plumbing model, leading to a crippled default behavior and it fractures the community.
This proposal is to make sure we have considered all the defaults that must be native to Docker and not dependent on external orchestrators to define the basic network plumbing. Docker being the common platform, everyone should be able to contribute to the Default feature-set and benefit out of it.

@jainvipin

This comment has been minimized.

Show comment
Hide comment
@jainvipin

jainvipin Nov 5, 2014

@nerdalert:
+1 on the problem (potential virtual port density, etc.) and use of OVS for performance/manageability for a feature rich production-grade system, and possible HW leverage..

Will all that benefits not come if OVS control/data/mgmt plane is not natively integrated into docker but is completely orchestrated from outside to provide with the network intent. Given that the solution requires some network orchestrator/controller to talk to it, the simplicity comes from that entity/integration and pehraps not native docker integration. Having OVS as a default docker bridge is good, but that may still not require all native integration.

The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.

@nerdalert:
+1 on the problem (potential virtual port density, etc.) and use of OVS for performance/manageability for a feature rich production-grade system, and possible HW leverage..

Will all that benefits not come if OVS control/data/mgmt plane is not natively integrated into docker but is completely orchestrated from outside to provide with the network intent. Given that the solution requires some network orchestrator/controller to talk to it, the simplicity comes from that entity/integration and pehraps not native docker integration. Having OVS as a default docker bridge is good, but that may still not require all native integration.

The potential network density in a host is a virtual port density multiplier beyond anything to date in a server and typically solved in networking today with purpose built network ASICs for packet forwarding. This is why we are very passionate about Docker having the fundamental the capabilities of an L3 switch, complete with a fastpath in kernel or OVS actuated in HW (e.g. Intel) along with L4 flow services in OVS for performance/manageability attempts to reduce as much risk as possible. The reasonable simplicity of a well known network consistency model coupled feels very right to those of us who have ever been measured in service uptime. Implementing natively to Docker captures a handful of the dominate network architectures out of the box which reflects a Docker community core value of being easy to deploy, develop against and operate.

@MalteJ

This comment has been minimized.

Show comment
Hide comment
@MalteJ

MalteJ Nov 5, 2014

Contributor

@jainvipin agree
Also I think part of docker's success is that you can use it as a tool: use it for different things and in different ways - just as you like.
Docker shouldn't get too big. If you want to add functionality add APIs and build an ecosystem (and maybe earn some money with that).

Contributor

MalteJ commented Nov 5, 2014

@jainvipin agree
Also I think part of docker's success is that you can use it as a tool: use it for different things and in different ways - just as you like.
Docker shouldn't get too big. If you want to add functionality add APIs and build an ecosystem (and maybe earn some money with that).

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@jainvipin @MalteJ I can see the disconnect with your understanding of the proposed solution. I will update the proposal with these details.

When we say native, we mean native control of the network backend (linux bridge / IP-Tables or OVS or other backend) from the plugin layer (#8952).
The control and mgmt mechanisms such as netlink, ovsdb, OF are all APIs exposed by the backends and will be used by the plugin/driver in order to manage that backend. We don’t need an external orchestrator to manage them.

This will keep the footprint small and free of external dependencies in order to get network plumbing taken care.

Contributor

mavenugo commented Nov 5, 2014

@jainvipin @MalteJ I can see the disconnect with your understanding of the proposed solution. I will update the proposal with these details.

When we say native, we mean native control of the network backend (linux bridge / IP-Tables or OVS or other backend) from the plugin layer (#8952).
The control and mgmt mechanisms such as netlink, ovsdb, OF are all APIs exposed by the backends and will be used by the plugin/driver in order to manage that backend. We don’t need an external orchestrator to manage them.

This will keep the footprint small and free of external dependencies in order to get network plumbing taken care.

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@jainvipin

"Given that the solution requires some network orchestrator/controller to talk to it,"

Is there anything specific in the proposal that made you believe that the solution is based on a controller ?

Contributor

mavenugo commented Nov 5, 2014

@jainvipin

"Given that the solution requires some network orchestrator/controller to talk to it,"

Is there anything specific in the proposal that made you believe that the solution is based on a controller ?

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Nov 5, 2014

Contributor

@c4milo

"I guess the decision comes down to the sort of networking Docker wants to provide. You can get as crazy as you want with things like Opendaylight, OpenContrail and the like, or use something simpler like VXLAN+DOVE or GUE which wouldn't have a fancy control plane or monitoring but that gets the job done too."

The proposal is trying to find that simplicity for Multi-Host Docker Networking without need for external controllers to manage network plumbing & at the same time not sacrificing on functionality and performance. Please refer to @dave-tucker comment about on the performance comparisons. (We have more data to share on these comparisons shortly).

Also I would recommend jumping to #8952 to discuss on the actual back-end choices via plugin model and we can hash out the API details together.

Contributor

mavenugo commented Nov 5, 2014

@c4milo

"I guess the decision comes down to the sort of networking Docker wants to provide. You can get as crazy as you want with things like Opendaylight, OpenContrail and the like, or use something simpler like VXLAN+DOVE or GUE which wouldn't have a fancy control plane or monitoring but that gets the job done too."

The proposal is trying to find that simplicity for Multi-Host Docker Networking without need for external controllers to manage network plumbing & at the same time not sacrificing on functionality and performance. Please refer to @dave-tucker comment about on the performance comparisons. (We have more data to share on these comparisons shortly).

Also I would recommend jumping to #8952 to discuss on the actual back-end choices via plugin model and we can hash out the API details together.

@thewmf

This comment has been minimized.

Show comment
Hide comment
@thewmf

thewmf Dec 3, 2014

@liljenstolpe You can do L3-only over VXLAN if you want; choosing a different encapsulation format will disable hardware offload. Likewise OVS with learning disabled can be used as an L3-only vRouter.

Semantics and implementation are orthogonal in many ways, so maybe we should have a more focused discussion on desired semantics for the "batteries included" plugin first and then worry about the implementation. Obvious semantic questions are:

  • L2 vs. L3
  • multicast enabled or not
  • overlapping vs. global IP addressing
  • subnet-based or group-based connectivity
  • 1 vNIC vs. multiple vNICs per container

(Disclosure: IBM. We make SDN-VE and OpenDOVE.)

thewmf commented Dec 3, 2014

@liljenstolpe You can do L3-only over VXLAN if you want; choosing a different encapsulation format will disable hardware offload. Likewise OVS with learning disabled can be used as an L3-only vRouter.

Semantics and implementation are orthogonal in many ways, so maybe we should have a more focused discussion on desired semantics for the "batteries included" plugin first and then worry about the implementation. Obvious semantic questions are:

  • L2 vs. L3
  • multicast enabled or not
  • overlapping vs. global IP addressing
  • subnet-based or group-based connectivity
  • 1 vNIC vs. multiple vNICs per container

(Disclosure: IBM. We make SDN-VE and OpenDOVE.)

@danehans

This comment has been minimized.

Show comment
Hide comment
@danehans

danehans Dec 4, 2014

@liljenstolpe I agree and was not trying to imply OVS or VXLAN are the only considerations. I agree with @NetCubist that kernel VXLAN + SD can be a good enough solution. My preferred direction is to leave default networking as-is and use the plugins model to implement any additional networking functionality, but it sounds like Docker has already made their decision.

danehans commented Dec 4, 2014

@liljenstolpe I agree and was not trying to imply OVS or VXLAN are the only considerations. I agree with @NetCubist that kernel VXLAN + SD can be a good enough solution. My preferred direction is to leave default networking as-is and use the plugins model to implement any additional networking functionality, but it sounds like Docker has already made their decision.

@liljenstolpe

This comment has been minimized.

Show comment
Hide comment
@liljenstolpe

liljenstolpe Dec 16, 2014

@thewmf The question is, in an L3-only network, do you NEED an overlay. In L2 networks you certainly do, and in some cases (such as L3 address overlap), an overlay network can address "issues", however they are not the only solutions, and the general case (say 90% of the traffic in a scale-out environment) they are probably not necessary. Therefore, do we want to assume that they will be present? It's an additional "cost" that may not always be justified.

@danehans The question is if we think that overlays are the base? If so, it's burdening the environment when it's not always necessary.

@thewmf The question is, in an L3-only network, do you NEED an overlay. In L2 networks you certainly do, and in some cases (such as L3 address overlap), an overlay network can address "issues", however they are not the only solutions, and the general case (say 90% of the traffic in a scale-out environment) they are probably not necessary. Therefore, do we want to assume that they will be present? It's an additional "cost" that may not always be justified.

@danehans The question is if we think that overlays are the base? If so, it's burdening the environment when it's not always necessary.

@danehans

This comment has been minimized.

Show comment
Hide comment
@danehans

danehans Dec 16, 2014

@liljenstolpe I think it's hard to define what is needed without having detailed requirements to build against. One cloud provider may say that supporting overlapping IP's is a requirement but another may say it's not needed. This is a good example of why we, as a community, need to clearly define the requirements. Thus far, high-level analogies are the only thing to build against.

@liljenstolpe I think it's hard to define what is needed without having detailed requirements to build against. One cloud provider may say that supporting overlapping IP's is a requirement but another may say it's not needed. This is a good example of why we, as a community, need to clearly define the requirements. Thus far, high-level analogies are the only thing to build against.

@thewmf

This comment has been minimized.

Show comment
Hide comment
@thewmf

thewmf Dec 16, 2014

@liljenstolpe @danehans Agreed. Different requirements will lead to different implementations, which is why I suggested that we discuss requirements. I don't think it makes sense to lock in any technology unless it is needed.

I am working in an environment where we want to allow customers to bring their own possibly-overlapping IP addresses so we are definitely looking at overlays, but we can use a plug-in for that. But I'd like to hear people's opinions on the future of default networking. I would like to see Docker move away from NAT and port mapping, but I'm not sure how to do that on random developers' laptops. Maybe IPv6 ULAs... can people stomach that?

thewmf commented Dec 16, 2014

@liljenstolpe @danehans Agreed. Different requirements will lead to different implementations, which is why I suggested that we discuss requirements. I don't think it makes sense to lock in any technology unless it is needed.

I am working in an environment where we want to allow customers to bring their own possibly-overlapping IP addresses so we are definitely looking at overlays, but we can use a plug-in for that. But I'd like to hear people's opinions on the future of default networking. I would like to see Docker move away from NAT and port mapping, but I'm not sure how to do that on random developers' laptops. Maybe IPv6 ULAs... can people stomach that?

@unclejack

This comment has been minimized.

Show comment
Hide comment
@unclejack

unclejack Jan 9, 2015

Contributor

There's an official proposal for networking drivers which can be found at #9983.
The architecture presented in the proposal would also enable multi-host networking for multiple Docker daemons.

This new proposal implements an architecture which has been discussed quite a bit. Implementing a proof of concept of the network drivers was also part of this effort.
We're not suggesting that the previous proposals had a lower quality or that they've required less effort. However, the design also had to be accepted by everyone and validated with a proof of concept, in addition to being good.

Should you discover something is confusing or missing from the new proposal, please feel free to comment.
If you'd like to continue the discussion, please comment on #9983. Please make sure to stay on topic and try to avoid writing long comments (or too many). This would help make it easier for everyone who's following the discussion.

Questions and lengthy discussions are more adequate for the #docker-network channel on freenode. Should you just want to talk about this, that is a better place to have the conversation.

We'd like to thank everyone who's provided input, especially those who've sent proposals. I will close this proposal now.

Contributor

unclejack commented Jan 9, 2015

There's an official proposal for networking drivers which can be found at #9983.
The architecture presented in the proposal would also enable multi-host networking for multiple Docker daemons.

This new proposal implements an architecture which has been discussed quite a bit. Implementing a proof of concept of the network drivers was also part of this effort.
We're not suggesting that the previous proposals had a lower quality or that they've required less effort. However, the design also had to be accepted by everyone and validated with a proof of concept, in addition to being good.

Should you discover something is confusing or missing from the new proposal, please feel free to comment.
If you'd like to continue the discussion, please comment on #9983. Please make sure to stay on topic and try to avoid writing long comments (or too many). This would help make it easier for everyone who's following the discussion.

Questions and lengthy discussions are more adequate for the #docker-network channel on freenode. Should you just want to talk about this, that is a better place to have the conversation.

We'd like to thank everyone who's provided input, especially those who've sent proposals. I will close this proposal now.

@unclejack unclejack closed this Jan 9, 2015

@phemmer

This comment has been minimized.

Show comment
Hide comment
@phemmer

phemmer Jan 9, 2015

Contributor

Does this mean docker has no intention of developing/supporting multi-host networking natively? #9983 is just for the creation of a driver scheme, and not the specific goal of multi-host networking. If multi-host networking is still a goal, I would have expected this proposal to remain open, and for it to utilize #9983.

Contributor

phemmer commented Jan 9, 2015

Does this mean docker has no intention of developing/supporting multi-host networking natively? #9983 is just for the creation of a driver scheme, and not the specific goal of multi-host networking. If multi-host networking is still a goal, I would have expected this proposal to remain open, and for it to utilize #9983.

@erikh

This comment has been minimized.

Show comment
Hide comment
@erikh

erikh Jan 9, 2015

Contributor

@phemmer we have a vxlan implementation in our PoC already. It's not very good, but yes, this is intended to be supported first-class.

Contributor

erikh commented Jan 9, 2015

@phemmer we have a vxlan implementation in our PoC already. It's not very good, but yes, this is intended to be supported first-class.

@erikh

This comment has been minimized.

Show comment
Hide comment
@erikh

erikh Jan 9, 2015

Contributor

We're reopening this after some discussion with @mavenugo pointing out that our proposal is not a solution for everything in here -- and it should be much closer.

We want this in docker and we don't want to communicate otherwise. So, until we can at least mostly incorporate this proposal into our new extension architecture, we will leave it open and solicit comments.

Contributor

erikh commented Jan 9, 2015

We're reopening this after some discussion with @mavenugo pointing out that our proposal is not a solution for everything in here -- and it should be much closer.

We want this in docker and we don't want to communicate otherwise. So, until we can at least mostly incorporate this proposal into our new extension architecture, we will leave it open and solicit comments.

@erikh erikh reopened this Jan 9, 2015

@c4milo

This comment has been minimized.

Show comment
Hide comment
@c4milo

c4milo Jan 9, 2015

@erikh would you mind giving us the main takeaways after your discussion with @mavenugo?

c4milo commented Jan 9, 2015

@erikh would you mind giving us the main takeaways after your discussion with @mavenugo?

@mavenugo

This comment has been minimized.

Show comment
Hide comment
@mavenugo

mavenugo Jan 9, 2015

Contributor

@c4milo following is the docker-network IRC log between us regarding reopening the proposal.

madhu: erikh: backjlack thanks for all the great work
[06:12am] madhu: on closing the proposals
[06:13am] madhu: 9983 replaces 8952 and hence closing is accurate
[06:13am] madhu: but imho 8951 should be still open because it is beyond just drivers
[06:13am] madhu: but a generic architecture for all the considerations for a multi-host scenario
[06:14am] madhu: we can close it once all the scenarios are addressed. through other proposals or through 8951
[06:14am] backjlack: madhu: Personally, I'd rather see 9983 implemented and then revisit 8951 to request an update.
[06:15am] madhu: backjlack: okay. if that is the preferred approach sure
[06:15am] erikh: gh#8951
[06:15am] erikh: hmm.
[06:15am] erikh: need to fix that.
[06:15am] confounds joined the chat room.
[06:15am] madhu: keeping it open is actually better imho
[06:15am] erikh: hmm
[06:16am] erikh: backjlack: do you have any objections to keeping it open? madhu does have a pretty good point here.
[06:16am] erikh: we can incorporate it and close it if we feel necessary later
[06:16am] madhu: exactly. that way we can easily answer the questions that are raised
[06:17am] backjlack: erikh: My main concern is that it's more of a discussion around adding OVS support.
[06:17am] erikh: hmm
[06:17am] erikh: ok. let me review and get back to you guys.
[06:17am] madhu: thanks erikh backjlack
[06:17am] madhu: backjlack: just curious. is there any trouble in keeping it open vs closed ?
[06:18am] erikh: hmm
[06:19am] erikh: the only concern I have is that with several networking proposals that we're accidentally misleading our users
[06:19am] backjlack: madhu: If it's open, people leave comments like this one: #8952 (comment)
[06:19am] backjlack: They're under the impression nobody cares about implementing that and it's very confusing.
[06:20am] erikh: hmm
[06:20am] erikh: backjlack: let's leave it open for now
[06:20am] madhu: backjlack: okay good point
[06:20am] madhu: but we were waiting on the extensions to be available
[06:20am] erikh: if we incorporate everything into the new proposal, we will close it.
[06:20am] erikh: (And we can work together to fit that goal)
[06:20am] madhu: now that we are having the momentum, there will be code backing this all up
[06:20am] jodok joined the chat room.
[06:20am] madhu: thanks erikh that would be my suggestion too
[06:21am] erikh: backjlack: WDYT? I think it's reasonable to let people know (by example) we're trying to solve the problem, even if our answers don't necessarily line up with that proposal
[06:22am] backjlack: erikh: Sure, we can reopen the issue and update the top level text to let people know this is going to be addressed after #9983 gets implemented.
[06:22am] erikh: yeah, that's a good idea.
[06:22am] erikh: madhu: can you drive updating the proposal and referencing our new one as well?
[06:23am] erikh: I'll reopen it.
[06:23am] madhu: yes sir.
[06:23am] madhu: thanks guys. appreciate it

Contributor

mavenugo commented Jan 9, 2015

@c4milo following is the docker-network IRC log between us regarding reopening the proposal.

madhu: erikh: backjlack thanks for all the great work
[06:12am] madhu: on closing the proposals
[06:13am] madhu: 9983 replaces 8952 and hence closing is accurate
[06:13am] madhu: but imho 8951 should be still open because it is beyond just drivers
[06:13am] madhu: but a generic architecture for all the considerations for a multi-host scenario
[06:14am] madhu: we can close it once all the scenarios are addressed. through other proposals or through 8951
[06:14am] backjlack: madhu: Personally, I'd rather see 9983 implemented and then revisit 8951 to request an update.
[06:15am] madhu: backjlack: okay. if that is the preferred approach sure
[06:15am] erikh: gh#8951
[06:15am] erikh: hmm.
[06:15am] erikh: need to fix that.
[06:15am] confounds joined the chat room.
[06:15am] madhu: keeping it open is actually better imho
[06:15am] erikh: hmm
[06:16am] erikh: backjlack: do you have any objections to keeping it open? madhu does have a pretty good point here.
[06:16am] erikh: we can incorporate it and close it if we feel necessary later
[06:16am] madhu: exactly. that way we can easily answer the questions that are raised
[06:17am] backjlack: erikh: My main concern is that it's more of a discussion around adding OVS support.
[06:17am] erikh: hmm
[06:17am] erikh: ok. let me review and get back to you guys.
[06:17am] madhu: thanks erikh backjlack
[06:17am] madhu: backjlack: just curious. is there any trouble in keeping it open vs closed ?
[06:18am] erikh: hmm
[06:19am] erikh: the only concern I have is that with several networking proposals that we're accidentally misleading our users
[06:19am] backjlack: madhu: If it's open, people leave comments like this one: #8952 (comment)
[06:19am] backjlack: They're under the impression nobody cares about implementing that and it's very confusing.
[06:20am] erikh: hmm
[06:20am] erikh: backjlack: let's leave it open for now
[06:20am] madhu: backjlack: okay good point
[06:20am] madhu: but we were waiting on the extensions to be available
[06:20am] erikh: if we incorporate everything into the new proposal, we will close it.
[06:20am] erikh: (And we can work together to fit that goal)
[06:20am] madhu: now that we are having the momentum, there will be code backing this all up
[06:20am] jodok joined the chat room.
[06:20am] madhu: thanks erikh that would be my suggestion too
[06:21am] erikh: backjlack: WDYT? I think it's reasonable to let people know (by example) we're trying to solve the problem, even if our answers don't necessarily line up with that proposal
[06:22am] backjlack: erikh: Sure, we can reopen the issue and update the top level text to let people know this is going to be addressed after #9983 gets implemented.
[06:22am] erikh: yeah, that's a good idea.
[06:22am] erikh: madhu: can you drive updating the proposal and referencing our new one as well?
[06:23am] erikh: I'll reopen it.
[06:23am] madhu: yes sir.
[06:23am] madhu: thanks guys. appreciate it

@c4milo

This comment has been minimized.

Show comment
Hide comment
@c4milo

c4milo Jan 9, 2015

@mavenugo nice, thank you, it makes more sense now :)

c4milo commented Jan 9, 2015

@mavenugo nice, thank you, it makes more sense now :)

@bmullan

This comment has been minimized.

Show comment
Hide comment
@bmullan

bmullan Feb 5, 2015

Related to VxLAN and the network "overlay" the stumbling block to implementation/deployment was always the requirement for multicast to be enabled in the network... which is rare.

Last year Cumulus Networks and MetaCloud open sourced VXFLD to implement VxLAN with uni-cast and UDP.

They also submitted it for consideration for consideration as a standard.

MetaCloud has since been acquired by Cisco Systems.

VXFLD consists of 2 components that work together to solve the BUM (Broadcast, unicast Unknown & Multicast) problem with VxLAN by using unicast instead of the traditional multicast.

The 2 components are called VXSND and VXRD.

VXSND provides:
unicast BUM packet flooding via the Service Node Daemon (re the SND in VXSND).
VTEP (Virtual Tunnel End-Point) "learning"

VXRD provides:
a simple Registration Daemon (the RD in VXRD) designed to register local VTEPs with a remote vxsnd daemon.

the source for VXFLD is on Github: https://github.com/CumulusNetworks/vxfld

Be sure to read the two github VXFLD directory .RST files as they describe in more detail the two daemon's for VXFLD ... VXRD and VXSND.

I thought I'd mention VXFLD as it could potentially solve part of your proposal and... the code already exists.

If you use debian or ubuntu Cumulus also has pre-packaged 3 .deb files for VXFLD:

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-common_1.0-cl2.2~1_all.deb

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxrd_1.0-cl2.2~1_all.deb

and
http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxsnd_1.0-cl2.2~1_all.deb

bmullan commented Feb 5, 2015

Related to VxLAN and the network "overlay" the stumbling block to implementation/deployment was always the requirement for multicast to be enabled in the network... which is rare.

Last year Cumulus Networks and MetaCloud open sourced VXFLD to implement VxLAN with uni-cast and UDP.

They also submitted it for consideration for consideration as a standard.

MetaCloud has since been acquired by Cisco Systems.

VXFLD consists of 2 components that work together to solve the BUM (Broadcast, unicast Unknown & Multicast) problem with VxLAN by using unicast instead of the traditional multicast.

The 2 components are called VXSND and VXRD.

VXSND provides:
unicast BUM packet flooding via the Service Node Daemon (re the SND in VXSND).
VTEP (Virtual Tunnel End-Point) "learning"

VXRD provides:
a simple Registration Daemon (the RD in VXRD) designed to register local VTEPs with a remote vxsnd daemon.

the source for VXFLD is on Github: https://github.com/CumulusNetworks/vxfld

Be sure to read the two github VXFLD directory .RST files as they describe in more detail the two daemon's for VXFLD ... VXRD and VXSND.

I thought I'd mention VXFLD as it could potentially solve part of your proposal and... the code already exists.

If you use debian or ubuntu Cumulus also has pre-packaged 3 .deb files for VXFLD:

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-common_1.0-cl2.2~1_all.deb

http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxrd_1.0-cl2.2~1_all.deb

and
http://repo.cumulusnetworks.com/pool/CumulusLinux-2.2/main/vxfld-vxsnd_1.0-cl2.2~1_all.deb

@rcarmo

This comment has been minimized.

Show comment
Hide comment
@rcarmo

rcarmo Feb 28, 2015

I'd like to chime in on this. I've been trying to put together a few arguments for and against doing this transparently to the user, and coming from a telco/"purist SDN" background it's hard to strike a middle ground between ease of use for small deployments and the kind of infrastructure we need to have it scale up into (and integrate with) datacenter solutions.

(I'm rather partial to the OpenVSwitch approach, really, but I understand how weave and pipework can be appealing to a lot of people)

So here are my notes:


This is just a high-level overview of how software-defined networking might work in a Docker/Swarm/Compose environment, written largely from a devops/IaaS perspective but with a fair degree of background on datacenter/telco networking infrastructure, which is fast converging towards full SDN.

There are two sides to the SDN story:

  • Sysadmins running Docker in a typical IaaS environment, where a lot of the networking is already provided for (and largely abstracted away) but where there's a clear need for communicating between Docker containers in different hosts.
  • On-premises telco/datacenter solutions where architects need deeper insight/control into application traffic or where hardware-based routing/load balancing/traffic shaping/QoS is already being enforced.

This document will focus largely on the first scenario and a set of user stories, with hints towards the second one at the bottom.

Offhand, there are two possible approaches from an end-user perspective:

  • Extending the CLI linking syntax and have the system build the extra bridge interfaces and tunnels "magically" (preserves the existing environment variable semantics inside containers)
  • Exposing networks as separate entities and make users aware of the underlying complexity (requires extra work for simple linking, may need extra environment variables to facilitate discovery, etc.).

This is largely described in http://www.slideshare.net/adrienblind/docker-networking-basics-using-software-defined-networks already, and is what pipework was designed to do.

Arguments for Keeping Things Simple (Sticking to Port Mapping)

Docker's primary networking abstraction is essentially port mapping/linking, with links exposed as environment variables to the containers involved - that makes application configuration very easy, as well as lessening CLI complexity.

Steering substantially away from that will shift the balance towards "full" networking, which is not necessarily the best way to go when you're focused on applications/processes rather than VMs.

Some IaaS providers (like Azure) provide a single network interface by default (which is then NATed to a public IP or tied to a load balancer, etc.), so the underlying transport shouldn't require extra network interfaces to work.

Arguments for Increasing Complexity (Creating Networks)

Docker does not exist in a vacuum. Docker containers invariably have to talk to services hosted in more conventional infrastructure, and Docker is increasingly being used (or at least proposed) by network/datacenter vendors as a way to package and deploy fairly low-level functionality (like traffic inspection, shaping, even routing) using solutions like OpenVSwitch and custom bridges.

Furthermore, containers can already see each other internally to a host - each is provided with a 172.17.0.0/16 IP address, which is accessible from other containers. Allowing users to define networks and bind containers to networks rather than solely ports may greatly simplify establishing connectivity between sets of containers.

Middle Ground

However, using Linux kernel plumbing (or OpenVSwitch) to provide Docker containers with what amount to fully-functional network interfaces implies a number of additional considerations (like messing with brctl) that may have unforeseen (and dangerous) consequences in terms of security, not to to mention the need to eventually deal with routing and ACLs (which are currently largely the host's concern).

On the other hand, there is an obvious need to restrict container (outbound) traffic to some extent, and a number of additional benefits that stem from providing limited visibility onto a network segment, internal or otherwise.

Minimal Requirements:

There are a few requirements that seem fairly obvious:

  • Docker containers should be able to talk to each other inside a swarm (i.e., a pre-defined set of hosts managed by Swarm) regardless of in which host they run.
  • That communication should have the least possible overhead (but, ideally, use a common enough form of encapsulation - GRE, IPoIP - that allows network teams to inspect and debug on the LAN using common, low-complexity tools)
  • One should be able to completely restrict outbound communications (there is a strong case to do that by default, in fact, since a compromised container may be used to generate potentially damaging traffic and affect the remainder of the infrastructure).

Improvements (Step 1):

  • Encrypted links when linking between Swarm hosts on open networks (which require extra setup effort)
  • Limiting outbound traffic from containers to specific networks or hosts (rather than outright on/off) is also desirable (but, again, require extra setup)

Further Improvements (Step 2):

  • Custom addressing and bridging for allowing interop with existing DC solutions
  • APIs for orchestrating and managing bridges, vendor interop.

Likely Approaches (none favored at this point):

  • Wrap OpenVSwitch (or abstract it away) into a Docker tool
  • Have two tiers of network support, i.e., beef up pipework (or weave) until it's easier to use and allow for custom OpenVSwitch-like solutions

rcarmo commented Feb 28, 2015

I'd like to chime in on this. I've been trying to put together a few arguments for and against doing this transparently to the user, and coming from a telco/"purist SDN" background it's hard to strike a middle ground between ease of use for small deployments and the kind of infrastructure we need to have it scale up into (and integrate with) datacenter solutions.

(I'm rather partial to the OpenVSwitch approach, really, but I understand how weave and pipework can be appealing to a lot of people)

So here are my notes:


This is just a high-level overview of how software-defined networking might work in a Docker/Swarm/Compose environment, written largely from a devops/IaaS perspective but with a fair degree of background on datacenter/telco networking infrastructure, which is fast converging towards full SDN.

There are two sides to the SDN story:

  • Sysadmins running Docker in a typical IaaS environment, where a lot of the networking is already provided for (and largely abstracted away) but where there's a clear need for communicating between Docker containers in different hosts.
  • On-premises telco/datacenter solutions where architects need deeper insight/control into application traffic or where hardware-based routing/load balancing/traffic shaping/QoS is already being enforced.

This document will focus largely on the first scenario and a set of user stories, with hints towards the second one at the bottom.

Offhand, there are two possible approaches from an end-user perspective:

  • Extending the CLI linking syntax and have the system build the extra bridge interfaces and tunnels "magically" (preserves the existing environment variable semantics inside containers)
  • Exposing networks as separate entities and make users aware of the underlying complexity (requires extra work for simple linking, may need extra environment variables to facilitate discovery, etc.).

This is largely described in http://www.slideshare.net/adrienblind/docker-networking-basics-using-software-defined-networks already, and is what pipework was designed to do.

Arguments for Keeping Things Simple (Sticking to Port Mapping)

Docker's primary networking abstraction is essentially port mapping/linking, with links exposed as environment variables to the containers involved - that makes application configuration very easy, as well as lessening CLI complexity.

Steering substantially away from that will shift the balance towards "full" networking, which is not necessarily the best way to go when you're focused on applications/processes rather than VMs.

Some IaaS providers (like Azure) provide a single network interface by default (which is then NATed to a public IP or tied to a load balancer, etc.), so the underlying transport shouldn't require extra network interfaces to work.

Arguments for Increasing Complexity (Creating Networks)

Docker does not exist in a vacuum. Docker containers invariably have to talk to services hosted in more conventional infrastructure, and Docker is increasingly being used (or at least proposed) by network/datacenter vendors as a way to package and deploy fairly low-level functionality (like traffic inspection, shaping, even routing) using solutions like OpenVSwitch and custom bridges.

Furthermore, containers can already see each other internally to a host - each is provided with a 172.17.0.0/16 IP address, which is accessible from other containers. Allowing users to define networks and bind containers to networks rather than solely ports may greatly simplify establishing connectivity between sets of containers.

Middle Ground

However, using Linux kernel plumbing (or OpenVSwitch) to provide Docker containers with what amount to fully-functional network interfaces implies a number of additional considerations (like messing with brctl) that may have unforeseen (and dangerous) consequences in terms of security, not to to mention the need to eventually deal with routing and ACLs (which are currently largely the host's concern).

On the other hand, there is an obvious need to restrict container (outbound) traffic to some extent, and a number of additional benefits that stem from providing limited visibility onto a network segment, internal or otherwise.

Minimal Requirements:

There are a few requirements that seem fairly obvious:

  • Docker containers should be able to talk to each other inside a swarm (i.e., a pre-defined set of hosts managed by Swarm) regardless of in which host they run.
  • That communication should have the least possible overhead (but, ideally, use a common enough form of encapsulation - GRE, IPoIP - that allows network teams to inspect and debug on the LAN using common, low-complexity tools)
  • One should be able to completely restrict outbound communications (there is a strong case to do that by default, in fact, since a compromised container may be used to generate potentially damaging traffic and affect the remainder of the infrastructure).

Improvements (Step 1):

  • Encrypted links when linking between Swarm hosts on open networks (which require extra setup effort)
  • Limiting outbound traffic from containers to specific networks or hosts (rather than outright on/off) is also desirable (but, again, require extra setup)

Further Improvements (Step 2):

  • Custom addressing and bridging for allowing interop with existing DC solutions
  • APIs for orchestrating and managing bridges, vendor interop.

Likely Approaches (none favored at this point):

  • Wrap OpenVSwitch (or abstract it away) into a Docker tool
  • Have two tiers of network support, i.e., beef up pipework (or weave) until it's easier to use and allow for custom OpenVSwitch-like solutions
@mk-qi

This comment has been minimized.

Show comment
Hide comment
@mk-qi

mk-qi Mar 20, 2015

hello everyone;

docker-muilt

i set the docker0 in hosta and hostb in the same network via vxlan ,and it could ping each other ,but docker alawys allocate the same ip between hosta and hostb,so i wonder if there any way or plugin or hack to help me to check if the ip if exist ?

mk-qi commented Mar 20, 2015

hello everyone;

docker-muilt

i set the docker0 in hosta and hostb in the same network via vxlan ,and it could ping each other ,but docker alawys allocate the same ip between hosta and hostb,so i wonder if there any way or plugin or hack to help me to check if the ip if exist ?

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Mar 20, 2015

Contributor

You need to pre-provision each docker0 with a different subnet range. Even
then you probably will not be able to ping across them unless you also add
your eth0 as a slave on docker0.

read this: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/

On Thu, Mar 19, 2015 at 10:24 PM, mk-qi notifications@github.com wrote:

hello everyone;

[image: docker-muilt]
https://cloud.githubusercontent.com/assets/642228/6745878/74b88210-cef9-11e4-8595-2928832ed70a.png

i set the docker0 in hosta and hostb in the same network via vxlan ,and it
could ping each other ,but docker alawys allocate the same ip between hosta
and hostb,so i wonder if there any way or plugin or hack to help me to
check if the ip if exist ?


Reply to this email directly or view it on GitHub
#8951 (comment).

Contributor

thockin commented Mar 20, 2015

You need to pre-provision each docker0 with a different subnet range. Even
then you probably will not be able to ping across them unless you also add
your eth0 as a slave on docker0.

read this: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/

On Thu, Mar 19, 2015 at 10:24 PM, mk-qi notifications@github.com wrote:

hello everyone;

[image: docker-muilt]
https://cloud.githubusercontent.com/assets/642228/6745878/74b88210-cef9-11e4-8595-2928832ed70a.png

i set the docker0 in hosta and hostb in the same network via vxlan ,and it
could ping each other ,but docker alawys allocate the same ip between hosta
and hostb,so i wonder if there any way or plugin or hack to help me to
check if the ip if exist ?


Reply to this email directly or view it on GitHub
#8951 (comment).

@fzansari

This comment has been minimized.

Show comment
Hide comment
@fzansari

fzansari Mar 20, 2015

@mk-qi : You can use "arping" which is essentially a utility to discover if an IP is already in use within a network. Thats the way you can make sure docker does not use the same set of IPs when its "over" multiple Hosts.
Or another way is to statically assign IPs yourself to each docker

@mk-qi : You can use "arping" which is essentially a utility to discover if an IP is already in use within a network. Thats the way you can make sure docker does not use the same set of IPs when its "over" multiple Hosts.
Or another way is to statically assign IPs yourself to each docker

@mk-qi

This comment has been minimized.

Show comment
Hide comment
@mk-qi

mk-qi Mar 20, 2015

@thockin sorry , i has not draw the picture clearly . in fact the eth0 is the slave of docker0. and as i has said before , i can ping them on each other...

@shykes I saw your fork https://github.com/shykes/docker/tree/extensions/extensions/simplebridge it looks like it have ping ip operation before really assigning it, but i am not sure, i do not know whether you could give more information.

mk-qi commented Mar 20, 2015

@thockin sorry , i has not draw the picture clearly . in fact the eth0 is the slave of docker0. and as i has said before , i can ping them on each other...

@shykes I saw your fork https://github.com/shykes/docker/tree/extensions/extensions/simplebridge it looks like it have ping ip operation before really assigning it, but i am not sure, i do not know whether you could give more information.

@mk-qi

This comment has been minimized.

Show comment
Hide comment
@mk-qi

mk-qi Mar 20, 2015

@fzansari thanks for reply , static ip allocation is ok , in fact we had useing pipwork +macvlan( +dhcp) for some small running cluster, but if running much of containers , this is very painful to manage ip, of course we can write tools, but I think hack the docker to directly Solveing the IP conflict problem , the problem will be much simpler. If this way is Possible

mk-qi commented Mar 20, 2015

@fzansari thanks for reply , static ip allocation is ok , in fact we had useing pipwork +macvlan( +dhcp) for some small running cluster, but if running much of containers , this is very painful to manage ip, of course we can write tools, but I think hack the docker to directly Solveing the IP conflict problem , the problem will be much simpler. If this way is Possible

@SamSaffron

This comment has been minimized.

Show comment
Hide comment
@SamSaffron

SamSaffron Apr 30, 2015

Having just implemented keepalived internally I think there would be an enormous benefit from simply implementing an interoperable vrrp protocol. It would allow docker to "play nice" without forcing it on every machine in the network.

For example:

Host 1 (ip address 10.0.0.1):

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority 100 --network-id 10 web

Host 2 (ip address 10.0.0.2: backup service)

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority  50 --network-id 10 web

Supporting vrrp give a very clean failover story and allows you to simply assign an IP to a service. It would take a lot to flesh out the details but I do think it would be an amazing change.

Having just implemented keepalived internally I think there would be an enormous benefit from simply implementing an interoperable vrrp protocol. It would allow docker to "play nice" without forcing it on every machine in the network.

For example:

Host 1 (ip address 10.0.0.1):

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority 100 --network-id 10 web

Host 2 (ip address 10.0.0.2: backup service)

docker run --vrrp eth0 -p 10.0.0.100:80:80 --priority  50 --network-id 10 web

Supporting vrrp give a very clean failover story and allows you to simply assign an IP to a service. It would take a lot to flesh out the details but I do think it would be an amazing change.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Apr 18, 2016

Contributor

Closing since multi-host networking, plugins, etc are all in since docker 1.9

Contributor

cpuguy83 commented Apr 18, 2016

Closing since multi-host networking, plugins, etc are all in since docker 1.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment