New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm/mesh workaround: --publish mode=host,target=80,published=80,protocol=tcp #34161

Closed
jefflill opened this Issue Jul 18, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@jefflill
Copy link

jefflill commented Jul 18, 2017

Some folks (including me) seem to be having serious problems with the routing mesh/ingress network (for example, see #32195 and #34136). These are apparent, in stable releases 17.03.0-ce, 17.03.2-ce, and **17.06.0-ce"". I haven't tried the other intermediate edge releases.

I'm going to try to workaround this by manually deploying my ingress services to all nodes as a simple container. Unfortunately, containers don't support secrets, so this means I'll need to modify the containers to accept secrets as environment variables. I'll also need to manually roll out container updates.

It would be nice if the docker service create command had an option that disabled the routing mesh feature, something like:

    docker service create --mesh=false -mode global --publish 80:80 ...

This would simply expose the service port on the docker host machine and disable the fancy networking. Then we could use a combination of the mode and constraint to control where our ingress services actually run. This would mean that we'd still get the other advantages of services (secrets, scheduling, updates,...)

Mesh routing is a fantastic feature, but it really needs to be airtight. It's been a year+ since 1.12 introduced this and I'm not sure it's ever been stable.

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jul 18, 2017

This is already possible with host-mode publishing;

docker service create \
  --publish mode=host,target=80,published=80,protocol=tcp \
  --name=web \
  --mode=global \
  nginx:alpine

host-mode publishing publishes container's ports directly to the host, skipping the ingress network (thus the routing mesh)

@jefflill

This comment has been minimized.

Copy link
Author

jefflill commented Jul 18, 2017

YAY! Learn something new every day!

@jefflill jefflill closed this Jul 18, 2017

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jul 18, 2017

Docs can be found here; https://docs.docker.com/engine/swarm/services/#publish-ports, but may be a bit hard to find, so I just opened docker/docker.github.io#3913 to request better docs (contributions welcome 😄 )

@jefflill jefflill changed the title Swarm/mesh workaround: add option "service create --mesh=false : option Swarm/mesh workaround: --publish mode=host,target=80,published=80,protocol=tcp Jul 19, 2017

@jefflill

This comment has been minimized.

Copy link
Author

jefflill commented Nov 29, 2018

A follow-up note for future generations:

I was finally able to fix the original ingress network problem by reducing the ingress network MTU from it's default value of 1500 to 1492 bytes. This problem happens when the Docker nodes are hosted as virtual machines by hypervisors like Hyper-V and XenServer. I'm guessing that the Docker ingress network works like this:

  1. Ingress network MTU defaults to 1500 bytes
  2. Ingress network actually reports its MTU to the attached containers as 8 bytes less (the size of a VXLAN header), so 1492 bytes.
  3. Ingress network will add an 8-byte VXLAN header to each packet routed through, so the maximum size of a packet will be 1492+8=1500 bytes. These will route on most LANs.
  4. But when the Docker hosts are running as VMs on Hyper-V or XenServer, another 8-byte VXLAN header is added and now the maximum packet size is 1508, which is to big for most LANs.
  5. I'm guessing that the packet no-fragment flag must be set so the ingress network must send an ICMP packet back to the source requesting that it split to packet and that these are lost or cause enough delay that we see random seeming timeouts.

I haven't actually confirmed the theory above, but changing MTU to 1492 fixed my problem.

Here's the script I used to recreate the ingress network. Note that I needed to a a bit of delay between removing and recreating it to give the Docker a chance to actually delete the thing. The subnet and gateway settings below match the Docker defaults.

# Delete the [ingress] network.

docker network rm ingress << EOF
y
EOF

# Give the network a chance to actually be deleted.

sleep 10

# Recreate the [ingress] network with the new settings.

docker network create \
   --driver overlay \
   --ingress \
   --subnet=10.255.0.0/16 \
   --gateway=10.255.0.1 \
   --opt com.docker.network.mtu=1492 \
   ingress
@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Nov 29, 2018

For containers, the MTU is set to 1450; https://github.com/docker/libnetwork/blob/5d113d19f93b14b7b6a2b2d1f2fc1a2c1b6f9b0d/drivers/overlay/joinleave.go#L67-L70

	// Set the container interface and its peer MTU to 1450 to allow
	// for 50 bytes vxlan encap (inner eth header(14) + outer IP(20) +
	// outer UDP(8) + vxlan header(8))
	mtu := n.maxMTU()

For the ingressnetwork, no option is set explicitly;

moby/daemon/network.go

Lines 202 to 224 in cf72051

func (daemon *Daemon) setupIngress(create *clustertypes.NetworkCreateRequest, ip net.IP, staleID string) {
controller := daemon.netController
controller.AgentInitWait()
if staleID != "" && staleID != create.ID {
daemon.releaseIngress(staleID)
}
if _, err := daemon.createNetwork(create.NetworkCreateRequest, create.ID, true); err != nil {
// If it is any other error other than already
// exists error log error and return.
if _, ok := err.(libnetwork.NetworkNameError); !ok {
logrus.Errorf("Failed creating ingress network: %v", err)
return
}
// Otherwise continue down the call to create or recreate sandbox.
}
_, err := daemon.GetNetworkByID(create.ID)
if err != nil {
logrus.Errorf("Failed getting ingress network by id after creating: %v", err)
}
}

moby/daemon/network.go

Lines 293 to 384 in cf72051

func (daemon *Daemon) createNetwork(create types.NetworkCreateRequest, id string, agent bool) (*types.NetworkCreateResponse, error) {
if runconfig.IsPreDefinedNetwork(create.Name) {
return nil, PredefinedNetworkError(create.Name)
}
var warning string
nw, err := daemon.GetNetworkByName(create.Name)
if err != nil {
if _, ok := err.(libnetwork.ErrNoSuchNetwork); !ok {
return nil, err
}
}
if nw != nil {
// check if user defined CheckDuplicate, if set true, return err
// otherwise prepare a warning message
if create.CheckDuplicate {
if !agent || nw.Info().Dynamic() {
return nil, libnetwork.NetworkNameError(create.Name)
}
}
warning = fmt.Sprintf("Network with name %s (id : %s) already exists", nw.Name(), nw.ID())
}
c := daemon.netController
driver := create.Driver
if driver == "" {
driver = c.Config().Daemon.DefaultDriver
}
nwOptions := []libnetwork.NetworkOption{
libnetwork.NetworkOptionEnableIPv6(create.EnableIPv6),
libnetwork.NetworkOptionDriverOpts(create.Options),
libnetwork.NetworkOptionLabels(create.Labels),
libnetwork.NetworkOptionAttachable(create.Attachable),
libnetwork.NetworkOptionIngress(create.Ingress),
libnetwork.NetworkOptionScope(create.Scope),
}
if create.ConfigOnly {
nwOptions = append(nwOptions, libnetwork.NetworkOptionConfigOnly())
}
if create.IPAM != nil {
ipam := create.IPAM
v4Conf, v6Conf, err := getIpamConfig(ipam.Config)
if err != nil {
return nil, err
}
nwOptions = append(nwOptions, libnetwork.NetworkOptionIpam(ipam.Driver, "", v4Conf, v6Conf, ipam.Options))
}
if create.Internal {
nwOptions = append(nwOptions, libnetwork.NetworkOptionInternalNetwork())
}
if agent {
nwOptions = append(nwOptions, libnetwork.NetworkOptionDynamic())
nwOptions = append(nwOptions, libnetwork.NetworkOptionPersist(false))
}
if create.ConfigFrom != nil {
nwOptions = append(nwOptions, libnetwork.NetworkOptionConfigFrom(create.ConfigFrom.Network))
}
if agent && driver == "overlay" {
nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
if !exists {
return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)
}
nwOptions = append(nwOptions, libnetwork.NetworkOptionLBEndpoint(nodeIP))
}
n, err := c.NewNetwork(driver, create.Name, id, nwOptions...)
if err != nil {
if _, ok := err.(libnetwork.ErrDataStoreNotInitialized); ok {
// nolint: golint
return nil, errors.New("This node is not a swarm manager. Use \"docker swarm init\" or \"docker swarm join\" to connect this node to swarm and try again.")
}
return nil, err
}
daemon.pluginRefCount(driver, driverapi.NetworkPluginEndpointType, plugingetter.Acquire)
if create.IPAM != nil {
daemon.pluginRefCount(create.IPAM.Driver, ipamapi.PluginEndpointType, plugingetter.Acquire)
}
daemon.LogNetworkEvent(n, "create")
return &types.NetworkCreateResponse{
ID: n.ID(),
Warning: warning,
}, nil
}
, in which case I think maxMTU() will also be used.

@jefflill

This comment has been minimized.

Copy link
Author

jefflill commented Nov 29, 2018

Interesting. I just read this guys experience with this in August here:

https://medium.com/@sylwit/how-we-spent-a-full-day-figuring-out-a-mtu-issue-with-docker-4d81fdfe2caf

...and I've been playing around with Docker's daemon.json file by setting mtu=1492 and redeployed a Docker Swarm.

The documentation for MTU here says: Set the containers network MTU. I inspected the ingress and a couple other networks I created and they all report MTU=1492. So I'm betting this might be the best way to configure the MTU because it will ensure that any networks created after swarm setup (e.g. via Docker Stacks) will also get the new MTU by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment