Bridge interface with Docker breaks Path MTU Discovery when host is using IPSEC #12565

Open
itwasntandy opened this Issue Apr 20, 2015 · 15 comments

Comments

Projects
None yet
7 participants
@itwasntandy

Description of problem:
When the host system is using IPSEC (libreswan) for encrypting communications, applications running within docker run into issues serving files larger than (MTU - IPSEC Overhead), with timeouts being seen for such files.
The same applications (e.g. Nginx) running outside of Docker do not have this problem.
Equally serving files under the (MTU-IPSEC overhead) are served fine. - i.e. <8920 bytes in our case

I think it is possible that PMTUD under docker is just broken full stop, but it only becomes an issue under IPSEC, where the host can't just fragment the traffic.

docker version:

Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64

docker info:

Containers: 10
Images: 28
Storage Driver: aufs
 Root Dir: /mounts/xvdf/appdata/docker/aufs
 Backing Filesystem: extfs
 Dirs: 48
 Dirperm1 Supported: false
Execution Driver: native-0.2
Kernel Version: 3.13.0-46-generic
Operating System: Ubuntu precise (12.04.5 LTS)
CPUs: 1
Total Memory: 3.676 GiB
Name: ipsectest01-uw2a
ID: TBAE:WRIE:X3XH:HXZH:CNTC:VFLV:L2AS:IWQX:VPKO:6TYI:ASJ5:IYNX
WARNING: No swap limit support

uname -a:

Linux ipsectest01-uw2a 3.13.0-46-generic #75~precise1-Ubuntu SMP Wed Feb 11 19:21:25 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Environment details (AWS, VirtualBox, physical, etc.):

AWS, on m3.medium instances.
eth0 interface configured with default MTU (9001 bytes)

$ /sbin/ifconfig 
docker0   Link encap:Ethernet  HWaddr 56:84:7a:fe:97:99  
          inet addr:172.17.42.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:5276 errors:0 dropped:0 overruns:0 frame:0
          TX packets:12079 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1713691 (1.7 MB)  TX bytes:26215400 (26.2 MB)

eth0      Link encap:Ethernet  HWaddr 06:39:78:8b:30:e5  
          inet addr:172.31.16.237  Bcast:172.31.31.255  Mask:255.255.240.0
          inet6 addr: fe80::439:78ff:fe8b:30e5/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:3558290 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2956460 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2542373946 (2.5 GB)  TX bytes:667374951 (667.3 MB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:1212019 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1212019 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:269564051 (269.5 MB)  TX bytes:269564051 (269.5 MB)

How reproducible:
Always

Steps to Reproduce:

  • Install libreswan on two host systems, and configure a relationship between them - https://libreswan.org/wiki/Host_to_host_VPN
  • Ensure the IPSEC tunnels are up sudo ipsec status |grep established
  • create a file larger than the MTU on both servers dd if=/dev/urandom of=/var/tmp/nginx/testfile bs=512 count=24
  • install nginx directly on one server, and configure it to server up the freshly created testfile - apt-get install nginx ; sudo start nginx ; sudo cp /var/tmp/nginx/testfile /usr/share/nginx/html
  • Bring up a docker container on the other server with nginx: https://registry.hub.docker.com/_/nginx/
  • sudo docker run -p 80:80 -v /var/tmp/nginx:/var/www/html dockerfile/nginx
  • Retrieve that file from the only nginx server with configured ipsec - curl -o /dev/null http:/onlynginx/testfile
  • Attempt to retrieve that file from the nginx under docker server - curl -o /dev/null http://nginxdocker/testfile

Actual Results:

Curl from pure nginx works fine, and file downloaded very quickly

~$ curl -o /dev/null http://onlynginx/testfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12288  100 12288    0     0  2002k      0 --:--:-- --:--:-- --:--:-- 2400k

Curl from docker never responds...

$ curl -o /dev/null http://nginxdocker/testfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:17:30 --:--:--     0

tcpdump on the docker0 interface of the server shows it attempting to to send packets of size 9001 bytes, and a tcp max segment size of 8961 bytes.

$ sudo tcpdump -nvv -i docker0
tcpdump: listening on docker0, link-type EN10MB (Ethernet), capture size 65535 bytes
19:44:46.082719 IP (tos 0x0, ttl 63, id 15325, offset 0, flags [DF], proto TCP (6), length 60)
    172.31.15.88.43265 > 172.17.0.1.80: Flags [S], cksum 0x90ed (correct), seq 2992186550, win 26883, options [mss 8961,sackOK,TS val 694951787 ecr 0,nop,wscale 7], length 0
19:44:46.082758 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.17.0.1.80 > 172.31.15.88.43265: Flags [S.], cksum 0x67b8 (incorrect -> 0xcb5f), seq 1826792915, ack 2992186551, win 26847, options [mss 8961,sackOK,TS val 197175082 ecr 694951787,nop,wscale 7], length 0
19:44:46.083976 IP (tos 0x0, ttl 63, id 15326, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.15.88.43265 > 172.17.0.1.80: Flags [.], cksum 0x7f84 (correct), seq 1, ack 1, win 211, options [nop,nop,TS val 694951788 ecr 197175082], length 0
19:44:46.090379 IP (tos 0x0, ttl 63, id 15327, offset 0, flags [DF], proto TCP (6), length 228)
    172.31.15.88.43265 > 172.17.0.1.80: Flags [P.], cksum 0xd22f (correct), seq 1:177, ack 1, win 211, options [nop,nop,TS val 694951789 ecr 197175082], length 176
19:44:46.090395 IP (tos 0x0, ttl 64, id 766, offset 0, flags [DF], proto TCP (6), length 52)
    172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x67b0 (incorrect -> 0x7ec9), seq 1, ack 177, win 219, options [nop,nop,TS val 197175084 ecr 694951789], length 0
19:44:46.090514 IP (tos 0x0, ttl 64, id 767, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x8aa5 (incorrect -> 0x7024), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197175084 ecr 694951789], length 8949
19:44:46.090599 IP (tos 0x0, ttl 64, id 768, offset 0, flags [DF], proto TCP (6), length 3646)
    172.17.0.1.80 > 172.31.15.88.43265: Flags [P.], cksum 0x75ba (incorrect -> 0xc625), seq 8950:12544, ack 177, win 219, options [nop,nop,TS val 197175084 ecr 694951789], length 3594
19:44:46.091831 IP (tos 0x0, ttl 63, id 15328, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.43265 > 172.17.0.1.80: Flags [.], cksum 0xcecb (correct), seq 177, ack 1, win 350, options [nop,nop,TS val 694951790 ecr 197175084,nop,nop,sack 1 {8950:12544}], length 0
19:44:46.093634 IP (tos 0x0, ttl 64, id 769, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x8aa5 (incorrect -> 0x7022), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197175085 ecr 694951790], length 8949
19:44:46.297635 IP (tos 0x0, ttl 64, id 770, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.43265: Flags [.], cksum 0x8aa5 (incorrect -> 0x6fef), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197175136 ecr 694951790], length 8949
19:44:46.705639 IP (tos 0x0, ttl 64, id 771, offset 0, flags [DF], proto TCP (6), length 9001)

Expected Results:

nginx (or any network server) should work the same under docker as it does when running directly on a host.

There are some workarounds, but they should not be necessary, seeing as they are not required when running natively.

Additional info:

Some work arounds:

  1. Partial - but severely degraded performance, and not always working.

set sysctl net.ipv4.tcp_mtu_probing=1

$ curl -o /dev/null http://nginxdocker/testfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12288  100 12288    0     0   3948      0  0:00:03  0:00:03 --:--:--  3952

note with this, performance is very poor. you can see it's taken 3 seconds to download a file which was otherwise served sub second.

Additionally, whilst this workaround works with NGINX, it doesn't seem to work with other services (e.g. Java based webs)

You can see from TCP dump output that initially it tried to das before, and send packets of size 9001 bytes and tcp max segment size of 8961 bytes.

After a few retries from this, it gives up and sends them in small (564 byte fragments)

20:04:55.280669 IP (tos 0x0, ttl 63, id 28674, offset 0, flags [DF], proto TCP (6), length 60)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [S], cksum 0x8396 (correct), seq 1940900933, win 26883, options [mss 8961,sackOK,TS val 695254087 ecr 0,nop,wscale 7], length 0
20:04:55.280708 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [S.], cksum 0x67b8 (incorrect -> 0x7c2c), seq 72659806, ack 1940900934, win 26847, options [mss 8961,sackOK,TS val 197477381 ecr 695254087,nop,wscale 7], length 0
20:04:55.282082 IP (tos 0x0, ttl 63, id 28675, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x3052 (correct), seq 1, ack 1, win 211, options [nop,nop,TS val 695254087 ecr 197477381], length 0
20:04:55.299373 IP (tos 0x0, ttl 63, id 28676, offset 0, flags [DF], proto TCP (6), length 228)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [P.], cksum 0x82fa (correct), seq 1:177, ack 1, win 211, options [nop,nop,TS val 695254091 ecr 197477381], length 176
20:04:55.299388 IP (tos 0x0, ttl 64, id 63321, offset 0, flags [DF], proto TCP (6), length 52)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x67b0 (incorrect -> 0x2f91), seq 1, ack 177, win 219, options [nop,nop,TS val 197477386 ecr 695254091], length 0
20:04:55.299506 IP (tos 0x0, ttl 64, id 63322, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2eea), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477386 ecr 695254091], length 8949
20:04:55.299593 IP (tos 0x0, ttl 64, id 63323, offset 0, flags [DF], proto TCP (6), length 3646)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [P.], cksum 0x75ba (incorrect -> 0x76ed), seq 8950:12544, ack 177, win 219, options [nop,nop,TS val 197477386 ecr 695254091], length 3594
20:04:55.300861 IP (tos 0x0, ttl 63, id 28677, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x359a (correct), seq 177, ack 1, win 350, options [nop,nop,TS val 695254092 ecr 197477386,nop,nop,sack 1 {8950:12544}], length 0
20:04:55.301634 IP (tos 0x0, ttl 64, id 63324, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2ee8), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477387 ecr 695254092], length 8949
20:04:55.505633 IP (tos 0x0, ttl 64, id 63325, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2eb5), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477438 ecr 695254092], length 8949
20:04:55.913648 IP (tos 0x0, ttl 64, id 63326, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2e4f), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477540 ecr 695254092], length 8949
20:04:56.729645 IP (tos 0x0, ttl 64, id 63327, offset 0, flags [DF], proto TCP (6), length 9001)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x8aa5 (incorrect -> 0x2d83), seq 1:8950, ack 177, win 219, options [nop,nop,TS val 197477744 ecr 695254092], length 8949
20:04:58.365655 IP (tos 0x0, ttl 64, id 63328, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x1374), seq 1:513, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254092], length 512
20:04:58.366950 IP (tos 0x0, ttl 63, id 28678, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2d94 (correct), seq 177, ack 513, win 359, options [nop,nop,TS val 695254858 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.366974 IP (tos 0x0, ttl 64, id 63329, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xf426), seq 513:1025, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254858], length 512
20:04:58.366976 IP (tos 0x0, ttl 64, id 63330, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x5619), seq 1025:1537, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254858], length 512
20:04:58.368187 IP (tos 0x0, ttl 63, id 28679, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2b8b (correct), seq 177, ack 1025, win 367, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.368200 IP (tos 0x0, ttl 64, id 63331, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x5a24), seq 1537:2049, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.368202 IP (tos 0x0, ttl 64, id 63332, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xce4b), seq 2049:2561, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.368238 IP (tos 0x0, ttl 63, id 28680, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2983 (correct), seq 177, ack 1537, win 375, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.368247 IP (tos 0x0, ttl 64, id 63333, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x9184), seq 2561:3073, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.368248 IP (tos 0x0, ttl 64, id 63334, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x1ed3), seq 3073:3585, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369430 IP (tos 0x0, ttl 63, id 28681, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x277b (correct), seq 177, ack 2049, win 383, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369442 IP (tos 0x0, ttl 64, id 63335, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x6a4c), seq 3585:4097, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369444 IP (tos 0x0, ttl 64, id 63336, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xa2ad), seq 4097:4609, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369484 IP (tos 0x0, ttl 63, id 28682, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2573 (correct), seq 177, ack 2561, win 391, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369492 IP (tos 0x0, ttl 64, id 63337, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x72fc), seq 4609:5121, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369493 IP (tos 0x0, ttl 64, id 63338, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xd764), seq 5121:5633, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369525 IP (tos 0x0, ttl 63, id 28683, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x236b (correct), seq 177, ack 3073, win 399, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369531 IP (tos 0x0, ttl 63, id 28684, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x2163 (correct), seq 177, ack 3585, win 407, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.369538 IP (tos 0x0, ttl 64, id 63339, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x9ad5), seq 5633:6145, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369540 IP (tos 0x0, ttl 64, id 63340, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x42fa), seq 6145:6657, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369544 IP (tos 0x0, ttl 64, id 63341, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x02b3), seq 6657:7169, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.369546 IP (tos 0x0, ttl 64, id 63342, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xcf93), seq 7169:7681, ack 177, win 219, options [nop,nop,TS val 197478153 ecr 695254859], length 512
20:04:58.370746 IP (tos 0x0, ttl 63, id 28685, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1f5b (correct), seq 177, ack 4097, win 415, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370759 IP (tos 0x0, ttl 64, id 63343, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0x6d86), seq 7681:8193, ack 177, win 219, options [nop,nop,TS val 197478154 ecr 695254859], length 512
20:04:58.370831 IP (tos 0x0, ttl 63, id 28686, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1d53 (correct), seq 177, ack 4609, win 423, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370839 IP (tos 0x0, ttl 63, id 28687, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1b4b (correct), seq 177, ack 5121, win 431, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370865 IP (tos 0x0, ttl 63, id 28688, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1943 (correct), seq 177, ack 5633, win 439, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370872 IP (tos 0x0, ttl 63, id 28689, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x173b (correct), seq 177, ack 6145, win 447, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370854 IP (tos 0x0, ttl 64, id 63344, offset 0, flags [DF], proto TCP (6), length 564)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x69b0 (incorrect -> 0xd9e7), seq 8193:8705, ack 177, win 219, options [nop,nop,TS val 197478154 ecr 695254859], length 512
20:04:58.370858 IP (tos 0x0, ttl 64, id 63345, offset 0, flags [DF], proto TCP (6), length 297)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [.], cksum 0x68a5 (incorrect -> 0x3740), seq 8705:8950, ack 177, win 219, options [nop,nop,TS val 197478154 ecr 695254859], length 245
20:04:58.370916 IP (tos 0x0, ttl 63, id 28690, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1533 (correct), seq 177, ack 6657, win 455, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370922 IP (tos 0x0, ttl 63, id 28691, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x132b (correct), seq 177, ack 7169, win 463, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.370937 IP (tos 0x0, ttl 63, id 28692, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x1123 (correct), seq 177, ack 7681, win 471, options [nop,nop,TS val 695254859 ecr 197478153,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.371910 IP (tos 0x0, ttl 63, id 28693, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x0f19 (correct), seq 177, ack 8193, win 479, options [nop,nop,TS val 695254860 ecr 197478154,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.372054 IP (tos 0x0, ttl 63, id 28694, offset 0, flags [DF], proto TCP (6), length 64)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0x0d11 (correct), seq 177, ack 8705, win 487, options [nop,nop,TS val 695254860 ecr 197478154,nop,nop,sack 1 {8950:12544}], length 0
20:04:58.372114 IP (tos 0x0, ttl 63, id 28695, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0xf77c (correct), seq 177, ack 12544, win 495, options [nop,nop,TS val 695254860 ecr 197478154], length 0
20:04:58.372204 IP (tos 0x0, ttl 63, id 28696, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [F.], cksum 0xf77b (correct), seq 177, ack 12544, win 495, options [nop,nop,TS val 695254860 ecr 197478154], length 0
20:04:58.372287 IP (tos 0x0, ttl 64, id 63346, offset 0, flags [DF], proto TCP (6), length 52)
    172.17.0.1.80 > 172.31.15.88.45202: Flags [F.], cksum 0x67b0 (incorrect -> 0xf88e), seq 12544, ack 178, win 219, options [nop,nop,TS val 197478154 ecr 695254860], length 0
20:04:58.373422 IP (tos 0x0, ttl 63, id 28697, offset 0, flags [DF], proto TCP (6), length 52)
    172.31.15.88.45202 > 172.17.0.1.80: Flags [.], cksum 0xf77a (correct), seq 178, ack 12545, win 495, options [nop,nop,TS val 695254860 ecr 197478154], length 0
  1. Full, but requires manual configuration. Provides native performance.

Adjust the MTU of docker0 interface accordingly - allow the 40 bytes for IPSEC overhead (we're using transport mode). If using tunnel mode, allow 60 bytes.
In our case set MTU to 8960:
DOCKER_OPTS=" --host=unix:///var/run/docker.sock --mtu=8960 --storage-driver=aufs -g $(readlink -f /data/appdata/docker)"

Re-run the curl:

$ curl -o /dev/null http://nginxdocker/testfile
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12288  100 12288    0     0  2242k      0 --:--:-- --:--:-- --:--:-- 3000k

This enables performance as per outside of docker.

@cpuguy83

This comment has been minimized.

Show comment
Hide comment
@cpuguy83

cpuguy83 Apr 20, 2015

Contributor

You can set the mtu for containers when configuring the daemon.
docker -d --mtu <value>
I thought at one point this was available per container as well, but I don't see it...

Contributor

cpuguy83 commented Apr 20, 2015

You can set the mtu for containers when configuring the daemon.
docker -d --mtu <value>
I thought at one point this was available per container as well, but I don't see it...

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Apr 20, 2015

yes, I mention setting the MTU under the listed workarounds under Additional Notes - am aware my post was quite long :)

It certainly works, but it requires manual, environment specific configuration.

I would like this not to be required.

yes, I mention setting the MTU under the listed workarounds under Additional Notes - am aware my post was quite long :)

It certainly works, but it requires manual, environment specific configuration.

I would like this not to be required.

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Apr 20, 2015

Further to my last comment, setting MTU works within the environment where everything is on the same network, and all nodes have the same MTU.

However if I were to have interregional traffic, with devices in between with a 1500 bytes MTU in-between two servers, this would break again. - and yet would work if the app is running outside of Docker.

Hence my conclusion that Path MTU discovery is broken in docker.

Further to my last comment, setting MTU works within the environment where everything is on the same network, and all nodes have the same MTU.

However if I were to have interregional traffic, with devices in between with a 1500 bytes MTU in-between two servers, this would break again. - and yet would work if the app is running outside of Docker.

Hence my conclusion that Path MTU discovery is broken in docker.

@itwasntandy itwasntandy changed the title from Docker breaks Path MTU Discovery when host is using IPSEC to Bridge interface with Docker breaks Path MTU Discovery when host is using IPSEC Apr 22, 2015

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Apr 22, 2015

Been thinking some more on this.

What is curious is that in the tcpdumps of the docker0 interface above, you don't see any ICMP messages indicating fragmentation needed. So from doing some reading it seems that for a linux bridge interface, due to it not routing, there's nothing to send the ICMP messages.

see: http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge#What_can_be_bridged.3F
"All devices share the same maximum packet size (MTU). The bridge doesn't fragment packets."

So the issue isn't in Docker itself, but is a direct result of it using bridging.

Switching to route through the host, instead of bridging would be the solution - maybe something to consider with the planned docker networking enhancements…

Been thinking some more on this.

What is curious is that in the tcpdumps of the docker0 interface above, you don't see any ICMP messages indicating fragmentation needed. So from doing some reading it seems that for a linux bridge interface, due to it not routing, there's nothing to send the ICMP messages.

see: http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge#What_can_be_bridged.3F
"All devices share the same maximum packet size (MTU). The bridge doesn't fragment packets."

So the issue isn't in Docker itself, but is a direct result of it using bridging.

Switching to route through the host, instead of bridging would be the solution - maybe something to consider with the planned docker networking enhancements…

@mrjana

This comment has been minimized.

Show comment
Hide comment
@mrjana

mrjana Apr 22, 2015

Contributor

@itwasntandy Are you sure you don't have this set to 1?
mrjana@dev-1:~$ cat /proc/sys/net/ipv4/ip_no_pmtu_disc
0
This is needed for pmtu discovery. In your case nginx from the host is working fine because it knows about the mtu of the ipsec tunnel interface and therefore tcp would never try to send more that 8960 even without pmtu

Contributor

mrjana commented Apr 22, 2015

@itwasntandy Are you sure you don't have this set to 1?
mrjana@dev-1:~$ cat /proc/sys/net/ipv4/ip_no_pmtu_disc
0
This is needed for pmtu discovery. In your case nginx from the host is working fine because it knows about the mtu of the ipsec tunnel interface and therefore tcp would never try to send more that 8960 even without pmtu

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Apr 22, 2015

@mrjana yes.

andrewmulholland@ipsectest01-uw2a:/var/log$ sysctl net.ipv4.ip_no_pmtu_disc
net.ipv4.ip_no_pmtu_disc = 0

However as per my last comment, linux when bridging doesn't pass on ICMP fragmentation needed messages, and doesn't fragment packets.

oh and we're using netkey for ipsec, so there's no separate ipsec0 interface, it's all done through eth0

@mrjana yes.

andrewmulholland@ipsectest01-uw2a:/var/log$ sysctl net.ipv4.ip_no_pmtu_disc
net.ipv4.ip_no_pmtu_disc = 0

However as per my last comment, linux when bridging doesn't pass on ICMP fragmentation needed messages, and doesn't fragment packets.

oh and we're using netkey for ipsec, so there's no separate ipsec0 interface, it's all done through eth0

@cookandy

This comment has been minimized.

Show comment
Hide comment
@cookandy

cookandy Sep 20, 2016

hi @itwasntandy - did you ever find a solution to this issue? I am experiencing a similar issue using IPSec in transport mode: #26473

In my case, I need to adjust the MTU to below 1500 to support our network. However as soon as I set the MTU on the host with Docker running, it fails.

hi @itwasntandy - did you ever find a solution to this issue? I am experiencing a similar issue using IPSec in transport mode: #26473

In my case, I need to adjust the MTU to below 1500 to support our network. However as soon as I set the MTU on the host with Docker running, it fails.

@cookandy

This comment has been minimized.

Show comment
Hide comment
@cookandy

cookandy Oct 18, 2016

@itwasntandy:

I am also having trouble with what I think is ICMP fragmentation messages, and was wondering if you had any ideas...

First, due to IPSec overhead, I must adjust my MTU to 1460 in order to access services in Docker containers. Since the setting of the MTU didn't work (see #26473), I adjusted it on the firewall with:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu

This works fine for most all of my docker applications. However, when I try to run an elasticsearch cluster over 2 nodes, it fails whenever I try to load a data sample larger than the MTU. If I load the data against a single node it works fine. This makes me think the traffic between ES nodes isn't honoring the MTU somehow.

Can you think of any reason why MTU sizes would still be a problem after clamping the mss?

@itwasntandy:

I am also having trouble with what I think is ICMP fragmentation messages, and was wondering if you had any ideas...

First, due to IPSec overhead, I must adjust my MTU to 1460 in order to access services in Docker containers. Since the setting of the MTU didn't work (see #26473), I adjusted it on the firewall with:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu

This works fine for most all of my docker applications. However, when I try to run an elasticsearch cluster over 2 nodes, it fails whenever I try to load a data sample larger than the MTU. If I load the data against a single node it works fine. This makes me think the traffic between ES nodes isn't honoring the MTU somehow.

Can you think of any reason why MTU sizes would still be a problem after clamping the mss?

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Oct 18, 2016

@cookandy as per my comment on #26473 you were setting the MTU on the wrong interface, which is why that was not working.
You need to leave eth1's MTU alone, and then use the --mtu parameter with docker to set it for docker0.

As for why MSS clamping is not working for you, you'd need to run some packet captures to see what is going on. if you can provide them I can look at them for you to help analyzer them.

@cookandy as per my comment on #26473 you were setting the MTU on the wrong interface, which is why that was not working.
You need to leave eth1's MTU alone, and then use the --mtu parameter with docker to set it for docker0.

As for why MSS clamping is not working for you, you'd need to run some packet captures to see what is going on. if you can provide them I can look at them for you to help analyzer them.

@cookandy

This comment has been minimized.

Show comment
Hide comment
@cookandy

cookandy Oct 18, 2016

@itwasntandy:

Yeah, I've since gone back and adjusted the MTU on the docker0 interface by running the --mtu parameter with Docker. However, it didn't seem to make any difference. The iptables rule is what ultimately fixed my original problem.

I really appreciate the offer to help analyze my packet captures for the ES problem. To give you some background, here is the issue:

elastic/elasticsearch#20657

I'll need to install wireshark and try to decrypt the captures so I can send them to you. Before I do that, is there anything in particular you'd like for me to capture? I'm thinking the traffic between the two ES nodes (while trying to load the data) would be the best place to start...

@itwasntandy:

Yeah, I've since gone back and adjusted the MTU on the docker0 interface by running the --mtu parameter with Docker. However, it didn't seem to make any difference. The iptables rule is what ultimately fixed my original problem.

I really appreciate the offer to help analyze my packet captures for the ES problem. To give you some background, here is the issue:

elastic/elasticsearch#20657

I'll need to install wireshark and try to decrypt the captures so I can send them to you. Before I do that, is there anything in particular you'd like for me to capture? I'm thinking the traffic between the two ES nodes (while trying to load the data) would be the best place to start...

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Oct 18, 2016

you don't need wireshark, and won't need to decrypt the captures as packet captures from the docker0 interface will be outside of your ipsec so you won't be bothered.

if you sudo tcpdump -nvv -w /var/tmp/node1.pcap -i docker0 on the first node
and sudo tcpdump -nvv -w /var/tmp/node2.pcap -i docker0 on the second node, and then put the files somewhere I can grab them I can take a look.

you don't need wireshark, and won't need to decrypt the captures as packet captures from the docker0 interface will be outside of your ipsec so you won't be bothered.

if you sudo tcpdump -nvv -w /var/tmp/node1.pcap -i docker0 on the first node
and sudo tcpdump -nvv -w /var/tmp/node2.pcap -i docker0 on the second node, and then put the files somewhere I can grab them I can take a look.

@cookandy

This comment has been minimized.

Show comment
Hide comment
@cookandy

cookandy Oct 18, 2016

Hi @itwasntandy:

I ran a tcpdump against the docker0 interface as you suggested. First, I ran a single node and was able to import the sample data: Node 1 working

Next, I started a second node and formed a cluster. I then tried to re-import the sample data (it failed). I left it for about 10 seconds before I stopped the captures:

Node 1 broken
Node 2 broken

I am noticing some packets that are 7306 in length. Not sure if this is what is causing the issue...

Just for reference, this is the iptables rule I have in place:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu

Hi @itwasntandy:

I ran a tcpdump against the docker0 interface as you suggested. First, I ran a single node and was able to import the sample data: Node 1 working

Next, I started a second node and formed a cluster. I then tried to re-import the sample data (it failed). I left it for about 10 seconds before I stopped the captures:

Node 1 broken
Node 2 broken

I am noticing some packets that are 7306 in length. Not sure if this is what is causing the issue...

Just for reference, this is the iptables rule I have in place:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu

@itwasntandy

This comment has been minimized.

Show comment
Hide comment
@itwasntandy

itwasntandy Oct 19, 2016

Ok, so I've gone and re-read your other tickets as well to get more of an understanding on what's going on.

So it seems you're running on Digital Ocean... who from your pastes don't seem to support jumbo frames so telling Docker to set the MTU to 8960 was never going to work (it would have to be less than the MTU of your normal interface, i.e. 1500).

Therefor for your case if you pass --mtu 1460 to docker as a parameter, it would work as desired. (to calculate this you need to take the MTU of your normal network interface (in your case eth1 and 1500) and subtract 40 bytes for IPSec overhead. That's how I got to 1460.

I set up a test 2 node ES cluster, with host to host ipsec just now to test a few things. It's in EC2, but I disabled jumbo frames on the interfaces (set MTU to 1500 for eth0 on each node), to mimic your setup, and ran through a few scenarios

Steps used each time:
elastic search run with the following parameters:
Host1:
-Des.network.host=0.0.0.0 -Des.cluster.name=bob -Des.discovery.zen.ping.unicast.hosts=172.16.3.73,172.16.3.41 -Des.network.publish_host=172.16.3.43 -Des.discovery.zen.minimum_master_nodes=2
Host2:
-Des.network.host=0.0.0.0 -Des.cluster.name=bob -Des.discovery.zen.ping.unicast.hosts=172.16.3.73,172.16.3.41 -Des.network.publish_host=172.16.3.73 -Des.discovery.zen.minimum_master_nodes=2

Data load done via:
curl -XPOST '172.16.3.73:9200/bank/account/_bulk?pretty' --data-binary "@accounts.json"

Verification of documents replicated done via:
curl '172.16.3.41:9200/_cat/indices?v'

  1. Default configuration: cluster state yellow, the two nodes can't complete. Nothing is replicated
    • This is expected, due to PMTU not working on a linux bridge interface.
  2. Ensuring docker daemon runs with --mtu 1460
    • Cluster state turns Green, able to replicate sample data.
  3. MTU at defaults, but mss clamping set to 1460 as per your setting:
    • iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu
    • . Cluster starts, turns green but large posts hang as per your example
    • This is expected, as you're doing the mss clamping on eth1, not on the docker interface, and you're clamping to PMTU, which won't work in ipsec. Not 100% sure why its working.
  4. MTU at defaults, but mss clamping set a little differently
    • iptables -t mangle -A FORWARD -o docker0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1460:9001 -j TCPMSS --set-mss 1460
    • Cluster starts, turns green
    • Large posts work as expected.

Hope this is of help for you. FWIW, I'd recommend doing 2. as mss clamping only works for TCP, so potentially you may see other issues.

Ok, so I've gone and re-read your other tickets as well to get more of an understanding on what's going on.

So it seems you're running on Digital Ocean... who from your pastes don't seem to support jumbo frames so telling Docker to set the MTU to 8960 was never going to work (it would have to be less than the MTU of your normal interface, i.e. 1500).

Therefor for your case if you pass --mtu 1460 to docker as a parameter, it would work as desired. (to calculate this you need to take the MTU of your normal network interface (in your case eth1 and 1500) and subtract 40 bytes for IPSec overhead. That's how I got to 1460.

I set up a test 2 node ES cluster, with host to host ipsec just now to test a few things. It's in EC2, but I disabled jumbo frames on the interfaces (set MTU to 1500 for eth0 on each node), to mimic your setup, and ran through a few scenarios

Steps used each time:
elastic search run with the following parameters:
Host1:
-Des.network.host=0.0.0.0 -Des.cluster.name=bob -Des.discovery.zen.ping.unicast.hosts=172.16.3.73,172.16.3.41 -Des.network.publish_host=172.16.3.43 -Des.discovery.zen.minimum_master_nodes=2
Host2:
-Des.network.host=0.0.0.0 -Des.cluster.name=bob -Des.discovery.zen.ping.unicast.hosts=172.16.3.73,172.16.3.41 -Des.network.publish_host=172.16.3.73 -Des.discovery.zen.minimum_master_nodes=2

Data load done via:
curl -XPOST '172.16.3.73:9200/bank/account/_bulk?pretty' --data-binary "@accounts.json"

Verification of documents replicated done via:
curl '172.16.3.41:9200/_cat/indices?v'

  1. Default configuration: cluster state yellow, the two nodes can't complete. Nothing is replicated
    • This is expected, due to PMTU not working on a linux bridge interface.
  2. Ensuring docker daemon runs with --mtu 1460
    • Cluster state turns Green, able to replicate sample data.
  3. MTU at defaults, but mss clamping set to 1460 as per your setting:
    • iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -o eth1 -j TCPMSS --set-mss 1460 --clamp-mss-to-pmtu
    • . Cluster starts, turns green but large posts hang as per your example
    • This is expected, as you're doing the mss clamping on eth1, not on the docker interface, and you're clamping to PMTU, which won't work in ipsec. Not 100% sure why its working.
  4. MTU at defaults, but mss clamping set a little differently
    • iptables -t mangle -A FORWARD -o docker0 -p tcp -m tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1460:9001 -j TCPMSS --set-mss 1460
    • Cluster starts, turns green
    • Large posts work as expected.

Hope this is of help for you. FWIW, I'd recommend doing 2. as mss clamping only works for TCP, so potentially you may see other issues.

@cookandy

This comment has been minimized.

Show comment
Hide comment
@cookandy

cookandy Oct 19, 2016

@itwasntandy

Thank you so much for the detailed write up - it's been a tremendous help! I was able to get the ES cluster working with your instructions, and the networking stack makes much more sense to me now.

I think in my case I'll need to do 2 and 4, as other some of my other Docker applications don't seem to work correctly, even with just the --mtu 1460 setting.

Quick question: if 3 is set on the external interface (eth1), shouldn't the MTU already be adjusted (with --set-mss 1460) before it hits docker0?

Thanks again for the help!!

@itwasntandy

Thank you so much for the detailed write up - it's been a tremendous help! I was able to get the ES cluster working with your instructions, and the networking stack makes much more sense to me now.

I think in my case I'll need to do 2 and 4, as other some of my other Docker applications don't seem to work correctly, even with just the --mtu 1460 setting.

Quick question: if 3 is set on the external interface (eth1), shouldn't the MTU already be adjusted (with --set-mss 1460) before it hits docker0?

Thanks again for the help!!

@darvids0n

This comment has been minimized.

Show comment
Hide comment
@darvids0n

darvids0n Jun 13, 2017

Is there any chance this issue can be moved up to a new milestone? On PaaS setups such as App Engine we cannot set MTU for the host, and setting MTU per container is patchwork at best.

Is there any chance this issue can be moved up to a new milestone? On PaaS setups such as App Engine we cannot set MTU for the host, and setting MTU per container is patchwork at best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment