New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker swarm - encrypted network overlay - stops working. #30727

Closed
ventz opened this Issue Feb 3, 2017 · 29 comments

Comments

Projects
None yet
8 participants
@ventz

ventz commented Feb 3, 2017

Description
After creating a 3 node swarm (all managers), and then creating an encrypted overlay, we have noticed that the overlay network drops out randomly

Steps to reproduce the issue:

  1. Create docker swarm cluster of at least 3 nodes
  2. Create overlay with:
docker network create --attachable --opt encrypted -d overlay networkname"

NOTE: Making it attachable to test easily

  1. Start an alpine container (easy test) on 2 nodes:
docker run -it --rm --net=networkname alpine /bin/ash

4.) Find the IPs (ifconfig) of each, and ping across.

Describe the results you received:
It works and randomly it stops. Firewall (both IP protocol 50 and the rest of the parts are any/any allowed between the 3 nodes)

Describe the results you expected:
To work all the time :)

Additional information you deem important (e.g. issue happens only occasionally):
Happens randomly almost. If you reboot, it starts working again.

Output of docker version:

Client:
 Version:      1.13.0
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   49bf474
 Built:        Tue Jan 17 09:58:26 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 8
Server Version: 1.13.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 34
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 4y3mi5goxun18p0rif8hdrt5o
 Is Manager: true
 ClusterID: vcwzg0mebqw4kp58pz8ynm0cn
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: PUB#1
 Manager Addresses:
  PUB#1:2377
  PUB#2:2377
  PUB#2:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-59-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 24
Total Memory: 100 GiB
Name: swarmhost01
ID: GVD4:VFPH:ELAN:X2CK:CLFZ:MFDC:C5LT:RLTU:DWKE:KDKY:HT6M:BAC2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 nfs=yes
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
Environment is between physical and virtual systems. We have changed it around to be only virtual and only physical - same results. Systems are located in 3 different regions, on 3 different public IP spaces.

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 3, 2017

Additional information: the overlay seems to be aware of the other IPs, since when launching new containers on the different swarm subnets, it keeps increasing the last octet by 1 (ex: 10.0.0.3, 10.0.0.4, etc)

Also, so far from a few tests, it seems in some way related specifically to encrypted overlays -- we have not seen this happen on non-encrypted overlays. In fact, if you remove the overlay (when encrypted) and re-created it non-encrypted, it starts working again.

Also, if you remove and re-create the encrypted overlay, it still will not work until the host(s) are rebooted.

ventz commented Feb 3, 2017

Additional information: the overlay seems to be aware of the other IPs, since when launching new containers on the different swarm subnets, it keeps increasing the last octet by 1 (ex: 10.0.0.3, 10.0.0.4, etc)

Also, so far from a few tests, it seems in some way related specifically to encrypted overlays -- we have not seen this happen on non-encrypted overlays. In fact, if you remove the overlay (when encrypted) and re-created it non-encrypted, it starts working again.

Also, if you remove and re-create the encrypted overlay, it still will not work until the host(s) are rebooted.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 3, 2017

Contributor

@ventz Few questions:

  • How long does it take to hit the issue
  • Do you have one container on one node ping the container on another node, or both pinging each other
  • Once the ping stops working, have you tried capturing on the rx container node interface with tcpdump to see if the other node is still sending the packets (tcpdump -i <iface> -p esp)
    Just to see if the issue is at the tx node or at the rx node.

Thanks

Contributor

aboch commented Feb 3, 2017

@ventz Few questions:

  • How long does it take to hit the issue
  • Do you have one container on one node ping the container on another node, or both pinging each other
  • Once the ping stops working, have you tried capturing on the rx container node interface with tcpdump to see if the other node is still sending the packets (tcpdump -i <iface> -p esp)
    Just to see if the issue is at the tx node or at the rx node.

Thanks

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 3, 2017

@aboch

  • It seems "random". For example, on the latest cluster it worked for a few days (given no containers on it) before it hit this status again. I have seen it as soon as an hour, and the "average" (if you can even call it that) seems to be a few hours to less than a couple of days. I can start monitoring this more carefully.

  • I have one container on each node (when I test), but I've noticed that since I first test with just ICMP (easiest), it won't matter which way I am initializing it. If I get both ICMP echo + reply, the overlay works. As soon as I stop getting "replies", that indicates a problem. Once it hits this state, I do open it from all the sides, and test to see which way exactly it fails. I have noticed a few times that the issue seems to be surrounding the leader node. Ex: in this case where it just happened, as soon as I rebooted the leader, it started working again.

  • This is the interesting scenario -- you can see ESP coming in on the docker host (not container) while pinging, but it will not send ESP back (or originate if you try to originate only). It's almost like some part responsible for this just collapses. No error logs so far in /var/log/system

Here is what a tcpdump looks like

NOTE: again, PUB#1 happened to be leader again in this case
NOTE2: tcpdump taken _on_ NODE#3 (thus why "private#3"):

16:09:54.836702 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x13), length 140
16:09:55.836722 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x14), length 140
16:09:56.836727 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x15), length 140
16:09:57.836676 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x16), length 140
16:09:58.836693 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x17), length 140
16:09:59.836695 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x18), length 140
16:10:00.836763 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x19), length 140

As you can see, the system is not even trying to originate protocol 50 traffic back.

ventz commented Feb 3, 2017

@aboch

  • It seems "random". For example, on the latest cluster it worked for a few days (given no containers on it) before it hit this status again. I have seen it as soon as an hour, and the "average" (if you can even call it that) seems to be a few hours to less than a couple of days. I can start monitoring this more carefully.

  • I have one container on each node (when I test), but I've noticed that since I first test with just ICMP (easiest), it won't matter which way I am initializing it. If I get both ICMP echo + reply, the overlay works. As soon as I stop getting "replies", that indicates a problem. Once it hits this state, I do open it from all the sides, and test to see which way exactly it fails. I have noticed a few times that the issue seems to be surrounding the leader node. Ex: in this case where it just happened, as soon as I rebooted the leader, it started working again.

  • This is the interesting scenario -- you can see ESP coming in on the docker host (not container) while pinging, but it will not send ESP back (or originate if you try to originate only). It's almost like some part responsible for this just collapses. No error logs so far in /var/log/system

Here is what a tcpdump looks like

NOTE: again, PUB#1 happened to be leader again in this case
NOTE2: tcpdump taken _on_ NODE#3 (thus why "private#3"):

16:09:54.836702 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x13), length 140
16:09:55.836722 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x14), length 140
16:09:56.836727 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x15), length 140
16:09:57.836676 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x16), length 140
16:09:58.836693 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x17), length 140
16:09:59.836695 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x18), length 140
16:10:00.836763 IP PUB#1 > private#3: ESP(spi=0xd3052651,seq=0x19), length 140

As you can see, the system is not even trying to originate protocol 50 traffic back.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 3, 2017

Contributor

Thanks @ventz
It looks like the system is currently in the erroneous state. If so, it would be interesting do some debugging:

Can you please check if the container running on node#3 is actually receiving the icmp echo request packet and that it is generating the responses back (For this either tcpdump from docker exec if the container image has it, or nsenter into the container netns from the host (nsenter --net=<value from docker inspect container | grep xKey">)

Contributor

aboch commented Feb 3, 2017

Thanks @ventz
It looks like the system is currently in the erroneous state. If so, it would be interesting do some debugging:

Can you please check if the container running on node#3 is actually receiving the icmp echo request packet and that it is generating the responses back (For this either tcpdump from docker exec if the container image has it, or nsenter into the container netns from the host (nsenter --net=<value from docker inspect container | grep xKey">)

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 3, 2017

@aboch Hi - this is sadly the tcpdump I had taken from earlier today when it happened. I have already rebooted that node.

That said, the container on private#3 was generating back the icmp replies, but they didn't show up on the component responsible for receiving/processing/sending ESP.

It's almost like something isolated to the ESP connection is failing.

To give you an example, when this happened, I removed the containers and removed the overlay. Then re-created the overlay (encrypted), and it still didn't work. Tried removing+re-adding one more time. At that point, I re-added it without encryption, and it worked. I removed it, and re-added it with encryption again, nothing. But then rebooting the leader node fixed it.

ventz commented Feb 3, 2017

@aboch Hi - this is sadly the tcpdump I had taken from earlier today when it happened. I have already rebooted that node.

That said, the container on private#3 was generating back the icmp replies, but they didn't show up on the component responsible for receiving/processing/sending ESP.

It's almost like something isolated to the ESP connection is failing.

To give you an example, when this happened, I removed the containers and removed the overlay. Then re-created the overlay (encrypted), and it still didn't work. Tried removing+re-adding one more time. At that point, I re-added it without encryption, and it worked. I removed it, and re-added it with encryption again, nothing. But then rebooting the leader node fixed it.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 3, 2017

Contributor

It's almost like something isolated to the ESP connection is failing.

Looks like.

I did it before, but let me run again few nodes on AWS overnight with containers pinging over encrypted network. I will have them be all managers as in your test. I will report back my findings.

Contributor

aboch commented Feb 3, 2017

It's almost like something isolated to the ESP connection is failing.

Looks like.

I did it before, but let me run again few nodes on AWS overnight with containers pinging over encrypted network. I will have them be all managers as in your test. I will report back my findings.

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 3, 2017

I did get it to happen on AWS, but if possible - try to do one at a local datacenter or a different network. Noticed it happening a lot more when it's going over public lines -- so it might be a disruption or blip that ends up causing it.

I'll keep an eye for it again too, although for now the latest cluster seems to be working.

I know this sucks for debugging -- it's clearly broken in some way, but capturing what exactly is not easy, especially from someone going "hey it just fails, oh and it's random" :)

ventz commented Feb 3, 2017

I did get it to happen on AWS, but if possible - try to do one at a local datacenter or a different network. Noticed it happening a lot more when it's going over public lines -- so it might be a disruption or blip that ends up causing it.

I'll keep an eye for it again too, although for now the latest cluster seems to be working.

I know this sucks for debugging -- it's clearly broken in some way, but capturing what exactly is not easy, especially from someone going "hey it just fails, oh and it's random" :)

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 4, 2017

@aboch I have an environment where this is happening right now.

ventz commented Feb 4, 2017

@aboch I have an environment where this is happening right now.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 4, 2017

Contributor

@ventz
Check with ip -s xfrm state if the counters are increasing for the proper SAs (check the src and dst IP and the SPI). On each node you will find 4 states for each couple of nodes, 3 RX (dst is this node) states and one TX (src is this node).

Contributor

aboch commented Feb 4, 2017

@ventz
Check with ip -s xfrm state if the counters are increasing for the proper SAs (check the src and dst IP and the SPI). On each node you will find 4 states for each couple of nodes, 3 RX (dst is this node) states and one TX (src is this node).

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 4, 2017

Ok, this is kind of interesting:

The overlay seems to be a 10.0.0.0/24

04:06:30.845862 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 0, length 64
04:06:31.846990 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 1, length 64
04:06:32.848206 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 2, length 64
04:06:33.848998 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 3, length 64
04:06:34.849709 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 4, length 64
04:06:35.849833 ARP, Request who-has 10.0.0.13 tell 10.0.0.5, length 28
04:06:35.849892 ARP, Reply 10.0.0.13 is-at 02:42:0a:00:00:0d, length 28
04:06:35.850421 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 5, length 64
04:06:36.851073 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 6, length 64
04:06:37.851824 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 7, length 64
04:06:38.852600 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 8, length 64

Trying from PUB#1 -> PUB#3 (out of 5, where the first 3 are managers, and 2 are workers)
The interesting part is that the arp request had the right mac (00:00:0d) of .13 --> which I verified on .13, was the case. But .13 was not seeing any traffic. Same in the reverse direction.

However, I can actually see ESP exchanges in this case. Also, some part of the overlay works -- because I can reach some of the other nodes, so it's not complete shot. It's just the connection between these 2 hosts (even though they are both online)

ventz commented Feb 4, 2017

Ok, this is kind of interesting:

The overlay seems to be a 10.0.0.0/24

04:06:30.845862 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 0, length 64
04:06:31.846990 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 1, length 64
04:06:32.848206 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 2, length 64
04:06:33.848998 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 3, length 64
04:06:34.849709 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 4, length 64
04:06:35.849833 ARP, Request who-has 10.0.0.13 tell 10.0.0.5, length 28
04:06:35.849892 ARP, Reply 10.0.0.13 is-at 02:42:0a:00:00:0d, length 28
04:06:35.850421 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 5, length 64
04:06:36.851073 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 6, length 64
04:06:37.851824 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 7, length 64
04:06:38.852600 IP 10.0.0.5 > 10.0.0.13: ICMP echo request, id 3328, seq 8, length 64

Trying from PUB#1 -> PUB#3 (out of 5, where the first 3 are managers, and 2 are workers)
The interesting part is that the arp request had the right mac (00:00:0d) of .13 --> which I verified on .13, was the case. But .13 was not seeing any traffic. Same in the reverse direction.

However, I can actually see ESP exchanges in this case. Also, some part of the overlay works -- because I can reach some of the other nodes, so it's not complete shot. It's just the connection between these 2 hosts (even though they are both online)

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 4, 2017

Contributor

Also please check on the node which is not generating the responses back whether the counter in increasing for the rule the following command returns:

sudo iptables -t mangle -nvL OUTPUT
Contributor

aboch commented Feb 4, 2017

Also please check on the node which is not generating the responses back whether the counter in increasing for the rule the following command returns:

sudo iptables -t mangle -nvL OUTPUT
@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 4, 2017

@aboch any way we can chat in real time (irc?) - I can also share access into the env via a screen session.

ventz commented Feb 4, 2017

@aboch any way we can chat in real time (irc?) - I can also share access into the env via a screen session.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 4, 2017

Contributor

#docker-network irc channel

Contributor

aboch commented Feb 4, 2017

#docker-network irc channel

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 4, 2017

Thanks - pinged you there. We can update the solution here eventually for others, but it would be easier to debug in real time.

ventz commented Feb 4, 2017

Thanks - pinged you there. We can update the solution here eventually for others, but it would be easier to debug in real time.

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 4, 2017

@aboch - Thanks again for the help! Seems like there is an issue, but you have it now :)

For anyone else that sees this, for now you can run:

ip xfrm state flush; ip xfrm policy flush

on each host, and then start+rm a container to pull back the overlay on that host. It fixes the issue.

ventz commented Feb 4, 2017

@aboch - Thanks again for the help! Seems like there is an issue, but you have it now :)

For anyone else that sees this, for now you can run:

ip xfrm state flush; ip xfrm policy flush

on each host, and then start+rm a container to pull back the overlay on that host. It fixes the issue.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 4, 2017

Contributor

Thanks @ventz for the friday night debug session in irc ;-)
That was really helpful to spot the problem.

The problem was because of stale SAs from previous cluster life that were being selected to encrypt the packets. (the corresponding stale SA were not on the other node, most likely because that node did not have any container in the encrypted network in the previous cluster)

There are two workarounds:
One is to selectively remove the stale entries (you look at their date of installation) via ip xfrm state del
The other more practical one (which we opted for) is to flush the table ip xfrm state flush and then re-trigger the states programming by starting a container on the encrypted network on each node.

I have an open PR (docker/libnetwork/pull/1354) to age the stale entries out, but the aging time can only be 3 times the rotation key time + delta (36hrs+delta), so it would not have helped you right away.

I think something better can be done to avoid this. I will work to a fix.

Contributor

aboch commented Feb 4, 2017

Thanks @ventz for the friday night debug session in irc ;-)
That was really helpful to spot the problem.

The problem was because of stale SAs from previous cluster life that were being selected to encrypt the packets. (the corresponding stale SA were not on the other node, most likely because that node did not have any container in the encrypted network in the previous cluster)

There are two workarounds:
One is to selectively remove the stale entries (you look at their date of installation) via ip xfrm state del
The other more practical one (which we opted for) is to flush the table ip xfrm state flush and then re-trigger the states programming by starting a container on the encrypted network on each node.

I have an open PR (docker/libnetwork/pull/1354) to age the stale entries out, but the aging time can only be 3 times the rotation key time + delta (36hrs+delta), so it would not have helped you right away.

I think something better can be done to avoid this. I will work to a fix.

@aboch aboch self-assigned this Feb 4, 2017

@schikin

This comment has been minimized.

Show comment
Hide comment
@schikin

schikin Feb 4, 2017

Hello all. Can you check my last comment for this issue: #30595. Seems like a related one - networking starts failing after some time, but it's only TCP. I also have a cluster in that state right now. Your help would be highly appreciated

schikin commented Feb 4, 2017

Hello all. Can you check my last comment for this issue: #30595. Seems like a related one - networking starts failing after some time, but it's only TCP. I also have a cluster in that state right now. Your help would be highly appreciated

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 6, 2017

It seems like there might be a larger issue around this. On a new cluster (3 nodes, all managers), here is the time period between the overlay working perfectly and it automatically stopping to work:

2017-02-05 1:57:38 -> working
2017-02-05 20:15:51 -> failed

After being launched, this cluster was not touched.

The only reason we know the encrypted overlay failed is because there is a service that uses the overlay to sync every 5 seconds, and it failed + did not recover after that point.

ventz commented Feb 6, 2017

It seems like there might be a larger issue around this. On a new cluster (3 nodes, all managers), here is the time period between the overlay working perfectly and it automatically stopping to work:

2017-02-05 1:57:38 -> working
2017-02-05 20:15:51 -> failed

After being launched, this cluster was not touched.

The only reason we know the encrypted overlay failed is because there is a service that uses the overlay to sync every 5 seconds, and it failed + did not recover after that point.

@rusenask

This comment has been minimized.

Show comment
Hide comment
@rusenask

rusenask Feb 6, 2017

I have also noticed this happening. I am automatically launching new services from within the swarm cluster where new service has to connect to the parent service and sometimes it appears that node networking fails since the new service fails connect.
The node is visible in the cluster and docker seems to be thinking that the node is okay since new services are scheduled there. Rebooting node solves the issue, but you can only detect it if you have some workload there that relies on networking.

rusenask commented Feb 6, 2017

I have also noticed this happening. I am automatically launching new services from within the swarm cluster where new service has to connect to the parent service and sometimes it appears that node networking fails since the new service fails connect.
The node is visible in the cluster and docker seems to be thinking that the node is okay since new services are scheduled there. Rebooting node solves the issue, but you can only detect it if you have some workload there that relies on networking.

@pascalandy

This comment has been minimized.

Show comment
Hide comment
@pascalandy

pascalandy Feb 7, 2017

I believe --opt encrypted was messing with the network between my CMS and Mysql. I stop to used it.

I feel this shall be enabled by default when stable. But I don't know about the CPU overhead.
Cheers!

pascalandy commented Feb 7, 2017

I believe --opt encrypted was messing with the network between my CMS and Mysql. I stop to used it.

I feel this shall be enabled by default when stable. But I don't know about the CPU overhead.
Cheers!

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 7, 2017

@aboch - See panic dump: #30800
(especially towards the bottom)

Might shed light on additional bug that has to do w/ the encryption overlay.

ventz commented Feb 7, 2017

@aboch - See panic dump: #30800
(especially towards the bottom)

Might shed light on additional bug that has to do w/ the encryption overlay.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Feb 11, 2017

Contributor

Thanks @ventz I found where the issue is and pushed the fix ^. It affects only the use case where the advertise address is outside of the box (1-1 NAT case) which is way I could not initially reproduce. It happens during key rotations. It is expected a node reload, or removing and restarting the container on the node fixes it, but only until next datapath key rotation happens.

Contributor

aboch commented Feb 11, 2017

Thanks @ventz I found where the issue is and pushed the fix ^. It affects only the use case where the advertise address is outside of the box (1-1 NAT case) which is way I could not initially reproduce. It happens during key rotations. It is expected a node reload, or removing and restarting the container on the node fixes it, but only until next datapath key rotation happens.

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Feb 11, 2017

@aboch - That's great! Thanks.

What do you think is the eta for the next update that will have this?

ventz commented Feb 11, 2017

@aboch - That's great! Thanks.

What do you think is the eta for the next update that will have this?

@aboch aboch added this to the 1.13.2 milestone Feb 13, 2017

@ventz ventz changed the title from Docker swarm - encrypted network overlay - randomly stops working. to Docker swarm - encrypted network overlay - stops working. Feb 16, 2017

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Mar 6, 2017

@aboch It looks like a bug was introduced in 17.03 (which was not present in 1.13) related to the encrypted overlay when natting is involved, and when 2+ systems are on different networks can talk internally, but again when they advertise public IPs since they need to communicate with, let's say a 3rd system externally.

Just testing:

System A = 10.10.101.11 = PUB1
System B = 10.10.102.12 = PUB2
Encrypted overlay (attachable)

When you launch containers on A (10.0.0.2) and B (10.0.0.3) and ping from container on B->A, you get:

On A:

01:49:54.194927 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x1), length 140
01:49:55.195686 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x2), length 140
01:49:56.196093 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x3), length 140
01:49:57.196470 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x4), length 140
01:49:58.196834 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x5), length 140

On B:

01:49:54.193550 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x1), length 140
01:49:55.194218 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x2), length 140
01:49:56.194586 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x3), length 140
01:49:57.194948 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x4), length 140
01:49:58.195281 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x5), length 140

Nothing has changed config wise/launch wise since upgrading from 1.13.
(otherwise, it seems to work in the other scenarios --> one system per datacenter/non-routable rfc1918 space)

I can confirm that if you create it non-encrypted, everything works.

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: false

ventz commented Mar 6, 2017

@aboch It looks like a bug was introduced in 17.03 (which was not present in 1.13) related to the encrypted overlay when natting is involved, and when 2+ systems are on different networks can talk internally, but again when they advertise public IPs since they need to communicate with, let's say a 3rd system externally.

Just testing:

System A = 10.10.101.11 = PUB1
System B = 10.10.102.12 = PUB2
Encrypted overlay (attachable)

When you launch containers on A (10.0.0.2) and B (10.0.0.3) and ping from container on B->A, you get:

On A:

01:49:54.194927 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x1), length 140
01:49:55.195686 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x2), length 140
01:49:56.196093 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x3), length 140
01:49:57.196470 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x4), length 140
01:49:58.196834 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x5), length 140

On B:

01:49:54.193550 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x1), length 140
01:49:55.194218 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x2), length 140
01:49:56.194586 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x3), length 140
01:49:57.194948 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x4), length 140
01:49:58.195281 IP 10.10.102.12 > PUB2: ESP(spi=0x1304f906,seq=0x5), length 140

Nothing has changed config wise/launch wise since upgrading from 1.13.
(otherwise, it seems to work in the other scenarios --> one system per datacenter/non-routable rfc1918 space)

I can confirm that if you create it non-encrypted, everything works.

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   60ccb22
 Built:        Thu Feb 23 11:02:43 2017
 OS/Arch:      linux/amd64
 Experimental: false
@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Mar 6, 2017

@aboch And possibly related

System A has:

Mar  6 02:03:52 A dockerd[5207]: time="2017-03-06T02:03:52.600013459-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.101.11 Local-addr=10.10.101.11 Adv-addr=PUB1 Remote-addr ="
Mar  6 02:03:52 A dockerd[5207]: time="2017-03-06T02:03:52.600607214-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.101.11 Local-addr=10.10.101.11 Adv-addr=PUB1 Remote-addr ="

vs

System B has:

Mar  6 02:06:47 B dockerd[5689]: time="2017-03-06T02:06:47.848564573-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.102.12 Local-addr=10.10.102.12 Adv-addr=PUB2 Remote-addr =PUB2"
Mar  6 02:06:47 B dockerd[5689]: time="2017-03-06T02:06:47.866718260-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.102.12 Local-addr=10.10.102.12 Adv-addr=PUB2 Remote-addr =PUB2"

^ Note that System B (the joining one) gets the "Remote-addr" as the --advertise-addr (public IP), but System A does not see it.

The commands used to create and join the two are:

System A:

docker swarm init --advertise-addr PUB1 --listen-addr 10.10.101.11:2377

System B:

docker swarm join \
--token SWMTKN-1-... \
--advertise-addr PUB2 --listen-addr 10.10.102.12:2377 PUB1:2377

Just to be exact -- it doesn't work from the moment it is created, and this same setup worked in 1.13.
If you join a 3rd node outside of the datacenter (System C), both A and B can talk with it individually. But A still can't talk to B and vice versa.

ventz commented Mar 6, 2017

@aboch And possibly related

System A has:

Mar  6 02:03:52 A dockerd[5207]: time="2017-03-06T02:03:52.600013459-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.101.11 Local-addr=10.10.101.11 Adv-addr=PUB1 Remote-addr ="
Mar  6 02:03:52 A dockerd[5207]: time="2017-03-06T02:03:52.600607214-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.101.11 Local-addr=10.10.101.11 Adv-addr=PUB1 Remote-addr ="

vs

System B has:

Mar  6 02:06:47 B dockerd[5689]: time="2017-03-06T02:06:47.848564573-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.102.12 Local-addr=10.10.102.12 Adv-addr=PUB2 Remote-addr =PUB2"
Mar  6 02:06:47 B dockerd[5689]: time="2017-03-06T02:06:47.866718260-05:00" level=info msg="Initializing Libnetwork Agent Listen-Addr=10.10.102.12 Local-addr=10.10.102.12 Adv-addr=PUB2 Remote-addr =PUB2"

^ Note that System B (the joining one) gets the "Remote-addr" as the --advertise-addr (public IP), but System A does not see it.

The commands used to create and join the two are:

System A:

docker swarm init --advertise-addr PUB1 --listen-addr 10.10.101.11:2377

System B:

docker swarm join \
--token SWMTKN-1-... \
--advertise-addr PUB2 --listen-addr 10.10.102.12:2377 PUB1:2377

Just to be exact -- it doesn't work from the moment it is created, and this same setup worked in 1.13.
If you join a 3rd node outside of the datacenter (System C), both A and B can talk with it individually. But A still can't talk to B and vice versa.

@ventz

This comment has been minimized.

Show comment
Hide comment
@ventz

ventz Mar 6, 2017

Last possibly interesting update - on a new swarm create+encrypted overlay, it seems you can generate ESP packets one way (no matter if FIRST you try from A->B or B->A...but there won't be a return. From that point, the other direction does not work. So if I start with container on A pinging container on B -- I'll see the pri(A)->PUB(B) and on B packets coming in. But never an attempt to reply. If I stop and reverse direction, there wont' be anything. If I leave the swarm, and start again and try in the reverse direction - again the traffic will show up, but only in that direction, and the reverse will not work. Seems to be something related to the way the encryption is handling the NAT/advertised IP.

ventz commented Mar 6, 2017

Last possibly interesting update - on a new swarm create+encrypted overlay, it seems you can generate ESP packets one way (no matter if FIRST you try from A->B or B->A...but there won't be a return. From that point, the other direction does not work. So if I start with container on A pinging container on B -- I'll see the pri(A)->PUB(B) and on B packets coming in. But never an attempt to reply. If I stop and reverse direction, there wont' be anything. If I leave the swarm, and start again and try in the reverse direction - again the traffic will show up, but only in that direction, and the reverse will not work. Seems to be something related to the way the encryption is handling the NAT/advertised IP.

@aboch

This comment has been minimized.

Show comment
Hide comment
@aboch

aboch Mar 6, 2017

Contributor

Hi @ventz, yes something looks wrong from your logs.

On System A you should see outgoing ESP packets as PRI1 -> PUB2 and incoming as PUB2->PRI1.
Instead from your logs

01:49:54.194927 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x1), length 140

it looks like system A is aware of private IP of system B.
This is one problem that needs to be investigated. You should not see any ipsec tunnels with the PRI IP of any remote node in a swarm created with that configuration.

I am not aware of other changes in that area which could have caused this. Also, I have a 3 nodes swarm running on AWS with encryption via public IPs and it is running smooth for more than 2 weeks.

Note that System B (the joining one) gets the "Remote-addr" as the --advertise-addr (public IP), but System A does not see it.

That is expected to be seen only on the node that joins the swarm. During the encryption setup, the remote node PUB IP is learned from the nw control plane, not from that parameter.

As of now I cannot think of what could have caused your issue. I'll get back if I get any idea.

Contributor

aboch commented Mar 6, 2017

Hi @ventz, yes something looks wrong from your logs.

On System A you should see outgoing ESP packets as PRI1 -> PUB2 and incoming as PUB2->PRI1.
Instead from your logs

01:49:54.194927 IP 10.10.102.12 > 10.10.101.11: ESP(spi=0x1304f906,seq=0x1), length 140

it looks like system A is aware of private IP of system B.
This is one problem that needs to be investigated. You should not see any ipsec tunnels with the PRI IP of any remote node in a swarm created with that configuration.

I am not aware of other changes in that area which could have caused this. Also, I have a 3 nodes swarm running on AWS with encryption via public IPs and it is running smooth for more than 2 weeks.

Note that System B (the joining one) gets the "Remote-addr" as the --advertise-addr (public IP), but System A does not see it.

That is expected to be seen only on the node that joins the swarm. During the encryption setup, the remote node PUB IP is learned from the nw control plane, not from that parameter.

As of now I cannot think of what could have caused your issue. I'll get back if I get any idea.

@emcgee

This comment has been minimized.

Show comment
Hide comment
@emcgee

emcgee May 19, 2018

So we're seeing something quite similar to what is described above. It may also be related to #33133

We've noticed that we're unable to send any data traffic through an encrypted overlay to an endpoint behind AWS/GCE NAT on Debian 9, Ubuntu 18.04, or anything with a kernel > 4.4.

The only modern Debian variant that works is 16.04 with the 4.4 kernel.


Simple repro: node-1 on DigitalOcean, node-2 on GCE/AWS.

  1. Spin up a Debian Stretch host (4.9 kernel) on both DigitalOcean and GCE or AWS with the latest stable Docker-CE (18.03.1)

  2. Initialize the swarm on DigitalOcean node-1 using --advertise-addr external_ip

  3. By default, DigitalOcean has no firewall so node-1 is wide open. Open TCP 2377, TCP/UDP 7946, UDP 4789, Protocol 50 on node-2 in the AWS Security Group/GCP VPC Firewall Rules and join the swarm as a worker using --advertise-addr node-2-external-ip

  4. docker create network --attachable --driver overlay --opt encrypted encryption_test

  5. docker run -ti --rm --network encryption_test debian bash

If you do this with a non-encrypted overlay, traffic flows with no issues. We're able to ping between containers, run iperf3; all is well. But on the encrypted overlay, the traffic simply won't transmit.

ip xfrm state gives:

src 172.31.38.243 dst 207.106.235.21
	proto esp spi 0x3e7a19d6 reqid 13681891 mode transport
	replay-window 0
	aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e3e7a19d6 64
	anti-replay context: seq 0x0, oseq 0xa, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0
src 207.106.235.21 dst 172.31.38.243
	proto esp spi 0x617bc0ba reqid 13681891 mode transport
	replay-window 0
	aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e617bc0ba 64
	anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0

We can see ESP traffic on the remote host (not the one doing the pinging) but there is no return:

18:31:24.652640 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x1), length 140
18:31:25.653615 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x2), length 140
18:31:26.654775 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x3), length 140
18:31:27.655940 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x4), length 140
18:31:28.657156 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x5), length 140

There seems to be a change here in how the kernel is processing the traffic in > 4.4 - has anyone else seen this?

/cc @aboch @ventz

emcgee commented May 19, 2018

So we're seeing something quite similar to what is described above. It may also be related to #33133

We've noticed that we're unable to send any data traffic through an encrypted overlay to an endpoint behind AWS/GCE NAT on Debian 9, Ubuntu 18.04, or anything with a kernel > 4.4.

The only modern Debian variant that works is 16.04 with the 4.4 kernel.


Simple repro: node-1 on DigitalOcean, node-2 on GCE/AWS.

  1. Spin up a Debian Stretch host (4.9 kernel) on both DigitalOcean and GCE or AWS with the latest stable Docker-CE (18.03.1)

  2. Initialize the swarm on DigitalOcean node-1 using --advertise-addr external_ip

  3. By default, DigitalOcean has no firewall so node-1 is wide open. Open TCP 2377, TCP/UDP 7946, UDP 4789, Protocol 50 on node-2 in the AWS Security Group/GCP VPC Firewall Rules and join the swarm as a worker using --advertise-addr node-2-external-ip

  4. docker create network --attachable --driver overlay --opt encrypted encryption_test

  5. docker run -ti --rm --network encryption_test debian bash

If you do this with a non-encrypted overlay, traffic flows with no issues. We're able to ping between containers, run iperf3; all is well. But on the encrypted overlay, the traffic simply won't transmit.

ip xfrm state gives:

src 172.31.38.243 dst 207.106.235.21
	proto esp spi 0x3e7a19d6 reqid 13681891 mode transport
	replay-window 0
	aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e3e7a19d6 64
	anti-replay context: seq 0x0, oseq 0xa, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0
src 207.106.235.21 dst 172.31.38.243
	proto esp spi 0x617bc0ba reqid 13681891 mode transport
	replay-window 0
	aead rfc4106(gcm(aes)) 0xdcb885a138afc1d801f86a6b379dd22e617bc0ba 64
	anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0

We can see ESP traffic on the remote host (not the one doing the pinging) but there is no return:

18:31:24.652640 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x1), length 140
18:31:25.653615 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x2), length 140
18:31:26.654775 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x3), length 140
18:31:27.655940 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x4), length 140
18:31:28.657156 IP ec2-18-176-21-33.us-east-2.compute.amazonaws.com > hostname.mydomain.com: ESP(spi=0x3e7a19d6,seq=0x5), length 140

There seems to be a change here in how the kernel is processing the traffic in > 4.4 - has anyone else seen this?

/cc @aboch @ventz

@thaJeztah

This comment has been minimized.

Show comment
Hide comment
@thaJeztah

thaJeztah May 19, 2018

Member

@emcgee could you open a new ticket so that it doesn't get lost? (this issue is closed, so comments get easily overlooked

Member

thaJeztah commented May 19, 2018

@emcgee could you open a new ticket so that it doesn't get lost? (this issue is closed, so comments get easily overlooked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment