Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Swarm encrypted overlay network don't work with current Debian kernel 5.10.103-1 #43359

Closed
MichaelMFrost opened this issue Mar 11, 2022 · 19 comments

Comments

@MichaelMFrost
Copy link

MichaelMFrost commented Mar 11, 2022

Description

After upgrade Kernel from 5.10.0-11-amd64 #1 SMP Debian 5.10.92-2 to 5.10.0-12-amd64 #1 SMP Debian 5.10.103-1 the encrypted
overlay network bewteen the nodes ends in error.

Steps to reproduce the issue:

  1. Use: The Current Debian Kernel (5.10.0-12-amd64 SMP Debian 5.10.103-1)
  2. Use: Docker 20.10.13
  3. docker network create -d overlay --opt encrypted=true TestNet
  4. Start containers with the network

Describe the results you received:

  • No TCP Connection between the containers possible
  • Log output:

Mar 10 14:18:31 srv01 dockerd[1297]: time="2022-03-10T14:18:31.277303894+01:00" level=warning msg="Failed Adding rSA{Dst: 10.55.2.11, Src: 10.55.2.10, Proto: esp, Mode: transport, SPI: 0xd457eb22, ReqID: 0xd0c4e3, ReplayWindow: 0, Mark: , OutputMark: 0, Ifid: 0, Auth: , Crypt: , Aead: {Name: rfc4106(gcm(aes)), Key: , ICV length: 64}, Encap: , ESN: false}: invalid argument"

Mar 10 14:18:31 srv01 dockerd[1297]: time="2022-03-10T14:18:31.277371111+01:00" level=warning msg="Failed Adding fSA{Dst: 10.55.2.10, Src: 10.55.2.11, Proto: esp, Mode: transport, SPI: 0x29ad0c9a, ReqID: 0xd0c4e3, ReplayWindow: 0, Mark: , OutputMark: 0, Ifid: 0, Auth: , Crypt: , Aead: {Name: rfc4106(gcm(aes)), Key: , ICV length: 64}, Encap: , ESN: false}: invalid argument."

Mar 10 14:18:31 srv01 dockerd[1297]: time="2022-03-10T14:18:31.277415765+01:00" level=warning msg="Adding fSP{{Dst: 10.55.2.10/32, Src: 10.55.2.11/32, Proto: 17, DstPort: 4789, SrcPort: 0, Dir: dir out, Priority: 0, Index: 0, Action: allow, Ifindex: 0, Ifid: 0, Mark: (0xd0c4e3,0xffffffff), Tmpls: [{Dst: 10.55.2.10, Src: 10.55.2.11, Proto: esp, Mode: transport, Spi: 0x29ad0c9a, Reqid: 0xd0c4e3}]}}: invalid argument"

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:
Docker-ce 20.10.13

Additional environment details (AWS, VirtualBox, physical, etc.):

@MichaelMFrost MichaelMFrost changed the title Docker Swarm encrypted overlay network don't work with Debian kernel 5.10.103-1 Docker Swarm encrypted overlay network don't work with current Debian kernel 5.10.103-1 Mar 14, 2022
@grichner
Copy link

Hi Michael,

the same problem did happen on Amazon linux 2.
Version 5.10.93 and all kernel before work with IP-Sec encryption in a swarm multihost network...
but since version 5.10.96 (08.02.2022) no traffic is sent to another node... so we set the kernel to the last working one.

Here is a sample to reproduce the toppic for an amazon linux 2:

  1. Setup 2 Instances with amazon linux 2:
yum update -y ## in case the image is old and didn't have the latest SEC-updates
yum install -y yum-utils ##needs:restarting package
## always get the latest AWS Software even if the ami does (or does not)  
amazon-linux-extras install -y epel kernel-ng
yum install -y aws-cfn-bootstrap hibagent mlocate vim-enhanced bind-utils jq wget docker s3fs-fuse pv mc iotop net-tools nload amazon-efs-utils parallel fpart deltarpm
systemctl set-default multi-user.target ## default runlevel to non graphical
systemctl enable docker
systemctl restart docker

reboot the instances and do a swarm init and the corresponding join of the second node.
now you have installed the latest linux kernel - at time of writing 5.10.102.

  1. now we can build a cluster with a simple enhanced container.
    docker-compose.yaml:
# To start Docker in Swarm mode, you need to run `docker swarm init`
# To deploy the Grid, `docker stack deploy -c docker-compose.yaml aws-grid --with-registry-auth --prune`
# Stop with `docker stack rm aws-grid`
 
version: '3.9'
 
networks:
  default:
    driver: overlay
    driver_opts:
      encrypted: "true"
 
services:
  node01:
    image: grichner/httpd-test
    deploy:
      placement:
        constraints:
          - node.hostname==node1.hostname.goes.here
      replicas: 1
      update_config:
        parallelism: 1
        delay: 120s
        order: start-first
      resources:
        reservations:
          memory: 300M
 
  node02:
    image: grichner/httpd-test
    deploy:
      placement:
        constraints:
          - node.hostname==node2.hostname.goes.here
      replicas: 1
      update_config:
        parallelism: 1
        delay: 120s
        order: start-first
      resources:
        reservations:
          memory: 300M

the above image was build the following way...
Dockerfile:

FROM httpd:latest
RUN apt-get update
RUN apt-get install -y curl tcpdump

(docker build .) or use mine from docker.io if you don't want to build it for yourself....

  1. deploy the stack:
    docker stack deploy -c docker-compose.yaml aws-grid --with-registry-auth --prune

  2. get the running kernel version of all nodes:

grafik

  1. and than execute the following on one of the node-instances:

grafik

grafik

... as you can see it does **not** work
  1. Now we set the last working kernel back in place:
# Install kernel in case not yet available
sudo yum install -y kernel-5.10.93-87.444.amzn2.x86_64
# Set the default boot image
sudo grubby --set-default /boot/vmlinuz-5.10.93-87.444.amzn2.x86_64
# Set up yum to ignore kernel* updates
sudo sh -c 'echo "exclude=kernel*" >> /etc/yum.conf'

reboot all node-instances (init 6)
AND
check the new running kernel version:
grafik
grafik

  1. than execute the curl again against the other node:

grafik

as you can see it works now....

the problem seems to correlate to the kernel version >5.10.93

@grichner
Copy link

the following kernel patch will fix this issue in the propably next stable release:

https://lore.kernel.org/netdev/20220307082245.GA1791239@gauss3.secunet.de/T/

so this issue can be closed thereafter ...

@Nowheresly
Copy link

We have the same issue. Our setup:
docker version 20.10.13
ubuntu focal kernel 5.4.0-105 (kernel 5.4.0-104 is fine)

@MichaelMFrost
Copy link
Author

Thank you for the info, grichner.

@anpieber
Copy link

Has someone already tested / can confirm, that this fix had been included in any of the newer ubuntu focal kernel 5.4.0 releases?

Thank you very much & Kind regards

  • Andreas

@Nowheresly
Copy link

Hi @anpieber
so far I tested on ubuntu focal kernel 5.4.0-105 and 5.4.0-107 (currently the latest available) and neither is working.

Kernel rollback have been applied for fedora/coreos (as seen here: coreos/fedora-coreos-tracker#1111) so I suspect ubuntu will do the same...

@grichner
Copy link

My info on kernel update is:
it will be in kernel upstream down to 5.10...
but I may be wrong.

@UchihaYuki
Copy link

I can confirm, it doesn't work on 5.4.0-109-generic either.

@grichner
Copy link

grichner commented Apr 25, 2022

grafik

after kernel upstream 5.10.108 it works like before 5.10.94 ...see picture above...

@Nowheresly
Copy link

For ubuntu focal, the fix is currently planned for kernel version 5.4.0-110

See here

@sbhc68
Copy link

sbhc68 commented May 2, 2022

Hello,

I just downloaded version 5.4.0-110-generic for Ubuntu 20.04 and it doesn't fix the communication issue on encrypted overlay networks.

@Nowheresly
Copy link

I may be wrong, but as far as i can see here, the 5.4.0-110-generic is not yet released (status: proposed).

@UchihaYuki
Copy link

It works, when I use Ubuntu 20.04 through multiplass, then downgrade to kernel 5.4.0-104-generic.
But it doesn't work when I use Ubuntu 20.04 through digitalocean, then downgrade to kernel 5.4.0-104-generic.

@grichner
Copy link

grichner commented May 3, 2022

the ESP fix is in 5.10.108:

commit 9248694dac20eda06e22d8503364dc9d03df4e2f
Author: Steffen Klassert <steffen.klassert@secunet.com>
Date:   Mon Mar 7 13:11:39 2022 +0100

    esp: Fix possible buffer overflow in ESP transformation
    
    commit ebe48d368e97d007bfeb76fcb065d6cfc4c96645 upstream.
    
    The maximum message size that can be send is bigger than
    the  maximum site that skb_page_frag_refill can allocate.
    So it is possible to write beyond the allocated buffer.
    
    Fix this by doing a fallback to COW in that case.
    
    v2:
    
    Avoid get get_order() costs as suggested by Linus Torvalds.
    
    Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
    Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
    Reported-by: valis <sec@valis.email>
    Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
    Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

@UchihaYuki
Copy link

UchihaYuki commented May 4, 2022

Is there any way to make Digital Ocean Ubuntu 20.04 and Hetzner Ubuntu 20.04 work? No matter which kernel I downgrade to, it won't work. I tried 5.4.0-89-generic, 5.4.0-99-generic and 5.4.0-104-generic.
When I disable encrypted overlay, it works again...

I turn ufw off on all nodes, and open the following ports on firewall of Digital Ocean:
image

@s4ke
Copy link
Contributor

s4ke commented Jun 12, 2022

For us, networking in a Hetzner swarm did not work with linux-image-5.4.0-117-generic, but worked with linux-image-5.4.0-113-generic. Is the patch already merged?

@UchihaYuki
Copy link

For us, networking in a Hetzner swarm did not work with linux-image-5.4.0-117-generic, but worked with linux-image-5.4.0-113-generic. Is the patch already merged?

Any update? I use Hetzner and Ubuntu 20.04. I've updated my kernel to 5.15.0-86-generic. When I turn on the encrypted option, everything stops working.

@grichner
Copy link

if your kernel creation date is newer than

the ESP fix is in 5.10.108:

commit 9248694dac20eda06e22d8503364dc9d03df4e2f
Author: Steffen Klassert <steffen.klassert@secunet.com>
Date:   Mon Mar 7 13:11:39 2022 +0100

    esp: Fix possible buffer overflow in ESP transformation
    
    commit ebe48d368e97d007bfeb76fcb065d6cfc4c96645 upstream.
    
    The maximum message size that can be send is bigger than
    the  maximum site that skb_page_frag_refill can allocate.
    So it is possible to write beyond the allocated buffer.
    
    Fix this by doing a fallback to COW in that case.
    
    v2:
    
    Avoid get get_order() costs as suggested by Linus Torvalds.
    
    Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
    Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
    Reported-by: valis <sec@valis.email>
    Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
    Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

if your kernel is newer than this it should work...
make sure your VMs can be reached via the udp/tcp ports
https://docs.docker.com/engine/swarm/swarm-tutorial/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants