Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM #40275

Open
bluikko opened this issue Dec 1, 2019 · 22 comments · May be fixed by #47768
Open

IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM #40275

bluikko opened this issue Dec 1, 2019 · 22 comments · May be fixed by #47768
Labels
area/networking/ipv6 Issues related to ipv6 area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/19.03

Comments

@bluikko
Copy link

bluikko commented Dec 1, 2019

Description

It is documented that IPv6 pool "should" be at least /80 so that MAC address can fit in the last 48 bits.

Using a default-address-pools size larger than 80 causes dockerd to consume too much RAM - the longer the prefix (smaller subnet), the more RAM dockerd will use:

  • At /81 - /90 range the RAM usage increase is negligible in the range of few GB.
  • At /94 - /96 range the RAM usage is in the tens to hundreds of GB.

The pool prefix length could be set to larger by 80 by a typing error or a mistake and if this leads to dockerd consuming copious amounts of RAM will cause the administrator to possibly lose time troubleshooting the situation.

It looks like prefix length like /96 is totally unusable and dockerd should refuse to start instead of starting to allocate ridiculous amounts of RAM.

At minimum a warning message should be printed.

Steps to reproduce the issue:

  1. Set IPv6 address pool prefix length longer than 80:
  "default-address-pools": [
    { "base": "192.0.2.0/16", "size": 24 },
    { "base": "2001:db8:1:1f00::/64", "size": 96 }
  ],
  1. Start Docker.
  2. Watch the server grind to a halt and kernel OOM killer being invoked.

Describe the results you received:
dockerd consumes very large amounts of RAM (tens of GB).

Describe the results you expected:
Either IPv6 pool prefix lengths longer than 80 should work, or dockerd should refuse start with a configuration that cannot be used.

At minimum a warning message should be printed for prefix lengths longer than 80.

The documentation does not mention the RAM usage effect either:

The subnet for Docker containers should at least have a size of /80, so that an IPv6 address can end with the container’s MAC address and you prevent NDP neighbor cache invalidation issues in the Docker layer.

Additional information you deem important (e.g. issue happens only occasionally):
100% reproducible.

Output of docker version:

Docker version 19.03.5, build 633a0ea

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: fluentd
 Cgroup Driver: systemd
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
  selinux
 Kernel Version: 3.10.0-1062.4.3.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 47.15GiB
 Name: docker.domain
 ID: xxx
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

@thaJeztah thaJeztah added area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Dec 2, 2019
@thaJeztah
Copy link
Member

ping @selansen @arkodg PTAL - looks related to the comments on https://github.com/docker/libnetwork/pull/2058/files#r166658653

@arkodg
Copy link
Contributor

arkodg commented Dec 3, 2019

yes the issue is we are allocating too much space https://github.com/docker/libnetwork/blob/1680ce717394f8aa9ba6de26b851b7e02699d490/ipamutils/utils.go#L114
We should maybe limit n to 20 bits / 1M space

@dukekautington3rd
Copy link

dukekautington3rd commented Oct 15, 2020

+1 as this happened to me and caused my server to crash. Tried to limit the allocated IPv6 space used by leveraging "default-address-pools" in daemon.json.

# dockerd --version
Docker version 19.03.8, build afacb8b7f0
{
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "bip": "172.17.17.1/24",
  "ipv6": true,
  "fixed-cidr-v6": "2600:####:####:###1::/64",
  "dns": [
    "172.16.10.1",
    "2600:####:####:###0::1"
  ],
  "dns-search": [
    "kaut.io"
  ],
  "default-address-pools": [
    {
      "base": "2600:####:####:###1::/64",
      "size": 80
    }
  ]
}

Size 80 works fine but 112 would not allow dockerd to start, and 96 crippled my server.

It is very interesting to me that this happens with the default docker bridge, but you can easily achieve(no crashing) this with IPAM driver (shown in docker-compose.yml):

networks:
  custombr0:
    driver: bridge
    enable_ipv6: true
    driver_opts:
      com.docker.network.bridge.name: "custombr0"
      com.docker.network.bridge.enable_icc: "true"
      com.docker.network.bridge.enable_ip_masquerade: "true"
      com.docker.network.bridge.host_biniding_ipv4: "0.0.0.0"
      com.docker.network.driver.mtu: "1500"
      com.docker.network.enable_ipv6: "true"
    ipam:
      driver: default
      config:
        - subnet: 172.17.18.0/24
          ip_range: 172.17.18.32/28
          gateway: 172.17.18.1
        - subnet: "2600:####:####:###f::/64"
          ip_range: "2600:####:####:###f::/112"
          gateway: "2600:####:####:###f:face::1"

Workaround in place but hope it gets fixed soon. 😃

@matthijskooijman
Copy link

Is this still an issue? I just tried the below config:

        "ipv6": true,
        "fixed-cidr-v6": "fd00:dead:beef::/64",
        "default-address-pools": [
                {
                        "base": "172.80.0.0/16",
                        "size": 24
                },
                {
                        "base": "172.90.0.0/16",
                        "size": 24
                },
                {
                        "base": "fd00:dead:beef::/48",
                        "size": 64
                }
        ]

which starts docker without problems. I cannot actually get any ipv6 addresses auto-assigned to custom created networks, though, so maybe there's some other issue in my setup that prevents this issue from triggering.

$ docker --version
Docker version 20.10.2, build 2291f61

At /81 - /90 range the RAM usage increase is negligible in the range of few GB.

Is this a typo? "a few GB of extra RAM" does not sound negligable to me? Or do you mean the total usage is a few GB with or without this address pool?

@bluikko
Copy link
Author

bluikko commented Jan 15, 2021

Is this a typo? "a few GB of extra RAM" does not sound negligable to me? Or do you mean the total usage is a few GB with or without this address pool?

It was not a typo. It means the increase due to the pool configuration was a few GB.

The "negligible" may be debatable - a few GB may be a lot usually but compared to the increase of hundreds of GB for the smaller subnets it could be said to be negligible.

All of it is obviously a bug, even a few GB of increase just because using a /81 subnet makes no sense.

@matthijskooijman
Copy link

The "negligible" may be debatable - a few GB may be a lot usually but compared to the increase of hundreds of GB for the smaller subnets it could be said to be negligible.

It sounds like we might be working in different types of environment, then. In my environment, servers have a couple of GB of RAM, maybe 8G if I'm lucky. Sounds like "just a few GB" might be acceptable in your environment, from where I'm standing, this would mean it's not even near usable at all. Anyway, let's not waste too much time on this, thanks for confirming in any case :-)

@digitalresistor
Copy link

AWS is delegating a prefix up to a /80 per instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-prefix-eni.html#ec2-prefix-basics

GCP is delegating a prefix of /96 per instance: https://cloud.google.com/compute/docs/ip-addresses/configure-ipv6-address#ipv6-assignment

There are other cloud providers that are offering even smaller sizes for their prefix delegations.

Docker should work with smaller allocations too.

@olljanat
Copy link
Contributor

LOL. So there is bug like this and then other hand Calico which only allow IPv6 prefixes /116 - 128: https://docs.projectcalico.org/networking/change-block-size

I opened PR projectcalico/node#1337 to improve situation on Calico side but what is plan here? I think that at least /96 which GCP support should be supported by Docker too.

akerouanton added a commit to akerouanton/docker that referenced this issue Nov 15, 2021
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but I believe it's
not really impactful as either many subnets exists (more than one host
can handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Nov 15, 2021
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jun 3, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jun 3, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jun 11, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@gucki
Copy link

gucki commented Oct 11, 2022

Still happening with current docker 20.10.18 👎

akerouanton added a commit to akerouanton/docker that referenced this issue Dec 23, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Dec 23, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Dec 27, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Dec 27, 2022
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 9, 2023
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 9, 2023
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 9, 2023
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 9, 2023
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Jan 12, 2023
This commit resolves moby#40275 by implementing a custom iterator
named NetworkSplitter. It splits a set of NetworkToSplit into smaller
subnets on-demand by calling its Get method.

Prior to this change, the list of NetworkToSplit was split into smaller
subnets when ConfigLocalScopeDefaultNetworks or
ConfigGlobalScopeDefaultNetworks were called or when ipamutils package
was loaded. When one of the Config function was called with an IPv6 net
to split into small subnets, all the available memory was consumed. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
NetworkSplitter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
NetworkSplitter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available subnet. This is the worst-case but it shall not be
really impactful as either many subnets exists (more than one host can
handle) or not so many subnets exists and full iteration over
NetworkSplitter is fast.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Apr 11, 2023
A new Subnetter structure is added to lazily sub-divide an address pool
into subnets. This fixes moby#40275.

Prior to this change, the list of NetworkToSplit was eagerly split into
smaller subnets when ipamutils package was loaded, when
ConfigGlobalScopeDefaultNetworks was called or when the function
SetDefaultIPAddressPool from the default IPAM driver was called. In the
latter case, if the list of NetworkToSplit contained an IPv6 prefix,
eagerly enumerating all subnets could eat all the available memory. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
Subnetter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
the Subnetter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available.

Also, the Subnetter leverages the newly introduced ipbits package, which
handles IPv6 addresses correctly. Before this commit, a bitwise shift
was overflowing uint64 and thus only a single subnet could be enumerated
from an IPv6 prefix. This fixes moby#42801.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
akerouanton added a commit to akerouanton/docker that referenced this issue Apr 11, 2023
A new Subnetter structure is added to lazily sub-divide an address pool
into subnets. This fixes moby#40275.

Prior to this change, the list of NetworkToSplit was eagerly split into
smaller subnets when ipamutils package was loaded, when
ConfigGlobalScopeDefaultNetworks was called or when the function
SetDefaultIPAddressPool from the default IPAM driver was called. In the
latter case, if the list of NetworkToSplit contained an IPv6 prefix,
eagerly enumerating all subnets could eat all the available memory. For
instance, fd00::/8 split into /96 would take ~5*10^27 bytes.

Although this change trades memory consumption for computation cost, the
Subnetter is used by libnetwork/ipam package in such a way that it
only have to compute the address of the next subnet. When
the Subnetter reach the end of NetworkToSplit, it's resetted by
libnetwork/ipam only if there were some subnets released beforehand. In
such case, ipam package might iterate over all the subnets before
finding one available.

Also, the Subnetter leverages the newly introduced ipbits package, which
handles IPv6 addresses correctly. Before this commit, a bitwise shift
was overflowing uint64 and thus only a single subnet could be enumerated
from an IPv6 prefix. This fixes moby#42801.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@polarathene
Copy link
Contributor

polarathene commented Apr 12, 2023

It is documented that IPv6 pool "should" be at least /80 so that MAC address can fit in the last 48 bits.

Can a link be provided? No mention in the current Docker IPv6 docs page

EDIT: From old Docker docs authored Oct 2015:

Often servers or virtual machines get a /64 IPv6 subnet assigned (e.g. 2001:db8:23:42::/64).
In this case you can split it up further and provide Docker a /80 subnet while using a separate /80 subnet for other applications on the host.

...

Remember the subnet for Docker containers should at least have a size of /80.
This way an IPv6 address can end with the container’s MAC address and you prevent NDP neighbor cache invalidation issues in the Docker layer.

So if you have a /64 for your whole environment use /78 subnets for the hosts and /80 for the containers.
This way you can use 4096 hosts with 16 /80 subnets each.

All of that was stripped away in a Feb 2018 rewrite, but has been cached by various other sites which show up in search engine results for queries with Docker + IPv6.

I am curious if that's still applicable or outdated information given this tidbit:

The original design of IPv6 allocation was for 80 bits of network addressing and 48 bits of host addressing.
After that was shared with the IEEE they pointed out that future ethernet-follow-on protocols which would use a 64 bit EUI rather than a 48 bit MAC.
Thus the IPv6 network:host split was moved from 80:48 to 64:64.

That would seem to potentially align with why a prefix of /80 was advised in the older Docker docs advice? Today that would be /64?


UPDATE: Looked into this further.

  • Only the default docker0 bridge seems to have the /80 MAC behaviour. Which AFAIK limits the networks IP range to /80 in size anyway, and anything smaller like /81 would truncate the MAC address usage, hence the NDP cache issues. - However when you use a subnet size of /81 or smaller, the IP address assignment seems to behave like user-defined network bridges (dropping the MAC value and just incrementing). Since the default bridge is considered legacy, I doubt it matters then?

Using a default-address-pools size larger than 80 causes dockerd to consume too much RAM - the longer the prefix (smaller subnet), the more RAM dockerd will use

  "default-address-pools": [
    { "base": "192.0.2.0/16", "size": 24 },
    { "base": "2001:db8:1:1f00::/64", "size": 96 }
  ],
  • An IPv6 /64 address block (GUA and ULA assign as the network prefix (routing prefix + subnet ID), while the remaining 64-bits is a network device ID; in this context representing an individual container).
  • Split into over 4 billion subnets of size /96 ( 2^64 / 2^32 = ~4 billion, aka 128-bit addresses with /96 prefix => 2^(128-96) => 2^32)
  • The IPv4 address block of 2^16 addresses is more sensible, supporting 256 sub-networks each offering 256 IP addresses (2^(32-16) / 2^(32-24) => 2^16 / 2^8 => 256)

Size 80 works fine but 112 would not allow dockerd to start, and 96 crippled my server.

An IPv6 address block with a /64 CIDR prefix split into subnets of size /80 (48-bit blocks), results in 65k networks: 2^(128-64) / 2^(128-80) => 2^64 / 2^48 => 2^16.

That is much smaller than 4 billion /96 subnets (or for /112 subnets (2^16 hosts each), a total of 2^48 networks! 😬 )

@polarathene
Copy link
Contributor

polarathene commented Apr 12, 2023

There are other cloud providers that are offering even smaller sizes for their prefix delegations.

What you've linked to is vendor docs for IP address assignment within a given CIDR prefix range, not the full CIDR block?

Similar to a host/server with a single IPv4 public address, you can bind your containers ports to that address, but the containers may be using subnets of IPv4 private range addresses internally. You can likewise have a public IPv6 address and have containers use private range ULA addresses that you can subnet too. Or bind to the public IPv6 address(es) your server NIC has available.

The problem described in this issue AFAIK is Docker is being given a large block ("base" in config) to divide into billions of subnets (due to "size" prefix reducing number of hosts per subnet).


Docker should work with smaller allocations too.

I don't think that is the actual problem?

This isn't my forte, but I'd assume you could provide a longer prefix length for "base" (less networks), or lower "size" value (larger IP range per network). I don't think the Docker docs explain the settings too well, but there are some decent articles on the subject that make up for that.

Someone might want to correct me, but this might work: fd00:dead:beef:cafe:feed:face::/96 (base) + 102 (size) would presumably provide 256 subnets, each supporting 2^24 hosts (over 16 million). Although AFAIK this may break things with /96 instead of /64?

With IPv6, a subnet is not meant to be smaller than /64 (as in no less than 2^64 hosts per subnet, /80 would lower to 2^48 hosts per subnet). The other 64-bit is for the interface ID, and to use an IPv6 prefix like /80 would break IPv6 features:

"Normal" subnets are never expected to be any narrower (longer prefix) than /64.
No subnets are ever expected to be wider (shorter prefix) than /64

Using a subnet prefix length other than a /64 will break many features of IPv6,
including Neighbor Discovery (ND), Secure Neighbor Discovery (SEND) [RFC3971], privacy extensions [RFC4941], parts of Mobile IPv6 [RFC4866], Protocol Independent Multicast - Sparse Mode (PIM-SM) with Embedded-RP [RFC3956], and Site Multihoming by IPv6 Intermediation (SHIM6) [SHIM6], among others.
A number of other features currently in development, or being proposed, also rely on /64 subnet prefixes.

And wikipedia on the address formats:

  • The network prefix (the routing prefix combined with the subnet id) is contained in the most significant 64 bits of the address.
  • The size of the routing prefix may vary; a larger prefix size means a smaller subnet id size.
  • The bits of the subnet id field are available to the network administrator to define subnets within the given network.

UPDATE: I might be mistaken?

There is this Google blogpost about using ULA /48 prefix per VPC, which allows for 2^16 subnets of /64. They then depict that each /64 subnet provides 4 billion /96 for VM interfaces, each with 4 billion IPv6 addresses available:

Google IPv6 VPC assignment breakdown

It's still a /64 subnet, while the /96 is bridged with DHCP6?:

When you enable IPv6 on a VM, the VM is assigned a /96 range from the subnet that it is connected to.
The first IP address in that range is assigned to the primary interface using DHCPv6.

You don't configure whether a VM gets internal or external IPv6 addresses.
The VM inherits the IPv6 access type from the subnet that it is connected to.

If you use Docker on one of those VMs, then perhaps the networks it creates become "subnets" smaller than /96, but from what I've read the IPv6 issues with subnets smaller than /64 aren't as applicable when DHCPv6 is used and you have something like Docker managing it's networks / IP assignments?

But these are different from the /64 subnet routing wise? 🤷‍♂️

@akerouanton akerouanton added the area/networking/ipv6 Issues related to ipv6 label Apr 12, 2023
@polarathene
Copy link
Contributor

polarathene commented Apr 13, 2023

Just for reference, since some config examples shared in this discussion have a variety in IP addresses used.

  1. IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM #40275 (comment)
    Original post example is fine, showing examples with IP addresses reserved for documentation.
  2. IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM #40275 (comment)
    • Private / public mix: An IPv4 address in the private range + with a public IPv6 prefix for a single subnet (but they then subdivide that into /80?).
    • Then a compose example configures that /64 subnet with an ip_range of /112. AFAIK that is not equivalent (1 explicit subnet is configured, and the allowed range is reduced to /112 aka 2^16 IP addresses), but you can't use another /112 range with the same /64 subnet that way? (at least not with equivalent options in docker network create)
  3. IPv6 address pool subnet smaller than /80 causes dockerd to consume all available RAM #40275 (comment)
    Private / public mix:
    • Public IPv4 (unclear if intentional, 172.x.y.z is sometimes mistaken as private range, not a subset of it).
    • IPv6 ULA address with prefix /48, allowing for 2^16 subnets (approx 65k) of /64 (but presently will only add a single /64 into the pool).

Since the default default-address-pools use IPv4 private range addresses, I'd assume for IPv6 usage it'd typically make sense to do the same with IPv6 by using ULA here?

  "default-address-pools": [
    {
      "base": "fd00:face:feed::/48",
      "size": 64
    }
  ]

Documentation is lacking a bit, and a few bugs that have been unresolved for years (undocumented) added to the confusion. I get the impression that the above snippet is how IPv6 subnet pools should be configured, but as detailed below is not practical due to bugs.

Behaviour (with gotchas)

This will kind of work. You can't use docker network create --ipv6 as it'll complain about no IPv4 available (presumably a bug? 🤷‍♂️ ), so add IPv4 into the mix:

  "default-address-pools": [
    {
      "base": "172.17.0.0/12",
      "size": 16
    },
    {
      "base": "fd00:face:feed::/48",
      "size": 64
    }
  ]

With that resolved, if you create an IPv6 enabled network, it'll be assigned two subnet: /16 (IPv4) + /64 (IPv6).

Unless you enabled IPv6 on the default bridge (daemon.json with ipv6: true + fixed-cidr-v6), the fixed-cidr-v6 setting is like docker network create --subnet (but specifically for the default docker0 bridge). Providing an explicit subnet is not restricted to the default-address-pools, but if you assigned the bridge the only /64 the pool available (due to bug), then there is no more subnets left in the IPv6 pool.

AFAIK docker network create will choose a subnet from default-address-pools config when you don't provide --subnet, but with IPv6:

  • The size value only starts from 64 and higher (splitting a /64, eg size: 65 will allow creating two networks each with half the /64 subnet space). I don't think IPv6 subnets are meant to be any larger (or smaller?) than /64.
  • The base value is effectively treated as /64. So /48 won't equate to 2^16 subnets of /64. Appears to be a bug.

Perhaps I misunderstood how default-address-pools is meant to be used, or got confused with the difference of IPv6 addressing / subnets due to the network prefix having a subnet ID and the resources I have linked all cite to avoid subnets smaller than /64 (eg: /65)? (maybe that's different in the context of containers?)

docker network create --subnet "fd00:face:feed:cafe::/64" test-ipv6 with --ip-range fd00:face:feed:cafe::/65 option specifies a single subnet, and a subrange of IP addresses, but cannot assign another network with the same --subnet value but a different --ip-range (unclear if bug). docker network inspect would show the network has a /64 subnet, while daemon.json with a /64 base and size: 65 would show a network created with a /65 subnet.


Could this be clarified?

  • XXXX:YYYY:ZZZZ::/48 to XXXX:YYYY:ZZZZ::/64 is a 2^16 range for "Subnet ID" for IPv6 GUA and ULA address formats in their Network prefix (GUA Subnet ID range is lower when a larger Routing prefix is used).
  • fd00:face:feed:cafe::/64 is a single ULA subnet that supports 2^64 IPv6 addresses, and splitting that into smaller networks may break some IPv6 features?
  • How does Docker managing networks play into this? When creating subnets that are smaller than /64, does this somehow avoid the negative side effects mentioned from /65 or higher? (is the old docs /80 subnet size concern still relevant?)
  • Is there a good reason to use IPv6 subnets smaller than /64? Wouldn't a pool of /64 subnets from an IPv6 with prefix length /48 as base make sense? (at least with ULA) That seems to be the equivalent of using private IPv4 range for address pools in subnets? (but this is not equivalent to a subnet with --ip-range?)

Related issues

Found these during the above investigation write-up 😅

@scyto
Copy link

scyto commented Sep 4, 2023

I am a plain old regular user (PORU?) i have read the current live docs, the threads linked above, and I am still utterly confused on the scenarios vs the best practice. The docs IMO really need to be clear on:

  • When to use ip6tables vs not and WHY
  • When you want to use some private ipv6 address space inside the docker networks (i.e. only want routing between containers in IPv6 vs having the globaly routable).
  • When the containers will be IPv6 NAT-like vs globally routed
  • not assume that /48 are the only good example (i have a /56)

For me i am stumped. I have a swarm.

The questions I have are:

  1. is IPv6 supported in a swarm (will the overly networks get auto created correctly)
  2. if i set the default CIDR to one /64 what should I be setting the default IPv6 address pools to (if at all) and how does this vary by wanting globally routed vs not
  3. what do i need do about RAs if i want full external connectivity for the containers, or ping containers from my LAN etc
    etc

I see no reason i wouldn't want my containers globally routable like any other host on my network.
As such i think my json entries should be as follows, but not entirely sure. (i think i needed the 172.x and 192.x as adding the ipv6 pool seemed to break the native base pools?)

I think the ip6tables is basically required if i want connectivity from the container to external

i also set sysctl net.ipv6.conf.all.proxy_ndp=1 and net.ipv6.conf.all.accept_ra = 1 based on some documentation i found, but i am really not sure how and when they are needed.

I assume ip6tables effectively NAT'd or firewalled the incoming packets OR that the host doesn't advertise routes so no client on my network can get ingress to the containers with IPv6?

{
  "ipv6": true,
  "fixed-cidr-v6": "xxxx:xxxxx:xxxxx:d1::/64",
  "experimental": true,
  "ip6tables": true,
  "default-address-pools": [
    {"base" : "172.17.0.0/12","size" : 16},
    {"base" : "192.168.0.0/16","size" : 20},
    {"base" : "xxxx:xxxx:xxxx:d2::/64", "size" : 80}
  ]
}

@bluikko
Copy link
Author

bluikko commented Sep 4, 2023

and I am still utterly confused on the scenarios vs the best practice

I've not had to look at the Docker documents in years, lucky me, but I assume they still only try to vaguely describe basic concepts like networking and omit all the technical details specific to Docker that us administrators would really need.
It is the most frustrating documentation I have seen.

I have a swarm.

Swarm does not support IPv6?

You should probably open a new issue as it has nothing to do with the issue described in the OP.

@scyto
Copy link

scyto commented Sep 4, 2023

Swarm does not support IPv6?

I don't know, it was supposed to be a question not a statement :-) I was more meaning the documentation needs to be clear thats all and why i posted on the documentation thread.

It seems the docker_gwbridge does get scope local but doesn't get assigned anything from the main bridge ipv6 range or the pool range (docker0 did). I am working through about 15 open tabs with different guidance ... at least I am now at a point where a docker run -it -rm container on the default bridge it has simillar levels of IPv4 and IPv6 connectivity...

next up, what happens when i push a service to that node....

@bluikko
Copy link
Author

bluikko commented Sep 4, 2023

Swarm does not support IPv6?

I don't know, it was supposed to be a question not a statement

I phrased that not well. I meant that a few years ago swarm did not really support IPv6, even if you might see references to it working with IPv6 in the fine documentation.

And as late of just a few months ago I have seen comments by those "in the know" that swarm does not work -- or does not work "right" -- with IPv6.

@scyto
Copy link

scyto commented Sep 4, 2023

And as late of just a few months ago I have seen comments by those "in the know" that swarm does not work -- or does not work "right" -- with IPv6.

thanks, thats good to know, i can stop banging my head, and i guess i need to read the doc better (but IPv6 is documeted in many places, it would have been good if this was documented in the main IPv6 section, as well as the compose reference (i just found it)

image

though that formatting is terrible, maybe i will do a PR on the docs if my OCD gets the better of me :-)

@bluikko
Copy link
Author

bluikko commented Sep 4, 2023

i guess i need to read the doc better

As mentioned in my reply above, I'm fairly certain the docs used to proudly and explicitly state that swarm supports IPv6 while it really did not. I would wager it's still like that in one way or another; if not explicitly stating that IPv6 works then at least implicitly.

You might have better luck trawling the issues here, I'm sure there are some very lengthy comment threads on swarm & IPv6.

@scyto
Copy link

scyto commented Sep 4, 2023

yeah that was what i was doing, for example even if i take swarm out of the equation the example in the compose 3 reference only shows this, who on earth what to statically IPv6 address anything but the most critical hosts (DNS servers, like i mean thats it, thats the only ones) routers should be discovered by RAs. Everything else should be dynamic. Also what was the point of me setting a base pool if the ipam commands can't use it? Or maybe if i didn't put a v6 address it would pick from the pool? who knows the docs don't say, lo. anyway enough of my griping, i will go see what i can find with more searches - this IPv6 / docker task has been on my to do list for ages, - just my sort of holiday weekend research :-)

networks:
  app_net:
    ipam:
      driver: default
      config:
        - subnet: "172.16.238.0/24"
        - subnet: "2001:3984:3989::/64"

@scyto
Copy link

scyto commented Sep 4, 2023

well cool docker network create --ipv6 ip6net worked and each time i created a new one it incremented the subnet! nice so i assume i can leave the IPv6 addressing out of the compose too! not sure i have seen a single howto that describes this behavior - time for me to do a blog maybe? and now i have to go file a bug with portainer as the force me to manually specfify.... sigh...

image

@polarathene
Copy link
Contributor

polarathene commented Sep 4, 2023

if i take swarm out of the equation the example in the compose 3 reference only shows this

I can't advise on swarm and my experience with GUA network had various gotchas I ran into that I didn't find time to document that better, but you may find these IPv6 with Docker docs I wrote helpful?

It shows how to setup with Docker CLI or Docker Compose. The official Docker IPv6 docs were in worse shape until recently (May) when they received a big revision (I provided some review feedback). My unofficial docs might provide a helpful resource though 😅

You can definitely create an IPv6 network via the CLI and reference it via compose.yaml. My linked docs should mention that IIRC (NOTE: the link is not entirely stable as it's waiting on a v13 release of the project, while the linked edge version in future will probably break when the docs are moved around).


Here's a preview:

image

If you're using IPv4 NAT (default), IPv6 ULA works well at providing IPv6 networking that works in the same way between containers and the default enabled userland-proxy.

IPv6 ULA benefit over IPv6 GUA:

  • Containers aren't going to like binding to a public IPv4 interface when another container already has for the same public port, which AFAIK made the benefits of IPv6 GUA less useful unless you don't need to be publicly reachable via IPv4? (presently last I checked you can't opt-out of IPv4 assignment to a container)
    • You could of course also use a reverse-proxy, but I'm not sure why you'd assign each container their own public IPv6 address if a reverse-proxy is in use for IPv4?
  • IPv6 ULA makes more sense as a preferred default for those that don't know any better as users less familiar with IPv6 tend to get confused about GUA publicly exposing their containers (or not being accessible due to firewall, while IPv4 / ULA bypass firewall via directly managing iptables rules).

@funnelfiasco
Copy link

From a comment on 43033, it seems that #46755 may provide a fix for this issue. Can anyone verify one way or another?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking/ipv6 Issues related to ipv6 area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/19.03
Projects