Skip to content

Commit

Permalink
libnet/ipams/default: introduce a linear allocator
Browse files Browse the repository at this point in the history
The previous allocator was subnetting address pools eagerly
when the daemon started, and would then just iterate over that
list whenever RequestPool was called. This was leading to high
memory usage whenever IPv6 pools were configured with a target
subnet size too different from the pools prefix size.

For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8)
subnets would be generated upfront. This would take approx.
9 * 10^18 bits -- way too much for any human computer in 2024.

Another noteworthy issue, the previous implementation was allocating
a subnet, and then in another layer was checking whether the
allocation was conflicting with some 'reserved networks'. If so,
the allocation would be retried, etc... To make it worse, 'reserved
networks' would be recomputed on every iteration. This is totally
ineffective as there could be 'reserved networks' that fully overlap
a given address pool (or many!).

To fix this issue, a new field `Exclude` is added to `RequestPool`.
It's up to each driver to take it into account. Since we don't know
whether this retry loop is useful for some remote IPAM driver, it's
reimplemented bug-for-bug directly in the remote driver.

The new allocator uses a linear-search algorithm. It takes advantage
of all lists (predefined pools, allocated subnets and reserved
networks) being sorted and logically combines 'allocated' and
'reserved' through a 'double cursor' to iterate on both lists at the
same time while preserving the total order. At the same time, it
iterates over 'predefined' pools and looks for the first empty space
that would be a good fit.

Currently, the size of the allocated subnet is still dictated by
each 'predefined' pools. We should consider hardcoding that size
instead, and let users specify what subnet size they want. This
wasn't possible before as the subnets were generated upfront. This
new allocator should be able to deal with this easily.

The method used for static allocation has been updated to make sure
the ascending order of 'allocated' is preserved. It's bug-for-bug
compatible with the previous implementation.

One consequence of this new algorithm is that we don't keep track
of where the last allocation happened, we just allocate the first
free subnet we find.

Before:

- Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ;
Allocate 10.0.3.0/24.

Now, the 3rd allocation would yield 10.0.1.0/24 once again.

As it doesn't change the semantics of the allocator, there's no
reason to worry about that.

Finally, about 'reserved networks'. The heuristics we use are
now properly documented. It was discovered that we don't check
routes for IPv6 allocations -- this can't be changed because
there's no such thing as on-link routes for IPv6.

(Kudos to Rob Murray for coming up with the linear-search idea.)

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
  • Loading branch information
akerouanton committed Apr 26, 2024
1 parent c5376e5 commit 9c6196f
Show file tree
Hide file tree
Showing 29 changed files with 1,173 additions and 689 deletions.
7 changes: 4 additions & 3 deletions daemon/config/config_test.go
Expand Up @@ -2,6 +2,7 @@ package config // import "github.com/docker/docker/daemon/config"

import (
"encoding/json"
"net/netip"
"os"
"path/filepath"
"reflect"
Expand Down Expand Up @@ -157,7 +158,7 @@ func TestDaemonConfigurationMergeDefaultAddressPools(t *testing.T) {
emptyConfigFile := makeConfigFile(t, `{}`)
configFile := makeConfigFile(t, `{"default-address-pools":[{"base": "10.123.0.0/16", "size": 24 }]}`)

expected := []*ipamutils.NetworkToSplit{{Base: "10.123.0.0/16", Size: 24}}
expected := []*ipamutils.NetworkToSplit{{Base: netip.MustParsePrefix("10.123.0.0/16"), Size: 24}}

t.Run("empty config file", func(t *testing.T) {
conf := Config{}
Expand All @@ -167,7 +168,7 @@ func TestDaemonConfigurationMergeDefaultAddressPools(t *testing.T) {

config, err := MergeDaemonConfigurations(&conf, flags, emptyConfigFile)
assert.NilError(t, err)
assert.DeepEqual(t, config.DefaultAddressPools.Value(), expected)
assert.DeepEqual(t, config.DefaultAddressPools.Value(), expected, cmpopts.EquateComparable(netip.Prefix{}))
})

t.Run("config file", func(t *testing.T) {
Expand All @@ -177,7 +178,7 @@ func TestDaemonConfigurationMergeDefaultAddressPools(t *testing.T) {

config, err := MergeDaemonConfigurations(&conf, flags, configFile)
assert.NilError(t, err)
assert.DeepEqual(t, config.DefaultAddressPools.Value(), expected)
assert.DeepEqual(t, config.DefaultAddressPools.Value(), expected, cmpopts.EquateComparable(netip.Prefix{}))
})

t.Run("with conflicting options", func(t *testing.T) {
Expand Down
2 changes: 1 addition & 1 deletion daemon/info.go
Expand Up @@ -258,7 +258,7 @@ func (daemon *Daemon) fillDefaultAddressPools(ctx context.Context, v *system.Inf
defer span.End()
for _, pool := range cfg.DefaultAddressPools.Value() {
v.DefaultAddressPools = append(v.DefaultAddressPools, system.NetworkAddressPool{
Base: pool.Base,
Base: pool.Base.String(),
Size: pool.Size,
})
}
Expand Down
8 changes: 7 additions & 1 deletion libnetwork/cnmallocator/drivers_ipam.go
Expand Up @@ -2,6 +2,8 @@ package cnmallocator

import (
"context"
"fmt"
"net/netip"
"strconv"
"strings"

Expand All @@ -22,8 +24,12 @@ func initIPAMDrivers(r ipamapi.Registerer, netConfig *networkallocator.Config) e
// happens with default address pool option
if netConfig != nil {
for _, p := range netConfig.DefaultAddrPool {
base, err := netip.ParsePrefix(p)
if err != nil {
return fmt.Errorf("invalid prefix %q: %w", p, err)
}
addressPool = append(addressPool, &ipamutils.NetworkToSplit{
Base: p,
Base: base,
Size: int(netConfig.SubnetSize),
})
str.WriteString(p + ",")
Expand Down
27 changes: 17 additions & 10 deletions libnetwork/drivers/bridge/bridge_linux_test.go
Expand Up @@ -12,6 +12,9 @@ import (

"github.com/docker/docker/internal/testutils/netnsutils"
"github.com/docker/docker/libnetwork/driverapi"
"github.com/docker/docker/libnetwork/internal/netiputil"
"github.com/docker/docker/libnetwork/ipamapi"
"github.com/docker/docker/libnetwork/ipams/defaultipam"
"github.com/docker/docker/libnetwork/ipamutils"
"github.com/docker/docker/libnetwork/iptables"
"github.com/docker/docker/libnetwork/netlabel"
Expand Down Expand Up @@ -206,17 +209,21 @@ func compareBindings(a, b []types.PortBinding) bool {
return true
}

var a, _ = defaultipam.NewAllocator(ipamutils.GetLocalScopeDefaultNetworks(), []*ipamutils.NetworkToSplit(nil))

func getIPv4Data(t *testing.T) []driverapi.IPAMData {
ipd := driverapi.IPAMData{AddressSpace: "full"}
nw, err := netutils.FindAvailableNetwork(ipamutils.GetLocalScopeDefaultNetworks())
if err != nil {
t.Fatal(err)
}
ipd.Pool = nw
// Set network gateway to X.X.X.1
ipd.Gateway = types.GetIPNetCopy(nw)
ipd.Gateway.IP[len(ipd.Gateway.IP)-1] = 1
return []driverapi.IPAMData{ipd}
t.Helper()

alloc, err := a.RequestPool(ipamapi.PoolRequest{
AddressSpace: "LocalDefault",
Exclude: netutils.InferReservedNetworks(false),
})
assert.NilError(t, err)

gw, _, err := a.RequestAddress(alloc.PoolID, nil, nil)
assert.NilError(t, err)

return []driverapi.IPAMData{{AddressSpace: "LocalDefault", Pool: netiputil.ToIPNet(alloc.Pool), Gateway: gw}}
}

func getIPv6Data(t *testing.T) []driverapi.IPAMData {
Expand Down
29 changes: 29 additions & 0 deletions libnetwork/internal/netiputil/netiputil.go
Expand Up @@ -59,3 +59,32 @@ func AddrPortFromNet(addr net.Addr) netip.AddrPort {
}
return netip.AddrPort{}
}

// LastAddr returns the last address of prefix 'p'.
func LastAddr(p netip.Prefix) netip.Addr {
return ipbits.Add(ipbits.Sub(p.Addr(), 1, 0), 1, uint(p.Addr().BitLen()-p.Bits()))
}

// Compare two prefixes and return a negative, 0, or a positive integer as
// required by [slices.SortFunc]. When two prefixes with the same address is
// provided, the shortest one will be sorted first.
func Compare(a, b netip.Prefix) int {
cmp := a.Addr().Compare(b.Addr())
if cmp != 0 {
return cmp
}
return a.Bits() - b.Bits()
}

// PrefixAfter returns the prefix of size 'sz' right after 'prev'.
func PrefixAfter(prev netip.Prefix, sz int) netip.Prefix {
s := sz
if prev.Bits() < sz {
s = prev.Bits()
}
addr := ipbits.Add(prev.Addr(), 1, uint(prev.Addr().BitLen()-s))
if addr.IsUnspecified() {
return netip.Prefix{}
}
return netip.PrefixFrom(addr, sz).Masked()
}
46 changes: 46 additions & 0 deletions libnetwork/internal/netiputil/netiputil_test.go
@@ -0,0 +1,46 @@
package netiputil

import (
"net/netip"
"testing"

"gotest.tools/v3/assert"
)

func TestLastAddr(t *testing.T) {
testcases := []struct {
p netip.Prefix
want netip.Addr
}{
{netip.MustParsePrefix("10.0.0.0/24"), netip.MustParseAddr("10.0.0.255")},
{netip.MustParsePrefix("10.0.0.0/8"), netip.MustParseAddr("10.255.255.255")},
{netip.MustParsePrefix("fd00::/64"), netip.MustParseAddr("fd00::ffff:ffff:ffff:ffff")},
{netip.MustParsePrefix("fd00::/16"), netip.MustParseAddr("fd00:ffff:ffff:ffff:ffff:ffff:ffff:ffff")},
{netip.MustParsePrefix("ffff::/16"), netip.MustParseAddr("ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff")},
}

for _, tc := range testcases {
last := LastAddr(tc.p)
assert.Check(t, last == tc.want, "LastAddr(%q) = %s; want: %s", tc.p, last, tc.want)
}
}

func TestPrefixAfter(t *testing.T) {
testcases := []struct {
prev netip.Prefix
sz int
want netip.Prefix
}{
{netip.MustParsePrefix("10.0.10.0/24"), 24, netip.MustParsePrefix("10.0.11.0/24")},
{netip.MustParsePrefix("10.0.10.0/24"), 16, netip.MustParsePrefix("10.1.0.0/16")},
{netip.MustParsePrefix("10.10.0.0/16"), 24, netip.MustParsePrefix("10.11.0.0/24")},
{netip.MustParsePrefix("2001:db8:feed:cafe:b000:dead::/96"), 16, netip.MustParsePrefix("2002::/16")},
{netip.MustParsePrefix("ffff::/16"), 16, netip.Prefix{}},
{netip.MustParsePrefix("2001:db8:1::/48"), 64, netip.MustParsePrefix("2001:db8:2::/64")},
}

for _, tc := range testcases {
next := PrefixAfter(tc.prev, tc.sz)
assert.Check(t, next == tc.want, "PrefixAfter(%q, %d) = %s; want: %s", tc.prev, tc.sz, next, tc.want)
}
}
5 changes: 5 additions & 0 deletions libnetwork/ipamapi/contract.go
Expand Up @@ -35,6 +35,7 @@ var (
ErrIPOutOfRange = types.InvalidParameterErrorf("requested address is out of range")
ErrPoolOverlap = types.ForbiddenErrorf("Pool overlaps with other one on this address space")
ErrBadPool = types.InvalidParameterErrorf("address space does not contain specified address pool")
ErrNoMoreSubnets = types.InvalidParameterErrorf("all predefined address pools have been fully subnetted")
)

// Ipam represents the interface the IPAM service plugins must implement
Expand Down Expand Up @@ -73,6 +74,10 @@ type PoolRequest struct {
// Options is a map of opaque k/v passed to the driver. It's non-mandatory.
// Drivers are free to ignore it.
Options map[string]string
// Exclude is a list of prefixes the requester wish to not be dynamically
// allocated (ie. when Pool isn't specified). It's up to the IPAM driver to
// take it into account, or totally ignore it.
Exclude []netip.Prefix
// V6 indicates which address family should be used to dynamically allocate
// a prefix (ie. when Pool isn't specified).
V6 bool
Expand Down

0 comments on commit 9c6196f

Please sign in to comment.