New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libnet/ipams/default: introduce a linear allocator #47768
base: master
Are you sure you want to change the base?
Conversation
aae4040
to
9c6196f
Compare
9c6196f
to
59e3d2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
|
||
if p.Addr().Is4() { | ||
v4 = append(v4, p) | ||
if n.Base.Addr().Is4() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't changed - but Is4
is false for an IPv4-mapped IPv6 addresses. It might be worth checking for that and storing the unmapped prefix in the IPv4 list, or just bailing out?
(We don't deal with mapped addresses in command line options either, but it might be less obvious here - the address pool just won't do anything useful. I'm not quite sure why someone would want to write IPv4 addresses as IPv6, but there's an issue somewhere asking us allow it.)
@corhere Please make sure this won't break Swarm in any way (eg. when a new leader is elected, etc...). |
59e3d2a
to
304e175
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review WIP; I haven't even finished with address_space.go. I'll be back tomorrow.
var last *ipamutils.NetworkToSplit | ||
var discarded int | ||
for i, imax := 0, len(predefined); i < imax; i++ { | ||
p := predefined[i-discarded] | ||
if last != nil && last.Overlaps(p.Base) { | ||
predefined = slices.Delete(predefined, i-discarded, i-discarded+1) | ||
discarded++ | ||
continue | ||
} | ||
last = p | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the slices
package is already being used, may as well take full advantage.
var last *ipamutils.NetworkToSplit | |
var discarded int | |
for i, imax := 0, len(predefined); i < imax; i++ { | |
p := predefined[i-discarded] | |
if last != nil && last.Overlaps(p.Base) { | |
predefined = slices.Delete(predefined, i-discarded, i-discarded+1) | |
discarded++ | |
continue | |
} | |
last = p | |
} | |
predefined = slices.CompactFunc(predefined, func(last, p *ipamutils.NetworkToSplit) bool { | |
return last.Overlaps(p.Base) | |
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slices.CompactFunc
works a bit differently. It expects a strict equality as it doesn't compare to the last non-duplicate found, but to 'current-1'. If you have the following subnets:
10.0.0.0/8
10.0.0.0/16
10.10.0.0/16
It tries to compare s1 == s2, and then s2 == s3. That's not what we want.
for i, allocated := range aSpace.allocated { | ||
if nw.Addr().Compare(allocated.Addr()) < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aSpace.allocated
is a sorted slice, which means binary searching is possible. Turn that O(n) search into O(log n) time complexity!
func (aSpace *addrSpace) allocatePool(nw netip.Prefix) error {
n, _ := slices.BinarySearchFunc(aSpace.allocated, nw, func(allocated, nw netip.Prefix) int { return nw.Addr().Compare(allocated.Addr()) })
aSpace.allocated = slices.Insert(aSpace.allocated, n, nw)
aSpace.subnets[nw] = newPoolData(nw)
return nil
}
Also, are duplicate allocations allowed? It would be trivial to detect this situation and return an error instead of inserting the duplicate entry into the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new allocator should work fine with Swarm. The CNM network allocator replays allocations as static assignments if there is an existing allocation in the Swarm state.
moby/libnetwork/cnmallocator/networkallocator.go
Lines 866 to 873 in 4554d87
// If there is non-nil IPAM state always prefer those subnet | |
// configs over Spec configs. | |
if n.IPAM != nil { | |
ipamConfigs = n.IPAM.Configs | |
} else if n.Spec.IPAM != nil { | |
ipamConfigs = make([]*api.IPAMConfig, len(n.Spec.IPAM.Configs)) | |
copy(ipamConfigs, n.Spec.IPAM.Configs) | |
} |
304e175
to
8f4adfe
Compare
The previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size. For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024. Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!). To fix this issue, a new field `Exclude` is added to `RequestPool`. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver. The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit. Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily. The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation. One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find. Before: - Allocate: 10.0.1.0/24, 10.0.2.0/24 ; Deallocate: 10.0.1.0/24 ; Allocate 10.0.3.0/24. Now, the 3rd allocation would yield 10.0.1.0/24 once again. As it doesn't change the semantics of the allocator, there's no reason to worry about that. Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6. (Kudos to Rob Murray for coming up with the linear-search idea.) Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This normalization process does two things: - Unmap IPv4-mapped IPv6 addrs. This ensures such address pools are part of the IPv4 address space. - Mask the host ID. This was done by newAddrSpace but `splitByIPFamily` is already validating / normalizing the address pools. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Nothing was validating whether address pools' `base` prefix were larger than the target subnet `size` they're associated to. As such invalid address pools would yield no subnet, the error could go unnoticed. Signed-off-by: Albin Kerouanton <albinker@gmail.com>
5cfd940
to
b2fb88d
Compare
- What I did
This previous allocator was subnetting address pools eagerly when the daemon started, and would then just iterate over that list whenever RequestPool was called. This was leading to high memory usage whenever IPv6 pools were configured with a target subnet size too different from the pools prefix size.
For instance: pool = fd00::/8, target size = /64 -- 2 ^ (64-8) subnets would be generated upfront. This would take approx. 9 * 10^18 bits -- way too much for any human computer in 2024.
Another noteworthy issue, the previous implementation was allocating a subnet, and then in another layer was checking whether the allocation was conflicting with some 'reserved networks'. If so, the allocation would be retried, etc... To make it worse, 'reserved networks' would be recomputed on every iteration. This is totally ineffective as there could be 'reserved networks' that fully overlap a given address pool (or many!).
To fix this issue, a new field
Exclude
is added toRequestPool
. It's up to each driver to take it into account. Since we don't know whether this retry loop is useful for some remote IPAM driver, it's reimplemented bug-for-bug directly in the remote driver.The new allocator uses a linear-search algorithm. It takes advantage of all lists (predefined pools, allocated subnets and reserved networks) being sorted and logically combines 'allocated' and 'reserved' through a 'double cursor' to iterate on both lists at the same time while preserving the total order. At the same time, it iterates over 'predefined' pools and looks for the first empty space that would be a good fit.
Currently, the size of the allocated subnet is still dictated by each 'predefined' pools. We should consider hardcoding that size instead, and let users specify what subnet size they want. This wasn't possible before as the subnets were generated upfront. This new allocator should be able to deal with this easily.
The method used for static allocation has been updated to make sure the ascending order of 'allocated' is preserved. It's bug-for-bug compatible with the previous implementation.
One consequence of this new algorithm is that we don't keep track of where the last allocation happened, we just allocate the first free subnet we find.
Before:
Now, the 3rd allocation would yield 10.0.1.0/24 once again.
As it doesn't change the semantics of the allocator, there's no reason to worry about that.
Finally, about 'reserved networks'. The heuristics we use are now properly documented. It was discovered that we don't check routes for IPv6 allocations -- this can't be changed because there's no such thing as on-link routes for IPv6.
(Kudos to Rob Murray for coming up with the linear-search idea.)
- How to verify it
CI -- a bunch of tests have been added, some have been rewritten.
Or manually by creating, deleting and re-creating networks.
- Description for the changelog
- Introduce a new subnet allocator that can deal with IPv6 address pools of any size
- A picture of a cute animal (not mandatory but encouraged)