Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libnet: Revert "Only check if route overlaps routes with scope: LINK" #46630

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

akerouanton
Copy link
Member

@akerouanton akerouanton commented Oct 12, 2023

- What I did

Route scope is used by the kernel to choose what source IP address should be used when establishing an outbound connection. As such, filtering routes based on their scope doesn't make sense. It happened to work for the original contributor only by chance.

I didn't take much time to look into all the issues related to CheckRouteOverlaps but it'd be probably better to add an option to skip this check if the user wants it.

- How to verify it

See the reproducer in #46615

- A picture of a cute animal (not mandatory but encouraged)

This reverts commit ee9e526.

Route scope is used by the kernel to choose what source IP address
should be used when establishing an outbound connection. As such,
filtering routes based on their scope doesn't make sense.

Resolve moby#46615.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but could use eyes from some other people as well

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Route scope is used by the kernel to choose what source IP address should be used when establishing an outbound connection. As such, filtering routes based on their scope doesn't make sense.

Checking whether a network prefix is available for use by inspecting the routing table doesn't make much sense in general. The vast majority of the IPv4 address space is going to be routable (to some interface) on most systems, simply by virtue of having a default route. As pointed out in #33925 (comment) a routing table with routes for 0.0.0.0/1 and 128.0.0.0/1 is functionally indistinguishable from a routing table with a default route, yet dockerd treats them differently. Whatever prefix is selected for a container network will practically always overlap with some route. I posit that it is not possible in general to determine whether an arbitrary network prefix is available for use by container networks by examining the host's state as the ultimate determination of whether a prefix is available is whether the user needs to communicate with any remote hosts assigned addresses within the prefix. Therefore the only way to be sure is to have the user to tell us which prefixes are available, by configuring the address pools. FindAvailableNetwork must therefore use heuristics to determine whether a network prefix is probably available, given incomplete information, and can never be free of false positives or false negatives. CheckRouteOverlaps is one of those heuristics.

#42598 tweaked the heuristics to be biased more towards prefixes being available when there is uncertainty. I think that is the right direction to go in. The workaround for a prefix being assigned to a container network which shadows addresses that need to be routed remotely is to remove the offending prefixes from the configured pools and to not assign the prefixes to networks manually. The workaround for the daemon refusing to assign a prefix to a container network because it falsely thinks the prefix is unavailable is...nothing. Someone is going to be angry whenever the daemon gets this wrong, whichever direction we bias the heuristics. In my opinion, having the daemon biased towards doing what the user says with a best-effort attempt to protect the user from misconfigurations (less magic) is more user-friendly than trying really hard to do what the user means and providing escape hatches for when things go wrong (more magic) because magical behaviour is difficult to explain or understand. "The daemon is acting this way because you configured it to act this way, fix it by configuring it to act the way you want" vs. "the daemon is blocking you from doing what you want because our heuristics didn't account for your situation, fix it with this magical escape hatch which may or may not have unintended side effects because magic."

I'm not saying the current behaviour is ideal; I just don't think reverting to the previous heuristic is the right move either.

@akerouanton
Copy link
Member Author

akerouanton commented Oct 12, 2023

I tend to agree with you, but I still disagree on a few points and on not reverting to the original heuristics.

The vast majority of the IPv4 address space is going to be routable (to some interface) on most systems, simply by virtue of having a default route. As pointed out in #33925 (comment) a routing table with routes for 0.0.0.0/1 and 128.0.0.0/1 is functionally indistinguishable from a routing table with a default route

They are functionally indistinguishable but they don't convey the same meaning as the default route is the fallback route used when the endpoint isn't aware of any specific routing rules for a given address. I highly doubt anyone or any networking software set routes for 0.0.0.0/1 and 128.0.0.0/1.

Therefore the only way to be sure is to have the user to tell us which prefixes are available, by configuring the address pools.

The original reporter of #46615 doesn't mention they configured the address pools. You will tell me they should, but I believe this is where these heuristics are useful. Otherwise, I agree with you: once a user configure the address pool, it's their responsibility to make sure it fits their needs.

FindAvailableNetwork must therefore use heuristics to determine whether a network prefix is probably available, given incomplete information, and can never be free of false positives or false negatives.
#42598 tweaked the heuristics to be biased more towards prefixes being available when there is uncertainty.

The heuristics used by CheckRouteOverlaps prior to ee9e526 were trying to avoid false negatives, and this is better IMO as it avoids shadowing an existing non-default route (or a subset of it). Moreover, the route scope doesn't indicate anything about the route itself, it only tells the kernel what source address it should pick. Said another way, #42598 is conflating the route scope with something it isn't. They could have picked any other property randomly to achieve the same result.

Someone is going to be angry whenever the daemon gets this wrong.

With ee9e526 or not, the heuristics are still wrong for routes with scope link. And the only way to get these heuristics right is actually to get rid of them. But I also think it's useful to heuristically determine whether a prefix is available on a best-effort basis (ie. by avoiding false positive).

Hence my proposal to add an option to enable/disable this check when it makes sense. Thinking more about this, and based on your comment, I think it should be automatically defined when the address pool is user-defined, and still make it available to users to ease integrators' job (eg. I'm pretty sure DD suffers from this check but I need to double-check tomorrow).

EDIT: Also I was considering backporting this fix to 24 since the link-scoped check doesn't make sense. But obviously the option I'm proposing would be made available only in a new major release.

@thaJeztah
Copy link
Member

IIRC, the issue this was trying to resolve was "could not find ANY available, non-overlapping port-range", which resulted in the daemon failing to start (if a VPN was used), causing lots of grieve. IF we are to decide making changes (this PR, or variant thereof), should we treat that as a non-fatal error (we tried, but failed, so let's assume we never tried?)

The attempt to find a non-overlapping range was to make things easier for the user (less chance on overlapping range), but if there's a risk of invalid / false-positive detection, it means we're actually making things work, and denying use of ranges that actually can be used (and now users have to work around invalid detection on the daemon side).

@corhere
Copy link
Contributor

corhere commented Oct 12, 2023

The scope of a route says a lot about the network topology in relation to the host. The prefixes of the link-scoped routes forms the set of prefixes which are directly reachable from the host. Filtering out prefixes which overlap the host's local subnets is an effective heuristic to avoid shadowing the LAN(s) the host is connected to. In contrast, we cannot safely assume any semantics for prefixes in wider-scoped routes because they could have come from anywhere and the user may or may not care about them, so filtering out prefixes which overlap wider-scoped route prefixes will tend to filter out more prefixes than necessary. This is exactly the rationale used when the solution was proposed.

I'm starting to warm up to the idea of applying magic filtering to the IPAM pools iff the IPAM pools are not explicitly configured. That would align the logic for explicitly-configured pools with an explicitly-configured network prefix: trust the user and skip overlap checks. We could then make the heuristics as aggressive as we want since only users who don't bother telling the daemon what they want will be impacted. Error messaging could be improved to instruct the users to configure the daemon's pools explicitly if all the prefixes in the default pool are filtered out by our heuristics.

@dtronche
Copy link

I'm the author of #46615 and as @akerouanton commented, we do not specify any address pools in the daemon.json. Our customers are not docker experts and the way it was behaving before was ideal for our needs. Now they cannot access to their application because they have configured routes but not the daemon.json since we have not documented it as, so far, there was no need for it.
If we could have an option to get back to the original behavior, or only apply the new one when the daemon.json is configured, that would be satisfying for us.

@akerouanton
Copy link
Member Author

akerouanton commented Oct 13, 2023

Admittedly my claim that link scope is only used to select source IP isn't totally true. Link scope carry some information about the route but that's mostly informational and I still believe it only provides very partial information about the overall LAN topology.

For instance, here's the routing table currently set inside DD's VM:

% docker run --privileged --net=host -it alpine ip route
127.0.0.0/8 dev lo scope host 
172.17.0.0/16 dev docker0 scope link  src 172.17.0.1 
172.18.0.0/16 dev br-c3269442db8e scope link  src 172.18.0.1 
192.168.64.0/24 dev eth1 scope link  src 192.168.64.134 
192.168.65.0/24 dev eth0 
192.168.65.7 dev services1 

Currently, the Engine would happily select 192.168.65.0/24 if users specify the default-address-pools parameter to something that overlaps (eg. they would do that if the default value conflicts with the host's networking setup).

At this point I think we shouldn't merge this PR alone. We need a proper way to let users bypass this check if they want to, and setting default-address-pools should be enough. And we need a way for integrators to forcefully make it sure this check is enabled. I'll draft a PR for that.

As for backporting this revert to 24.x, I now think it's a wrong idea as we'd introduce a regression with no workaround in a minor version.

@corhere
Copy link
Contributor

corhere commented Oct 13, 2023

Link scope carry some information about the route but [...] I still believe it only provides very partial information about the overall LAN topology.

I agree that it is only partial information. Even so, it's some of the highest-quality data about the network topology available to the daemon without user input. Perhaps in that follow-up PR we could also expose configuration knobs for systems integrators to tune the routing table scope heuristic?


For instance, here's the routing table currently set inside DD's VM:

Curious, as my DD VM's route table does not have a global-scope route for any RFC 1918 prefix. I wonder why.

Docker Desktop for Mac, Version 4.23.0 (120376)

$ docker run --rm -it --net=host alpine
/ # apk -U add iproute2
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.16/community/x86_64/APKINDEX.tar.gz
(1/12) Installing libcap (2.64-r1)
(2/12) Installing libbz2 (1.0.8-r1)
(3/12) Installing fts (1.2.7-r1)
(4/12) Installing xz-libs (5.2.5-r1)
(5/12) Installing libelf (0.186-r0)
(6/12) Installing libmnl (1.0.5-r0)
(7/12) Installing iproute2-minimal (5.17.0-r0)
(8/12) Installing libnftnl (1.2.3-r0)
(9/12) Installing iptables (1.8.8-r1)
(10/12) Installing iproute2-tc (5.17.0-r0)
(11/12) Installing iproute2-ss (5.17.0-r0)
(12/12) Installing iproute2 (5.17.0-r0)
Executing iproute2-5.17.0-r0.post-install
Executing busybox-1.35.0-r15.trigger
OK: 11 MiB in 26 packages
/ # ip -d route
unicast default via 192.168.65.5 dev eth0 proto boot scope global
unicast 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
unicast 172.18.0.0/16 dev br-dbf277990947 proto kernel scope link src 172.18.0.1 linkdown
unicast 172.19.0.0/16 dev br-61d6db49b86b proto kernel scope link src 172.19.0.1 linkdown
unicast 172.20.0.0/16 dev br-18b4d4f6f194 proto kernel scope link src 172.20.0.1 linkdown
unicast 172.21.0.0/16 dev br-647a1326dd2b proto kernel scope link src 172.21.0.1 linkdown
unicast 172.22.0.0/16 dev br-b5fbb3e87037 proto kernel scope link src 172.22.0.1 linkdown
unicast 172.23.0.0/16 dev br-513961aae86f proto kernel scope link src 172.23.0.1 linkdown
unicast 172.25.0.0/16 dev br-2b31c1188926 proto kernel scope link src 172.25.0.1 linkdown
unicast 192.168.65.5 dev eth0 proto kernel scope link src 192.168.65.4

@bf
Copy link

bf commented Feb 20, 2024

I'm also affected by this, is there a specific reason why this isn't moving forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: docker network conflicts with host routes when creating a new network
5 participants