Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Containers route tables expected behaviour #386

Open
gillg opened this issue Jun 26, 2023 · 21 comments
Open

Windows Containers route tables expected behaviour #386

gillg opened this issue Jun 26, 2023 · 21 comments
Assignees
Labels
Networking Connectivity and network infrastructure question Further information is requested

Comments

@gillg
Copy link

gillg commented Jun 26, 2023

Hello,

I face an issue since a long time, we have workaround and tricks in place, but I would have a more elaborated answer about the expected behaviour, an eventual better approach, and see what is doable as a cleaner solution.

On linux, by default, when you create a container, the host main route table is reusable. So in case of static routes (like for cloud provider local link metadata API, IMDS at the ip 169.254.169.254) your container automaticaly has this route at runtime.

Example on vanilla AWS Linux2 AMI:

$ cat /etc/sysconfig/network-scripts/route-eth0
# Static route for metadata service
169.254.169.254 via 0.0.0.0 dev eth0

From inside a created container:

# ip route
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2

# nc -vz 169.254.169.254 80
instance-data.us-west-2.compute.internal [169.254.169.254] 80 (http) open

On windows, currently, you have static routes at host level, and these routes are inherited inside the container. But they seems not "applied" as active, so they are kind of useless and confusing. By the way, why a route in the persistent store is not always active ?
Example on a vanilla AWS Windows server AMI:

C:\Windows\system32> route print
===========================================================================
Interface List
  8...02 44 67 aa 56 b3 ......Amazon Elastic Network Adapter
  1...........................Software Loopback Interface 1
 11...00 15 5d 5f be 83 ......Hyper-V Virtual Ethernet Adapter
===========================================================================

IPv4 Route Table
===========================================================================
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0     10.102.28.65     10.102.28.86     15
     10.102.28.64  255.255.255.192         On-link      10.102.28.86    271
     10.102.28.86  255.255.255.255         On-link      10.102.28.86    271
    10.102.28.127  255.255.255.255         On-link      10.102.28.86    271
        127.0.0.0        255.0.0.0         On-link         127.0.0.1    331
        127.0.0.1  255.255.255.255         On-link         127.0.0.1    331
  127.255.255.255  255.255.255.255         On-link         127.0.0.1    331
  169.254.169.123  255.255.255.255     10.102.28.65     10.102.28.86     30
  169.254.169.249  255.255.255.255     10.102.28.65     10.102.28.86     30
  169.254.169.250  255.255.255.255     10.102.28.65     10.102.28.86     30
  169.254.169.251  255.255.255.255     10.102.28.65     10.102.28.86     30
  169.254.169.253  255.255.255.255     10.102.28.65     10.102.28.86     30
  169.254.169.254  255.255.255.255     10.102.28.65     10.102.28.86     30
      172.30.42.0    255.255.255.0         On-link       172.30.42.1   5256
      172.30.42.1  255.255.255.255         On-link       172.30.42.1   5256
    172.30.42.255  255.255.255.255         On-link       172.30.42.1   5256
        224.0.0.0        240.0.0.0         On-link         127.0.0.1    331
        224.0.0.0        240.0.0.0         On-link      10.102.28.86    271
        224.0.0.0        240.0.0.0         On-link       172.30.42.1   5256
  255.255.255.255  255.255.255.255         On-link         127.0.0.1    331
  255.255.255.255  255.255.255.255         On-link      10.102.28.86    271
  255.255.255.255  255.255.255.255         On-link       172.30.42.1   5256
===========================================================================
Persistent Routes:
  Network Address          Netmask  Gateway Address  Metric
  169.254.169.254  255.255.255.255     10.102.28.65      15
  169.254.169.250  255.255.255.255     10.102.28.65      15
  169.254.169.251  255.255.255.255     10.102.28.65      15
  169.254.169.249  255.255.255.255     10.102.28.65      15
  169.254.169.123  255.255.255.255     10.102.28.65      15
  169.254.169.253  255.255.255.255     10.102.28.65      15
===========================================================================

IPv6 Route Table
===========================================================================
Active Routes:
 If Metric Network Destination      Gateway
  1    331 ::1/128                  On-link
  8    271 fe80::/64                On-link
 11   5256 fe80::/64                On-link
  8    271 fe80::7771:238a:685c:942d/128
                                    On-link
 11   5256 fe80::e0d0:4961:fc58:7f13/128
                                    On-link
  1    331 ff00::/8                 On-link
  8    271 ff00::/8                 On-link
 11   5256 ff00::/8                 On-link
===========================================================================
Persistent Routes:
  None

From inside a container:

> route print
===========================================================================
Interface List
 18...........................Software Loopback Interface 2
 19...00 15 5d 5f ba 23 ......Hyper-V Virtual Ethernet Adapter #2
===========================================================================

IPv4 Route Table
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0      172.30.42.1    172.30.42.214   5256
        127.0.0.0        255.0.0.0         On-link         127.0.0.1    331
        127.0.0.1  255.255.255.255         On-link         127.0.0.1    331
  127.255.255.255  255.255.255.255         On-link         127.0.0.1    331
      172.30.42.0    255.255.255.0         On-link     172.30.42.214   5256
    172.30.42.214  255.255.255.255         On-link     172.30.42.214   5256
    172.30.42.255  255.255.255.255         On-link     172.30.42.214   5256
        224.0.0.0        240.0.0.0         On-link         127.0.0.1    331
        224.0.0.0        240.0.0.0         On-link     172.30.42.214   5256
  255.255.255.255  255.255.255.255         On-link         127.0.0.1    331
  255.255.255.255  255.255.255.255         On-link     172.30.42.214   5256
===========================================================================
Persistent Routes:
  Network Address          Netmask  Gateway Address  Metric
  169.254.169.254  255.255.255.255     10.102.28.65      15
  169.254.169.250  255.255.255.255     10.102.28.65      15
  169.254.169.251  255.255.255.255     10.102.28.65      15
  169.254.169.249  255.255.255.255     10.102.28.65      15
  169.254.169.123  255.255.255.255     10.102.28.65      15
  169.254.169.253  255.255.255.255     10.102.28.65      15
          0.0.0.0          0.0.0.0      172.30.42.1  Default
===========================================================================

IPv6 Route Table
===========================================================================
Active Routes:
 If Metric Network Destination      Gateway
 18    331 ::1/128                  On-link
 19   5256 fe80::/64                On-link
 19   5256 fe80::26bd:7b55:f3d2:51fb/128
                                    On-link
 18    331 ff00::/8                 On-link
 19   5256 ff00::/8                 On-link
===========================================================================
Persistent Routes:
  None

And if I try to reach 169.254.169.254 it's impossible.

> Test-NetConnection -ComputerName 169.254.169.254 -Port 80
WARNING: TCP connect to (169.254.169.254 : 80) failed
WARNING: Ping to 169.254.169.254 failed with status: DestinationHostUnreachable


ComputerName           : 169.254.169.254
RemoteAddress          : 169.254.169.254
RemotePort             : 80
InterfaceAlias         :
SourceAddress          :
PingSucceeded          : False
PingReplyDetails (RTT) : 0 ms
TcpTestSucceeded       : False

The obvious solution is to manualy create the route with the command "route add" as an example but what I don't understand is why if the persistentstore contains the routes it's not working ?
Moreover, if the route in unknown we should fallback to the route 0.0.0.0 0.0.0.0 172.30.42.1 172.30.42.214 5256 and the host should route the request with NAT. So I don't really understand the current behaviour.

Tweaking the route table at startup during the runtime seems not a clean and reliable solution assuming you could host your image on different cloud providers, and need different routes in different contextes... This should be managed at the container runtime level.

Any insights, opinions welcome !

@gillg gillg added the question Further information is requested label Jun 26, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot added the triage New and needs attention label Jun 26, 2023
@gillg gillg changed the title Windows Containers route tables expected bahaviour Windows Containers route tables expected behaviour Jun 26, 2023
@ntrappe-msft ntrappe-msft added the Networking Connectivity and network infrastructure label Jun 27, 2023
@fady-azmy-msft fady-azmy-msft removed the triage New and needs attention label Jun 28, 2023
@ol-etienne-saimond
Copy link

Hello @gillg, I have the same issue.
I am forced to manually create this route in order to make it work.

@MikeZappa87
Copy link

Are you able to share your set up? Are you using containerd as a container runtime or are you using docker?

"On linux, by default, when you create a container, the host main route table is reusable. So in case of static routes (like for cloud provider local link metadata API, IMDS at the ip 169.254.169.254) your container automatically has this route at runtime."

This is not default Linux behavior. However, depending on the answer to your container runtime, this is the behavior of the CNI plugin. If you run the following commands with iproute2 you can see the default behavior.

ip netns add cni-1234
ip netns exec cni-1234 ip route

You won't see any routes until you explicitly add them.

This appears to be a missing route in the root network namespace? What route are you adding? The route in the Linux container you shared, has a default so the root network namespace would need to be shared to see the behavior of that specific prefix.

@gillg
Copy link
Author

gillg commented Jul 12, 2023

@MikeZappa87 my setup is the most basic as possible.
On what I give as example on linux, docker (not containerd) is installed as a system package with yum in a linux AMZ2 version.

ip netns list returns nothing, and if I add a custom ns I agree, the route table is completely empty.
I also confirm, on the container, the route table doesn't contains the host main route table, but as I said in my previous message the packets are still routed. So that means the "host" ip on linux acts as a nat router, where on windows it seems not.

Reminder for linux from inside a container

# ip route
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2

# nc -vz 169.254.169.254 80
instance-data.us-west-2.compute.internal [169.254.169.254] 80 (http) open

On windows I also use docker (not containerd) so I expect a consistent behaviour, installed from an official zip package extracted in program files and with the service installed with --register-service.
My current version is 20.10.9, I agree it's pretty old but for another reason I was not able to automate the update until some weeks. I could test with another version but let's continue the theorie first.
Inside a vanilla windows container (servercore or nanoserver) the route table is something like that:

> route print
===========================================================================
Interface List
 18...........................Software Loopback Interface 2
 19...00 15 5d 5f ba 23 ......Hyper-V Virtual Ethernet Adapter #2
===========================================================================

IPv4 Route Table
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
          0.0.0.0          0.0.0.0      172.30.42.1    172.30.42.214   5256
        127.0.0.0        255.0.0.0         On-link         127.0.0.1    331
        127.0.0.1  255.255.255.255         On-link         127.0.0.1    331
  127.255.255.255  255.255.255.255         On-link         127.0.0.1    331
      172.30.42.0    255.255.255.0         On-link     172.30.42.214   5256
    172.30.42.214  255.255.255.255         On-link     172.30.42.214   5256
    172.30.42.255  255.255.255.255         On-link     172.30.42.214   5256
        224.0.0.0        240.0.0.0         On-link         127.0.0.1    331
        224.0.0.0        240.0.0.0         On-link     172.30.42.214   5256
  255.255.255.255  255.255.255.255         On-link         127.0.0.1    331
  255.255.255.255  255.255.255.255         On-link     172.30.42.214   5256
===========================================================================
Persistent Routes:
  Network Address          Netmask  Gateway Address  Metric
  169.254.169.254  255.255.255.255     10.102.28.65      15
  169.254.169.250  255.255.255.255     10.102.28.65      15
  169.254.169.251  255.255.255.255     10.102.28.65      15
  169.254.169.249  255.255.255.255     10.102.28.65      15
  169.254.169.123  255.255.255.255     10.102.28.65      15
  169.254.169.253  255.255.255.255     10.102.28.65      15
          0.0.0.0          0.0.0.0      172.30.42.1  Default
===========================================================================

IPv6 Route Table
===========================================================================
Active Routes:
 If Metric Network Destination      Gateway
 18    331 ::1/128                  On-link
 19   5256 fe80::/64                On-link
 19   5256 fe80::26bd:7b55:f3d2:51fb/128
                                    On-link
 18    331 ff00::/8                 On-link
 19   5256 ff00::/8                 On-link
===========================================================================
Persistent Routes:
  None

So the mains questions are :

Let me know if you need more detailed informations, or anything.

EDIT: Exactly the same behaviour tested in version 24.0.4 on windows. Same route table inside the container, and Test-NetConnection -ComputerName 169.254.169.254 -Port 80 still failing

@gillg
Copy link
Author

gillg commented Jul 12, 2023

Last note, the route I have to automate inside the container because it's not working as expected is this one:

// Get the host container runtime IP (172.x.x.1)
$gateway = (Get-NetRoute | Where { $_.DestinationPrefix -eq '0.0.0.0/0' } | Sort-Object RouteMetric | Select NextHop).NextHop;" 
// Fetch the container internal IP interface index
$ifIndex = (Get-NetAdapter -InterfaceDescription 'Hyper-V Virtual Ethernet*' | Sort-Object | Select ifIndex).ifIndex
New-NetRoute -DestinationPrefix 169.254.169.254/32 -InterfaceIndex $ifIndex -NextHop $gateway

@MikeZappa87
Copy link

MikeZappa87 commented Jul 12, 2023

Is the issue only with 169.254.169.254?

Does this happen with the others below?:
169.254.169.123
169.254.169.249
169.254.169.250
169.254.169.251
169.254.169.253

Do you have layer 2 connectivity for 10.102.28.65?

Are you running that powershell script inside the container?

@gillg
Copy link
Author

gillg commented Jul 12, 2023

The problem is with all the persistent route not actives, but the others are not revelant to me.

And yes I have a full connectivity to 10.102.28.64/26 because I reach a database server on this range. Hard to exactly say for the implicit AWS router at 10.102.28.65 because it's completely locked down and a blackbox but I should, else I would not reach my database.

Or... For an unknown reason (windows firewall default rules?) I can reach it from the host but not from the container (but the container reaches the DB). That could explain why the persistent routes are not added ?

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@Eishi2012
Copy link

I am facing the same issue.

@sam-sla
Copy link

sam-sla commented Sep 12, 2023

We also face this problem on GCE, currently using @gillg solution to force the route inside our docker containers.
Strange we didn't notice this issue until recently, so was possibly related to a recent change in the windows images.

Host running Windows Server 2019 Datacenter build 1809, OS build 17763.4737.
Docker version 20.10.24

@TBBle
Copy link

TBBle commented Oct 10, 2023

Just came across this issue from #420, which turns out to be GCE specific from August 2023, and hence is probably what @sam-sla was seeing and maybe @Eishi2012, but that is not the original problem.

It's odd to have routes to 169.254.0.0/16 addresses, those are "link-local" and should not be routed. You can see this in the Linux example, it has an explicit route to 0.0.0.0 which AFAIK actually means "link-local", and is just explicitly stating which link to use by default for such messages. You can see that there's no such route inside the container on Linux, i.e. "your container automaticaly has this route at runtime." is not actually what's happening here.

My guess for the original question is that although route print sees the persistent routes inside the container from the registry, they were not applied because the container network is not set up the same way as the host is, it's configured externally as part of the container creation process. The routes inside the container are just the default routes for the IP address it has been assigned (the on-link addresses) and its default gateway for everything else.

I actually don't know how those routes got into the container's registry, did you already try to persist the routes inside the container? I don't expect it to be copying random stuff from the host registry into the container registry, but maybe the network management system does that here as a side effect, even though it doesn't use them ;docker/for-win#12297 suggests they used to be both copied and applied, which is buggy because containers don't live on the same network as the host, so maybe they fixed that since. By default, you don't need routes for 169.254.0.0/16 on Windows, it appears to handle that network internally without the route. However, if you have more than one interface, such a route would tell Windows which link to use if you don't specify the interface when sending packets, e.g. with ping -S. The container here only has one non-loopback interface, so that's not the issue, but this is why the host needs those explicit routes, on both Linux and Windows.

Technically, I think the Linux example is doing the wrong thing, as it forwarded the 169.254.169.154 to its gateway (172.17.0.1) which then forwarded it to the host's eth0 per its local config. It's doing what it's been told, because there isn't a link-local route set up inside the container, but per the RFC, the router at 172.17.0.1 should have dropped those packets, not forwarded them. Ignoring those rules happens to make AWS's link-local services (IMDS, DNS, etc) work from inside containers somewhat by accident.

Windows doesn't ignore those rules, unless you explicitly tell it to by adding a route, so you can't reach 169.254.0.0/16 if it's not actually on the local link, and the local link here is a Hyper-V virtual network adapter. So 169.254.169.254 from inside the container should be trying to connect to a host on that virtual network, and of course the EC2 metadata service etc. are not present on that virtual network.

To some extent this is a legacy of AWS using link-local addresses in a world where hosts may have internal networks (as Google recently bounced off, seen in #420), and in IPv6 it's resolved by use of Unique Local Addresses (which are routable within a site, just not out into the world) which would do the right thing in this case.

Assuming IPv6 isn't an option, you may want to consider hooking up your container to a custom transparent network. I haven't tested this myself, but I believe that should produce a network that can see the host's link-local peers, since they allow you to DHCP from the outside world, and that's approximately the same thing.

AWS's own suggestion is a PowerShell script that adds an explicit route for 169.254.169.254.


So I was thinking about this some more, and HCN's network-create API can include a list of routes for the network. So it's possible, but untested, that a Docker custom network (or equivalent in your container-hosting environment, e.g., a Kubernetes Pod's network) could be defined with an explicit route for 169.254.0.0/16 to the HCN's gateway so that individual containers on that network don't need to have the route added directly, and don't need to use transparent mode (which would conflict with Kubernetes Pod expected behaviour, for example).

Of course, this would need to be implemented in the various HCN users (Docker, the CNI plugins, etc), so even if it is possible, it seems unlikely to happen any time soon, but someone who has this use-case could file a feature request with the appropriate runtime system for their use-case and see if anything comes of it.

@gillg
Copy link
Author

gillg commented Nov 7, 2023

Many thanks @TBBle for this very long but precise and realistic analysis.
I'm not fan about the current situation, but I 200% agree with you about the fact using the local link family address was fine in a "VM" context, but is a design mistake on a container world.

By searching workarounds, I also discovered that AWS introduced some IPv6 addresses fd00:ec2::254, but I still have some problems...
image
The routes exists in the persistent table, but they seems "unroutable" unless they should be folowing the RFC for local unicast. In fact the whole IPv6 network seems broken inside the container. Any idea ?

So my only interrogation now, is by which magic does that routes are "persisted" in the container when we launch it, but windows doesn't "apply" them as it should be the default behaviour. It seems more realistic to not have that routes persisted at all when we start the container.

Outside of that, I partialy understand your proposal to move forward in the right way. I have to go deeper in HCN to understand you idea.

@TBBle
Copy link

TBBle commented Nov 9, 2023

I haven't had a chance to play with IPv6 in Windows containers, but a quick look at that route table shows no default gateway for IPv6, suggesting that IPv6 is not active on the virtual network's gateway router, as normally it'd advertise itself as a router and routes would appear in the IPv6 config by magic RADV. I'm not sure off-hand if there's more setup needed, or if it's just fully unsupported by some part of the stack you're using there. Either way, I guess the upshot is that your IPv6 persistent route entries are causing fd00:ec2::254-bound traffic to be sent to the listed gateway, but it's not configured to route packets onwards, so they're lost. (I assume you can ping /6 the container's link-local IPv6 address, and the router's link-local IPv6 address may respond, but nothing beyond that.)

https://docs.docker.com/engine/reference/commandline/network_create/ suggests that IPv6 is opt-in for Docker-created networks. https://learn.microsoft.com/en-au/virtualization/windowscontainers/container-networking/architecture#unsupported-features-and-network-options notes that IPv6 works with l2bridge but not NAT or overlay networks (they also list the --ip6 flag as unsupported, so maybe it's always enabled when supported...). They don't mention whether transparent or l2tunnel networks support IPv6. I would guess that transparent mode works, since once you're spoofing a MAC address, the IP version flowing over that doesn't matter. I assume l2tunnel IPv6 works as a variant of l2tunnel.

However, poking around suggests that Transparent mode doesn't work on EC2, presumably for similar reasons to why it doesn't work on Azure: the cloud networking infrastructure does not allow MAC spoofing.

So I fear that IPv6 in containers on AWS/EC2 is likely to be a dead-end if you cannot set up an L2bridge configuration. #230 suggests the same thing. https://techcommunity.microsoft.com/t5/networking-blog/l2bridge-container-networking/ba-p/1180923 (and maybe a simpler version in this comment) shows a way of using L2bridge with Docker, but doesn't touch on IPv6, so it may or may not result in working IPv6 if the host is IPv6-connected.

That said, I honestly haven't looked very closely at this, as my recent focus was on only getting enough networking going to run BuildKit, so I've only played with NAT via the Windows CNI plugin in particular lately. So I can't speak to the practicality of any of this right now, sadly.

Edit: This issue isn't really in a satisfactory place, so as a possible workaround, a host-based proxy for IMDS or equivalent on other cloud providers would also be possible. However, it must only be available to containers on that host. So having it running in a container and attached to the same virtual network would be a safer option; either way, there's still the challenge of telling your code or SDK that IMDS is actually found at a different IP address. So still not great, but if this turns out to be an absolute blocker, it's a workaround to evaluate.

@MikeZappa87
Copy link

Have you tried the latest AWS AMI? We had a conversation with AWS and they were changing this behavior.

@gillg
Copy link
Author

gillg commented Nov 13, 2023

Have you tried the latest AWS AMI? We had a conversation with AWS and they were changing this behavior.

Which behavior do you mean? IPv6 or ipv4 with persistent routes ?

When you mean they made a change it was on the very latest ? I would love have more details about this potential change because I will need to take care of it to avoid problems ^^

Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

3 similar comments
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@grcusanz, please provide an update or close this issue.

Copy link
Contributor

This issue has been open for 30 days with no updates.
@grcusanz, please provide an update or close this issue.

1 similar comment
Copy link
Contributor

This issue has been open for 30 days with no updates.
@grcusanz, please provide an update or close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking Connectivity and network infrastructure question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants