Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 nightmare, can someone explain the magic numbers please? #37779

Open
markg85 opened this issue Sep 6, 2018 · 14 comments
Open

IPv6 nightmare, can someone explain the magic numbers please? #37779

markg85 opened this issue Sep 6, 2018 · 14 comments
Labels

Comments

@markg85
Copy link

markg85 commented Sep 6, 2018

Hi,

I've literally spend days following all kind of shady guides on enabling ipv6 on docker and i just can't get it to work.

My hosting is at vultr (digitalocean will also work) so i can make a new instance any moment and try out whatever you folks are going to suggest ;) They support ipv6. Infact, i can ping the ipv6 instance itself so there isn't anything wrong on that side. Lets take Fedora (or centos) as example distributions.

The issue i have is understanding what is said here: https://docs.docker.com/v17.09/engine/userguide/networking/default_network/ipv6/ (or whatever the current page is)

That page - and literally a million others - talk about 2001:db8:1::/64. Where does that come from?
Other guides offer more information and show their ifconfig output. In my case that begins with: "2001:19f0:".... some of those other guides then also still use 2001:db8:1::/64 in their docker config while their own ipv6 ip doesn't even begin with that.

So, please, explain to me in clear short instructions which IPv6 should be filled in the fixed-cidr-v6 config value. Some increment it by 1, some don't. But none tell why they do what they do.

It makes absolutely no sense to me why i should fill in "2001:db8:1::/64" and how that is supposed to work if my own host starts with something different. And yes, i tried. Not working for me. But then again, i haven't gotten any IPv6 to work from within the container to the outside world or ping from the outside to my container.

Please, once we get this working, update the documentation to explain what is going on and why. ipv6 (even though it's here for years by now) is still something very difficult to wrap your head around for basically everybody who needs to use it.

Cheers,
Mark

@DennisGlindhart
Copy link

The 2001:db8::/32 subnet is reserved for examples/documentation purposes. https://tools.ietf.org/html/rfc3849

You should input your IPv6-prefix provided by your ISP/Router. If you don't one, you can using ULA subnet/addresses ( https://en.wikipedia.org/wiki/Unique_local_address ) but you will only be able to communicate locally.

@markg85
Copy link
Author

markg85 commented Sep 22, 2018

Hi @DennisGlindhart, that makes sense to me now. Thank you for that :)

But i still can't get docker with ipv6 running on a vultr server.
Could someone perhaps try it out and post the exact steps you're doing when you get it working?

These are the steps i did:

  • Deploy a new fedora 27
  • Follow the configuration for ipv6 here https://www.vultr.com/docs/configuring-ipv6-on-your-vps with the prefix in the from the instance (my test server begins with 2001:19f0:5001:2f9d::)
  • Install docker (dnf install docker)
  • Configure the daemon:
    { "ipv6": true, "fixed-cidr-v6": "2001:19f0:5001:2f9d::/64" }
  • start docker (systemctl start docker)
  • Start a container in bash (docker run -it alpine ash -c "/bin/sh")
  • ifconfig inside the container (i get 2001:19f0:5001:2f9d::242:ac11:2/64)
  • ping to the outside world (ping -6 www.google.com). This fails.
  • ping from the outside to the container (ping -6 2001:19f0:5001:2f9d::242:ac11:2) also fails

It must be either something insanely simple that i keep missing somehow that might make it work.
Or ipv6 is insanely complicated to get working...

@agowa
Copy link

agowa commented Feb 17, 2019

ipv6 setups is very simple, if you know how routing works.
The problem is, that the implementation within docker is currently crappy and non consistent.
Your failure is most likely about routing.

The Vultr router thinks it has a /64 on the one side, with hosts inside it.
When you configure a smaller prefix on your box in order to be able to subnet it, your host thinks it has a /65 on the interface facing to Vultr and another /65 on some other interface. Now one packet from a host in your newly created subnet tries to reach out to google. What happens:

  1. It puts in its source ip and the destination ip of google.
  2. That package traverses your host, your host sees, that it should go to google. Looks up its destination ip within its routing table and forwards it to its (default) gateway.
  3. The package arrives at the Vultr router, it sees a package coming from one ip within the /64, so everything is fine, it will process that package as your host did.
  4. That all repeats until the package reaches google.
  5. Google creates a response and sends that back to your server, basically flipping the source and destination ip in the package.
  6. The package after traversing some routers reaches the Vultr router.
  7. The Vultr router the router looks into its routing table, thinks that everything within the /64 is just right on that interface. So it looks into its arp table (actually it's NDP, but try to stick to ipv4 namings), sees that there is no entry for that IP.
  8. The Vultr router sends out a arp request
  9. And now nothing happens. So the Vultr router thinks the package is for a non existing host and rops it, as it cannot fill in the mac address of the destination.

What should happen instead of step 7-9 should be obvious, but because you only got a /64 instead of a routable prefix, you're kinda stuck for now.

In IPv6 a /64 is considered the same as a single IP in ipv4, many things rely upon one host having multiple addresses (2^64 IPs), for example SLAAC or the IPv6 Privacy extension (even though most of the time not needed on servers). IPv6 was designed to never have to care about IP shortage ever again. As long as you don't burn bits of the address, that assumption holds true (story for another time).
You can cut additional networks out of a /64, but you really don't want to, as that causes problems and complicates routing, as you often have to use NDP to fake that the host within the subnet is on the same as the host (to stick to the above example, make your host instead of the inner vm respond to that address being called, but that can cause performance problems if you're using may of these addresses, as you fill up the NDP table of the router, basically unwantedly performing a ndp exhaustion attack).
Ideally you want to have a /64 per container/service, as that is considered the smallest non subnetable entity.

But it is not game over, there are four things you can do (order of preference):

  1. Get Vultr to give you more than one /64, or switch to a different provider (you should contact them to let them know that there is a need, if enough do that they are very likely to provide it in the future)
  2. Got to https://tunnelbroker.net and tunnel in a /48 prefix to assign to your containers.
  3. Put some garbage into the fixed-cidr-v6 config variable to make docker startup, and runn all of your containers that should have IPv6 access with --net=host
  4. You can subnet the /64, but stuff will break.

The last one is the same, as if you have a ISP, that assigns a /64 to your router. Than you have a IPv6 on your router, but no addresses for your home network behind the router. For solutions you can refere to this question https://serverfault.com/questions/714890/ipv6-subnetting-a-64-what-will-break-and-how-to-work-around-it

Someone on reddit pointed me to the 2nd one and it works quite well. For that you need to do:

  1. Register on tunnelbroker.net
  2. On the left click on Create Regular Tunnel
  3. Enter your IPv4 address of the Vultr instance
  4. Click the assignment button at Routed /48
  5. Go to the Example Configurations tab and apply that config to your instance.
  6. If you need reverse dns, enter the ip of a server where you want to manage your zone on.
  7. Put that copy that /48 address into the fixed-cidr-v6 field, but replace the /48 with /64, or something like that (does not really matter, for now it just needs to exist so that docker is able to start), so that the default bridge gets some addresses that your containers could pick.
  8. Create custom networks with your actual networks.

For your custom networks you could use these addresses if your assignment is 2001:db8::/48:

2001:db8:0::/64 (which is equivalent to 2001:db8::/64), for your default docker bridge
2001:db8:1::/64
2001:db8:2::/64
2001:db8:9::/64
...
2001:db8:a::/64
...
2001:db8:f::/64
2001:db8:10::/64
2001:db8:11::/64
...
2001:db8:ffff::/64

If you have any question regarding IPv6, don't hesitate to ask, so that I can improve the documentation later on.
I assumed, that you know the basics of routing within IPv4, it should just be about switching habits and figuring out, that everything is the same, but the notation changed from decimal to hexadecimal and from dot to colon.

@markg85
Copy link
Author

markg85 commented Feb 17, 2019

Hi @agowa338, that is quite a comprehensive write-up!
Thank you very much for that!

While i do have a little bit of a network background, i apparently lack a lot of knowledge there :)
I think the solution in my case is your third option (--net=host) but i would like to know a little more in general about this and why it's so difficult to wrap my head around.

For instance, why is a /48 subnet required? Or bigger then /64? That /64 subnet already allows a whopping 18,446,744,073,709,551,616 ip's in it. To me it makes no sense to allocate that much ip's to a container that really only needs one. Sure, ipv6 has an insane amount of ip's available so we can be generous when giving hosts a bucket of ip's, but this is more like you gave the container an ocean full of ip's. Also the concept of docker is one service per container so scoping it down to something tiny really seems to be OK in my opinion. And when you have that many ip's it seems like total madness to me to use an external service for even more. But perhaps that's just my mindset ;)

Also, i still don't quite get (you did explain it) why passing traffic into a container from the host is this difficult in ipv6 where it's dead simple with ipv4. There is no such thing as an NDP that needs to be tweaked for ipv4 yet it works just perfectly.

@agowa
Copy link

agowa commented Feb 17, 2019

Hi @agowa338, that is quite a comprehensive write-up!
Thank you very much for that!

While i do have a little bit of a network background, i apparently lack a lot of knowledge there :)
I think the solution in my case is your third option (--net=host) but i would like to know a little more in general about this and why it's so difficult to wrap my head around.

No worry, can you please mark sections, where you think I'm to technical? It should become readable for everyone that knows how routing works with IPv4. If you don't mind, would you like to review the doc change? Just click review and mark everything that you think needs to be changed, or further explained and maybe also why you have problems understanding a specific part.

For instance, why is a /48 subnet required? Or bigger then /64? That /64 subnet already allows a whopping 18,446,744,073,709,551,616 ip's in it. To me it makes no sense to allocate that much ip's to a container that really only needs one. Sure, ipv6 has an insane amount of ip's available so we can be generous when giving hosts a bucket of ip's, but this is more like you gave the container an ocean full of ip's.

IPv6 is designed to be wastefull with IPs. Therefore some bits have meaning.
The first 10bits are the prefix your provider gets assigned (2001:0db8::/32).
The next bits are what your provider can use to create independent subnets like 2001:0db8:85a3::/48, or 2001:0db8:85a3:0800::/56.
The next bits are what the customer uses for his own network segregation until the network is 2001:0db8:85a3:08d3::/64
The last 64bit are the interface identifier, e.g. the host part. So it is just not meant to be subnetted further than 2001:0db8:85a3:08d3::/64.

Also the concept of docker is one service per container so scoping it down to something tiny really seems to be OK in my opinion. And when you have that many ip's it seems like total madness to me to use an external service for even more. But perhaps that's just my mindset ;)

In fact, for most applications you just need one ip per container. But for simplicity and to avoid routing issues (like you have when subnetting a 2001:0db8:85a3:08d3::/64 further) you want to have more than just one /64. If you want to deploy a webservice, you need 3 networks. So been assigned a 2001:0db8:85a3:08d0::/62 would be enough, but providers should assign you a /56 or even a /48 without even thinking about it, as there are that many IPs, that it just does not matter.
But back to the example.

For your webservice you would divide the 2001:0db8:85a3:08d0::/62 into:
2001:0db8:85a3:08d0::/62 means the range from 2001:0db8:85a3:08d0:0:0:0:0 to 2001:0db8:85a3:08d3:ffff:ffff:ffff:ffff.
You first split it into 4-parts so you get:
2001:0db8:85a3:08d0::/64 => host
2001:0db8:85a3:08d1::/64 => default bridge
2001:0db8:85a3:08d2::/64 => custom subnet 1
2001:0db8:85a3:08d3::/64 => custom subnet 2
And assign it according to there use.

Also, i still don't quite get (you did explain it) why passing traffic into a container from the host is this difficult in ipv6 where it's dead simple with ipv4. There is no such thing as an NDP that needs to be tweaked for ipv4 yet it works just perfectly.

In ipv4 you would tweak arp instead of NDP. I wanted to add a image to make that clear, but couldn't figure out how to embed a image. Do you know how to include a image into the docs?

The problem is, a /64 is not further subnetable, therefore you cannot further assign a /64. That's like if your provider does assign you a 192.0.2.0/29 (6 hosts) and you want to subnet that.
You end up with the same problem, as long as your provider does not change his routing table, you would have to arpspoof everything out on wan, to make your provider think, he is talking to a /29 instead of two /30s.
So basically IPv6 is much more simple than IPv4, as it just removes the NAT problems and the requirements to punch through it by using UPNP, Hole Punching, STUN-, TURN-Server, ... (think about implementing a WebRTC service, how much easier that is with ipv6).
Everything that really changes is the decimal numbers into hexadecimal and the dots into colons...

@DennisGlindhart
Copy link

@agowa338 I don't understand the reason why you need more subnets than a single /64 for your containers. That is not a pattern I've seen used any other places before unless one have multiple "trust-zones" like DMZ etc. Can you explain why the web service in your example need 3 /64 subnets?

The only place where multiple /64 subnets is needed is that you typically need one for the container-network and another subnet for the Docker-hosts (they might be on the same, but it can potentially be a bit tricky).

@markg85 To answer your question on why IPv6 seems more difficult to get to work.
Basically Docker with IPv4 uses NAT as default, meaning that all in- and out-going trafic from your containers will be NAT'ed to the hosts IP. That means that no external setup is required outside the Docker-host itself and just works out of the box.
IPv6 could in theory do the same, but one of the points of IPv6 was to not need NAT (at least in most common cases). Instead IPv6/Docker does network the pure way and uses routing, so you need to set up some routes on your router.

You will need a /56 or /48 subnet from your ISP (a /64 might also be possible, but might be a bit more tricky)

Basically you have a LAN /64 IPv6-subnet where your Docker-hosts are attachedto your network - lets say 2001:db8:1::/64. Router address: 2001:db8:1::1 and your docker host(s) will then have 2001:db8:1::2, 2001:db8:1::3, 2001:db8:1::4 etc.

Container subnet we can then take 2001:db8:2::/64 let each Docker-host have a /80 (This is the subnet in fixed-cdr)

Docker Host 1: IP: 2001:db8:1::2, Fixed-CIDR: 2001:db8:2:0:1::/80
Docker Host 2: IP: 2001:db8:1::3, Fixed-CIDR: 2001:db8:2:0:2::/80
Docker Host 3: IP: 2001:db8:1::4, Fixed-CIDR: 2001:db8:2:0:3::/80

On router you will then need to add the following routes:

ip -6 route add 2001:db8:2:0:1::/80 via 2001:db8:1::2
ip -6 route add 2001:db8:2:0:2::/80 via 2001:db8:1::3
ip -6 route add 2001:db8:2:0:3::/80 via 2001:db8:1::4

@agowa
Copy link

agowa commented Feb 18, 2019

@agowa338 I don't understand the reason why you need more subnets than a single /64 for your containers. That is not a pattern I've seen used any other places before unless one have multiple "trust-zones" like DMZ etc. Can you explain why the web service in your example need 3 /64 subnets?

/64 is the smallest subnet, everything beneath is just ugly and can cause problems, as outlined above, also most providers don't route the /64 to your host, but only configure it on the link.
In the above setup I used multiple docker networks.
One for the frontend servers, one for the backend servers and one for the database servers.
If you only want host a static webpage using nginx, a single /64 for you containers is enough.

The only place where multiple /64 subnets is needed is that you typically need one for the container-network and another subnet for the Docker-hosts (they might be on the same, but it can potentially be a bit tricky).

@markg85 To answer your question on why IPv6 seems more difficult to get to work.
Basically Docker with IPv4 uses NAT as default, meaning that all in- and out-going trafic from your containers will be NAT'ed to the hosts IP. That means that no external setup is required outside the Docker-host itself and just works out of the box.
IPv6 could in theory do the same, but one of the points of IPv6 was to not need NAT (at least in most common cases). Instead IPv6/Docker does network the pure way and uses routing, so you need to set up some routes on your router.

The core problem is that docker does currently not manage the ipv6 routes.

You will need a /56 or /48 subnet from your ISP (a /64 might also be possible, but might be a bit more tricky)

Basically you have a LAN /64 IPv6-subnet where your Docker-hosts are attachedto your network - lets say 2001:db8:1::/64. Router address: 2001:db8:1::1 and your docker host(s) will then have 2001:db8:1::2, 2001:db8:1::3, 2001:db8:1::4 etc.

Container subnet we can then take 2001:db8:2::/64 let each Docker-host have a /80 (This is the subnet in fixed-cdr)

You really shouldn't go below /64 per docker network.

Also there is no need for IPv4, in a cloud environment. For example Facebook has internally only IPv6. Only there public facing load balancers have IPv4.

@DennisGlindhart
Copy link

@markg85

To make it short:

Your problem is that a package arriving to Vultr's router from the outside to destination 2001:19f0:5001:2f9d::242:ac11:2 (Your container IP) has no idea that it should go through your Docker-host (Lets say it's 2001:19f0:5001:2f9d::1) to reach the container (@agowa338 explains this very well in details, but this is the ultra short version).

So to make it work you need Vultr to add a route to their router like:

ip -6 route add 2001:19f0:5001:2f9d::242:ac11:2 via 2001:19f0:5001:2f9d::1

This should be "scaled up" to be a whole subnet (the subnet in fixed-cidr-v6) being routed (preferably a subnet different from the subnet the Docker-host is in) so not every container/ip needs its own seperate routing rule.

But if you cannot make them do that (it sounds like you can't) you might have another option (Beside --net=host & using Tunnelbroker):

I don't know how Vultr network works, but if it is just a simple Layer2 network attached to your Docker-host, then maybe macvlan (or ipvlan) would work. Read more here: https://docs.docker.com/network/macvlan/ - That way you don't need configuration outside your Docker-host, but I have my doubts that it actually IS a clean layer 2 network, but it might be worth a try.
Also be aware that you probably cannot make a connection between Docker-host and containers.

@agowa338

In IPv6 a /64 is considered the same as a single IP in ipv4

IMHO I think this can very easily be misinterpreted. I think it only holds true if by IPv4 you mean a single external Ipv4 address (with NAT'ed clients behind) and in that case a /48 or /56 would probably be more correct as they are the most common allocations by ISP's.

In an typical network your clients/computers would all be in a single /64 which is on the same Layer 2 network where SLAAC, NDP, mDNS etc can be used. In companies you might create different segregation zones (DMZ, Clients, Internal servers etc.), but i would definitely expect more than a single device in a /64 subnet.

/64 is the smallest subnet, everything beneath is just ugly and can cause problems, as outlined above, also most providers don't route the /64 to your host, but only configure it on the link.
In the above setup I used multiple docker networks.
One for the frontend servers, one for the backend servers and one for the database servers.
If you only want host a static webpage using nginx, a single /64 for you containers is enough.

I'm not sure we have understanding when using the word subnet. You mention docker networks and it seems, based on your example, that you create multiple networks by docker networke create with their own /64 subnet?

2001:0db8:85a3:08d0::/64 => host
2001:0db8:85a3:08d1::/64 => default bridge
2001:0db8:85a3:08d2::/64 => custom subnet 1
2001:0db8:85a3:08d3::/64 => custom subnet 2

I'm curious - why can't the Websevrer & DB-server in your example be in the same network/subnet (/64) - i.e be in the default_bridge? Why need the custom subnet 1 & 2 ?

If you use multiple/different /64 subnets you will have to get to Layer 3 (Routing) to communicate between them and thus the features you can only have on a L2 with /64 bit is lost anyways (mDNS i.e)

How will you have different containers in the same /64 spawn multiple docker-hosts in that setup without dividing it into smaller "parts"?

The way it is done in both Swarm & Kubernetes (all the "big" L3 CNI-providers like Calico, Cilium, Static-routing, Kube-router etc.) is that part of a /64-subnet is split up in sizes between /80 to /122 and allocated to each node and not bad practice in any way AFAIK. Of course this is Layer 3 network, but that is also the case when you use different /64's in your Docker-example setup.

@agowa
Copy link

agowa commented Feb 18, 2019

@markg85
But if you cannot make them do that (it sounds like you can't) you might have another option (Beside --net=host & using Tunnelbroker):

I don't know how Vultr network works, but if it is just a simple Layer2 network attached to your Docker-host, then maybe macvlan (or ipvlan) would work. Read more here: https://docs.docker.com/network/macvlan/ - That way you don't need configuration outside your Docker-host, but I have my doubts that it actually IS a clean layer 2 network, but it might be worth a try.
Also be aware that you probably cannot make a connection between Docker-host and containers.

Or by using NDP Proxy:
https://deploy-preview-8292--docsdocker.netlify.com/network/ipv6/#ndp-proxy

@agowa338

In IPv6 a /64 is considered the same as a single IP in ipv4

IMHO I think this can very easily be misinterpreted. I think it only holds true if by IPv4 you mean a single external Ipv4 address (with NAT'ed clients behind) and in that case a /48 or /56 would probably be more correct as they are the most common allocations by ISP's.

In an typical network your clients/computers would all be in a single /64 which is on the same Layer 2 network where SLAAC, NDP, mDNS etc can be used. In companies you might create different segregation zones (DMZ, Clients, Internal servers etc.), but i would definitely expect more than a single device in a /64 subnet.

Only if you also do dhcp-pd, or how would you do virtualization like VirtualBox/VMwarePlayer/... (ok, may be not needed in your business, depending on what your needs are). And having a /64 per host within your network is just how you can future proof yourself.
Or support Privacy extension? It is just much easier, if the whole /64 is assigned to a single host (as you can uniquely identify the host even with enabled privacy extension), even though it circumvents the privacy extension a bit...

/64 is the smallest subnet, everything beneath is just ugly and can cause problems, as outlined above, also most providers don't route the /64 to your host, but only configure it on the link.
In the above setup I used multiple docker networks.
One for the frontend servers, one for the backend servers and one for the database servers.
If you only want host a static webpage using nginx, a single /64 for you containers is enough.

I'm not sure we have understanding when using the word subnet. You mention docker networks and it seems, based on your example, that you create multiple networks by docker networke create with their own /64 subnet?

In the above example, I subnetted the prefix to gain separate networks that I could assign to the docker networks. To illustrate how one would use the prefix to create separate ipv6 networks for each docker network.

2001:0db8:85a3:08d0::/64 => host
2001:0db8:85a3:08d1::/64 => default bridge
2001:0db8:85a3:08d2::/64 => custom subnet 1
2001:0db8:85a3:08d3::/64 => custom subnet 2

I'm curious - why can't the Websevrer & DB-server in your example be in the same network/subnet (/64) - i.e be in the default_bridge? Why need the custom subnet 1 & 2 ?

They might as well be within the same /64, but if you want to use auto scaling, having different networks can just be handy for administrative purposes and readability of logfiles. It was just a simple example of where you would use different docker networks. It might as well be two networks with your webserver and another one with workers, or you only have one with everything inside it. That is up to you how you design your application/network.

If you use multiple/different /64 subnets you will have to get to Layer 3 (Routing) to communicate between them and thus the features you can only have on a L2 with /64 bit is lost anyways (mDNS i.e)

Up to your design.

How will you have different containers in the same /64 spawn multiple docker-hosts in that setup without dividing it into smaller "parts"?

RFC6275, but docker currently does not implement that. So you'll have to do it yourself.

The way it is done in both Swarm & Kubernetes (all the "big" L3 CNI-providers like Calico, Cilium, Static-routing, Kube-router etc.) is that part of a /64-subnet is split up in sizes between /80 to /122 and allocated to each node and not bad practice in any way AFAIK. Of course this is Layer 3 network, but that is also the case when you use different /64's in your Docker-example setup.

Until now, but the IPv6 implementation in docker is everything, but not usable without pain. If you host the docker hosts yourself, I would currently recommend creating multiple vlans and using macvlan and it's ipvlan option so you can handle networking outside of docker. As this will also allow you to create networks without a IPv6 gateway to create non routed networks, that are only reachable from your docker nodes. And also to assign multiple IPv6 addresses to a single interface for redundant uplinks.

@markg85
Copy link
Author

markg85 commented Feb 18, 2019

Aargh, i'm getting totally insane by this damned IPv6 crap. Really, it is not intuitive! It is not just the replacing the dots with semicolons and digits for hex. It's far more complicated. Each time i try it i spend hours and ending up getting exactly nowhere.

Sorry if i sound a bit agitated, that's not because of you answering (i'm honestly happy you do!) but just because it in my mind is just really overly and needlessly complex. I consider myself quite technological advanced. I'm a programmer, i know my stuff or know how to find the information to learn and understand it. IPv6 does not fall into that category.

So, what do i get thus far?

  • IPv6 is apparently globally routed till the /64 subnet. How this works or why that's the case is beyond me. I just take it as a fact.
  • The hosts gets (say, the docker instance) the whole /64 subnet. Anything within that is my issue to route back and forth
  • I need to tell the host how to route traffic from the outside into the docker container and vice versa.

Now here's the real trouble. I - again - tried a vultr instance with docker and IPv6 following this guide nearly to the letter: https://docs.docker.com/v17.09/engine/userguide/networking/default_network/ipv6/

It specifically talks about the case where you get a /64 block and need to subnet it further. It also gives me the impression that i can make it work without any external hackary (like tunnelbroker or asking for an even bigger ipv6 range from vultr). The section that makes me think that:

When the router wants to send an IPv6 packet to the first container, it transmits a neighbor solicitation request, asking “Who has 2001:db8::c009?” However, no host on the subnet has the address; the container with the address is hidden behind the Docker host. The Docker host therefore must listen for neighbor solicitation requests and respond that it is the device with the address. This functionality is called the NDP Proxy and is handled by the kernel on the host machine.

When following that exactly (and adjusting IP's to what i have) i just don't get it working. I might be doing the same thing wrong over and over. But if i do that, even after literally years (as i tried the same a couple years ago) then the problem likely is the documentation.

Now that very documentation (https://docs.docker.com/v17.09/engine/userguide/networking/default_network/ipv6/) is very elaborate and probably works well for people who are deep into ipv6 that have gone through the insane pains of getting it working and now know how to interpret things written in there. It is not easy to follow if you just want to get it working and don't know the ins and outs of ipv6. I'm certainly not going to follow an ipv6 course just to be able to setup a docker container, i just keep using ipv4 if v6 requires that amount of knowledge.

A very big improvement for that documentation page is to not solely use the reserved ipv6 address all over the place. It needs to explain what 2001:db8:1:: is (i know, it's a reserved example address). It needs to show real examples so that users get a better feeling of how ip's look. Right now everything is based on that example and my ip's just don't look like that. And because you can omit alot in ipv6 it doesn't make it clear to the reader and just seems like magic.

So again, it needs to be easy to understand in bite-sized pieces of information. It doesn't need to explain the protocol in depth. It needs to explain it to get it working and provide enough information for the user to build upon. If you need hours and hours of time (days even) where you would setup an ipv4 container in mere minutes then the documentation is (very) wrong.

Cloud hosting
As DigitalOcean and Vultr passed a couple of times. DigitalOcean is probably the most idiotic provider when it comes to ipv6. They literally give you about 15 ip's.
Vultr gives you a /64 block.
Hetzner gives a /64 block.

I'd recommend basing documentation on /64 where the host is in fact going to subnet further down. Bad practice or not. Therefore explain why you need an NDP proxy (that much is clear to me, even though i only get that i "need" it, not why ipv6 does not route below /64) and describe how to set it up in clear terms and with real world examples.

@agowa
Copy link

agowa commented Feb 19, 2019

Aargh, i'm getting totally insane by this damned IPv6 crap. Really, it is not intuitive! It is not just the replacing the dots with semicolons and digits for hex. It's far more complicated. Each time i try it i spend hours and ending up getting exactly nowhere.

Ok, may I'm just assuming to much. Have you ever done anything with routing? Do you know how routing tables work? Subnetting? Supernetting? CIDR? Within IPv4 you basically have two different concepts of how networking is understood to work by people. The first is one network internally and nat to public, it simply does not care what you do internally, as you're hidden. The other is within enterprise environments where also routing, bgp, spinning tree, ... come into play. With IPv6 this all got way simpler, as the first one can be dropped entirely. And that is also a problem for people that don't know how routing works.

Sorry if i sound a bit agitated, that's not because of you answering (i'm honestly happy you do!) but just because it in my mind is just really overly and needlessly complex. I consider myself quite technological advanced. I'm a programmer, i know my stuff or know how to find the information to learn and understand it. IPv6 does not fall into that category.

No problem, I see that all the time, and as soon as people understand IPv6 they're like: that is really not different at all, why was that so troublesome for me to understand?!?.

As you're a programmer and not a devop/sysadmin/networkadmin, have you ever looked at WebRTC/P2P, NAT Holepunching, UPnP? If so, than you know the limitations of IPv4 and what crude and error proun workarounds and problems that causes. IPv6 fixes all of them.

So, what do i get thus far?

  • IPv6 is apparently globally routed till the /64 subnet. How this works or why that's the case is beyond me. I just take it as a fact.

You shouldn't, as routing internally works just the same. There is no in public it's like a and internally its like b. It's always a.

  • The hosts gets (say, the docker instance) the whole /64 subnet. Anything within that is my issue to route back and forth

Not really, your provider needs to route whatever prefix you want to have to your router e.g. docker host. Instead of just dropping it out on the interface.

A routing table of the router would look like:
Your docker network via your host
and not
Your /64 prefix via eth0

  • I need to tell the host how to route traffic from the outside into the docker container and vice versa.

That will happen automatically, if you don't want to do fancy stuff, your host will start to route every network it knows into every other network, after you set net.ipv6.conf.all.forwarding (IPv6) and net.ipv4.conf.all.forwarding (IPv4) to 1.

Now here's the real trouble. I - again - tried a vultr instance with docker and IPv6 following this guide nearly to the letter: https://docs.docker.com/v17.09/engine/userguide/networking/default_network/ipv6/

It specifically talks about the case where you get a /64 block and need to subnet it further. It also gives me the impression that i can make it work without any external hackary (like tunnelbroker or asking for an even bigger ipv6 range from vultr).

The problem is not that you need a bigger prefix, the problem is how vultra routes you the /64. And according to best practice and RIR assignment policies, they should just provide a routable prefix either by assigning a larger prefix (which when I say implies that they add a route to your instance ip to there routing table) or by using dhcp-pd. They could also just provide you another /64 via a route to your instance, than you would be able to assign that prefix to your instance instead of the one you got on the wan side of your instance (like your home router gets at least a /64 (or /127) for its ISP facing side a /64 for your lan).

Or what I've seen from other providers, they do this within there routing table:

Your instance ip (/128) via eth0 priority medium
Your prefix via your instance ip priority low

This is a crude hack, that the provider could do to allow you to subnet the network as you like. So the whole problem is how the network is routed at the handover between vultr and you.

The section that makes me think that:

When the router wants to send an IPv6 packet to the first container, it transmits a neighbor solicitation request, asking “Who has 2001:db8::c009?” However, no host on the subnet has the address; the container with the address is hidden behind the Docker host. The Docker host therefore must listen for neighbor solicitation requests and respond that it is the device with the address. This functionality is called the NDP Proxy and is handled by the kernel on the host machine.

That's exactly what we said above as another workaround.

When following that exactly (and adjusting IP's to what i have) i just don't get it working. I might be doing the same thing wrong over and over. But if i do that, even after literally years (as i tried the same a couple years ago) then the problem likely is the documentation.

As I also said above, Vultr, may just block NDP Proxies, as that is basically the same as an NDP Exhaustion Attack... You may try macvlan instead of a bridge, as @DennisGlindhart outlined above.

Now that very documentation (https://docs.docker.com/v17.09/engine/userguide/networking/default_network/ipv6/) is very elaborate and probably works well for people who are deep into ipv6 that have gone through the insane pains of getting it working and now know how to interpret things written in there. It is not easy to follow if you just want to get it working and don't know the ins and outs of ipv6. I'm certainly not going to follow an ipv6 course just to be able to setup a docker container, i just keep using ipv4 if v6 requires that amount of knowledge.

A very big improvement for that documentation page is to not solely use the reserved ipv6 address all over the place. It needs to explain what 2001:db8:1:: is (i know, it's a reserved example address). It needs to show real examples so that users get a better feeling of how ip's look. Right now everything is based on that example and my ip's just don't look like that. And because you can omit alot in ipv6 it doesn't make it clear to the reader and just seems like magic.

What does using another prefix change? That's like having a guide where 172.0.0.1 is the server instead of 10.0.0.1, can you explain your thought process about that? I've problems understanding why that should make any difference. Or is it because of the abbreviation with 0 instead of 0000, db8 instead of 0db8 and :: instead of multiple zero blocks?

So again, it needs to be easy to understand in bite-sized pieces of information. It doesn't need to explain the protocol in depth. It needs to explain it to get it working and provide enough information for the user to build upon. If you need hours and hours of time (days even) where you would setup an ipv4 container in mere minutes then the documentation is (very) wrong.

That's what I tried above and in the linked doc change. But it is not that easy, if you come from two different angles. I started IT by learning how to subnet IPv4 and IPv6, how to write routing tables, firewall rules, why the different types of NAT suck, designing big and small networks, ... And you may have started instead by learning C++/Java/PHP/... instead. If I show you how I struggled with programming language/api documentations, you most likely would think the same... E. G. Why the heck is there no example of how that interface should be implemented, or just where do I have to add a try catch to prevent that specific error message from crashing that application inside of that giant repository?

That's why I asked you to review my explanation of what you need to configure. And as you cannot tell me how I would have to implement a Interface for my specific use case, I cannot tell you how you need to design your network.

Cloud hosting
As DigitalOcean and Vultr passed a couple of times. DigitalOcean is probably the most idiotic provider when it comes to ipv6. They literally give you about 15 ip's.
Vultr gives you a /64 block.
Hetzner gives a /64 block.

Hetzner profides you also an additional /48 for assignment behind or inside your host if you ask for it (at least for root server, that's all I used from them until now).

I'd recommend basing documentation on /64 where the host is in fact going to subnet further down. Bad practice or not. Therefore explain why you need an NDP proxy (that much is clear to me, even though i only get that i "need" it, not why ipv6 does not route below /64) and describe how to set it up in clear terms and with real world examples.

There are multiple reasons, why it is bad practice, and that is because it can cause problems, if one assumes, you're following RIR assignment policies, RFC standards, common practices, ... And without knowing that, you're also unable to debug these issues...
Therefore you need to know that as well as you need to know how nat works within IPv4...
If you want to know all of the reasons, why it is bad to subnet a /64 further, you have to know much more about IPv6 as you would ever need. So either take it for granted, or do your research, but you cannot have both.

Bytheway that's also the reason you got a networking/infrastructure department within organisations, because you as a developer than just request to have a docker with ipv6 for containers instead of having to think about how the whole infrastructure before your app should look like.

P. S. Just to mention the one problem that could have the most impact after you've setup such a prefix and have your service up and running. Let's say your provider assigns you 16 ips in the case of digital ocean. The next 16 IPs are assigned to another customer. You host a mailserver/webserver that sends emails within your 16 ips. The other customer uses his 16 IPs to send spam emails. the whole /64 or bigger depending on what the provider specified as the "customer allocation size" within the RIR will be blacklisted. Therefore you'll be unable to send mails to anybody until the /64 is either removed from the blacklist again which takes time or you get another 16 IPs from a different /64 assigned, for which you've to change your whole crude setup.

@markg85
Copy link
Author

markg85 commented Feb 19, 2019

Ok, may I'm just assuming to much. Have you ever done anything with routing? Do you know how routing tables work? Subnetting? Supernetting? CIDR? Within IPv4 you basically have two different concepts of how networking is understood to work by people. The first is one network internally and nat to public, it simply does not care what you do internally, as you're hidden. The other is within enterprise environments where also routing, bgp, spinning tree, ... come into play. With IPv6 this all got way simpler, as the first one can be dropped entirely. And that is also a problem for people that don't know how routing works.

I've done subnetting and site-to-site connections where both sides use DHCP (that's more of a firewall issue though). I have tweaked routes before to isolate networks, but i admit i try to stay away from that. I know how the netmask works to scope your network in smaller chunks. I am at the level where i can easily setup networks, but when it comes to setting up VLAN's i have some trouble. I'd say that is more than the average tech guy, but less than someone who has actually went to network management courses.

As you're a programmer and not a devop/sysadmin/networkadmin, have you ever looked at WebRTC/P2P, NAT Holepunching, UPnP? If so, than you know the limitations of IPv4 and what crude and error proun workarounds and problems that causes. IPv6 fixes all of them.

I know P2P and UPnP. The latter tells the router to punch a hole :)

As I also said above, Vultr, may just block NDP Proxies, as that is basically the same as an NDP Exhaustion Attack... You may try macvlan instead of a bridge, as @DennisGlindhart outlined above.

I doubt that, but then again i'm not getting it working so it might just be the case.
Question, could you try it on vultr? If you send me your mail i will spin up a new vultr machine and send you the private key to connect to it. Or you can send me your public key and i'ill add you to the authorized_hosts. (yes, i know.. exchanging private keys is not the intention.. i don't care for this test and this private key that only lives for you and only for as long as i keep the instance alive).

What does using another prefix change? That's like having a guide where 172.0.0.1 is the server instead of 10.0.0.1, can you explain your thought process about that? I've problems understanding why that should make any difference. Or is it because of the abbreviation with 0 instead of 0000, db8 instead of 0db8 and :: instead of multiple zero blocks?

Yes, sure.
When i see the example i see 2001:db8:1:: over and over again. Yet when i look at my host where i tested this i see: 2001:19f0:5001:2482:: (an actual range of /64 btw) That is just very different. I know (or suspect) this is due to the shortening going on but that makes it very confusion when you're new in this concept. And mind you, you can assume people to be new at this concept when reading the docs on getting an ipv6 container up. The examples should at the very least be of equal semicolon count. The core issue here is a different subnet range. The example is for /48, the thing i see is for /64. You first have to educate people the ipv6 structure before that makes sense. That education should not need to be there in great detail. The docs should just elaborate a little on different common subnets and how that looks. For instance, i find this image quite clear: https://image.slidesharecdn.com/ipv6addressplanning-20130924-2-130925112810-phpapp02/95/ipv6-address-planning-11-638.jpg

That's what I tried above and in the linked doc change. But it is not that easy, if you come from two different angles. I started IT by learning how to subnet IPv4 and IPv6, how to write routing tables, firewall rules, why the different types of NAT suck, designing big and small networks, ... And you may have started instead by learning C++/Java/PHP/... instead. If I show you how I struggled with programming language/api documentations, you most likely would think the same... E. G. Why the heck is there no example of how that interface should be implemented, or just where do I have to add a try catch to prevent that specific error message from crashing that application inside of that giant repository?

That's why I asked you to review my explanation of what you need to configure. And as you cannot tell me how I would have to implement a Interface for my specific use case, I cannot tell you how you need to design your network.

Hahaha, fair enough ;) (FYI, i started totally wrong in programming. Beginning from PHP (where everything is allowed) working up till C++ and staying there).
Interfaces aren't that hard, hehe. But yeah, i get your point.

I did look over it and my first impression was "this is a lot better!". But yeah, i will give a detailed review with inline comments later today or this week at the very latest.

P. S. Just to mention the one problem that could have the most impact after you've setup such a prefix and have your service up and running. Let's say your provider assigns you 16 ips in the case of digital ocean. The next 16 IPs are assigned to another customer. You host a mailserver/webserver that sends emails within your 16 ips. The other customer uses his 16 IPs to send spam emails. the whole /64 or bigger depending on what the provider specified as the "customer allocation size" within the RIR will be blacklisted. Therefore you'll be unable to send mails to anybody until the /64 is either removed from the blacklist again which takes time or you get another 16 IPs from a different /64 assigned, for which you've to change your whole crude setup.

I've seen that example a couple of times when looking for digitalocean indeed. Seems like /64 is the bare minimum a provider should assign :)

Am i correct in assuming that if docker would have implemented RFC6275, this whole issue would not exist at all? Aka, docker would then work as easy as ipv4 currently does? If the answer to this is "yes", then i'm curious to know about any docker bugs tracking progress on that RFC.

@agowa
Copy link

agowa commented Feb 19, 2019

I know P2P and UPnP. The latter tells the router to punch a hole :)

Not only that UPnP can do way more. For example it tells you your public IP or helps you discover printers. But only if the vendor implemented it correctly and nearly all programs or devices that claim to support UPnP fail at some point or another (ktorrent does not work with ubnt gateways for example)...
Or if you plug in device a that claims to support UPnP device b can no longer be discovered using UPnP without a reason. In short UPnP may look good on paper, but it is a mess.

As I also said above, Vultr, may just block NDP Proxies, as that is basically the same as an NDP Exhaustion Attack... You may try macvlan instead of a bridge, as @DennisGlindhart outlined above.

I doubt that, but then again i'm not getting it working so it might just be the case.
Question, could you try it on vultr?

Most likely not before friday.

If you send me your mail i will spin up a new vultr machine and send you the private key to connect to it. Or you can send me your public key and i'ill add you to the authorized_hosts. (yes, i know.. exchanging private keys is not the intention.. i don't care for this test and this private key that only lives for you and only for as long as i keep the instance alive).

No worry, that should not be needed, as I have a few bugs on vultr.

What does using another prefix change? That's like having a guide where 172.0.0.1 is the server instead of 10.0.0.1, can you explain your thought process about that? I've problems understanding why that should make any difference. Or is it because of the abbreviation with 0 instead of 0000, db8 instead of 0db8 and :: instead of multiple zero blocks?

Yes, sure.
When i see the example i see 2001:db8:1:: over and over again. Yet when i look at my host where i tested this i see: 2001:19f0:5001:2482:: (an actual range of /64 btw) That is just very different. I know (or suspect) this is due to the shortening going on but that makes it very confusion when you're new in this concept. And mind you, you can assume people to be new at this concept when reading the docs on getting an ipv6 container up. The examples should at the very least be of equal semicolon count. The core issue here is a different subnet range. The example is for /48, the thing i see is for /64. You first have to educate people the ipv6 structure before that makes sense. That education should not need to be there in great detail. The docs should just elaborate a little on different common subnets and how that looks. For instance, i find this image quite clear: https://image.slidesharecdn.com/ipv6addressplanning-20130924-2-130925112810-phpapp02/95/ipv6-address-planning-11-638.jpg

Let me see if I can come up with some good graphic to illustrate that and also subnetting visually...
Maybe a combination of your image and this one: https://goo.gl/images/MkRFMb

Hahaha, fair enough ;) (FYI, i started totally wrong in programming. Beginning from PHP (where everything is allowed) working up till C++ and staying there).
Interfaces aren't that hard, hehe. But yeah, i get your point.

Networking is also not that hard ;-)
And if you started totally wrong with programming, I did worse vb.net, Java reflections (mc mods), Java, C# & powershell, python, C++ (yes, I tried to do java reflections before understanding java ;-) ) but my favourites are C#, PowerShell and Java, because I like "managed code" and not having to care about memory management that much. And currently I'm looking at go, ruby on rails and haskell.

I did look over it and my first impression was "this is a lot better!". But yeah, i will give a detailed review with inline comments later today or this week at the very latest.

👍

I've seen that example a couple of times when looking for digitalocean indeed. Seems like /64 is the bare minimum a provider should assign :)

no, they should assign multiple /64's to customers.
Refer to https://www.ripe.net/publications/docs/ripe-690 sections 4.2, 4.2.1, 4.2.2, 4.2.3:

However, /64 is not sustainable, it doesn't allow customer subnetting, and it doesn't follow IETF recommendations of “at least” multiple /64s per customer. Moreover, future work within the IETF and recommendations from RFC 7934 (section 6) allow the assignment of a /64 to a single interface (...)
Assigning a /64 or longer prefix does not conform to IPv6 standards and will break functionality in customer LANs. With a single /64, the end customer CPE will have just one possible network on the LAN side and it will not be possible to subnet, assign VLANs, alternative SSIDs, or have several chained routers in the same customer network, etc.

Am i correct in assuming that if docker would have implemented RFC6275, this whole issue would not exist at all? Aka, docker would then work as easy as ipv4 currently does? If the answer to this is "yes", then i'm curious to know about any docker bugs tracking progress on that RFC.

Yes and no, yes, because you as a user would not notice it. No, as it is just abstracted away from you and that problem just occurs under the hood. The container would have assigned a "technical" care off address from each wan links subnet and a home address, what you add to your dns and others use to connect to your server. But therefore you also need more than one /64 subnet (but it could be abstracted for single servers by using link local addresses instead of GUAs)...
Here you go: #38748

@polarathene
Copy link
Contributor

Hi @agowa338 and @DennisGlindhart !

Since the issue remains open and you both seem quite knowledgeable on the subject, any help clarifying how the subnet with IPv6 works would be greatly appreciated 😀

Docker specific

NAT via ip6tables since Docker Engine 20.10.0 (2020Q4)

Since the last comment, I will point out that at the end of 2020 Docker 20.10.0 released with experimental ip6tables (still experimental today). That allows for using /48 of ULA as private network with many /64 subnets you can use, with Docker providing a NAT to a public facing IPv6 address.

That provides a similar experience to containers using the IPv4 private range for internal networks, and only publishing container ports that you want to be publicly accessible. I'm aware that IPv6 is meant to make NAT obsolete, and that you can restrict access by other means, but this context seems perfectly fine to use ULA with NAT. That is non-standard AFAIK, so any readers landing here should be aware that /etc/gai.conf tend to use a priority of: IPv6 (public) > IPv4 (public) > IPv4 (private) > IPv6 (private).

A common issue I've seen for deployments with Docker, was IPv6 connections being routed by default through NAT64 into container networks (where containers only have IPv4 assigned). That caused containers to see the source IP as Docker network gateway IPv4 instead of the IPv6 client address, breaking some containerized software (Fail2Ban, mail servers, etc) that'd otherwise work fine with IPv4 connections. With ip6tables and ULA providing IPv6 addresses to the containers, the issue goes away and the client IPv6 address is preserved.

There has been other improvements / fixes, and some bugs still remain with Docker's IPv6 support AFAIK.

Docker with /80 subnet to prevent NDP neighbour cache invalidation issues

This was originally detailed in the official Docker docs until a rewrite in a Feb 2018 rewrite. It remains cached by various other sites which show up in search engine results for queries with Docker + IPv6.

For someone not too familiar with IPv6, I'm not sure if there's enough context to make sense of it, or if it's still relevant (since official docs removed all mention of it). Especially with the usual advice to not have subnets smaller than /64.

From the old Docker docs authored in Oct 2015:

Often servers or virtual machines get a /64 IPv6 subnet assigned (e.g. 2001:db8:23:42::/64).
In this case you can split it up further and provide Docker a /80 subnet while using a separate /80 subnet for other applications on the host.

...

Remember the subnet for Docker containers should at least have a size of /80.
This way an IPv6 address can end with the container’s MAC address and you prevent NDP neighbor cache invalidation issues in the Docker layer.

So if you have a /64 for your whole environment use /78 subnets for the hosts and /80 for the containers.
This way you can use 4096 hosts with 16 /80 subnets each.


I am curious if that's still applicable or outdated information given this tidbit:

The original design of IPv6 allocation was for 80 bits of network addressing and 48 bits of host addressing.
After that was shared with the IEEE they pointed out that future ethernet-follow-on protocols which would use a 64 bit EUI rather than a 48 bit MAC.
Thus the IPv6 network:host split was moved from 80:48 to 64:64.

That would seem to potentially align with why a prefix of /80 was advised in the older Docker docs advice? Today that would be /64?


UPDATE: Investigated behaviour with Docker Engine v23, observing if MAC is still used in assigned IPv6 container addresses:

When using:

  • A user-defined network, and running a container attached to that network for an IPv6 address, the IP increments from the lowest available address (eg: <network prefix>::2, with ::1 assigned as the IPv6 gateway).

  • The default docker0 bridge network (now considered legacy), will behave similarly, but instead of ::2 you'd get the containers MacAddress:

    docker inspect <container name>:

    "IPv6Gateway": "fd00:face:feed:cafe::1",
    "GlobalIPv6Address": "fd00:face:feed:cafe:0:242:ac10:2",
    "GlobalIPv6PrefixLen": 64,
    "MacAddress": "02:42:ac:10:00:02",

    docker network inspect bridge (with container running):

    "Subnet": "fd00:face:feed:cafe::/64"
    "MacAddress": "02:42:ac:10:00:02",
    "IPv6Address": "fd00:face:feed:cafe:0:242:ac10:2/64"

Presumably a network on bridge (aka docker0) would be limited to a /80 sized subnet due to that usage of MAC? The concern with NDP cache invalidation issues would be from smaller subnets like /96 truncating into the bits needed to represent the MAC in address space, preventing Docker from ensuring containers are uniquely addressed?

The user-defined networks being incrementally assigned wouldn't have that issue. You'd have an address space that can actually be larger than /80, but also not break when smaller than /80?

EDIT: In /etc/docker/daemon.json, as soon as the docker0 bridge CIDR exceeds /80 ("fixed-cidr-v6": "fd00:face:feed:cafe::/81"), it will no longer include the MAC in the IPv6 assigned address, and instead behave like the user-defined network description. Not sure what to make of that.


IPv6 confusion

IPv6 subnets should always be /64?

With IPv6, for GUA and ULA address formats:

  • The first 64 bits is the network prefix, with at most 16-bits used for addressing a block of /64 subnets?
  • The remaining 64 bits is for the host/interface ID, the pool of IPv6 addresses within a /64 subnet?

As discussed earlier in this issue, to use a /80 subnet would break various IPv6 features:

"Normal" subnets are never expected to be any narrower (longer prefix) than /64.
No subnets are ever expected to be wider (shorter prefix) than /64

Using a subnet prefix length other than a /64 will break many features of IPv6,
including Neighbor Discovery (ND), Secure Neighbor Discovery (SEND) [RFC3971], privacy extensions [RFC4941], parts of Mobile IPv6 [RFC4866], Protocol Independent Multicast - Sparse Mode (PIM-SM) with Embedded-RP [RFC3956], and Site Multihoming by IPv6 Intermediation (SHIM6) [SHIM6], among others.
A number of other features currently in development, or being proposed, also rely on /64 subnet prefixes.

And wikipedia on the address formats:

  • The network prefix (the routing prefix combined with the subnet id) is contained in the most significant 64 bits of the address.
  • The size of the routing prefix may vary; a larger prefix size means a smaller subnet id size.
  • The bits of the subnet id field are available to the network administrator to define subnets within the given network.

Cloud vendors providing instances with prefix lengths of /80, /96 or narrower

There is this Google blogpost about using ULA /48 prefix per VPC, which allows for 2^16 subnets of /64. They then depict that each /64 subnet provides 4 billion /96 for VM interfaces, each with 4 billion IPv6 addresses available:

Google IPv6 VPC assignment breakdown

It's still a /64 subnet, while the /96 is bridged with DHCP6?:

When you enable IPv6 on a VM, the VM is assigned a /96 range from the subnet that it is connected to.
The first IP address in that range is assigned to the primary interface using DHCPv6.

You don't configure whether a VM gets internal or external IPv6 addresses.
The VM inherits the IPv6 access type from the subnet that it is connected to.

If you use Docker on one of those VMs, then perhaps the networks it creates become "subnets" smaller than /96, but from what I've read the IPv6 issues with subnets smaller than /64 aren't as applicable when DHCPv6 is used and you have something like Docker managing it's networks / IP assignments?

AWS offers similar with /80 prefix for interfaces instead of /96.

I'm not too familiar with the network jargon here, but:

  • Is there a distinction between the /64 subnet interface vs the VM virtual interfaces that are only assigned a portion of IP addresses to manage from that /64? (is that still a subnet?)
  • With Docker if you create networks of /80, docker network inspect will describe it as an /80 subnet, I assume it's similar to the VM interfaces? (or their usage of DHCPv6 changes that?)
  • Are the features like SLAAC broken due to the /80 or /96, or in this case is Amazon / Google avoiding the issue since they're in full control of the /48 issued to a VPC and it's /64 subnets each subdivided into /80 or /96? I'm curious how it compares to a VPS with a single /64, or with ULA /48 for a homelab (since GCP is using ULA for VPCs).
  • NDP using Router Advertisements has been mentioned on this topic. Are these interfaces with /80 and /96 prefixes likely to rely on that? The old docker docs seemed to indicate similar with NDP Proxying to communicate outside of /64 subnet?

Is a /64 subnet split into longer prefix nested subnets?

The way it is done in both Swarm & Kubernetes (all the "big" L3 CNI-providers like Calico, Cilium, Static-routing, Kube-router etc.) is that part of a /64-subnet is split up in sizes between /80 to /122 and allocated to each node and not bad practice in any way AFAIK.

  • Splitting to /80 or /122 is not considered smaller individual subnets? These slices of IP addresses are still part of the same /64 subnet?
  • If they're not subnets, what do you call them? I know the quote uses a CIDR prefix, but not sure what an appropriate description is to differentiate from /64 (if this remains the actual subnet).

In /etc/docker/daemon.json, there is default-address-pools array of objects with base (IPv4/IPv6 CIDR blocks) and size (larger length to split into subnets):

{
  "ipv6": true,
  "fixed-cidr-v6": "fd00:face:feed::/80",
  "ip6tables": true,
  "experimental": true,
  "default-address-pools": [
    {
      "base": "172.17.0.0/12",
      "size": 16
    },
    { 
      "base": "fd00:face:feed:cafe::/64",
      "size": 80
    }
  ]
}

Docker default bridge docker0 would have a single /80 subnet from fd00:face:feed::/48, and any new networks created with IPv6 will take from the fd00:face:feed:cafe::/64 subnet and split that up into /80 subnets (at least that is what docker network inspect would show), unless providing an explicit subnet to use with --subnet.

docker network create does have a --ip-range option too, but you can only use a subnet for a single network, it's not possible to create multiple with --ip-range divvying that up into /80 for the same /64.

The old docker docs refer to smaller subnets than /64 however, and make a distinction between Docker subnets and Router subnet_, as well as a /48 network. From wikipedia, and the Google GCP docs, it seems a subnet is always of size /64 and part of the Network Prefix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants