-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] KEP Services ClusterIP Range and Allocation API #1881
Conversation
Ok, let's see if this time we can remove this limitation 😄 💪 ping @Queuecumber @uablrek @aramase @neolit123 I have one working implementation with the approach suggested in the KEP, so you can see is not a big change /assign @danwinship @dcbw @thockin @lavalamp @sftim |
So just to give a one sentence summary to make sure I understand the solution, the idea is to allow the user to provide a /64 but only allocate out of the lowest /112 bits right? |
<<[/UNRESOLVED]>> | ||
``` | ||
|
||
- If the CIDR is larger than that fixed size, set the bit (IP number) % (bit array size). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a lot of work to introduce a new failure mode. What's really motivating this change? This sort of behavior fails the "least surprise" test, IMO.
If there's REALLY a need to be able to allocate from anywhere in the whole /64 space, maybe we should focus on a better representation than a dumb bitmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a spinoff of kubernetes/kubernetes#90115 but I keep being unhappy with every alternative we come up with.
The motivation is that forcing people to use a non-/64 subnet length just feels wrong. It says "Yes, fellow "hackers", we support the Internet Protocol Version Six for sending your Transmission Control Protocol packages over". It's not entirely wrong, but it gives off strong "we don't really understand what we're talking about" vibes.
So the goal is to get rid of that limitation, without causing any other problems, but the "not causing any other problems" part keeps screwing things up.
(I am not sure I like this approach better than any of the previous suggestions.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a lot of work to introduce a new failure mode. What's really motivating this change? This sort of behavior fails the "least surprise" test, IMO.
in my personal opinion /64 should be the only allowed prefix for the service subnet, there is a lot of literature about that, but that will be the "least surprise" test, so I think that at least we should allow users to use /64 for the service subnet
If there's REALLY a need to be able to allocate from anywhere in the whole /64 space,
IPv6 protocol was designed to be autoconfiguring, personally, I don't think there should be need to assign addresses manually from the whole range. However, this really breaks current behavior where users can assign from the whole range.
maybe we should focus on a better representation than a dumb bitmap?
I've tried that with 2 different data structures, added in the alternatives sections, did it with roaring bitmaps and a map[int64]struct{} , it seems too disruptive and too much effort for something that as Dan said
So the goal is to get rid of that limitation, without causing any other problems, but the "not causing any other problems" > part keeps screwing things up.
The first proposal is allow /64 subnets, allow allocating from the whole range but with the caveat that there can be duplicates because we only are able to track 2^16 addreses. Random addresses will be allocated from the lowest /112 bits.
that was proposed by Dan Winship and the PR has some push back, so I added it here in the alternatives section, but definitively is the less disruptive and straight forward solution |
I find the failure mode of this modulus to be very surprising. """ I
asked for X and it failed because X is in use. I looked, but nothing
was using it, so I tried again. It failed the same way. """
Humans are not good at modulus math.
…On Wed, Jul 1, 2020 at 12:53 AM Antonio Ojea ***@***.***> wrote:
So just to give a one sentence summary to make sure I understand the solution, the idea is to allow the user to provide a /64 but only allocate out of the lowest /112 bits right?
The first proposal is allow /64 subnets, allow allocating from the whole range but with the caveat that there can be duplicates because we only are able to track 2^16 addreses. Random addresses will be allocated from the lowest /112 bits.
idea is to allow the user to provide a /64 but only allocate out of the lowest /112 bits right?
that was proposed by Dan Winship and the PR has some push back, so I added it here in the alternatives section, but definitively is the less disruptive and straight forward solution
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
updated the KEP with Dan Winship proposal, only 2 points to clarify and we should be good to go |
The solution proposed consist in: | ||
|
||
Relax only the `kube-apiserver` validation without any further changes in the ipallocator logic, | ||
so the user can provide a /64 prefix for IPv6 but only allocate out of the lowest 2^16 bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back-of-envelope math. If the user specifies a large CIDR, we could allow a larger subset (e.g. 2^22 bits, 512KiB), but that's pretty insignificant compared to 2^64.
If we thought that reserving specific IPs was important, we would change the structure to allow arbitrary specific values plus a bitmap of allocated values. E.g. 2^16 bits (8KiB) of allocation + list of up to 10K offsets from the CIDR (8 bytes each, 80 KiB). But as you say, that would require retooling the structure that is stored and careful rollout.
If we did that, we could also move to a model that stored multiple chunks of data and even allowed multiple CIDRs. It's a lot of work and not very high demand. So I guess I am OK with this proposal, and if we decide later to allow discrete allocations, we can work on converting the structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say lowest bits, I assume you mean leading 0s? Maybe an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say lowest bits, I assume you mean leading 0s? Maybe an example.
heh didn't know asciiflow worked for markdown,
what about something as this?
+---------------------------------------+
| 2001:db8:1234:5678:0000:0000:0000:0000|
+-------------------+--------------+----+
| |
|---> |
/64 |
subnet |
+---->
/112
ipallocator
16 bits
- subnet range
2001:db8:1234:5678:0000:0000:0000:0000 - 2001:0db8:1234:5678:ffff:ffff:ffff:ffff
- kubernetes allocator range
2001:db8:1234:5678:0000:0000:0000:0000 - 2001:0db8:1234:5678:0000:0000:0000:ffff
if we decide later to allow discrete allocations, we can work on converting the structures.
yeah, nowadays /64 is a fair request based on real facts and a few users are demanding it
OTHO discrete allocation in the whole range is a guess or a nice to have at this moment ...
Just want to bring up once again, I know it's not a popular suggestion, but this is part of the reason SLAAC and associated duplicate address detection was invented for "traditional" networks. Keeping track of 2^64 addresses is no fun, what is the argument against randomly allocating out of the entire /64, allowing manual specification, and actually trying to reach to addresses to see if they're in use, giving an error if something responds? |
These are service IPs - they might be allocated and have no living
backends, and the cost of being wrong would mean that we have to
retool a lot of Kubernetes flows.
Another possible path would be to not store the allocated addresses,
but rely on optimistic concurrency. Each apiserver builds an
in-memory structure representing all allocated addresses. Incoming
requests pick a random value and ensure that the in-memory
representation says it is free. Then they make a mutation to a
generation counter if-and-only-if they have the correct previous
generation. In case of conflict, they have to refresh their in-memory
structure and do it all over again.
There are a lot of corner cases - is that really worthwhile?
…On Thu, Jul 2, 2020 at 1:44 PM Max Ehrlich ***@***.***> wrote:
Just want to bring up once again, I know it's not a popular suggestion, but this is part of the reason SLAAC and associated duplicate address detection was invented for "traditional" networks. Keeping track of 2^64 addresses is no fun, what is the argument against randomly allocated out of the entire /64, allowing manual specification, and actually trying to reach to addresses to see if they're in use, giving an error if something responds?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Couldn't that part be solved by delegating each apiserver a smaller prefix at cluster creation time? E.g instead of having only a /112 that you allocate out of and store in etcd, you have 2001:db8:baad:f00d::/64 and the first apiserver gets 2001:db8:baad:f00d:1::/80, the second gets 2001:db8:baad:f00d:2::/80, something like that. You only need to detect duplicates if a user manually sets an IP, any for that you would need to do DAD, but the issue is that services might be allocated an IP and not respond to it correct? Also did anyone ever look into how Calico does this for pods? |
It's really nice having feedback about this, but this is not really a networking problem, ClusterIPs are virtual and has to be "present" in all nodes, the IPs are not real, so DAD is not an option. As Tim explained there are solutions to the problem using different data structures, ... , but the cost for solving this problem (that we really don't know if is causing issues, everything here is theoretical) right now is very high, bear in mind that Kubernetes has to allow upgrades and to run components with skew versions. |
Yes. Don't think of this as being about IP addresses. It's about assigning IDs to objects. |
Assigning each apiserver a disjoint range from the larger set solves this problem for random assignment, this is how some CNI plugins handle pod IPs. How does the kube-proxy running on each node know where to route requests for the sevice IPs? It is asking the apiserver for that information?
Of course, this needs to be fully thought out before anything is committed to which is why we're having a discussion about it.
IP addresses are (usually unique) IDs that are assigned to (usually) NICs (which are objects) I've been trying to pick through Calico to see how they handle this since they must have run into the same problem although it's possible they aren't storing in etcd. |
This was the main pain point I hoped to address. We have seen at lease one case of a cluster outage because one of the masters could never catch up with this massively contended resource. |
Yeah in that case the steps you gave sound right, with one additional detail: it has to handle the case where it creates the IP object and then crashes or whatever. To handle that, the IP object needs to state which service it's created for, and give a TTL (or a creation time), so that an external observer can tell the difference between a failure and the service just not having been created yet. |
We have to maintain 1 ServiceIPRange (2 in dual-stack) inmutable for keeping backward compatibility with the current flags, right? |
Guys, I NEED to chime in here, coming from maintaining A LOT of IPv6 only clusters.
That said, the 16 bit limitation feels somewhat small (max of 65k service IPs per cluster), but that has nothing to do with IPv6 in the regard of "maybe that's a reasonable number of service IP limitation". If it was 32 bit ("almost inf for almost every use case") it would look certainly more useful. Also, if you allow using 32 bit, one can potentially MAP IPv6 addresses nicely using NAT64 to IPv6 service IPs:
So if you want an easy solution, I recommend to implement the following way:
Problem solved, nice and easily. And usable for the long term future when k8s clusters stay IPv6 only, but still accept traffic from the IPv4 world. In case you have any doubt about anything above, feel free to discuss on the https://IPv6.chat. |
|
||
## Drawbacks | ||
|
||
Compared to a bitmap this design takes more space per reservation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real drawback is that the number of etcd objects required for this scales at O(N) where N is the number of services. That seems reasonable (the largest number of services per cluster I have seen peaks at about 15-30k). etcd memory use is proportional to number of keys, but since pod counts tend to dominate service counts I think this is a reasonable tradeoff. We still budget 1-2 million keys as a reasonable upper bound for number of etcd keys in a cluster, and this doesn't make that substantially worse (<10% impact). The objects described here are smaller than most other Kube objects (especially pods), so again the next factor of etcd storage size is still reasonable.
I'll echo Tim's last comment - if you're seeking to prevent duplicate allocation, a single etcd key per ip is the simplest approach for o(10000) number of services. If you're trying to track ranges, I agree that is a much more complex problem (but don't see the justification clearly laid out in the kep for that, maybe I missed it) that requires a more complex storage+admission approach. While I don't particularly think this is the right place to do it, it might be possible to leverage an etcd multi-key transaction to create both together, and to put in the minimal machinery to enable that in the etcd storage interface. However, I'd rather not do it without a second concrete use case which we don't really have today in core (lots of third party things for CRDs might benefit from it). Then you'd create both together or fail to create both together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not describe how users can change a cluster from existing configuration based settings to api based settings (and fall back) which will impact feature progress from alpha/beta/stable since during alpha/beta the future can be turned on/off.
|
||
#### Story 3 | ||
|
||
As a Kubernetes user I want to reserve specific IP addresses from the Service subnet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? reserve to what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can say , I don't want that the IP 10.96.0.100 can be randomly allocated, right now each IP can be allocated randomly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be clearer - is the goal to reserve it for future use or to block it from use at all? Why is "create a service with no endpoints" not an answer?
|
||
#### Story 4 | ||
|
||
As a Kubernetes user I want to dynamically change the node-port range so I can allocate more node-ports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really, REALLY think we should keep the focus on service CIDRs. And then using that learning to cover node ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah
|
||
### Proposed Allocation model | ||
https://github.com/kubernetes/enhancements/pull/1881#issuecomment-757003362 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expand it here?
Since the Range objects will be created by the bootstrap logic of the apiserver, the first apiserver to start will define the Range for the whole cluster. | ||
The Range objects can not be deleted, but can be updated meanwhile the objects allocated referencing it are withing the new range. Per example, in a cluster with a Service Range 10.0.0.0/24, it will be always possible to update the service to a larger subnet 10.0.0.0/16. However, it will not be possible to use a smaller subnet, i.e. 10.0.0.0/28, if there are IPs allocated out if that range, i.e. 10.0.0.244. | ||
|
||
<<[/UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it is written it does not help people who want to move away from existing service cidr. Think of folks who are restructuring their network address spaces. We should be able to replace the repair loop with something that detect a situation where ClusterIPs are no longer in range. Maybe with some flags on ranges that says (drain/retire)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To totally change a range, we probably need a notion of multiple IPs on the service, and even then it's a perilous operation (e.g. DNS IPs are specified literally to kubelet).
there is no transition, it has to be the same for users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be cut, cut, cut. Simpler. Less is more.
Are we going to do the "reserved lower-half" on each range? Just some? A flag per range?
Do we need to define overlap semantics for this? E.g. I define 10.0.0.0/20 as "available for any use" and "10.0.0.0/25" as "for explicit use only" and define that the flag has precedence?
Seems like some opportunities for extending metadata over time.
|
||
#### Story 3 | ||
|
||
As a Kubernetes user I want to reserve specific IP addresses from the Service subnet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be clearer - is the goal to reserve it for future use or to block it from use at all? Why is "create a service with no endpoints" not an answer?
``` | ||
|
||
The name of the IPAddress has to be unique, so we can use it to provide consistency and avoid duplicate allocations. | ||
However, Kubernetes objects doesn't allow names with `:`, used by IPv6 Addresses. In addition, [IPv6 addresses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name format is up to each resource to decide.
|
||
```go | ||
// IPRange defines a range of IPs using CIDR format (192.168.0.0/24 or 2001:db2::0/64). | ||
type IPRange struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to talk about overlaps - are they allowed and what do they mean? As my above notes, I think overlap is OK and just gets merged by anyone using this API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can validate overlaps if we use status on the IPRange
- Create IPRange (simply validation= -> status "notReady"
- Controller sees the new IPRange, performs validation against all the IPRanges
- No overlapping, set status to Ready, it can start allocating
- Overlapping, keep status as notReady ot invalid, and a condition message
|
||
### Upgrade / Downgrade Strategy | ||
|
||
The service-subnet and node-port range will keep using the flags for the initial configuration, it can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more details here. In what cases will the flags be used vs range objects? Is it the case that as soon as one range exists (for a given family) then we start using that? Is that only evaluated at apiserver startup?
|
||
Any update on the IPRange allocated object will not be allowed if there are apiservers without the | ||
same version | ||
QUESTION Is this even possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be worked throug, for sure. Not sure it matters, as long as the conversion uses the same values...
I have to retake my notes, and reread all the comments ... I will update the KEP once I'm able to remember 🙃 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
I have to retake this |
following up on #3365 This issues will serve as reference but in current status is impossible to keep track of anything so I'll close it |
This KEP solves current Kubernetes Services issues and constraints caused by current allocation model,
offering a public API for Services resources allocation.
xref kubernetes/kubernetes#104088