Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3071

Merged
merged 1 commit into from
Jan 10, 2022

Conversation

aojea
Copy link
Member

@aojea aojea commented Dec 2, 2021

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2021
@aojea
Copy link
Member Author

aojea commented Dec 2, 2021

/assign @thockin @danwinship
/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 2, 2021
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Dec 2, 2021
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
Comment on lines 314 to 315
The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values
if there are free values on the upper block of the range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems weird. I would expect that if there was a reserved range, that it would be reserved unconditionally, not "reserved until we run out of unreserved addresses".

Also, it's weird to hardcode the reserved range to 32 addresses regardless of the size of the service CIDR...

Copy link
Member Author

@aojea aojea Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems weird. I would expect that if there was a reserved range, that it would be reserved unconditionally, not "reserved until we run out of unreserved addresses".

this keeps the behavior backwards compatible, you can create services as long as there are free IPs, if we keep the lower range reserved only for static IPs we are changing the behavior

Also, it's weird to hardcode the reserved range to 32 addresses regardless of the size of the service CIDR...

@thockin I knew we should have a history about why 32 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. We went back and forth on this a lot in discussing.

If your service CIDR is /27, you only get 32 service IPs (IIRC, it's 30 actually, we avoid the first and last). At that point we can debate whether it's OK to have so few and if, at that point, we can make many guarantees.

If your CIDR is bigger than that, do we need to reserve more than 32 IPs? If so, do we need a "cascade" of blocks?

e.g. something like min(32, cidrSize/2) is attractive, but only useful if we think there are a lot of users with really small ranges.

Something more dynamic like "always prefer the upper half unless it is full" is more general, I guess. Is it worth the extra effort?

E.g. assume a /22 service range (1024 IPs). First look at [512]-[1023]. If that's full, look at [256-511]. If that's full, look at [128-255]. It seems like that gives the caller influence where it isn't needed, which could be a vector for some sort of attack.

min(max(16, cidrSize/16), 256) ? At /28 it would be 16 (full range). At /27 - /24 it would be 16 (part of the range). At /23 it would be 32. /22 -> 64. /21 ->128. /20 and bigger -> 256.

Again, worth the effort?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about making it a fixed percentage? we subdivide in two ranges,






           ┌─────────────┬─────────────────────────────────────────────┐
           │             │                                             │
           │             │                                             │
           │             │                                             │
           │   static    │                    dynamic                  │
           │             │                                             │
           │             │                                             │
           │             │                                             │
           └─────────────┴─────────────────────────────────────────────┘

           ◄────────────► ◄────────────────────────────────────────────►

                15%                              85%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"15%" must be a typo, since it is not a power of two. I think you meant to type "12.5%" :)

Seriously though, I think percents are OK for humans to get a big-picture, but harder to be precise about.

min(max($min, cidrSize/$step), $max) is easy to describe as "never less than $min or more than $max, with a graduated step function between them. We can argue about values for those variables (or even tweak them over time if we get feedback), but the logic still seems appropriate?

I proposed:

$min = 16
$max = 256
$step = 16

But I am fine to haggle :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with anything meanwhile users can understand it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with the suggested formula

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR I meant that 32 seemed like too many reserved IPs for small service CIDRs, not that I thought we should reserve more for large CIDRs.

But I guess it's reasonable to assume that users with larger clusters are likely to have more use cases for fixed-IP services, so I guess giving them a larger range makes sense. And if we want the reserved size to (a) always be a power of 2, and (b) always include the .10 address, then I guess 16 has to be the minimum value. And given that the "reserved" range is more like a "preferred" range, then even 16 isn't too large for a /27

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
Comment on lines 314 to 315
The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values
if there are free values on the upper block of the range.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. We went back and forth on this a lot in discussing.

If your service CIDR is /27, you only get 32 service IPs (IIRC, it's 30 actually, we avoid the first and last). At that point we can debate whether it's OK to have so few and if, at that point, we can make many guarantees.

If your CIDR is bigger than that, do we need to reserve more than 32 IPs? If so, do we need a "cascade" of blocks?

e.g. something like min(32, cidrSize/2) is attractive, but only useful if we think there are a lot of users with really small ranges.

Something more dynamic like "always prefer the upper half unless it is full" is more general, I guess. Is it worth the extra effort?

E.g. assume a /22 service range (1024 IPs). First look at [512]-[1023]. If that's full, look at [256-511]. If that's full, look at [128-255]. It seems like that gives the caller influence where it isn't needed, which could be a vector for some sort of attack.

min(max(16, cidrSize/16), 256) ? At /28 it would be 16 (full range). At /27 - /24 it would be 16 (part of the range). At /23 it would be 32. /22 -> 64. /21 ->128. /20 and bigger -> 256.

Again, worth the effort?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 21, 2021
@aojea aojea force-pushed the reserved_serviceip_allocation branch from 48bffff to 6debbd4 Compare December 21, 2021 12:51
@thockin
Copy link
Member

thockin commented Dec 21, 2021

After discussion a few hours ago - should we cover node ports in the same way?

@aojea
Copy link
Member Author

aojea commented Dec 21, 2021

After discussion a few hours ago - should we cover node ports in the same way?

implementation wise is simpler if we want to use the same reserved ranges sizes for each

@aojea
Copy link
Member Author

aojea commented Dec 22, 2021

however, the maths we are doing here doesn't apply, we are always assuming binary arithmetic for the range sizes, not a big deal if we assume this will always remain as is.
Now that I write it, I have the feeling that for the future it is better if we decouple NodePorts from ClusterIPs and break this dependency now that we can.

@aojea aojea force-pushed the reserved_serviceip_allocation branch 2 times, most recently from 9009c87 to 4c5ea45 Compare December 22, 2021 09:24
@thockin
Copy link
Member

thockin commented Dec 22, 2021

OK. In the future we can do a KEP for NodePorts, if needed. I bet the same formula applies, just with different coefficients.

This LGTM - @danwinship do you want a word on it? I'll approve but hold for now.

/approve
/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 22, 2021
@thockin
Copy link
Member

thockin commented Dec 22, 2021

oh, needs PRR anyway. I did not read for that. Will do


#### GA

- 2 examples of real-world usage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? That 2 providers have switched to this mode as the default? That's unlikely while it is beta. Let's get specific here. I think time and lack of trouble is the only real signal.

When do we make this mode the default? Is there a reason not to do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interest in enabling it in Openshift by default to solve some races at installation time, I was expecting more installers will face similar issue, but as you say this is just more a matter of time and signal.

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

###### What specific metrics should inform a rollback?

The allocation logic already has the following metrics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want new metrics to indicate "forced to use reserved range" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to add new metrics but its complicated, since this behavior is deep in the bitmap allocation and the metrics are one layer above on the ip allocator .... inheritance problems 🤷

I'll add a comment explaining it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to add new metrics but its complicated, since this behavior is deep in the bitmap allocation and the metrics are one layer above on the ip allocator .... inheritance problems

Hmm - can't we report them somewhat post-factum (when we're returning the actual result we already know what IP we allocated and if it was statically requested or not)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean something like randomly_allocated_ips gauge and randomly_allocation_total counter ?
that is easy to do, will that work for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean something like randomly_allocated_ips gauge and randomly_allocation_total counter ?
that is easy to do, will that work for you?

Can you clarify? I think these can be interpreted differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have these metrics now

  - allocated_ips
  - available_ips
  - allocation_total
  - allocation_errors_total

we can add two new ones

randomly_allocated_ips
randomly_allocation_total

randomly_allocated_ips reports the number of IPs in the allocator that are allocated "randomly" i.e. with ClusterIP empty, it reports current value.
randomly_allocation_total is a counter, every time that you allocate a ClusterIP randomly it increases, the user can play with the current metrics and these new ones to obtain the IPs allocated statically

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifiction.

Yes - those metric will be very useful.

What I had on my mind (please let me know if this doesn't make any sense because of some reason) is actually an extension of that by:

  • adding a label to that metric (like "from-static" or sth) to show if those random IPs were allocated from static range

Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that is much simpler and easier, I will update the kep with that

keps/sig-network/3070-reserved-service-ip-range/kep.yaml Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 23, 2021
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@wojtek-t wojtek-t self-assigned this Jan 4, 2022
Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except for the "reserved block" comment

I'd like to avoid using "Reserved" in the KEP title since neither range is actually reserved for a single use, but I can't think of any way to describe the behavior well in few enough words...


There are situation that users need to rely on well-known predefined IP addresses.

The best example is the Kubernetes DNS, the IP is hardcoded in the kubelet flag `--cluster-dns`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Well, no, the best example is the apiserver...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to race with the apiserver .1 address?
The loop to create the kubernetes.default runs at the beginning, I honestly don't know if at that point clients are able to create services, but is worth adding it

Comment on lines 267 to 269
// forward looking for the next available address. The big.Int range is subdivided so it will try
// to allocate first from the reserved block of addresses (it will wrap the reserved subrange if necessary).
// If there is no free address it will try to allocate one from the reserved block too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to allocate first from the unreserved block", right? (You're referring to both ranges as "the reserved block" here.)

and "(it will wrap around back to the start of the unreserved subrange if necessary)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will word it as reserved for static and reserved for dynamic allocation, it sounds less ambiguous

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
@thockin thockin added this to the v1.24 milestone Jan 6, 2022
@aojea aojea changed the title KEP-3070: Reserve Service IP Range For Dynamic and Static IP Allocation KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation Jan 7, 2022
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2022
@aojea
Copy link
Member Author

aojea commented Jan 9, 2022

/hold cancel

comments addressed, ready for final review

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2022
Comment on lines 368 to 369
- allocated_ips
- available_ips
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t these 2 metrics can not track the information of the type of allocation, because when an ip is released we don't have that information.
however, the counters can have it, because are only incremented

  - allocation_total
  - allocation_errors_total

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair - that sounds good enough to me.

Can you clarify that here?

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple nits, but I'm overall fine with it. Will approve once addressed.

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved
Comment on lines 368 to 369
- allocated_ips
- available_ips
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair - that sounds good enough to me.

Can you clarify that here?

@wojtek-t
Copy link
Member

@aojea - can you please squash the commits?

@aojea aojea force-pushed the reserved_serviceip_allocation branch from ea0e123 to fa2606a Compare January 10, 2022 11:55
@wojtek-t
Copy link
Member

This is more than enough for Alpha from PRR perspective - thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2022
@k8s-ci-robot k8s-ci-robot merged commit 6b90035 into kubernetes:master Jan 10, 2022
@wojtek-t
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants