KEP: EndpointSlice API #1086

freehan · 2019-06-04T17:01:52Z

The proposal doc has been circulating for quite a while. https://docs.google.com/document/d/1sLJfolOeEVzK5oOviRmtHOHmke8qtteljQPaDUEukxY/edit

Formally converting the proposal into a KEP

k8s-ci-robot · 2019-06-04T17:01:54Z

Welcome @freehan!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

freehan · 2019-06-04T17:10:01Z

/assign thockin

freehan · 2019-06-05T18:35:22Z

cc: @wojtek-t @bowei @johnbelamaric

bowei · 2019-06-05T18:45:52Z

keps/sig-network/20190603-EndpointSlice-API.md

+type EndpointSlice struct {
+        metav1.TypeMeta
+        metav1.ObjectMeta
+        Spec EndpointSliceSpec


No status for now. There is no consumer/provider for EndpointSlice status. Unless we want to allow all kube-proxy to feedback into it.

I feel like it's more "all status" than it is "all spec"...

After second look, it actually feel like "all status". But I guess it depends on the perspective:

from service point of view, endpoints is like its status

from consumer (kube-proxy) point of view, endpoints is like spec they need to implement. EndpointSlice object status is more place holder for consumer feedback. (similar to kubelet implements podSpec and feedback podStatus)

FYI @apelisse
We discuss this in sig-network and decided to keep it dumb by making it an object without spec and status, only focusing on carrying data.

bowei · 2019-06-05T18:46:58Z

keps/sig-network/20190603-EndpointSlice-API.md

+}
+
+type EndpointPort struct {
+        // Optional


Should this be:

optional if there is only one Port.

bowei · 2019-06-05T18:49:58Z

keps/sig-network/20190603-EndpointSlice-API.md

+        Name string 
+        // Required
+        Protocol Protocol
+        // Optional: If unspecified, port remapping is not implemented


what does this mean? I see that this is a required field in the current EndpointPort

will add a section describing this.

bowei · 2019-06-05T18:50:09Z

keps/sig-network/20190603-EndpointSlice-API.md

+        Ports []EndpointPort
+}
+
+type EndpointPort struct {


isn't EndpointPort name already taken?

also, should we make sure the names are less clashing with v1.Endpoint structs...

I thought we are moving to networking api group. Does it matter?

bowei · 2019-06-05T18:54:08Z

keps/sig-network/20190603-EndpointSlice-API.md

+}
+
+type Endpoint struct {
+        // Required: must contain at least one IP.


Probably want to copy over the comment from the dual stack PR about the semantics of multiple IPs?

bowei · 2019-06-05T18:54:27Z

keps/sig-network/20190603-EndpointSlice-API.md

+        // Required: must contain at least one IP.
+        IPs []net.IP      
+        // Optional
+        Hostname string         


If Hostname is optional, why is it not a pointer field?

This derives from the hostname field in pod spec. That field is optional but not pointer as well. Not sure the reason at the time.

It should be a pointer. Pod should probably be changed.

Okay. change it to pointer

If this is included in backends, why would it appear separately here?

bowei · 2019-06-05T18:54:43Z

keps/sig-network/20190603-EndpointSlice-API.md

+        TargetRef *ObjectReference  
+}
+
+type EndpointCondition struct {


EndpointConditions?

bowei · 2019-06-05T18:55:30Z

keps/sig-network/20190603-EndpointSlice-API.md

+}
+
+```
+


Might be worthwhile to give an example section with the YAML output etc

Ack will add

thockin

Let's talk about topology and Services.

When we talked about this before, we sort of ruled out having topology information in Endpoints because of bloat. If we are really trying to make EndpointSlice more general-purpose, I think we need to reconsider this. I'd love for systems like Istio to be able to glean topology information from kubernetes and "go watch all nodes" seems unfortunate. For other arbitrary consumers of endpoints it seems even worse.

So what would change if we wanted to include topology information (basically the labels of the node a pod is running on) inside EndpointSlice somewhere?

Would we encode it as a topology per slice? That stinks because of hostname. Maybe we special-case hostname? Any other unique-per-node label would kill us, too.

Would we encode it per Endpoint[] element? It would be bloaty and somewhat redundant, but correct.

Would we pre-cook node labels into a new resource (in the same apigroup as EndpointSlice) and simply reference that from each Endpoint[] ?

@wojtek-t @johnbelamaric @m1093782566

thockin · 2019-06-05T19:02:57Z

keps/sig-network/20190603-EndpointSlice-API.md

+- Leave room for foreseeable extension:
+  - Support multiple IPs per pod
+  - More endpoint states than Ready/NotReady
+  - Dynamic endpoint subsetting


Can we add a secondary goal of moving the API towards being more general-purpose and less locked into Kubernetes patterns?

thockin · 2019-06-05T19:04:03Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+## Proposal
+
+### EndpointSlice API


Let's argue about names. Slice? Set? Something else more descriptive?

Using Slice is mostly to avoid confusion with Subset.
Slice is sort of neutral to subsetting or overlapping.

Slice implies a piece of a whole, rather than a whole (as in Set). It's not horrible, but I wanted an anchor point for any debate :)

I'm curious how this specific set is selected? I'm not sure I understand what this means or how this works.

It's just a chunk. A semi-random subset of a whole.

You mentioned a crossed-correlation between EndPoints and Ports. How are these separated in slices? I think the API is sort of broken if objects can't live independently. If we do accept that for trade-offs reasons, then it needs to be very explicit. I like chunk more than slice because it conveys the fact that a chunk doesn't make sense on its own (all chunks are needed in order to use it).

For a given slice the whole Ports[] block applies to all Endpoints[]. So pathologically, if every endpoint has a different port number, they would all get slices of length 1.

keps/sig-network/20190603-EndpointSlice-API.md

thockin · 2019-06-05T19:08:58Z

keps/sig-network/20190603-EndpointSlice-API.md

+        Port *int32
+}
+
+type Endpoint struct {


Comments. Something like:

Endpoint describes a single backend instance (e.g. Pod) of an abstract front-end (e.g. Service).

thockin · 2019-06-05T19:23:17Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+type Endpoint struct {
+        // Required: must contain at least one IP.
+        IPs []net.IP      


I think we need to consider non-IP endpoints, if we want this API to be more general-purpose (and I think we do). Specifically, allowing hostnames here and maybe even URLs.

Granted, kube-proxy could not easily set up iptables or ipvs for k8s services that had hostnames as backends, but a) Services should have Endpoints generated by controllers and b) other consumers of endpoints could be smarter.

What if we named this "address" or "backend" or similar?

Also, we don't generally use net.IP in API structs. string

Also, we need to spec that all addresses here are assumed to be fungible.

Ack. Changed it to Backends []string

thockin · 2019-06-05T21:08:32Z

keps/sig-network/20190603-EndpointSlice-API.md

+For self managed EndpointSlice objects, this label is not required.
+
+## Estimation
+This section provides comparisons between Endpoints API and EndpointSlice API under 3 scenarios:


I actually think we should have 2 tables - one for 20K endpoints (edge case) and one for 20 endpoints (average case).

We know that MOST services have less than 100 endpoints, so while 20K dominates the scalability considerations, <100 dominates the practicality considerations.

All this leads to answering "why not 1 per slice?" issues.

The point being to make it very clear

thockin · 2019-06-05T21:16:00Z

keps/sig-network/20190603-EndpointSlice-API.md

+On Service Create/Update/Delete:
+- `syncService(svc)`
+
+On Pod Create/Update/Delete: 


Also should describe the reconciliation (because edge cases) - e.g. how do we detect / handle finding EndpointSlices that have had their Service removed (where we might not have gotten notified, e.g. we were down) ? I suspect you'll need a label that just says "is managed by service-endpoint controller" and then compare against known services and find leftovers. But be careful not to get this wrong and nuke real services!

Also how we handle labels changing

etc, how do we make sure it all stays in sync

Can't we use OwnerReferences for this purpose?

If we create EndpointSlice based on the service, we add owner-ref and when service will be deleted, gc will delete endpointslice objects for us.

If user creates EndpointSlice, they don't set ownerref, and then they fully own its lifecycle.

Sounds feasible

thockin · 2019-06-05T21:25:03Z

keps/sig-network/20190603-EndpointSlice-API.md

+  - If Pod needs to be removed:
+    - Remove the endpoint from EndpointSlice. 
+    - If the # of endpoints in EndpointSlice is less than ¼ of threshold.
+      - Try to pack the endpoints into smaller number of EndpointSlice objects.


Specifically, we should be really clear that we will not simply rewrite ALL of the EndpointSlices on a removal. The goal is stability, so maybe spelling out a heuristic, eg.

if num-endpoints in slice S1 is < min-endpoints (1/4 by flag) and num-slices > 1 {
try to disperse endpoints from S1 in other slices
if total number of writes needed > min(num-slices/4, 4) don't bother // all values pulled from thin air)
}

You get the point - the idea is to balance compaction with not rewriting too many slices. I'd actually argue the values should be very lax to start, e.g. allow as few as 5% or 10% of a slice before forcing compaction, which obviously affects the other variables.

Also be careful not to let users get into unsolvable situations :) Not all of these variables are independent, I think.

Might be worth writing a simulation to see if there's a natural pattern - I feel like this is, but I can't quite place my finger on it right now.

Will write down the detail algorithm in this section. Added a requirement section above under implemetation

Consider also attempting to pack the noisiest pods together, so their updates can be combined.

Consider also attempting to pack the noisiest pods together, so their updates can be combined.

That is an implementation detail and we can also change the packing policy later.

thockin · 2019-06-05T21:26:29Z

keps/sig-network/20190603-EndpointSlice-API.md

+In order to ensure backward compatibility for external consumer of the core/v1 Endpoints API, the existing K8s endpoint controller will keep running until the API is EOL. The following limitations will apply:
+
+- Starting from EndpointSlice beta: If # of endpoints in one Endpoints object exceed 100, generate a warning event to the object. 
+- Starting from EndpointSlice GA: Only include up to 500 endpoints in one Endpoints Object. 


500 seems low, I'd argue for 1000 unless the math is bad?

thockin · 2019-06-05T21:26:47Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+Based on the data collected from user clusters, vast majority (> 99%) of the k8s services have less than 100 endpoints. For small services, EndpointSlice API will make no difference. If the MaxEndpointThreshold is too small (e.g. 1 endpoint per EndpointSlice), controller loses capability to batch updates, hence causing worse write amplification on service creation/deletion and scale up/down. Etcd write RPS is significant limiting factor.
+
+- Why do we have a status struct for each endpoint? Why not boolean state for readiness?


we don't any more?

apelisse · 2019-06-10T21:31:39Z

keps/sig-network/20190603-EndpointSlice-API.md

+type EndpointPort struct {
+        // The name of this port (corresponds to ServicePort.Name).
+        // Must be a DNS_LABEL.
+        // Optional only if one port is defined.


This conditional "requiredness" is a little confusing. Can we group the fields in a different way?

Sadly, this is how the legacy API is written. But we should do better.

Yeah, I agree it is confusing. But it is currently everywhere, from pod spec, service spec all the way to endpoints.

Since this is a new resource we can change the rules. We should think of this from the POV of the api machinery and the use-cases. Do we want to allow people to patch these lists or do we see them as atomic? I think addresses certainly wants to be patched, but ports I am less 100% sure.

To be patchable we need a good merge key, so optional values are not great for that.

We could argue that the merge key is "port number + protocol", and name is just a decoration. Even then, sometimes optional is blech. Also, protocol is optional-with-default, which makes the machinery ugly.

We could argue that the merge key is name, and that "" is a valid name. We'd have to look at places that do named-port lookups against endpoints and make sure that "" is allowed.

That still leaves a bit of ugly in that we really should be able to use the same name across protocols (e.g. DNS has them as different fields in SRV) but that, again, is made ugly by protocol-defaulting.

So I think my position is that name is the (required) merge key and "" is a valid name (not omitempty). Looking at @apelisse for more formal guidance. It seems to me that sometimes-optional should be described somewhere as an anti-pattern.

To be patchable we need a good merge key, so optional values are not great for that.

Yes, merge key(s) should not be optional.

apelisse · 2019-06-10T21:35:27Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+type EndpointSliceSpec struct {
+        Endpoints []Endpoint
+        Ports []EndpointPort


Are these two list independent from each other, or is there some hidden implicit connections between their items?

The hidden implicit is that for each endpoint in []Endpoint, it exposes ports in []EndpointPort.

This is mostly to avoid containing duplicate information per Endpoint.

danwinship · 2019-06-14T13:28:14Z

keps/sig-network/20190603-EndpointSlice-API.md

+The new EndpointSlice API aims to address existing problems as well as leaving room for future extension.
+
+
+## Motivation


Can you explain the use case for having multiple megabytes of endpoints for a single service in the first place?

This is how people are using k8s still and I've heard different usecases.
The last one I've heard is that they have single-threaded app and they need thousands of them to serve the traffic.

danwinship · 2019-06-14T14:04:20Z

keps/sig-network/20190603-EndpointSlice-API.md

+- increase the etcd size limits
+- endpoints controller batches / rate limits changes
+- apiserver batches / rate-limits watch notifications
+- apimachinery to support object level pagination


I think more discussion of the alternatives, and why you think EndpointSlice is the best one, would be useful. Are any other parts of kubernetes running into scaling issues that might be fixable with one of these more-generic fixes?

Another alternative would be to have kube-proxy just ignore Endpoints, and instead watch Pods and compute the endpoints itself. (This has similar properties to "1 Endpoint per EndpointSlice" in some ways but not in others.)

^ Agreed, this is a brand new pattern, and would set a substantial precedent.

Any of the generic solutions would really solve the problem - they just help to some extent, I've heard users asking for.

Watching all pods by all nodes would kill us :)

Watching all pods by all nodes would kill us :)

well, note that some (many?) network plugins have to do that to implement NetworkPolicy currently, so a more generic fix for the Endpoints problem would help with NetworkPolicy scaling too.

We have briefly explored the alternatives. The batch/rate limiting mechanisms essentially trade supporting large # of endpoints with something like endpoint update latency (we have heard complains with some specific cases already). In order to achieve optimal result (low latency + large scale), it can get quite complex when more dimensions are considered, like # of nodes etc. Hence, it is equal or more work for apimachinery to implement these features, with less performance gain.

In addition, Slicing up endpoints aligns better with the established pattern for load balancer scalability: endpoint subsetting. Thus, opening the door for future extension in this direction.

I feel like scalability for network policy is very different from endpoints scalability. It heavily depends on the underlying implementation.

The worst case is "Watching all pods by all nodes".
The ideal case is like "Only watch pods on its own node".

So I do not think this particular proposal would help in that sense.

However, if one day we decided to go with service policy, then this proposal will automatically benefit the implementation.

So we almost discussed this in the SIG meeting last night but were then running out of time. Anyway, I found EndpointSlice to be a fairly surprising proposal, and not how I would have guessed that people would have wanted to address this issue, so I wanted to see more discussion (in the KEP, documented for posterity) about why you ended up going with this approach rather than the alternatives.

added why not pursue the alternatives? in the FAQ section

danwinship · 2019-06-14T14:12:53Z

keps/sig-network/20190603-EndpointSlice-API.md

+type EndpointSlice struct {
+        metav1.TypeMeta
+        metav1.ObjectMeta
+        Spec EndpointSliceSpec


I feel like it's more "all status" than it is "all spec"...

dcbw · 2019-06-27T21:42:05Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+### Kube-Proxy
+
+Kube-proxy will be modified to consume EndpointSlice instances besides Endpoints resource. A flag will be added to kube-proxy to toggle the mode. 


Is there a way to autodetect this? eg start with Endpoints but the moment we see an EndpointSlice switch to watching those? Seems like a shame to add another flag.

How about this?
The controller creates a config map with specific name.
kube-proxy tries to get the config map on start up. If the config map exists, then use EndpointSlice, if not fall back to Endpoints.

If all goes well, I think EndpointSlice and TopologyAwareServiceRouting goes Alpha in 1.16 together, which EndpointSlice implement controller only, and TopologyAwareServiceRouting watch node for topology infomation. EndpointSlice and TopologyAwareServiceRouting goes Beta in 1.17 together, which kube-proxy will consume EndpointSlice, and TopologyAwareServiceRouting also relies on this, so what about consuming EndpointSlice when TopologyAwareServiceRouting is enabled, and I can help to implement this.

dcbw · 2019-06-27T21:45:01Z

keps/sig-network/20190603-EndpointSlice-API.md

+        // Required: must contain at least one backend.
+        // This can be an IP, hostname or URL.
+        // The consumer (e.g. kube-proxy) decides how to handle the backends.
+        Backends []string


This needs some explanation around how eg kube-proxy would handle Backends of type URL/hostname. eg if a given Service Proxy can't handle one of the types what is the expected behavior?

We currently stated that the consumer determines how it would handle different type of backend.

What is the use case for URL or hostname?

What is the use case for more than one entry?

If more than one, how are consumers supposed to choose which to use?

What if some of the backends are ready but not all?

Hostname already exists in the v1 Endpoints API. It is used for KubeDns resolve pods by hostname.

Adding more than one entry is originally aimed to support dual stack where each pod has 1 v4 and 1 v6 IP.

I think in the Dual Stack KEP, it states that this effectively kicks it back to the consumer. For instance, kube-proxy can pick the 1st v4 IP and the 1st v6 IP to program.

As one pod might have multiple IPs, we only consider pod readiness. So if one pod is ready, then all backends are ready.

For URL, there is no consumer yet. But the goal is evolve K8s Endpoints API towards a generic service discovery API so that it can be consumed by 3rd party solutions.

I'd strongly consider splitting these into a more helpful system. E.g.:

Hostnames []string IPAddresses []string URLs []string FQDNS []string ...

Alternatively, you could do a list with type and value fields. The way it is now, you can't really validate an IP address (maybe it's a funny looking DNS name with an in-house TLD).

As I understand it, these are different ways of reaching the exact same backend. I think "Aliases" or "Names" would be much less confusing names than "Backends".

Yes. I am considering adding a field to describe backend type. So that all Endpoint in the EndpointSlice will have the same type of backends, plus API server validation can enforce that.

Consumers will also be easier to determine if it can handle the backend type.

lavalamp · 2019-07-01T22:08:26Z

keps/sig-network/20190603-EndpointSlice-API.md

+        // Required: must contain at least one IP.
+        IPs []net.IP      
+        // Optional
+        Hostname string         


It should be a pointer. Pod should probably be changed.

lavalamp · 2019-07-01T22:10:21Z

keps/sig-network/20190603-EndpointSlice-API.md

+        // Required: must contain at least one backend.
+        // This can be an IP, hostname or URL.
+        // The consumer (e.g. kube-proxy) decides how to handle the backends.
+        Backends []string


What is the use case for URL or hostname?

What is the use case for more than one entry?

If more than one, how are consumers supposed to choose which to use?

What if some of the backends are ready but not all?

lavalamp · 2019-07-01T22:14:58Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+```
+
+### Mapping


How are endpoints mapped to slices?

It depends on the algorithm in controller below.

lavalamp · 2019-07-01T22:25:25Z

keps/sig-network/20190603-EndpointSlice-API.md

+| # of total watch event   | O(N)                  | O(N)                            | O(N)                         |
+|                          | 5000                  | 5000                            | 5000                         |
+| Total Bytes Transmitted  | O(PN)                 | O(BN)                           | O(N)                         |
+|                          | ~2.0MB * 5000 = 10GB  | ~10k * 5000 = 50MB              | ~1KB * 5000 = ~5MB           |


Not sure a single endpoint update is a good thing to think about, since in practice we should expect multiple endpoints to change within the time period in a huge service.

I don't agree with that. We want to keep the "batching period" fairly small (less than 1s). Given that, (obviously assuming there isn't a rollout or sth happening now), it's very rare to have changes of more than 1 pod within that period. So I actually think that it is a metric that we should be looking at.

Well, then it's not comparing apples to apples-- the current mechanism is like 10s worth of batching IIRC. So when the rate of pod changes is above .1 per second, the rate of updates is a constant .1 / second, and never goes higher.

There's some implicit assumption here about how frequently state-changing things happen to pods. I am not sure what a good assumption is. If we assume the "mean time to state change" is like 1 day (~86k seconds), then we'd expect a service with 20k pods to have about 1 change every 5 seconds, which implies two (small) updates in the new system vs one (big) update in the old system.

But the calculation is pretty sensitive to that time-- if my nodes are worse or my pods are buggier, and the MTtST is only 1 hour, then it's ~5.5 changes per second, and the old mechanism gives 1 big update every 10s (10GB total) and the new system gives ~55 (small) updates (50MB*55=2.7GB) in the same amount of time.

So, the new method is still much better on bandwidth, but it is by a factor of 4, not a factor of 200.

With watch events in that latter scenario, the new method causes 5000*55 = ~250k events, compared to 5k with the old method. So it is clearly much worse by this metric. I think at these numbers bandwidth is more important than events, but I'm not 100% sure.

I think I understand what @lavalamp is saying.

Here is a worst case scenario: 10k endpoints, 1 big Endpoints object vs. 100 EndpointSlice Objects.
Within "batching period", let us say 100 endpoints got removed and they happen to distribute across all 100 EndpointSlice objects. The old API and the new API will perform the same in this case as all endpoints have to be redistributed.

For removal, we can not do much as the endpoints may be scattered.
For addition, the controller can try to pack the new endpoints together and add to one slice.
For noisy pods, the controller can try to concentrate them into a few EndpointSlice. (This seems complex)

@lavalamp - I think there is one false assumption in your thing which is: "the current mechanism is like 10s worth of batching".
This is what isn't true today. In EndpointsController (which we're talking about here) we batch "everything that happened until now" - so if pods are changing very fast we will batch some stuff, but it's more like 0.1s rather than 10s.

To clarify a bit what is happening in EndpointsController is we have a queue from which workers are taking services to process, and when processing a given service we list pods from cache and compute the Endpoints object. There is no explicit batching - batching is only a result of lag in controller.
Then 10s batching I guess you might have been talking about batching on kubeproxy side to update iptables.

And I think that explains our disagreement about this point - because we have different assumptions.
Because what I'm saying is that it rare (if we exclude creation or deletion the whole service) to have things batched by endpoints-controller.

I must be misremembering something--I thought endpoints controller deliberately batched. IMO it should do that, it would be a massive bandwidth savings for high-churn services.

If it doesn't deliberately batch, it'll still be limited by how long it takes to write the endpoints object, but I agree that should be < 1 second even in pretty bad conditions.

IMO it should do that, it would be a massive bandwidth savings for high-churn services.

There is a trade-off with how fast changes are reflected in the iptables (or whatever else is used) [e.g. it is on the path for autoscaling where speed is something that customers are complaining]

So in general, I agree we can be much smarter, but it's also not that obvious.

lavalamp · 2019-07-01T22:31:51Z

keps/sig-network/20190603-EndpointSlice-API.md

+        Conditions EndpointConditions
+        // Optional: Reference to object providing the endpoint.
+        TargetRef *v1.ObjectReference
+}


I think an endpoint needs a counter or some other mechanism to enable moving it between slices (e.g., if it appears in two slices, the entry with the higher counter is more correct--the other one hasn't propagated yet).

I do not quite understand. If there are duplicates, the consumer will merge them, right?

If there are duplicates which disagree on some point, how does the consumer know which is correct?

I see. So it would need to add generation to each endpoint. Correct?

Yeah, that's the first thing that comes to mind, but there might be other ways of solving the problem.

I guess there are other ways, but generation is frequent enough concept in k8s so that I would suggest using it rather than coming up something new.

Do we have resolution to this point? At the very least, we need to document what happens if 2 slices have the same endpoint. What should consumers do? De-dup? Consider it as "weight"? If those endpoints are in conflict (e.g. ready = true and false), what should they assume?

lavalamp · 2019-07-01T22:39:01Z

keps/sig-network/20190603-EndpointSlice-API.md

+    - Skip
+  - If Pod needs to be added:
+    - If  all EndpointSlice objects has reached the MaxEndpointThreshold. Create a new EndpointSlice object.
+    - Find the EndpointSlice with the lowest number of endpoint, and add it into it. 


"lowest number of endpoints" gives you a redistribution problem below. You want to completely fill up some and let others empty out as much as possible, which will make them cheaper to redistribute. So, I think you want to add to the one with the highest number of endpoints less than the maximum.

I see. Okay. I will adjust the algorithm here.

lavalamp · 2019-07-01T22:46:22Z

keps/sig-network/20190603-EndpointSlice-API.md

+  - If Pod needs to be removed:
+    - Remove the endpoint from EndpointSlice. 
+    - If the # of endpoints in EndpointSlice is less than ¼ of threshold.
+      - Try to pack the endpoints into smaller number of EndpointSlice objects.


Consider also attempting to pack the noisiest pods together, so their updates can be combined.

lavalamp · 2019-07-02T17:08:05Z

keps/sig-network/20190603-EndpointSlice-API.md

+| Total Bytes Transmitted  | O(PN)                 | O(PN)                           | O(PN)                        |
+|                          | 2.0MB * 5000 = 10GB   | 10KB * 5000 * 200 = 10GB        | ~10GB                        |
+
+### Single Endpoint Update


See my last comment-- I think it would be better to compare not a single update, but rather a rate of updates, since that is what will happen in practice. In practice the rate will likely depend on the quality of the pods, multiplied by the size of the service--as services get bigger, holding the quality of the pods constant, the rate will go up.

Considering this, I think some entries that are currently labeled O(N) actually in practice are going to represent O(N * rate) = O(N^2) total load on the system; granted, the constant factor will be favorable.

Under the same scenario (e.g. size of the service), can we assume the rate of change for both APIs is the same? Since we could not determine the rate of changes on pods, both API will respond to them. The main difference is that the cost is smaller for each change.

Just read you response down below. I think I understand your point now. I will respond.

keps/sig-network/20190603-EndpointSlice-API.md

vllry · 2019-07-12T04:56:02Z

/cc

m1093782566 · 2019-07-17T02:12:17Z

After a quick look at, I think the benefit of this proposal to topology-aware service routing is that kube-proxy may avoid watching nodes and can watch endpointslice instead for getting topology/labels information. cc @imroc for deeper look.

freehan · 2019-07-17T19:01:40Z

After a quick look at, I think the benefit of this proposal to topology-aware service routing is that kube-proxy may avoid watching nodes and can watch endpointslice instead for getting topology/labels information. cc @imroc for deeper look.

Yes. That is the goal. It would be great if you can check in the API change on Service first. Then EndpointSlice controller implementation can consume that once it is checked in.

imroc · 2019-07-22T08:05:40Z

Would we pre-cook node labels into a new resource (in the same apigroup as EndpointSlice) and simply reference that from each Endpoint[] ?

Agree with this, I think this is a good idea: EndpointSlice holds an ObjectRef for each endpoint, pointing to a duck-typed "decorator" which is non-namespaced and auto-derived from Node.metadata (and updated when Nodes change)

cc @thockin @freehan

freehan · 2019-07-23T19:02:17Z

Agree with this, I think this is a good idea: EndpointSlice holds an ObjectRef for each endpoint, pointing to a duck-typed "decorator" which is non-namespaced and auto-derived from Node.metadata (and updated when Nodes change)

The ObjectRef on each endpoint is currently pointing to Pods if the endpointSlice is derived from service. With this approach, kube-proxy still need to watch extra objects.

imroc · 2019-07-24T01:36:15Z

The ObjectRef on each endpoint is currently pointing to Pods if the endpointSlice is derived from service. With this approach, kube-proxy still need to watch extra objects.

yeah, it seems we should go back to PodLocator API, or implement the metadata-only watch, then kube-proxy would just watch Node's metadata for topology information

wojtek-t · 2019-07-24T06:11:04Z

yeah, it seems we should go back to PodLocator API, or implement the metadata-only watch, then kube-proxy would just watch Node's metadata for topology information

meta-only watch has been implemented in 1.15: kubernetes/kubernetes#71548
Basically what you need to do is watch nodes as PartialObjectMetadata

That said, I would also like to see an analysis of performance impact of this - as nodes (even with node lease feature) are still being updated every 1m, so every node being watched by every node is a lot of data.

m1093782566 · 2019-07-24T06:18:42Z

Cool! Glad to know that!

imroc · 2019-07-24T06:25:47Z

meta-only watch has been implemented in 1.15: kubernetes/kubernetes#71548
Basically what you need to do is watch nodes as PartialObjectMetadata

That said, I would also like to see an analysis of performance impact of this - as nodes (even with node lease feature) are still being updated every 1m, so every node being watched by every node is a lot of data.

Cool! I'll take a deeper look at this feature

thockin

A bunch of API nits. This is not a full API review, and we can argue about names when that PR comes thru.

Merge and iterate

Thanks!

/lgtm
/approve

thockin · 2019-07-26T16:41:39Z

keps/sig-network/20190603-EndpointSlice-API.md

+    // +optional
+    metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
+    // Metadata of the Endpoints in the EndpointSlice.
+    EndpointMeta `json:",inline"`


Why is this an embedded sub-struct rather than inlined? AFAICT it is not reused anywhere?

thockin · 2019-07-26T16:43:39Z

keps/sig-network/20190603-EndpointSlice-API.md

+The following new EndpointSlice API will be added to the `Discovery` API group.
+
+```
+type EndpointSlice struct {


I still wonder about names - is there a reason EndpointSet is less good?

EndpointShard?

thockin · 2019-07-26T16:50:19Z

keps/sig-network/20190603-EndpointSlice-API.md

+
+type Endpoint struct {
+    // Required: Targets of the endpoint. Must contain at least one target.
+    // Different consumers (e.g. kube-proxy) handle different types of


Yesterday (IRL) we nixed the "type" field. Given that this is alpha, that's probably fine, but I was thinking about this more. If we really want to be future-compatible and we really think there's utility for other data types here, maybe we should bring back the type field as optional with a single value "IP" and a default value of "IP" when not specified.

The we can say something like:

Consumers should always check the type field before making assumptions about the contents of this field.

thockin · 2019-07-26T16:54:12Z

keps/sig-network/20190603-EndpointSlice-API.md

+    // targets in the context of its own capabilities.
+    // +listType=set
+    Targets []string `json:"targets,omitempty" protobuf:"bytes,1,opt,name=targets"`
+    // The Hostname of this endpoint.


maybe more detail, e.g. "The stated hostname of this endpoint. This field may be used by consumers of endpoint to distinguish endpoints from each other (e.g. in DNS names). Multiple endpoints which use the same hostname should be considered fungible (e.g. multiple A values in DNS)."

@johnbelamaric I don't know if CoreDNS gets this aspect rigth (we probably did not spec it well). I know KubeDNS gets it wrong.

thockin · 2019-07-26T16:55:47Z

keps/sig-network/20190603-EndpointSlice-API.md

+    // endpoint.
+    // Key/value pairs contained in topology must conform with the label format.
+    // https://kubernetes.io/docs/concepts/overview/working-with-objects/labels
+    // Topology may contain the following well known keys:


"may include, but is not limited to"

thockin · 2019-07-26T16:59:01Z

keps/sig-network/20190603-EndpointSlice-API.md

+        Conditions EndpointConditions
+        // Optional: Reference to object providing the endpoint.
+        TargetRef *v1.ObjectReference
+}


Do we have resolution to this point? At the very least, we need to document what happens if 2 slices have the same endpoint. What should consumers do? De-dup? Consider it as "weight"? If those endpoints are in conflict (e.g. ready = true and false), what should they assume?

thockin · 2019-07-26T16:59:56Z

keps/sig-network/20190603-EndpointSlice-API.md

+    // +optional 
+    Hostname *string `json:"hostname,omitempty" protobuf:"bytes,2,opt,name=hostname"`
+    // Required: the conditions of the endpoint.
+    Conditions EndpointConditions  `json:"conditions,omitempty" protobuf:"bytes,3,opt,name=conditions"`


We may want a better name, since this name has meaning. We can debate when teh API review comes.

thockin · 2019-07-26T17:01:49Z

keps/sig-network/20190603-EndpointSlice-API.md

+    Topology map[string]string `json:"topology,omitempty" protobuf:"bytes,5,opt,name=topology"`
+}
+
+type EndpointConditions struct {


We should document something about the polarity or meaning of values here. E.g. are these always "good" things ? I know we hypothesized other conditions - what do they have in common?

thockin · 2019-07-26T17:03:45Z

keps/sig-network/20190603-EndpointSlice-API.md

+}
+
+type EndpointMeta struct {
+    // This field specifies the list of ports associated with each endpoint in the EndpointSlice


...If this list is empty or unspecified, no ports or protocols are are defined and the endpoints are effectively unreachable.

Do we want to allow that, or demand that at least one ports block, even if {} be provided?

thockin · 2019-07-26T17:04:43Z

keps/sig-network/20190603-EndpointSlice-API.md

+  - "@bowei"
+  - "@thockin"
+creation-date: 2019-06-01
+last-updated: 2019-06-01


k8s-ci-robot · 2019-07-26T17:05:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: freehan, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-network/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from dcbw and thockin June 4, 2019 17:02

freehan force-pushed the master branch from ef9d506 to f52a31b Compare June 4, 2019 17:09

k8s-ci-robot assigned thockin Jun 4, 2019

freehan force-pushed the master branch from f52a31b to 64913c9 Compare June 4, 2019 17:18

bowei reviewed Jun 5, 2019

View reviewed changes

freehan force-pushed the master branch from 64913c9 to 9d10f8a Compare June 5, 2019 23:37

thockin reviewed Jun 6, 2019

View reviewed changes

apelisse reviewed Jun 10, 2019

View reviewed changes

danwinship reviewed Jun 14, 2019

View reviewed changes

dcbw reviewed Jun 27, 2019

View reviewed changes

freehan force-pushed the master branch 3 times, most recently from d937a95 to ee7a871 Compare June 28, 2019 20:26

lavalamp reviewed Jul 1, 2019

View reviewed changes

lavalamp reviewed Jul 2, 2019

View reviewed changes

wojtek-t mentioned this pull request Jul 9, 2019

vertical-pod-autoscaler: rate limit pod evictions kubernetes/autoscaler#2178

Closed

freehan mentioned this pull request Jul 9, 2019

EndpointSlice API #752

Closed

freehan force-pushed the master branch from ee7a871 to 025a87f Compare July 11, 2019 00:17

andrewsykim reviewed Jul 11, 2019

View reviewed changes

keps/sig-network/20190603-EndpointSlice-API.md Outdated Show resolved Hide resolved

andrewsykim mentioned this pull request Jul 11, 2019

corev1: add topologyKeys to service spec kubernetes/kubernetes#80068

Closed

k8s-ci-robot requested a review from vllry July 12, 2019 04:56

imroc mentioned this pull request Jul 22, 2019

Update Service Topology Proposal #1161

Closed

freehan force-pushed the master branch 4 times, most recently from e29ffc5 to 5c8612b Compare July 26, 2019 00:36

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 26, 2019

freehan force-pushed the master branch from 5c8612b to 74f1144 Compare July 26, 2019 00:37

Add EndpointSlice API KEP

1da52f0

freehan force-pushed the master branch from 74f1144 to 1da52f0 Compare July 26, 2019 00:43

thockin reviewed Jul 26, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 26, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2019

k8s-ci-robot merged commit abdc87e into kubernetes:master Jul 26, 2019

imroc mentioned this pull request Aug 20, 2019

Service Topology implementation for Kubernetes kubernetes/kubernetes#72046

Merged


		Based on the data collected from user clusters, vast majority (> 99%) of the k8s services have less than 100 endpoints. For small services, EndpointSlice API will make no difference. If the MaxEndpointThreshold is too small (e.g. 1 endpoint per EndpointSlice), controller loses capability to batch updates, hence causing worse write amplification on service creation/deletion and scale up/down. Etcd write RPS is significant limiting factor.

		- Why do we have a status struct for each endpoint? Why not boolean state for readiness?

		The new EndpointSlice API aims to address existing problems as well as leaving room for future extension.


		## Motivation


		### Kube-Proxy

		Kube-proxy will be modified to consume EndpointSlice instances besides Endpoints resource. A flag will be added to kube-proxy to toggle the mode.

		}

		```

KEP: EndpointSlice API #1086

KEP: EndpointSlice API #1086

Conversation

freehan commented Jun 4, 2019 • edited Loading

k8s-ci-robot commented Jun 4, 2019

freehan commented Jun 4, 2019

freehan commented Jun 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freehan Jun 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imroc Jul 21, 2019 • edited Loading

Choose a reason for hiding this comment

freehan commented Jun 4, 2019 •

edited

Loading

freehan Jun 25, 2019 •

edited

Loading

imroc Jul 21, 2019 •

edited

Loading