Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable EDS Patch on Pilot EDS incremental push #19287

Open
seflerZ opened this issue Nov 28, 2019 · 21 comments
Open

Enable EDS Patch on Pilot EDS incremental push #19287

seflerZ opened this issue Nov 28, 2019 · 21 comments

Comments

@seflerZ
Copy link
Contributor

@seflerZ seflerZ commented Nov 28, 2019

Describe the feature request
Currently, incremental EDS push has been implemented on both Envoy and Pilot side. Even with this, some large clusters(with 10000 endpoints for example) may still suffer from performance issues due to:

  1. Even with only 1 endpoints change within cluster would cause 10000 endpoints push. Most of the data is duplicated and wasted.
  2. Full EDS refresh will trigger Envoy to reload and recompute data which causes high CPU and memory usage constantly.
  3. Changes within large clusters will likely to be constantly which leads to push constantly.

The core idea is to make use of 'xDS delta' to only send endpoints changed. I'll write a google doc to describe it in details later.

Describe alternatives you've considered
Perhaps endpoint slices?

Additional context
Refer to this issue the EDS Patch on Envoy side is not ready yet. We will start some work at Envoy side in the meantime.

This issue is about patching of EDS on Envoy data pushing. There is another issue about the MCP registry data pushing. If both are ready, the performance and usability of Pilot will increase significantly, especially on networking.

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Nov 28, 2019

Sounds great but as you mentioned needs envoy changes. My understanding is the delta xds protocol would not enable this, we need dinner grain patching

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Nov 29, 2019

@howardjohn We can contribute to Envoy and benefit a lot from this. The primary objective is to make the patching protocol clear. Is there any suggestion you can give or any documentation/specification you've found?

Sounds I've mis-understood the xDS delta, I'll continue digging in.

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Nov 29, 2019

Yes my understanding is that https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/discovery.proto#envoy-api-msg-deltadiscoveryrequest you still need to send a complete Cluster Load Assignment and therefore all the endpoints together

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Nov 29, 2019

@howardjohn I got it. You'r right. So there should be a new way to do resource patching. @kyessenov I've read your sketch and think it helpful. Could you tell me the process of patching protocol you designed?

To my understanding, previous discussions focused on the following:
1、Should we deliver a comprehensive solution which supports EDS/CDS/xDS resources and complex path matching or make it dead simple to reach our goals only (Such as focusing on EDS patching first)?
2、Should we move this feature in UPDA and implement it directly instead of improving current xDS protocol?

@sudhi-vm

This comment has been minimized.

Copy link

@sudhi-vm sudhi-vm commented Dec 6, 2019

We (@ eBay) are also dependent on this feature. Please let us know how we can help.

@kyessenov

This comment has been minimized.

Copy link
Contributor

@kyessenov kyessenov commented Dec 6, 2019

Sorry, just saw this. You'd need to have this discussion in Envoy community. From what I understand, general patching is not agreed upon, but if you think you can make it work with endpoint groups, that's probably the way to go. UDPA is supposed to be a natural evolution of xDS, so if it doesn't really matter where you want to add it, it's the same people :)

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Dec 8, 2019

Yes. We've made much discussion on Envoy side within this issue and many folks including me appreciate adding a new resource type to enable endpoint groups to support this feature. Although it's not a perfect solution, it is best practice at present.

@sudhi-vm Currently we need to reach a common solution by discussing with folks in Pilot and Envoy communities. We've made some process at present. Our organization also marks this feature urgent.

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Dec 8, 2019

@seflerZ what is your organizations plans with regards to Istio versions? Say envoy support for egds was merged today - the earliest it would be shipped in an Istio release would be February (1.5 release date)

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Dec 8, 2019

@howardjohn First, the date is acceptable. We would start to implement it immediately in our internal Istio branch after discussion with you folks and keep tack of community process before Istio 1.5.

Of course, we love to contribute. Our process and design will be made public in Istio community because we believe this feature can help most of us.

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Dec 8, 2019

I've created a design doc here to summarize points we've got.

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Dec 20, 2019

@howardjohn I'm wishing to push this feature in to Istio 1.6 and I plan to join in Istio Network Group discussion next Thursday (12-26 11am PST, and 12-17 3am Beijing Time). Yet I don't have access to the agenda document. Would you grant it for me?

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Dec 20, 2019

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Dec 20, 2019

@howardjohn Thanks, got it.

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Jan 2, 2020

@howardjohn Is it necessary to make backward compatibility by disabling EGDS?

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Jan 2, 2020

I am not sure what you mean by that? We need to support non-egds because new pilot versions need to work with old proxy versions

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Jan 2, 2020

@howardjohn Thanks. Another question, do we need to switch EGDS feature on or off based on proxy versions? That is to say, with the same Pilot instance, old version proxy A got endpoints by old EDS way and new version proxy B got endpoints through EGDS. I've designed a switch in MeshConfig to enable or disable EGDS feature, but it is systemwide.

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Jan 2, 2020

@howardjohn @lizan There is another issue. The EGDS feature touches the xDS API which can be only implemented on xDS v3alpha because the v2 seems to be stable, but Pilot itself doesn't have any code supporting xDS v3alpha yet. It'll touch many files if we make support for xDS API version v3alpha within this EGDS feature. Do you have any ideas to avoid that?

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Jan 2, 2020

Yes it will need to be switched off per proxy. This is a very common pattern in pilot, look for functions like IsProxyVersionGE14 or similar.

As far as the v3 API, we will need to add support for that at some point. As far as I know there is no way around that, which would unfortunately mean that egds is dependent on us upgrading to v3. The v3 upgrade has the same problem with proxy compatibility which makes it a lot more painful than just renaming everything from v2 to v3 and updating the few things that changed

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Jan 3, 2020

@howardjohn There is a debug interface called "/debug/edsz" which prints the EDS data returned to the proxy node. However, it simply picks the xDS connection of most recently(which may result in different outputs everytime) . If we aimed to supported new and old proxy versions, this debug interface should also be enhanced.

BTW. What's the plan of v3 upgrading work? When will it be completed?

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Jan 3, 2020

@seflerZ

This comment has been minimized.

Copy link
Contributor Author

@seflerZ seflerZ commented Feb 6, 2020

Hi,
We've implemented the EGDS in our internal Pilot and benefited a lot from it. As you can see from the performance benchmark below, the EGDS can significantly reduce the network traffic with partial endpoint changes.

Before:
d1

After:
d2

The test was fired with 50 services witch each has 2000 endpoints. That is a total of 0.1 million endpoints. The test platform got 1 CPU 2 GB memory. Every 10s, 1% of the endpoints will be changed randomly.

Later we'll start a PR for reviewing, this work involves both the Envoy and Pilot side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.