-
Notifications
You must be signed in to change notification settings - Fork 91
Compute Envoy config eagerly rather than on-demand #1639
Comments
Unfortunately I don't have permissions to add labels or update assignees. cc @rshriram @kyessenov |
Something to keep in mind is that the cache has to keep track of old versions to be able to stage updates across xDS. We should remove stale versions only when no proxy is utilizing it. |
Before we make this optimization, we have to check if it is needed, determine the breaking point, the number of pods or the conditions justifying it. |
To be clear, the optimization is in the caching. Regardless of caching/performance, we need to invert config generation so its eager so that we're ready for the push-based world of xDS v2. |
To @andraxylia 's point, we're stuck with v1 until we can prove v2 performs better. The work will have to happen in a separate envoy adapter module ( |
How are we sequencing things? I thought we had to move to v2 APIs to address the 404 issues, which means we need to land updated config handling in 0.3 as well. |
My point was about avoiding premature optimization, and relative to the caching/invalidation of the config. I would defer this to when/if needed. We still need to prepare the config for each Envoy instance. We are not stuck with v1, we want to move to v2, and we do not need to prove it performs better than v2, just that the pilot implementation of v2 is production ready. We should also have a way to switch from v1 and v2 and revert if needed, so @kyessenov suggestion is relevant. Though ADS will solve the 404 issues, they can also be handled separately, in case v2 is not ready with production grade quality for 0.3. Sequencing wise, I thought #1257 will cover everything, but if we need multiple issues and multiple owners, let's use an epic. |
The work has moved to a new envoy org project as there is a lot of complexity in the push/streaming interface. It's definitely far from a single issue/epic. |
@andraxylia how do you envision the 404s can be solved with v1? I would be interested in exploring that solution. |
There are several ways to solve the 404. The most direct is to fix Envoy, so it returns 503 as expected. The solution that I am proposing is to change the proxy agent, and have it download the full config before Obviously, such a cache could directly expose a v2 API - converting from v1 into v2 is apparently what Envoy is doing internally as well if I understood it correctly. And the code can be extremely simple - only It is also ok to start with a v1 cache in the agent - only special behavior is to not return routes or listeners Even if envoy is fixed to not return 404 on miss-config - it is likely we'll still need to properly deal with As a side note - the local cache + proxy can also greatly simplify push and reduce the complexity of |
I think using the agent to serve xDS is an interesting solution for stickiness of the updates. The agent can sequences the polled updates in the order that Envoy would always have internally consistent view of the config. The logic to sequence updates is required in v2 implementation, so this is just moving it closer to the proxy. The downside is the increase in the complexity in the agent. We've been fighting for simplification of the agent since it's attached to every single pod. Another tricky bit is lack of ACK/NACKs in v1. The server does not receive confirmation that the update is applied in v1 API. |
It is also taking us back to the state 8 months ago when we did this in the agent. I don’t see any advantage of double caching in the proxy agent and Envoy. Envoy automatically falls back to the cached version if pilot is unavailable. That has been our design since day 1. So the argument about helping in hybrid/onprem doesn’t seem convincing. I also don’t see how it will reduce complexity (if any) in pilot because we don’t do any sequencing in pilot. Implementing v2 api in the agent seems suboptimal, more like I see no use for the v2 apis in Envoy for large scale installations. Finally, not everyone will use the agent. Because they could be running Envoy in different formats (cloudfoundry for example) where there is no accompanying agent. This could be a temporary solution but not a permanent one. @mattklein123 / @htuch do you think @costinm’s concerns are real ? - about envoys failing upon failure of management planes (like pilot or the one at lyft or what you guys are building at google). Do you guys have double caching, with local agent and a central agent? I don’t find it necessary. |
@rshriram no need for this local cache. As you correctly state, if a management update fails, the last valid config remains latched and in effect. |
Agreed, there is no reason to cache anything on the node. re: 404 vs. 503, there are legitimate reasons why someone may want to return 404 in this scenario. Having an option to return 503 instead in the case the cluster is not found is trivial. Someone just needs to do the work (envoyproxy/envoy#1820). |
Envoy returning 503 plus the 'liveness' check at startup will take care of
the 404 bug, and I think it is the right solution.
Regarding 'cache on the node' - there are reasons beyond the 404, and it is
not the same with what we had 8 months ago.
Envoy does indeed gracefully deal with pilot unavailability - but in case
of a DDOS it is likely that both pilot and some
of the ingress and backends will flip and restart. I've seen this happen
quite a few times, and avoiding cascading
failures is not easy.
Envoy could save snapshots on disk, or maybe it already saves snapshots of
the config in the shared memory - and that
would also be a good solution, but I don't think we should dismiss this
problem, DDOS are common and there are other
scenarios in hybrid/on-prem where similar behavior may happen.
In addition having an on-prem caching proxy for pilot will help in other
use cases too - and likely we'll need to implement it
anyways.
Either way - it's worth trying.
…On Sun, Oct 29, 2017 at 6:41 PM, Matt Klein ***@***.***> wrote:
Agreed, there is no reason to cache anything on the node.
re: 404 vs. 503, there are legitimate reasons why someone may want to
return 404 in this scenario. Having an option to return 503 instead in the
case the cluster is not found is trivial. Someone just needs to do the work
(envoyproxy/envoy#1820 <envoyproxy/envoy#1820>).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1639 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFI6rK6m5a5OatM92DPt-QWmgVTKOcYks5sxSkvgaJpZM4QJc6x>
.
|
One extra note on the subject - in a hosted environment, with very large number of domains ( So long term I don't think the current design of keeping the entire mesh config cached in Pilot's memory The use case is something like a an app that allows users to bring their own domains ( something like |
I don't think The envoy crash happens if we restart it too frequently - which is caused by auth certificates changing rapidly. I think its a question for the auth folks, and that too will become moot with v2 api. My two cents. We could also totally rearchitect Pilot (in my mind this solution is basically what we had in Amalgam8 a year ago) and finish this up, call it production ready by 0.3 time frame. Whether this is more stable than the V2 implementation, is up for debate. |
@mattklein123 would you agree that adding support in Envoy for snapshotting/checkpointing current running config is the right thing to do to allow data plane to correctly function if the control plane is partitioned, and Envoy needs to restart or hot-restart on the same host? There will be no control plane to provide clusters and listeners, so traffic will be lost. |
No. I think this is complexity that is not actually needed and is more likely to lead to an outage than the situation it is trying to cover for. |
Shriram: the 503 on route rule change happens only when the cluster is
missing, i.e. for a brand new service. AFAIK it doesn't happen if you
change rules
for an existing service/cluster.
Having a local cache is not useless - the main use case is hybrid and HA
cases. In particular since Matt indicated he doesn't believe a checkpoint
in
Envoy is the right solution, as cache is even more important to have, and
will be needed with v2 as well. My proposal for the cache was to
have it expose v2 from start anyways ( transcoding the pilot v1 ).
I'm not aware of any crash caused by cert change - sorry if I missed that
bug.
But I really don't see how we can call anything brand new 'production
ready' - the whole point is to have extensive and broad testing and proven
stability,
we can't 'rearchitect' pilot and immediately call it 'production ready'.
And there is no need to do this in 0.3 - a liveness check at startup is
still needed
even with v2 API ( we don't want requests coming in before v2 API got the
config ), and even with v2 API we want 503 if a route is defined
but not the cluster.
…On Mon, Oct 30, 2017 at 10:25 AM, Shriram Rajagopalan < ***@***.***> wrote:
I don't think Envoy returning 503 plus the 'liveness' check at startup
will take care of the 404 bug, and I think it is the right solution.
takes care of 503s or 404s fully. Everytime you change a route rule, you
will hit this and the chances of hitting this depends on your polling
interval. You can solve this by having a temporary local cache, but thats a
short term solution, until V2 api lands - at which point the agent becomes
useless.
The envoy crash happens if we restart it too frequently - which is caused
by auth certificates changing rapidly. I think its a question for the auth
folks, and that too will become moot with v2 api. My two cents. We could
also totally rearchitect Pilot (in my mind this solution is basically what
we had in Amalgam8 a year ago) and finish this up, call it production ready
by 0.3 time frame. Whether this is more stable than the V2 implementation,
is up for debate.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1639 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFI6tDnnT-KFn9k0_d95Dz6y6CNSwXUks5sxgajgaJpZM4QJc6x>
.
|
Also, Envoy may be deployed in a diskless scenario (or read only storage), so there inherently needs to exist a story for configuration recovery that doesn't rely on this. I think a useful first step in this discussion is to get a clear picture (and consensus) on what the failure scenarios that Istio considers important are. There have been various suggested, can we get a doc written? |
I agree snapshot/cache is complexity that may not be needed in envoy, and
is better handled by the DS infrastructure.
I've been involved in a few outages caused by this kind of cascading
failure ( and maybe caused a couple as well :-) -
the goal is to gracefully deal with cases where a component is down. Mixer
already handles this (fails open), envoy
partially handles it by using last known config - but only until restart.
The cache doesn't have to be implemented in the agent, and it doesn't need
to store the cached config on local disk,
but
We can start a doc - the caching proxy can be an optional component as
well, will be more valuable to users who
run hybrid clusters and are concerned about HA, there is no need to add it
for all users and at this point the liveness + 503
should be sufficient for 0.3 ( IMHO ). We'll document that after adding a
service it may take up to refresh_interval until it
becomes visible.
…On Tue, Oct 31, 2017 at 10:22 AM, htuch ***@***.***> wrote:
Also, Envoy may be deployed in a diskless scenario (or read only storage),
so there inherently needs to exist a story for configuration recovery that
doesn't rely on this.
I think a useful first step in this discussion is to get a clear picture
(and consensus) on what the failure scenarios that Istio considers
important are. There have been various suggested, can we get a doc written?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1639 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFI6nXj0g8zmeD_swd0XOWGQPC-Qa9Xks5sx1dJgaJpZM4QJc6x>
.
|
On Tue, Oct 31, 2017 at 1:29 PM Costin Manolache <notifications@github.com>
wrote:
I agree snapshot/cache is complexity that may not be needed in envoy, and
is better handled by the DS infrastructure.
I've been involved in a few outages caused by this kind of cascading
failure ( and maybe caused a couple as well :-) -
the goal is to gracefully deal with cases where a component is down. Mixer
already handles this (fails open), envoy
partially handles it by using last known config - but only until restart.
Yes and that why all the restarts are being eliminated. With v2, there is
no restart.
The cache doesn't have to be implemented in the agent, and it doesn't need
to store the cached config on local disk,
but
Not in agent or disk. Means the DS server. And server has the cache.
We can start a doc - the caching proxy can be an optional component as
well, will be more valuable to users who
run hybrid clusters
Let’s use the word carefully. Hybrid means on prem plus cloud. I don’t know
what you mean here. And what outage scenario as Harvey indicates. Having a
crystal clear fault model helps. What failures can be handled and what
cannot.
and are concerned about HA, there is no need to add it
for all users and at this point the liveness + 503
should be sufficient for 0.3 ( IMHO ).
You get 503 (or 404) anytime you create a new weighted routing rule and the
receiving Envoy doesn’t have the clusters associated with the routes. It’s
not for just brand new services.
It also makes no sense to change pilot architecture completely, do local
fetch and v2 api locally and Call this production ready. I would say that
we live with high refresh rates in v1 and make it configurable in Envoy to
return 404/503.
We'll document that after adding a
service it may take up to refresh_interval until it
becomes visible.
Once again this is not the only scenario. Set up a 60s refresh interval.
Create a weighted routing rule and you will see the issue.
…
On Tue, Oct 31, 2017 at 10:22 AM, htuch ***@***.***> wrote:
> Also, Envoy may be deployed in a diskless scenario (or read only
storage),
> so there inherently needs to exist a story for configuration recovery
that
> doesn't rely on this.
>
> I think a useful first step in this discussion is to get a clear picture
> (and consensus) on what the failure scenarios that Istio considers
> important are. There have been various suggested, can we get a doc
written?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1639 (comment)>, or
mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/AAFI6nXj0g8zmeD_swd0XOWGQPC-Qa9Xks5sx1dJgaJpZM4QJc6x
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1639 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AH0qd6kzprscy6Odu_mtmfhi2t_89hbjks5sx1kPgaJpZM4QJc6x>
.
|
Will move the discussion on how to 'fail open', HA and avoiding cascading failires to separate document, it's a different topic. On the 404: I think it is clear that 503-if-cluster missing and liveness checks will mitigate some of the The main remaining issue is the case Shriram describes - which I don't understand and I was not |
Getting a 404 for a newly created service is acceptable. After all, the configuration has to propagate to all callers, and if the external world clients start accessing it before its done, a 404 is what a server sends. In other words, end users are actually used to this scenario. What they are not used to is getting a 404 when they split an existing service into multiple clusters for the purpose of version routing. And this is the bigger issue that we need to solve. We would not be undertaking such a big effort just for the healthcheck or initial startup thing :). #926 is the first occurrence of this issue (this user is trying to deploy Istio to production :) ). The problem only compounds when we increase the refresh interval from 1s to 60s. While a partial solution is to have high refresh rate, it burns a lot of CPU and tanks performance (evidence: #1237 where we found "First tested with Istio 0.1.6. Big performance jump when refresh interval was increased from 10ms to 60s. " ). |
I see the comment that "Yes, a route rule can introduce a new cluster (for special version, for example)" - It seems we agree that healthchecks / liveness checks are needed for both v1 and v2 - and having some Short refresh rate is not a solution - everything should work with 60s or more, with understanding that |
new clusters are created because of route changes (version split). An existing service's cluster is split into multiple newer clusters, and traffic is split by weights. This involves changing the route config to use weighted clusters and pushing those new clusters to envoy. And this is where the 404s/503s are occurring. and thats the whole reason for v2 push api. |
we should probably consolidate the various 404/503 open issues and move this to istio/istio |
Thanks Shriram, I see the problem.
I'll start a doc - there are few relatively simple options on how to push
clusters before routes (or ensure cluster stability when route rules are
changed), but may require small
extensions/adjustments to the API.
The current weight rules don't work on 'mesh expansion' ( no labels ) - I
was not familiar with how it is implemented, but at least one of the
options we were exploring
for adding weight rules for 'raw vms' will also solve this problem.
…On Wed, Nov 1, 2017 at 11:09 AM, Shriram Rajagopalan < ***@***.***> wrote:
new clusters are created because of route changes (version split). An
existing service's cluster is split into multiple newer clusters, and
traffic is split by weights. This involves changing the route config to use
weighted clusters and pushing those new clusters to envoy. And this is
where the 404s/503s are occurring. and thats the whole reason for v2 push
api.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1639 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFI6kyY1n9F5sMAPhQjsk7OR3Aj9aXqks5syLPygaJpZM4QJc6x>
.
|
Why would you not have labels on VMs? k8s, consul, eureka and whatnot all have labels on workloads. An API change is a drastic measure to fix this problem. The actual solution will happen in v2 library under envoy org, once we work out the ADS protocol. This is the solution that will scale, be backwards/future compatible, and fix all the other "inconsistency" issues that are possible with v1. Once it's ready we'll replace the pilot core with it, and only need to translate istio config into data plane config to complete the picture. |
On Thu, Nov 2, 2017 at 5:25 PM, Kuat ***@***.***> wrote:
Why would you *not* have labels on VMs?
Well - hard to answer this question, non-orchestrated workloads don't have
a lot of other things, they're simply
processes running on some machines. That's how things work.
Even if they did have labels - it's not clear this model can work with
non-flat networks or if it can scale in the hybrid
model, or if all orchestration systems use labels in the same way, or if
keeping track of all global labels and pods
in pilot's memory is a scalable architecture.
The important feature is to allow users to control weighted routing.
k8s, consul, eureka and whatnot all have labels on workloads. An API
change is a drastic measure to fix this problem. The actual solution will
happen in v2 library under envoy org, once we work out the ADS protocol.
This is the solution that will scale, be backwards/future compatible, and
fix all the other "inconsistency" issues that are possible with v1. Once
it's ready we'll replace the pilot core with it, and only need to translate
istio config into data plane config to complete the picture.
I'm not proposing any API change to fix this problem.
I was saying that some of the API changes that we are planning for 'bring
your name' and hybrid also happen to naturally
not have this problem, and can be used as a different way to allow weighted
routing until ADS is ready and tested.
In general - there are multiple solutions, each with its own benefits. ADS
can fix the current 404 bug, there are other ways as
well - but we need to also consider if the current design of weight network
is the right one (regardless of the bug).
Getting fixated on one solution or model and finding reasons to dismiss any
other option is usually not working very well :-)
|
I'll be happy to hear concrete proposals :) in the meantime I'm investing
in building v2 implementation. It will be a separate decision how to
reconcile the two approaches.
On Thu, Nov 2, 2017, 7:13 PM Costin Manolache <notifications@github.com>
wrote:
… On Thu, Nov 2, 2017 at 5:25 PM, Kuat ***@***.***> wrote:
> Why would you *not* have labels on VMs?
>
Well - hard to answer this question, non-orchestrated workloads don't have
a lot of other things, they're simply
processes running on some machines. That's how things work.
Even if they did have labels - it's not clear this model can work with
non-flat networks or if it can scale in the hybrid
model, or if all orchestration systems use labels in the same way, or if
keeping track of all global labels and pods
in pilot's memory is a scalable architecture.
The important feature is to allow users to control weighted routing.
> k8s, consul, eureka and whatnot all have labels on workloads. An API
> change is a drastic measure to fix this problem. The actual solution will
> happen in v2 library under envoy org, once we work out the ADS protocol.
> This is the solution that will scale, be backwards/future compatible, and
> fix all the other "inconsistency" issues that are possible with v1. Once
> it's ready we'll replace the pilot core with it, and only need to
translate
> istio config into data plane config to complete the picture.
>
I'm not proposing any API change to fix this problem.
I was saying that some of the API changes that we are planning for 'bring
your name' and hybrid also happen to naturally
not have this problem, and can be used as a different way to allow weighted
routing until ADS is ready and tested.
In general - there are multiple solutions, each with its own benefits. ADS
can fix the current 404 bug, there are other ways as
well - but we need to also consider if the current design of weight network
is the right one (regardless of the bug).
Getting fixated on one solution or model and finding reasons to dismiss any
other option is usually not working very well :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1639 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJGIxrfXxExoYyB-_RBdzdWt6z9Y2jV_ks5syna2gaJpZM4QJc6x>
.
|
Issue moved to istio/istio #1370 via ZenHub |
Today Pilot implements Envoy's v1 xDS APIs, which are pull based: Envoy sends a request, we compute the config it needs (cache it), then reply to Envoy. As a result of the current implementation, we don't keep track of which Istio config was used to construct Envoy's config. This means we have poor caching (we invalidate the entire config cache when any Istio config changes), and means we don't know what Envoy configs need to be updated as Istio configs change.
In the world of v2 xDS APIs, Pilot will need to push config to Envoy, rather than send config as Envoy requests it. This means Pilot needs to understand which specific Istio config documents contribute to the config of each specific Envoy instance, so it can update and push config for specific Envoy instances as the underlying Istio configs change. As a side effect of tracking this usage, we can also improve our caching of config by invalidating only portions of the cache rather than the entire cache any time an Istio config changes.
The text was updated successfully, but these errors were encountered: