Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

Compute Envoy config eagerly rather than on-demand #1639

Closed
ZackButcher opened this issue Oct 27, 2017 · 33 comments
Closed

Compute Envoy config eagerly rather than on-demand #1639

ZackButcher opened this issue Oct 27, 2017 · 33 comments
Assignees

Comments

@ZackButcher
Copy link
Contributor

Today Pilot implements Envoy's v1 xDS APIs, which are pull based: Envoy sends a request, we compute the config it needs (cache it), then reply to Envoy. As a result of the current implementation, we don't keep track of which Istio config was used to construct Envoy's config. This means we have poor caching (we invalidate the entire config cache when any Istio config changes), and means we don't know what Envoy configs need to be updated as Istio configs change.

In the world of v2 xDS APIs, Pilot will need to push config to Envoy, rather than send config as Envoy requests it. This means Pilot needs to understand which specific Istio config documents contribute to the config of each specific Envoy instance, so it can update and push config for specific Envoy instances as the underlying Istio configs change. As a side effect of tracking this usage, we can also improve our caching of config by invalidating only portions of the cache rather than the entire cache any time an Istio config changes.

@ZackButcher
Copy link
Contributor Author

Unfortunately I don't have permissions to add labels or update assignees. cc @rshriram @kyessenov

@kyessenov
Copy link
Contributor

Something to keep in mind is that the cache has to keep track of old versions to be able to stage updates across xDS. We should remove stale versions only when no proxy is utilizing it.

@andraxylia
Copy link
Contributor

Before we make this optimization, we have to check if it is needed, determine the breaking point, the number of pods or the conditions justifying it.

@ZackButcher
Copy link
Contributor Author

To be clear, the optimization is in the caching. Regardless of caching/performance, we need to invert config generation so its eager so that we're ready for the push-based world of xDS v2.

@kyessenov
Copy link
Contributor

To @andraxylia 's point, we're stuck with v1 until we can prove v2 performs better. The work will have to happen in a separate envoy adapter module (proxy/envoy2) since this is a large effort.

@ZackButcher
Copy link
Contributor Author

How are we sequencing things? I thought we had to move to v2 APIs to address the 404 issues, which means we need to land updated config handling in 0.3 as well.

@andraxylia
Copy link
Contributor

My point was about avoiding premature optimization, and relative to the caching/invalidation of the config. I would defer this to when/if needed. We still need to prepare the config for each Envoy instance.

We are not stuck with v1, we want to move to v2, and we do not need to prove it performs better than v2, just that the pilot implementation of v2 is production ready. We should also have a way to switch from v1 and v2 and revert if needed, so @kyessenov suggestion is relevant.

Though ADS will solve the 404 issues, they can also be handled separately, in case v2 is not ready with production grade quality for 0.3.

Sequencing wise, I thought #1257 will cover everything, but if we need multiple issues and multiple owners, let's use an epic.

@kyessenov
Copy link
Contributor

The work has moved to a new envoy org project as there is a lot of complexity in the push/streaming interface. It's definitely far from a single issue/epic.
I can't vouch that we can reach parity with v1 since there are some serious semantic differences (deprecation of bind_to_port that need to be reconciled).

@rshriram
Copy link
Member

@andraxylia how do you envision the 404s can be solved with v1? I would be interested in exploring that solution.

@costinm
Copy link
Contributor

costinm commented Oct 28, 2017

There are several ways to solve the 404. The most direct is to fix Envoy, so it returns 503 as expected.
AFAIK there is no technical reason to return a 404 if a route exists but the cluster is not yet available.

The solution that I am proposing is to change the proxy agent, and have it download the full config before
starting envoy, and act as a local caching proxy. The primary goal of this will be to handle cases where
the pilot becomes unavailable, in particular for hybrid/onprem cases.

Obviously, such a cache could directly expose a v2 API - converting from v1 into v2 is apparently what Envoy is doing internally as well if I understood it correctly. And the code can be extremely simple - only
depends on grpc, http/json and the interfaces defined by envoy.

It is also ok to start with a v1 cache in the agent - only special behavior is to not return routes or listeners
until all dependent clusters/routes have been downloaded. On top of this it may allow some local overrides and/or templates.

Even if envoy is fixed to not return 404 on miss-config - it is likely we'll still need to properly deal with
cases where pilot becomes unavailable, cascading failures can be pretty bad.

As a side note - the local cache + proxy can also greatly simplify push and reduce the complexity of
pilot.

@kyessenov
Copy link
Contributor

I think using the agent to serve xDS is an interesting solution for stickiness of the updates. The agent can sequences the polled updates in the order that Envoy would always have internally consistent view of the config. The logic to sequence updates is required in v2 implementation, so this is just moving it closer to the proxy.

The downside is the increase in the complexity in the agent. We've been fighting for simplification of the agent since it's attached to every single pod.

Another tricky bit is lack of ACK/NACKs in v1. The server does not receive confirmation that the update is applied in v1 API.

@rshriram
Copy link
Member

It is also taking us back to the state 8 months ago when we did this in the agent. I don’t see any advantage of double caching in the proxy agent and Envoy. Envoy automatically falls back to the cached version if pilot is unavailable. That has been our design since day 1. So the argument about helping in hybrid/onprem doesn’t seem convincing. I also don’t see how it will reduce complexity (if any) in pilot because we don’t do any sequencing in pilot. Implementing v2 api in the agent seems suboptimal, more like I see no use for the v2 apis in Envoy for large scale installations.

Finally, not everyone will use the agent. Because they could be running Envoy in different formats (cloudfoundry for example) where there is no accompanying agent. This could be a temporary solution but not a permanent one.

@mattklein123 / @htuch do you think @costinm’s concerns are real ? - about envoys failing upon failure of management planes (like pilot or the one at lyft or what you guys are building at google). Do you guys have double caching, with local agent and a central agent? I don’t find it necessary.

@htuch
Copy link

htuch commented Oct 29, 2017

@rshriram no need for this local cache. As you correctly state, if a management update fails, the last valid config remains latched and in effect.

@mattklein123
Copy link

Agreed, there is no reason to cache anything on the node.

re: 404 vs. 503, there are legitimate reasons why someone may want to return 404 in this scenario. Having an option to return 503 instead in the case the cluster is not found is trivial. Someone just needs to do the work (envoyproxy/envoy#1820).

@costinm
Copy link
Contributor

costinm commented Oct 30, 2017 via email

@costinm
Copy link
Contributor

costinm commented Oct 30, 2017

One extra note on the subject - in a hosted environment, with very large number of domains (
like 'custom domains' in the case of a hosted app ), keeping the entire routing/listeners in pilot's
memory becomes impossible, and it's not really needed since envoy already caches the stable
config (and agent combined with some memcache could also do it on a local level).

So long term I don't think the current design of keeping the entire mesh config cached in Pilot's memory
is viable, it will need to be sharded and locally cached/distributed, and Envoy will probably
need to support more 'on demand' loading of config and certs.

The use case is something like a an app that allows users to bring their own domains ( something like
hosted gmail for example ) - a single backend, but with 1M+ routes and certs.

@rshriram
Copy link
Member

I don't think Envoy returning 503 plus the 'liveness' check at startup will take care of the 404 bug, and I think it is the right solution. takes care of 503s or 404s fully. Everytime you change a route rule, you will hit this and the chances of hitting this depends on your polling interval. You can solve this by having a temporary local cache, but thats a short term solution, until V2 api lands - at which point the agent becomes useless.

The envoy crash happens if we restart it too frequently - which is caused by auth certificates changing rapidly. I think its a question for the auth folks, and that too will become moot with v2 api. My two cents. We could also totally rearchitect Pilot (in my mind this solution is basically what we had in Amalgam8 a year ago) and finish this up, call it production ready by 0.3 time frame. Whether this is more stable than the V2 implementation, is up for debate.

@andraxylia
Copy link
Contributor

@mattklein123 would you agree that adding support in Envoy for snapshotting/checkpointing current running config is the right thing to do to allow data plane to correctly function if the control plane is partitioned, and Envoy needs to restart or hot-restart on the same host? There will be no control plane to provide clusters and listeners, so traffic will be lost.

@mattklein123
Copy link

@mattklein123 would you agree that adding support in Envoy for snapshotting/checkpointing current running config is the right thing to do to allow data plane to correctly function if the control plane is partitioned, and Envoy needs to restart or hot-restart on the same host? There will be no control plane to provide clusters and listeners, so traffic will be lost.

No. I think this is complexity that is not actually needed and is more likely to lead to an outage than the situation it is trying to cover for.

@costinm
Copy link
Contributor

costinm commented Oct 31, 2017 via email

@htuch
Copy link

htuch commented Oct 31, 2017

Also, Envoy may be deployed in a diskless scenario (or read only storage), so there inherently needs to exist a story for configuration recovery that doesn't rely on this.

I think a useful first step in this discussion is to get a clear picture (and consensus) on what the failure scenarios that Istio considers important are. There have been various suggested, can we get a doc written?

@costinm
Copy link
Contributor

costinm commented Oct 31, 2017 via email

@rshriram
Copy link
Member

rshriram commented Nov 1, 2017 via email

@costinm
Copy link
Contributor

costinm commented Nov 1, 2017

Will move the discussion on how to 'fail open', HA and avoiding cascading failires to separate document, it's a different topic.

On the 404: I think it is clear that 503-if-cluster missing and liveness checks will mitigate some of the
problems and will both be needed even with v2. While v2 may return a full and consistent config - it doesn't avoid the interval between startup and getting the initial config, nor the case where a cluster
is simply missing due to replication delays.

The main remaining issue is the case Shriram describes - which I don't understand and I was not
able to reproduce if the service is created in advance. Can you provide more details/example ?
And why would the 404 happen on 'weighted routing', if all the services and clusters are created
and propagated before the routing rule is applied ? The bookinfo samples are targeted for demo,
and typically don't wait for service/cluster propagation before applying route rules, but in a production
environment it is quite reasonable to not route traffic to a new service immediately.

@rshriram
Copy link
Member

rshriram commented Nov 1, 2017

While v2 may return a full and consistent config - it doesn't avoid the interval between startup and getting the initial config,
Needs to be solved with health checks.

nor the case where a cluster is simply missing due to replication delays.
Thats the whole point of ADS where we get to sequence updates to envoy over a single connection.

Getting a 404 for a newly created service is acceptable. After all, the configuration has to propagate to all callers, and if the external world clients start accessing it before its done, a 404 is what a server sends. In other words, end users are actually used to this scenario. What they are not used to is getting a 404 when they split an existing service into multiple clusters for the purpose of version routing. And this is the bigger issue that we need to solve. We would not be undertaking such a big effort just for the healthcheck or initial startup thing :).

#926 is the first occurrence of this issue (this user is trying to deploy Istio to production :) ). The problem only compounds when we increase the refresh interval from 1s to 60s. While a partial solution is to have high refresh rate, it burns a lot of CPU and tanks performance (evidence: #1237 where we found "First tested with Istio 0.1.6. Big performance jump when refresh interval was increased from 10ms to 60s. " ).

@costinm
Copy link
Contributor

costinm commented Nov 1, 2017

I see the comment that "Yes, a route rule can introduce a new cluster (for special version, for example)" -
I'm not familiar enough with the weighted routing rules, do you have any pointers on where the new
cluster gets introduced ? If we can decouple the creation of clusters / services from route changes - we
would have a pretty good fix for this.

It seems we agree that healthchecks / liveness checks are needed for both v1 and v2 - and having some
503 when a new service is just added is ok.

Short refresh rate is not a solution - everything should work with 60s or more, with understanding that
all route changes must wait for refresh_interval to avoid 503.

@rshriram
Copy link
Member

rshriram commented Nov 1, 2017

new clusters are created because of route changes (version split). An existing service's cluster is split into multiple newer clusters, and traffic is split by weights. This involves changing the route config to use weighted clusters and pushing those new clusters to envoy. And this is where the 404s/503s are occurring. and thats the whole reason for v2 push api.

@ldemailly
Copy link
Contributor

we should probably consolidate the various 404/503 open issues and move this to istio/istio

istio/istio#1038
istio/istio#1041
#1422

@costinm
Copy link
Contributor

costinm commented Nov 3, 2017 via email

@kyessenov
Copy link
Contributor

Why would you not have labels on VMs? k8s, consul, eureka and whatnot all have labels on workloads. An API change is a drastic measure to fix this problem. The actual solution will happen in v2 library under envoy org, once we work out the ADS protocol. This is the solution that will scale, be backwards/future compatible, and fix all the other "inconsistency" issues that are possible with v1. Once it's ready we'll replace the pilot core with it, and only need to translate istio config into data plane config to complete the picture.

@costinm
Copy link
Contributor

costinm commented Nov 3, 2017 via email

@kyessenov
Copy link
Contributor

kyessenov commented Nov 3, 2017 via email

@kyessenov
Copy link
Contributor

Issue moved to istio/istio #1370 via ZenHub

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants