Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/{grpc,contour}: EDS always returns a value #1110

Merged
merged 7 commits into from
May 21, 2019

Conversation

davecheney
Copy link
Contributor

@davecheney davecheney commented May 21, 2019

Fixes #1091
Updates #420
Updates #439

Problems with Contour not returning a value to Envoy when asked specifically have occurred several times in the past. When they have occurred, because I didn't understand the underlying issue, they were fixed in a piecemeal fashion. This PR addresses the root cause of the issue and hopefully we'll never have to talk about this again.

What caused the issue?

Envoy can ask Contour two different questions; the first is "please tell me the contents of the xDS table", the second is "please tell me the contents of these specific named xDS entries". The former is performed when a xDS Discovery Request supplies no resource_name hints. The latter uses resource_name hints to say "I want to know about these specific resources".

Naively one might think that the second case is a subset of the first, send me what you know about these specific resources is the union of the contents of the xDS cache and the names supplied. The first case is even simpler, it's the union of the cache with itself. However it's not so simple.

The first question, "tell me everything you have it the cache" is simple; everything in the cache is turned into grpc and sent over the wire to Envoy. The non intersecting set of things which are not in the cache are not sent to Envoy because that makes NO SENSE. However in the second case, for some resource types, Envoy will ask for a resource by name and will not be happy if the resource is not provided. This was the cause of #439, and #1091.

In the latter case a Service had been scaled to zero. The service was present in the api server so a valid ingress document pointed to it. Thus Contour created a Cluster record and sent that to Envoy, which in turn asked EDS for the members of the cluster we just told it existed. However, because the service was scaled to zero, the endpoint object was blank, there were no valid endpoints, so nothing was entered into the EDS cache on Contour's side. Here's where things get really upsetting.

xDS does not provide any way for a server to reject a request. Well, you can close the connection, but that's not going to help, Envoy will just reconnect and ask again, and all the while, it will be blocked on startup. So, when Envoy asks for an EDS record we must provide it, even if we don't have any information about it. This is actually a more general problem than EDS, but EDS is more susceptible to this problem because of the background described in #1091 (comment).

How does this PR fix the issue?

This PR does two things. First, it recognises that the two questions Envoy are different--the DiscoveryRequest with resource_name hint is not a subset of tell me everything in the cache--this this PR removes the Values(...string) method on grpc.Resource and replaces it with Contents() and Query([]string). The former clearly returns the contents of the cache, the latter takes a set of resources to query.

Secondly, on a per resource basis, the Query implementation will return blank results for records it does not hold in its cache when safe to do so. It turns out that for CDS, LDS, and SDS, there is no safe way to construct a blank record, so we cannot provide one to Envoy if it asks for one. For EDS and RDS, there is a way to construct a blank record given the query name provided, which we do, and this satisfies Envoy, unblocking the startup process.

In addition to tests at the internal/contour and internal/e2e level I have also tested it in my cluster repeatedly scaling a service to zero, restarting Envoy, and validating that prior to this change Envoy would block on startup when a Service with zero active pods was referenced via an Ingress record. Post this change, Envoy starts immediately.

Signed-off-by: Dave Cheney <dave@cheney.net>
Signed-off-by: Dave Cheney <dave@cheney.net>
There is no valid response for an unknown secret so we must return
nothing in that case.

Signed-off-by: Dave Cheney <dave@cheney.net>
Unlike SecretCache, if there is no route registered we return a blank
RouteConfiguration entry for the name provided to Query. This is safe to
do because RouteConfigurations have a simple blank value, just their
name.

Signed-off-by: Dave Cheney <dave@cheney.net>
Similar to SecretsCache there is no valid blank valid for an unknown
listener. This is because each listener has a required address field.
Making up a value there could cause Envoy to open that port, or worse,
conflict with an already open port. Because there is no safe valid
value, when querying for an unknown listener, return nothing.

Signed-off-by: Dave Cheney <dave@cheney.net>
As with Listener and Secret, there is no valid blank value for a Cluster
that can be supplied knowing just its name. In this case, if something
is not registered with the cluster, Query will return nothing rather
than a blank value like it does for RouteCache.Query.

Signed-off-by: Dave Cheney <dave@cheney.net>
Like RouteCache, but unlike Secret, Cluster, or Listener Caches,
EndpointTranslator.Query will return on answer per name supplied. Where
an endpoint set is not registered, we return a blank
ClusterLoadAssignment value using the query name supplied.

Signed-off-by: Dave Cheney <dave@cheney.net>
Copy link
Member

@stevesloka stevesloka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Fantastic work @davecheney! 🎉

@davecheney davecheney merged commit 039de06 into projectcontour:master May 21, 2019
@davecheney
Copy link
Contributor Author

Thanks for the prompt review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Envoy hangs when starting after upgrading to Contour 0.12.0
2 participants