Support Agent Caching for Service Discovery Results #4541

banks · 2018-08-17T20:26:02Z

This PR ended up being larger than I imagined. I was originally going to split this into just being the Catalog and Health changes which needed very little change to the Cache package as they have similar semantics.

I ended up just adding the ground work for Prepared Queries which can't block and ended up doing the whole thing as it felt like there was no good place to stop.

So this PR includes:

/catalog/services/:service now supports caching with ?cached param
/health/services/:service now supports caching with ?cached param
/query/:name/execute now supports caching with ?cached param
- since prepared queries don't support blocking, this endpoint caches without blocking support which required a bunch of changes to the cache framework.
- We also support a subset of Cache-Control headers to allow clients to manage the staleness of the cached response since we need to trade keeping cached results around for a long time for failure tolerance vs not serving 3 day old responses to prepared queries on the happy path. There are a bunch of tests and documentation for this.
documentation changes for every API endpoint to add cache support to the table of supported options.

I suggest reviewers first read the documentation on caching to understand the intended semantics of caching before getting to involved in the code changes.

mitchellh

😅 Big PR, but I get why.

This looks awesome. I'm really glad to see the framework for generic agent caching (the ?cached param) start making it in, too. This is looking really good.

I left some inline comments, but nothing major stood out. The positive thing I saw is that you didn't have to alter any of the previous agent/cache tests, which I feel is really important towards this. That's a good sign, since a lot of those tests test subtle behavior and bugs we found in the past.

The one meta-comment I'd make is: maybe we should consider adding information about agent caching in the general docs (not API docs) as well, since I think its a big benefit. I was thinking perhaps on the 1.) architecture page and 2.) maybe even a page under "Agent", though I'm not sure since it'd probably just repeat a lot of the API. :)

mitchellh · 2018-08-19T22:10:08Z

agent/cache/cache.go

@@ -84,6 +84,7 @@ type typeEntry struct {
 type ResultMeta struct {
 	// Return whether or not the request was a cache hit
 	Hit bool
+	Age time.Duration


I would add docs to this. It may be as simple as since the value was last loaded into the cache, but clarity that its the age its in the cache vs. set in the server is an important detail. If this is always 0 for background refreshing ones, its important to note that too.

mitchellh · 2018-08-19T22:12:46Z

agent/cache/cache_test.go

@@ -631,12 +632,15 @@ func TestCacheGet_expire(t *testing.T) {
 	require.Equal(42, result)
 	require.False(meta.Hit)

+	time.Sleep(5 * time.Millisecond)


Sleeps at this granularity will be flaky in CI, unfortunately. I usually only go in increments of 50ms. Since all the tests are set with t.Parallel it usually doesn't noticably increase the time for tests since... Amdahl's law.

Agreed although I think it's OK here - notice that the only reason for the sleep is not to wait for any background work to complete but just to ensure that we can measure at least 5ms of age between the fetch that this thread previously invoked and the following one. So I don't think it will make things flaky it's just sanity checking that the age actually increases by a meaningful amount between two serial calls.

I will add a comment if that isn't clear though as generally for "wait for background" this is an unreliable fudge factor.

mitchellh · 2018-08-19T22:13:30Z

agent/cache/request.go

+	// MaxAge if set limits how stale a cache entry can be. If it is non-zero and
+	// there is an entry in cache that is older than specified, it is treated as a
+	// cache miss and re-fetched. It only really makes sense for cache types that
+	// have Refresh: false.


"It only really makes sense" is unclear to me. Does it do anything otherwise? etc. Docs note.

mitchellh · 2018-08-19T22:17:51Z

agent/structs/prepared_query.go

+		q.QueryIDOrName,
+		q.Limit,
+		q.Connect,
+	}, nil)


This probably doesn't matter, but this will imply ordering on the hash value. The case this might matter in theory is if we ever persist the cache (we do not now) to support "warm starts" of the client.

If you want to nip this in the bud, you can tag a field and hashstructure will do the right thing. You may also want to consider just a full anonymous struct with fields set as well.

hashstructure.Hash(struct { Fields []interface{} `hash:"set"` }{ Fields: []interface{}{ ... }, })

Might not matter now though, probably worth thinking about both ways though.

Intersting.

Yeah I'd assumed that changing order here would invalidate current caches, could certainly call that out.

In the case of a full anonymous struct, it seems that would not solve the problem - if we changed the order of the fields in the source it would presumably change the order in which reflection iterates them right? So changing source order in either case would invalidate current caches? At least that seems to be the case given the following which may not be what you mean:

hashstructure.Hash(struct { QueryIDOrName string Limit int Connect bool }{ QueryIDOrName: q.QueryIDOrName, Limit: q.Limit, Connect: q.Connect, })

And that someone coming later and adding a field not at the end like this woudl cause the same problem as with the plain list I have here right?

hashstructure.Hash(struct { QueryIDOrName string Limit int OtherThing bool Connect bool }{ QueryIDOrName: q.QueryIDOrName, Limit: q.Limit, OtherThing: q.OtherThing, Connect: q.Connect, })

I think for now given that it doesn't matter a comment about ordering will suffice since if we later rely on order (e.g. for persistence) we could make the change then?

banks · 2018-08-20T19:30:55Z

@mitchellh I added comments for all the noted places and a paragraph to the 10,000 foot view on the architecture page. I kinda think it's worth mentioning there as it is a bit of an architectural change for fault tolerance etc but I could easily be persuaded to move it somewhere else.

I kinda agree that a whole other section under Agent would largely repeat the API docs but open to suggestions there. @pearkes do you have an opinion?

The conflicts are due to another docs thing that touched an API endpoint that I just merged. I'll fix it later since I realised there is another change needed to clean up the merge there first.

…roxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541

mkeeler

In general this looks pretty good although I have a few questions and comments.

Is there a reason not to use the same caching mechanisms for the service discovery aspects of DNS? It seems like we could use the same cache too.
If we have a cluster with 30k nodes and each one uses the /v1/catalog/service endpoint for getting information about 3 different services does that mean that there will be 90k blocking queries running on the leader to do the background refresh? If so, this seems less than ideal.
Many of the new cache types do the exact same thing cast the cache request to a DC specific request update the min index and issue the RPC. Could this be refactored a little so that we don't implement the same thing 4 times. I think the main issue is that for the RPC reply we need a real object to decode into. We would have to use other factory functions to create the reply vars before handing them to the RPC call. Otherwise the only thing to track is just the name of the RPC endpoint.
Uncached queries do not update the cache. So for example if you cached a prepared query execution and something else came along and did it uncached, my expectation would be that the next cached execution should just return the value that the uncached query did. As it is it will return potentially more stale data due to a disconnect between the caching mechanism and the uncached queries. This concern applies to all the caching. Would it be better to be able to request a direct fetch via the caching infrastructure and then it could do the fetch, cache the result and hand it back?

mkeeler · 2018-08-21T19:55:16Z

agent/cache-types/prepared_query.go

+func (c *PreparedQuery) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
+	var result cache.FetchResult
+
+	// The request should be a DCSpecificRequest.


nit: Comment needs updating

mkeeler · 2018-08-21T20:20:50Z

agent/http.go

@@ -551,6 +616,11 @@ func (s *HTTPServer) parseConsistency(resp http.ResponseWriter, req *http.Reques
 		fmt.Fprint(resp, "Cannot specify ?stale with ?consistent, conflicting semantics.")
 		return true
 	}
+	if b.UseCache && b.RequireConsistent {
+		resp.WriteHeader(http.StatusBadRequest)


Should this function instead return an error and be refactored to return the BadRequestError?

Maybe not as this change would need to be propagated through a couple other places. Just a thought

Could but I followed the precedent of all other helpers in this file - I think if we want to clean that up we should do it all at once separately.

Yeah, thats probably best.

I have a few remarks:

to the administrator, it would be cool to use the DiscoveryMaxStale mechanism for enabling caching by default

to the API user, having max-age as well as maxStaleis confusing: as a non-expert Consul user, I would mix both: max_stale can be semantically seen as maxAge for a non-leader server (if you think as non-leader servers as a kind a caching mechanism). I wonder if it would be nice to merge 2 concepts, because those are actually very close semantically for a developper

banks · 2018-08-22T11:11:44Z

Hmm great questions @mkeeler. Answers inline:

Is there a reason not to use the same caching mechanisms for the service discovery aspects of DNS? It seems like we could use the same cache too.

We could do that later. I mostly didn't consider it here because:

this is already huge and it's not necessary for Envoy or Connect in general
DNS has always had some level of caching support via TTL depending on the client
It's not obvious how DNS queries would signal that caching is acceptable in the same way - I don't think we can use exactly the same mechanism transparently because the semantics of the configured DNS TTL would be silently discarded since the internal cache has a different TTL.

If we have a cluster with 30k nodes and each one uses the /v1/catalog/service endpoint for getting information about 3 different services does that mean that there will be 90k blocking queries running on the leader to do the background refresh? If so, this seems less than ideal.

Yep. Note that it's exactly the same as now though without the cache. If anything it's possible we improve the situation because if two processes on a host are blocking on the same result and we switch them to ?cached, we now only make one shared blocking query from the cache rather than two separate identical ones. Also note that these are opt-in so just rolling this out will make zero change to current workloads.

It is theoretically possible that we could get more sophisticated in the future and implement something like consul-templates deduplication here but I doubt we will because that is so complex and brittle and a better option is likely a more significant change to how blocking queries work to make them much more efficient.

Many of the new cache types do the exact same thing cast the cache request to a DC specific request update the min index and issue the RPC. Could this be refactored a little so that we don't implement the same thing 4 times. I think the main issue is that for the RPC reply we need a real object to decode into. We would have to use other factory functions to create the reply vars before handing them to the RPC call. Otherwise the only thing to track is just the name of the RPC endpoint.

We maybe could reduce duplication a bit here but it's mostly as you said type marshalling and the request and response types are unique in every case. I personally think it's easier to reason about with this small bit of boilerplate than trying to come up with an abstraction (which may end up being a bad one) just to save a few lines. In practice virtually the only line that is not type specific is the RPC call one so it's hard to see how that could be factored out without something gross like generating code from some generic template!

Example, type specific lines annotated:

	var result cache.FetchResult

	// The request should be a PreparedQueryExecuteRequest.
	reqReal, ok := req.(*structs.PreparedQueryExecuteRequest) // request type specific
	if !ok {  // request type specific (error handling for the type assertion)
		return result, fmt.Errorf(
			"Internal cache failure: request wrong type: %T", req)
	}

	// Fetch
	var reply structs.PreparedQueryExecuteResponse  // response type specific
	if err := c.RPC.RPC("PreparedQuery.Execute", reqReal, &reply); err != nil {
		return result, err
	}

	result.Value = &reply
	result.Index = reply.QueryMeta.Index // response type specific

	return result, nil

Note that these are pretty minimal examples too - in other cases you might need to manipulate query params on request or response too which are also all type specific.

Uncached queries do not update the cache. So for example if you cached a prepared query execution and something else came along and did it uncached, my expectation would be that the next cached execution should just return the value that the uncached query did. As it is it will return potentially more stale data due to a disconnect between the caching mechanism and the uncached queries. This concern applies to all the caching. Would it be better to be able to request a direct fetch via the caching infrastructure and then it could do the fetch, cache the result and hand it back?

That's a good point to clarify.

I've tried to make this entire mechanism opt-in. Caching everything and just not using the results by default could have a huge memory impact on existing workloads - imagine how large the service discovery responses are in some of our busiest user's cases suddenly are all kept in memory on every agent permanently for no benefit on upgrade. That seems much worse than the issue here.

Even worse, users that consume service discovery APIs will have not just result memory consumed but background blocking query load added to their servers updating a result they never use.

Finally, the caching infrastructure as a read-through would break the current behaviour around consistency semantics - if you issue a request with ?stale and also have max_stale or discovery_max_stale configured then we do a stale read first, check it's LastContact time from leader and potentially re-issue a non-stale read if it's too high. Trying to make that semantic stay the same while querying through a cache seems super unpleasant.

It would be possible to only store the "read-through" responses if there is already a staler version in the cache. But in general this problem only applies to non-background refreshing cache entries since background refresh ones will be as up to date as the uncached response any way barring small timing issues. The only non-background refresh endpoint currently is prepared queries. This PR in fact already implements some of the moving parts needed to support that case - calls without ?cached could just be translated into cache calls with MustRevalidate = true. But we'd also need to add support for no-cache cache control directive or similar that tells the cache not to actually store the result unless it's replacing a staler one. Do you think that's important to add? It seems like an edge case that's we can just document for now?

mkeeler · 2018-08-22T15:46:48Z

DNS Caching

For DNS caching, I was thinking that clients wouldn’t do anything to opt-in but rather the agent would be configured to enable using the cache just like we have DNS configuration to enable stale data. Especially with the background blocking query mechanisms, it seems like the cache could be very helpful there as well. I understand not wanting to include that here as it would make the changes that much larger but it could be something we should consider for the future.

Blocking Query Performance

As for the performance of the blocking queries I can agree we are no worse off as far as the overall number of queries (blocking and non-blocking) will be less. However there is a potential for more blocking queries in flight and with the current implementation that will turn into a 1 go routine per blocking query. You are right that it is always better if the HTTP clients are using blocking queries themselves but if they are just periodically getting the latest data without blocking then we are worse off as far as long running queries are concerned. Go routines are nice but no scheduler has negligible overhead when dealing with 100K+ tasks. This isn’t really a problem with your implementation but rather is a deficiency in our current implementation of blocking queries. We really ought to have a worker model for blocking queries and be able to better control our go routine usage. My real concern is that I have seen us continually add more things that are doing blocking queries behind the scenes and that those operations coupled with users usage of consul-template or other clients utilizing blocking queries are going to overload the leader servers like we have seen in a few select cases. The issues have cropped up a few times so far and are hard to work around without telling users to just not use the software like they are. My fear is that if we continue utilizing this more prior to actually fixing blocking query performance we are going to run into more and more issues. I don’t think that is something you need to fix in this PR but we may want to prioritize updating the blocking query mechanism for the near-future.

Cache Type Generalization

Here is good example of what I was talking about:

func (c *CatalogServices) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
	var result cache.FetchResult
	// The request should be a DCSpecificRequest.
	reqReal, ok := req.(*structs.ServiceSpecificRequest)
	if !ok {
		return result, fmt.Errorf(
			"Internal cache failure: request wrong type: %T", req)
	}
	// Set the minimum query index to our current index so we block
	reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
	reqReal.QueryOptions.MaxQueryTime = opts.Timeout
	// Fetch
	var reply structs.IndexedServiceNodes
	if err := c.RPC.RPC("Catalog.ServiceNodes", reqReal, &reply); err != nil {
		return result, err
	}
	result.Value = &reply
	result.Index = reply.QueryMeta.Index
	return result, nil
}

func (c *ConnectCARoot) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
	var result cache.FetchResult

	// The request should be a DCSpecificRequest.
	reqReal, ok := req.(*structs.DCSpecificRequest)
	if !ok {
		return result, fmt.Errorf(
			"Internal cache failure: request wrong type: %T", req)
	}

	// Set the minimum query index to our current index so we block
	reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
	reqReal.QueryOptions.MaxQueryTime = opts.Timeout

	// Fetch
	var reply structs.IndexedCARoots
	if err := c.RPC.RPC("ConnectCA.Roots", reqReal, &reply); err != nil {
		return result, err
	}

	result.Value = &reply
	result.Index = reply.QueryMeta.Index
	return result, nil
}

The only difference between these is the function receiver, the reply type and the RPC endpoint. We could implement this in a generic way like:

package cachetype

import (
	"fmt"

	"github.com/hashicorp/consul/agent/cache"
	"github.com/hashicorp/consul/agent/structs"
)

type GenericBlockingQuery struct {
	NewReply func () interface{}
    Endpoint string
	RPC      RPC
}

func (c *GenericBlockingQuery) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
	var result cache.FetchResult

	// The request should be a DCSpecificRequest.
	reqReal, ok := req.(*structs.DCSpecificRequest)
	if !ok {
		return result, fmt.Errorf(
			"Internal cache failure: request wrong type: %T", req)
	}

	// Set the minimum query index to our current index so we block
	reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
	reqReal.QueryOptions.MaxQueryTime = opts.Timeout

	// Fetch
  reply := c.NewReply()
	if err := c.RPC.RPC(c.Endpoint, reqReal, reply); err != nil {
		return result, err
	}

	result.Value = reply
	result.Index = reply.(*QueryMeta).Index
	return result, nil
}

func (c *GenericBlockingQuery) SupportsBlocking() bool {
	return true
}

Then during registration:

  a.cache.RegisterType(cachetype.CatalogServicesName, &cachetype.GenericBlockingQuery{
		Endpoint: cachetype.CatalogServicesEndpoint,
        NewReply: func() { return &structs.IndexedServiceNodes{} }
		RPC: a,
	}, &cache.RegisterOptions{
		// Maintain a blocking query, retry dropped connections quickly
		Refresh:        true,
		RefreshTimer:   0 * time.Second,
		RefreshTimeout: 10 * time.Minute,
	})

This would eliminate the specific implementation for catalog services, ca roots, intention matches and health services and replace with the single common implementation. I am not advocating for generalizing all of them as some such as prepared queries and leaf certs need special handling but those other 4 where the fetch functions are basically identical it seems like we could pull those into a common implementation and specialize only the types that actually need different handling. Personally I dont think the generic implementation is any harder to reason about but I could be biased.

Uncached/Cached Query Interaction

As for uncached queries not updating the cache I think your reply makes sense. If you aren’t using caching queries then you shouldn’t incur the extra overhead in terms of memory and a little cpu for cache management to push replies to uncached queries into the cache. I think maybe we just need to document the behavior really well. Adding something that says "uncached queries will not update the cache even in the case where there are also cached queries executed for the same data" would suffice.

banks · 2018-08-22T21:16:37Z

Cool @mkeeler think we agree on most of that then :) I agree blocking query usage is becoming more and more a hot topic which is why I want to work on optimizing that.

Cache Type Generalization

Hmm your example misses the request types are different too (partly because of my copy-paste comment typo probably :D). That means all of this would need to be in an interface function or something too:

	// The request should be a DCSpecificRequest.
	reqReal, ok := req.(*structs.DCSpecificRequest)
	if !ok {
		return result, fmt.Errorf(
			"Internal cache failure: request wrong type: %T", req)
	}

	// Set the minimum query index to our current index so we block
	reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
	reqReal.QueryOptions.MaxQueryTime = opts.Timeout

Once you account for that too I'm just not sold that the extra abstraction buys anything real - I don't anticipate it being touched often once written or changed much in a way that doing a similar edit in 4 or 5 files is a real problem in the grand scheme of things. If we end up finding there is a whole other large bit of code we have to add to all of them then maybe that would be a good time? Or many cachetypes with identical request/response types getting added?

…roxy processes. (#4550) * Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

banks · 2018-08-23T09:45:31Z

Go routines are nice but no scheduler has negligible overhead when dealing with 100K+ tasks.

For the record the problem is worse that you imply on the server already - each blocking watch depending on the number of radix tree nodes it's watching, can end up with a whole tree of goroutines. JC and friends have noticed the effect of that in their performance analysis and it's yet another motivator (other than general efficiency of data transfer and encoding/decoding) to want to take a fundamental look at blocking.

So 90k blocking queries in your example is already going to be probably closer to 500k plus goroutines on the leader. If you use ?stale it gets a bit better as that can be split between other servers including non-voting read replicas. There is a decent argument that all our background blocking should do that by default here which is a sensible change I'll make.

slackpad · 2018-08-23T14:34:14Z

This would definitely be a separate change but some version of this for DNS could be really useful. A lot of client libraries don't respect DNS TTLs so I remember seeing sad setups with thousands of qps of RPCs coming from DNS lookups (often the same one over and over). For queries coming from DNS I wouldn't bother with any blocking query under the hood (just let those results sit in the cache for the TTL with no invalidation). You could even use the existing TTL in dns_config to set the cache TTL based on whether it's a node or service lookup (maybe use like half that value or similar since there'd be no active invalidation). This would probably reduce many of those ill-behaved clients down to 1 qps of actual RPCs.

The blocking query goroutine fanout because of the fine-grained radix tree stuff is pretty interesting to optimize. This was tuned with a benchmark but was still pretty arbitrary - https://github.com/hashicorp/go-memdb/blob/master/watch-gen/main.go#L12-L17. I'll be interested to see where things go with that one :-)

mkeeler · 2018-08-23T14:45:35Z

@banks Not having the cache types abstraction is perfectly fine. I thought the request type was already being abstracted to a cache.Request and then coerced into a DCSpecificRequest which all those contain. If you think it would be better left alone thats fine with me. I am not hard set on it. Given that I think all my concerns have been addressed.

banks · 2018-08-23T14:49:23Z

Hey @slackpad, I love your drive-by context sharing ❤️ .

To share a bit more our motivation here: this is not really a performance optimisation so much as a way to have Consul agents able to manage external proxies which need to have their service discovery "pushed" in. We could solve that part alone with blocking queries and no caching, but then proxies like the built-in one that do resolve discovery at connection time break if the servers are unavailable so this was a nice way to solve for that and get some performance benefit for other clients too if they want to use it.

But totally agree DNS is a great feature to build on this cache now we have it later. We may also allow caching for KV eventually too.

banks · 2018-08-23T14:52:02Z

@mkeeler thanks. I'm going to update this to fix comment typo and also to use stale queries with a time bound by default for service discovery to spread this blocking query load.

I think your still "requesting changes" here...

pierresouchay

First congrats, because this is a very interesting move. I actually did start implementing something less well designed for DNS (but with a basic Most Frequently Used entries only) as we have a few operational issues on this side (especially regarding xDC calls).

I have a few remarks however:

On the theoretical POV, requesting stale read for entries is similar to use non-leader servers as cache. From an API point of view, this is the exact same thing, so not having the same mechanism as ?stale&max_stale=5m for caching is IMHO a mistake:

when ?stale?&max_stale=5m reaches a follower, if the data is not available on the follower side and the leader is not available, you have an error. It means that if you use: ?cache&max_stale=5m, in the same way, you should either get a cache that is known to be refresher than 5m (whenever or not the call has been done is pure detail) AND you should have an error if it is not the cache.

Cache invalidation is hard, but adding stale-if-error make the whole stuff complex to understand as an API IMHO without immediate benefit (or at least none if understand).

In our shop, we use max_stale to 72h (on DNS for instance, but also with discovery_max_stale parameter and we implement monitoring to be notified when we have more than 30s of delay between servers. Having 72h is almost equivalent as stale-if-error=no, but the semantics of the caller are easier to get.

Furthermore, as a Consul administrator, I really would like to use the same default mechanism as discovery_max_stale (see #3920 ) to replace stale by cache by default. Having a single semantics of error occurs if cache is older than max_stale would really make it useful and more easy to understand by users.

I also see the will of re-using existing HTTP cache mechanisms, but as sad as is caching on Internet (basically, hard to get right), I am not convinced by the use of Max-Age, very often used far too much.

I also wonder how get blocking queries work with this cache mechanism (I mean on the client side), IIUC, the result would be strictly equivalent, right ?

Kind Regards

pierresouchay · 2018-08-28T21:44:09Z

agent/http.go

@@ -551,6 +616,11 @@ func (s *HTTPServer) parseConsistency(resp http.ResponseWriter, req *http.Reques
 		fmt.Fprint(resp, "Cannot specify ?stale with ?consistent, conflicting semantics.")
 		return true
 	}
+	if b.UseCache && b.RequireConsistent {
+		resp.WriteHeader(http.StatusBadRequest)


I have a few remarks:

to the administrator, it would be cool to use the DiscoveryMaxStale mechanism for enabling caching by default

to the API user, having max-age as well as maxStaleis confusing: as a non-expert Consul user, I would mix both: max_stale can be semantically seen as maxAge for a non-leader server (if you think as non-leader servers as a kind a caching mechanism). I wonder if it would be nice to merge 2 concepts, because those are actually very close semantically for a developper

pierresouchay · 2018-08-28T21:45:33Z

agent/http.go

@@ -682,6 +752,9 @@ func (s *HTTPServer) parseInternal(resp http.ResponseWriter, req *http.Request,
 	if s.parseConsistency(resp, req, b) {
 		return true
 	}
+	if parseCacheControl(resp, req, b) {


same remark here, might be interresting to think if max_stale is equivalent to maxAge...

pierresouchay · 2018-08-28T21:51:50Z

agent/prepared_query_endpoint.go

+		if err != nil {
+			// Don't return error if StaleIfError is set and we are within it and had
+			// a cached value.
+			if raw != nil && m.Hit && args.QueryOptions.StaleIfError > m.Age {


I would not introduce StaleIfError this as it renders the whole stuff VERY complicated to understand on a user point of view

banks · 2018-09-04T11:15:15Z

@pierresouchay thanks for taking a look!

Some great points - most of your suggestions are things we considered in our internal RFC process and rejected for a few reasons.

I'll sum up quickly though:

The immediate goal here is:
1. to make Connect work for proxies that need service discovery information "pushed" to them (e.g. Envoy),
2. for the built-in proxy (and other SD clients) to ensure SD info is cached locally so that service discovery still works if there is a server outage for some time.
We want ?cached to be explicit since although ?stale provides no guarantee of freshness, it does imply a stronger guarantee than we can provide. Specifically:
- In general servers being behind master for more than a few milliseconds is considered a failure that is likely monitored for and fixed by operators before it becomes very significant. The same is likely not true for any agent which means that in practice it would be much more likely to return very stale results form an agent cache which is not the semantic clients expected when they opted to use ?stale.
- ?stale currently has an API guarantee of both supporting max_stale= as you point out, but also returning how the staleness as a time lag from follower to leader in raft (X-Consul-LastContact). Agent caches can't really match that API with identical semantics - the Age and similar introduced here are somewhat equivalent as you point out, but they are not exactly equivalent and changing the semantics/definition of the existing API seems like surprising behaviour for upgrading users.

There are lots of future features and optimisations possible for other use cases as you suggest but they are out of scope for now as this is so huge already.

I really would like to use the same default mechanism as discovery_max_stale

Yeah I considered this. I think it certainly makes some sense but I want to leave it as a future enhancement as it's not needed for the original intent of this work.

I would not introduce StaleIfError this as it renders the whole stuff VERY complicated to understand on a user point of view

StaleIfError is needed to meet the second design goal from this work I stated - to allow "fail static" behaviour for the built-in proxy and other clients. In fact it's only needed for prepared queries in this case since the blocking refresh endpoints support "fail static" already. Without this there is now way to have prepared queries that updated frequently in the happy case when servers are up but still return cached values for a much longer period if the servers are unreachable.

I do agree that there is a lot of extra complexity added here but it seems cleaner than "breaking" the existing stale semantics. I also only expect expert level users who are prepared to read through the details to want to use this. Caching is all about tradeoffs and you should really understand them in detail to make a good decision about whether they meet your needs.

I tried to stick with standard HTTP caching directives and semantics for the new features to make it more clear for people familiar with configuring HTTP caching in CDNs and similar and to make it clearer that this is a whole different mechanism than just the "stale" consistency optimisation which is a raft thing.

Does that make sense?

banks · 2018-09-05T20:45:16Z

That commit:

Fixes comment typo
Documents the potential confusion that uncached requests don't update the cached value
Makes all these service discovery RPCs use AllowStale to spread their load since not freshness guarantees apply when using cache anyway.

I think this is still "approved" but any further feedback welcome.

We likely won't merge this directly as it's making sense with our refined release schedule to move this work into a feature branch rather than add it piecemeal to master so I'll do that and attempt to rebase this there before it's merged tomorrow.

…st that caching works

TODO: - docs for all endpoints to include cache support info

…to make the new data path more clear.

…roxy processes. (#4550) * Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

…anaged proxy processes. (#4550)" (#4561) This reverts commit e7e7c7d.

…ests to AllowStale so we can spread service discovery load across servers.

banks · 2018-09-06T10:33:44Z

I rebased this onto a new feature branch of the current master so we can merge Envoy supporting PRs without putting them on the release path for 1.2.3 patch release until it's all done.

Given the approvals here, I'll merge it to f-envoy now. If there are any additional concerns please commit or bring it up in future review when the feature branch is merged.

…roxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541

* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.

* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.

* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.

* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.

* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files

banks requested review from mitchellh, mkeeler and kyhavlov August 17, 2018 20:26

mitchellh reviewed Aug 19, 2018

View reviewed changes

banks mentioned this pull request Aug 21, 2018

Rename agent/proxy package to reflect that it is limited to managed proxy processes. #4550

Merged

mkeeler requested changes Aug 21, 2018

View reviewed changes

mkeeler approved these changes Aug 23, 2018

View reviewed changes

pierresouchay suggested changes Aug 28, 2018

View reviewed changes

banks mentioned this pull request Sep 5, 2018

Add Proxy Upstreams to Service Definition #4630

Closed

1 task

banks added 8 commits September 6, 2018 11:12

Add cache types for catalog/services and health/services and basic te…

2bd4b8c

…st that caching works

Support non-blocking cache types with Cache-Control semantics.

1ceb036

TODO: - docs for all endpoints to include cache support info

Update API docs to include caching info for every endpoint.

148ff7a

Comment updates per PR feedback.

d55e795

Add note on caching to the 10,000 foot view on the architecture page …

9f61d74

…to make the new data path more clear.

Revert "Rename agent/proxy package to reflect that it is limited to m…

74f71b8

…anaged proxy processes. (#4550)" (#4561) This reverts commit e7e7c7d.

Document prepared query staleness quirk and force all background requ…

64580b5

…ests to AllowStale so we can spread service discovery load across servers.

banks force-pushed the cache-sd branch from 22e0aed to 64580b5 Compare September 6, 2018 10:18

banks changed the base branch from master to f-envoy September 6, 2018 10:18

banks merged commit b53d0ea into f-envoy Sep 6, 2018

banks deleted the cache-sd branch September 6, 2018 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Agent Caching for Service Discovery Results #4541

Support Agent Caching for Service Discovery Results #4541

banks commented Aug 17, 2018 •

edited

Loading

mitchellh left a comment

mitchellh Aug 19, 2018

mitchellh Aug 19, 2018

banks Aug 20, 2018 •

edited

Loading

mitchellh Aug 19, 2018

mitchellh Aug 19, 2018

banks Aug 20, 2018

banks commented Aug 20, 2018

mkeeler left a comment •

edited

Loading

mkeeler Aug 21, 2018

mkeeler Aug 21, 2018

banks Aug 22, 2018

mkeeler Aug 23, 2018

pierresouchay Aug 28, 2018

banks commented Aug 22, 2018 •

edited

Loading

mkeeler commented Aug 22, 2018

banks commented Aug 22, 2018 •

edited

Loading

banks commented Aug 23, 2018

slackpad commented Aug 23, 2018

mkeeler commented Aug 23, 2018

banks commented Aug 23, 2018

banks commented Aug 23, 2018

pierresouchay left a comment

pierresouchay Aug 28, 2018

pierresouchay Aug 28, 2018

pierresouchay Aug 28, 2018

banks commented Sep 4, 2018 •

edited

Loading

banks commented Sep 5, 2018

banks commented Sep 6, 2018

Support Agent Caching for Service Discovery Results #4541

Support Agent Caching for Service Discovery Results #4541

Conversation

banks commented Aug 17, 2018 • edited Loading

mitchellh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks Aug 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Aug 20, 2018

mkeeler left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Aug 22, 2018 • edited Loading

mkeeler commented Aug 22, 2018

DNS Caching

Blocking Query Performance

Cache Type Generalization

Uncached/Cached Query Interaction

banks commented Aug 22, 2018 • edited Loading

banks commented Aug 23, 2018

slackpad commented Aug 23, 2018

mkeeler commented Aug 23, 2018

banks commented Aug 23, 2018

banks commented Aug 23, 2018

pierresouchay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Sep 4, 2018 • edited Loading

banks commented Sep 5, 2018

banks commented Sep 6, 2018

banks commented Aug 17, 2018 •

edited

Loading

banks Aug 20, 2018 •

edited

Loading

mkeeler left a comment •

edited

Loading

banks commented Aug 22, 2018 •

edited

Loading

banks commented Aug 22, 2018 •

edited

Loading

banks commented Sep 4, 2018 •

edited

Loading