Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance on large clusters] Share blocking queries between RPC requests #5050

Closed

Conversation

ShimmerGlass
Copy link
Contributor

Only run one blocking query per request and share the result with all RPC calls. It optimizes the blocking query process when many agents are watching the same service.

Implementation

When a blocking query RPC call is received, the server first checks if a blocking query with the same parameters is already running. If true, then just wait for this query to complete (or MaxWaitTime) without doing any other work. If no blocking query is running, run one in background and wait on it.
The background blocking queries have no max time, however they are cancelled once no more requests are watching them.
The new args.CacheInfo().Key is used to differentiate a query from another.
A map keeps track of all running blocking queries.

Improvements

  • As seen in this profile : profile.zip ran on a production server when a large service was deploying the main CPU hogs are the blocking query (ServiceNodes) code path, and the msgpack serialization. This changes just runs one ServiceNodes call no matter how many watchers there are, reducing that hot path.
  • softWatchLimit (discussed here [Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984) aims at capping the number of goroutines used by blocking queries. This change only keeps one WatchSet per blocking query type instead of one per request, ultimately keeping the number of goroutines low.

Notes

This PR only use shared blocking queries for /v1/health/service to keep it small and easy to review, and since this query consumes many G per service instance. All blocking endpoints can be migrated to this system.

Some of our tests show a CPU usage divided by two as well as far less memory and goroutines usage.

Only run one blocking query per request and share the result with all
RPC calls. This optimizes the blocking query process when many agents
are watching the same service.
@banks
Copy link
Member

banks commented Dec 4, 2018

This is really cool - thanks!

I have a few thoughts but at this stage I want to consider all the possible options available to solve this including others discussed in #4984 so lets do that over in the issue and come back here if we decide this option is worth taking forward.

@pierresouchay
Copy link
Contributor

Deployment feedback of this patch

@banks You'll see the results are simply amazing with our usage of Consul

Preprod

Deployed in whole preprod for 24h, significant decrease of CPU, write latency and number of goroutines.
No changes in features.

Prod

Deployed on a single server (among 5) in each of our DCs for comparison

All the graphs below represent 2 servers in the same cluster in "Follower" mode. consul-05 are using the patch, others are not.

Red line represent start of deployment of feature in all of our DCs

CPU

CPU compared

System CPU usage from 20% to 6%
User CPU usage from 180% to 40% !

The changes are very especially impressive when large services do change a lot (see the peak at 11:40): in that case, the CPU is almost stable on the patched version will it hit the sky on non-patched version.

Memory

From 11G to 4G of memory, almost divided by 3!
Memory on patched server

FSM Apply latency

FSM latency divided by 2 on both average and 90% percentile!

FSM Latency on AM5

Num GoRoutines

Number of Goroutines compared

From 180k to 65k !

@banks
Copy link
Member

banks commented Dec 7, 2018

Awesome graphs @pierresouchay it's cool we can get such big wins for this or something like it. See the discussion in the issue for finding the sweetspot between minimal changes and most performance gain!

@pierresouchay
Copy link
Contributor

pierresouchay commented Dec 7, 2018

@banks Sure... let's find a way to have gains in a way which is convenient for you.

We currently have an incident on one of our DCs (meaning, much more discovery requests than it used to), here are some graph of comparing the 2 best behaving servers during incident, so I can share some real data when things goes bad.

When things go bad

What is really impressive is that the patched server serves far more requests than the non-patched version (1.6Gb/s VS max 1.2Gb/s for non-patched ones), meaning that it consumes far less while being far more busy!

Go rountines compared

Go routines stay stable and low on patched server (red line, patched server):
capture d ecran 2018-12-07 a 17 44 41

CPU compared

Other server CPU hit the limit, while our patched instance stay far below (1944% - machine is CPU bound, while the patched version stays at 735% max):
capture d ecran 2018-12-07 a 17 44 21

Note the patched servers can also handle far more requests at the same time (so CPU difference is even worse than that): Peeks around 1.6Gb/s on patched server VS max 1.2Gb/s for all other ones.

Memory compared

Memory of non-patched server during incident:
capture d ecran 2018-12-07 a 17 58 10

Memory of patched server during incident (and handling far more QPS):
capture d ecran 2018-12-07 a 17 58 26

So, memory stays almost stable (between 7.5 and 8.5 during very high loads) while non-patched goes from 9 to 11Gb

Warning: non-patched servers = Vanilla + most of our perf patches + #4986 set to 8192

When I said non-patched servers, I am talking about servers already patched with #4986 and watch softLimit set to 8192. Vanilla Consul servers cannot handle those loads

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I want to say that overall this PR is awesome @Aestek, but just wanted to be a little more concrete about why I'm keen to understand how we can get some or all of the same performance win with slightly different design.

I'll think some more and maybe have a more concrete suggestion that would compose better with #5081

@@ -443,6 +445,157 @@ RUN_QUERY:
return err
}

type sharedQueryFn func(memdb.WatchSet, *state.Store) (uint64, func(uint64, interface{}) error, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function type sums up one of my big concerns here - that seems like a really unintuitive and fragile abstraction to build blocking queries on top of with funcs returning funcs with special types etc. It seems inevitable we'll find we need to plumb something new through and have to change types all over the place to do it.

I could be wrong of course, but it feels like there must be a cleaner abstraction we can make here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, I will think of a better way to do this.

}

return queryState.Apply.Load().(func(uint64, interface{}) error)(atomic.LoadUint64(&queryState.Index), res)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor concern is how much of this is duplicated from blockingQuery. I guess once we adopt this everywhere we can remove normal ones but that seems like a decent amount of work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the goal is to remove blockingQuery() and just keep the shared version. I however kept both to keep the PR small.
If this gets merged / or is needed to get merged, I can migrate the other endpoints to this system and remove the old code.

timeout = time.NewTimer(queryOpts.MaxQueryTime)
defer timeout.Stop()

cacheInfo := req.CacheInfo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep swaying backwards and forwards on whether the agentcache.Request interface is the right thing to use for deduplication here. On one level it's pretty similar purpose - uniquely identify this request - but on another there could be semantically different caching concerns. I fear that in practice we'll eventually hit a bug in one or the other place which needs a change to the CacheInfo key for a certain request, but it won't apply to the other place and we'll be hacking about.

This is more of a feeling of unease than a concrete issue though - I'm not totally sure what a better alternative would be. I'll think more because if we do find that de-duplicating the query processing as well as the watching is important then this would be something to solve however we do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share your feeling, I used this because it did the right thing for this use case and was ready to use.
This is a naming / namespacing issue to me, both the cache and this PR need an unique identifier for the request parameters. It could be moved out of cache.Request interface to have its own method with both the cache and this PR depending on it.

@banks
Copy link
Member

banks commented Dec 11, 2018

FWIW I think I have a very rough branch locally that combined with #5081 gets all of the rest of the benefit here with minimal changes. It looks like this (although this should probably be cleaned up and abstracted more:

Basically the only changes are to add a *singleflight.Group to the server struct and then change the following in the Health endpoint blocking query function:

diff --git a/agent/consul/health_endpoint.go b/agent/consul/health_endpoint.go
index 103fb2faf..e625c16be 100644
--- a/agent/consul/health_endpoint.go
+++ b/agent/consul/health_endpoint.go
@@ -142,12 +142,31 @@ func (h *Health) ServiceNodes(args *structs.ServiceSpecificRequest, reply *struc
 		&args.QueryOptions,
 		&reply.QueryMeta,
 		func(ws memdb.WatchSet, state *state.Store) error {
-			index, nodes, err := f(ws, state, args)
+
+			// TODO(banks): using key here feels wrong. It has many gotchas like ACL token
+			// and DC not being included which is kinda what we want but only if we are
+			// extremely careful in what we do under this key...
+			sharedKey := args.CacheInfo().Key
+
+			type result struct {
+				ws    memdb.WatchSet
+				index uint64
+				nodes structs.CheckServiceNodes
+			}
+			v, err, _ := h.srv.sfGroup.Do(sharedKey, func() (interface{}, error) {
+				index, nodes, err := f(ws, state, args)
+				return result{ws, index, nodes}, err
+			})
 			if err != nil {
 				return err
 			}
+			// Replace our watchset with the one populated in the de-duped operation
+			// otherwise `ws` will be empty and we won't notice any further changes...
+			for ch := range v.(result).ws {
+				ws.Add(ch)
+			}
+			reply.Index, reply.Nodes = v.(result).index, v.(result).nodes

-			reply.Index, reply.Nodes = index, nodes

The whole diff is just a few other lines to add the watchpool to struct and a single-file lib vendor.

I made a tool to simulate watchers on large service consistently but using direct RPC so we don't have the overhead of local agent http->RPC as well: https://gist.github.com/banks/f8429dfbf6b3c8c0145afeb9be16caa4

This is all just on my Mac so super unscientific but I consistently see master using more CPU than #5081 which uses a bit more than with the diff above. Very roughly with 500 blocking queries on 1000 service instances which change every 5 seconds, it seems to be about:

(measured using Instruments taking Consul's combined CPU %)

That is super unscientific though - I will try and reproduce on a dedicated server with better monitoring.

I'm optimisitic about singleflight though as a much simpler mechanism to acheive the same thing as this PR - I think the Cache Key gotcha is no different, and the watchset copy thing is a bit subtle and I missed initially but I think we could make a helper for that in just a few lines of code that can be adopted into other blocking endpoints much more simply.

What do you think?

@banks
Copy link
Member

banks commented Dec 11, 2018

Here is the whole diff for the singleflight PoC branch (not ready for PR but sharing for clarity: https://github.com/hashicorp/consul/compare/x/watchpool-hash...x/singleflight?expand=1

@pierresouchay
Copy link
Contributor

@banks you are on Fire! ;)

Read from my phone in several tabs, but if I read correctly, the implementation does not address late readers which do get optimized in our version. In your case, If I understand correctly for a given index 42 in the service, if a first request come at index 42, it will block until index of service get at least 43.
But if another reader come after the first one with index 40, it will get block until index 43 is fired, while it should get the result immediately in theory.

There might be several strategies :

  1. Do something similar as what we do: optimize for late readers
  2. Include the index in requested in the shared key. The more generic and less error prone, but less optimized, because if m readers come with m different values from currentInfexOfService to currentInfexOfService - m, parsing of nodes would be done m times, not once (that is why we first call the func once in order to get current index and cache the result).

This is a real world scenario as jitter of requests accounts for 1/16th of max wait time we set to 10 min, and in some cases we have several changes/sec when platform is unstable

@banks
Copy link
Member

banks commented Dec 12, 2018

But if another reader come after the first one with index 40, it will get block until index 43 is fired, while it should get the result immediately in theory.

Actually it's super subtle but I don't think that is the case currently. Note that the singleflight critical section is only in the state store query - requests that come in while others are waiting will perform the work again. That makes it less optimized than #5050 since we are not actually caching the "current" result.

But I don't think it would cause unnecessary waiting, even if two requests with different indexes race:

Consider request A with index 42 and B with 40. Current state is at 42:

Time A i=42 B i=40
t0 singleflight.Do singleflight.Do
t1 gets lock doesn't get lock, waits for A
t3 returns current state@42 gets state@42 from A
t4 blocking loop 42=42, waits blocking loops sees 42>40, returns

If request B comes in after A has completed the singleflight.Do and is already blocking then it will do the query on its own and so will get the same result.

So I don't think there is ever any additional waiting this was, but you are right that in the case where later readers come while other readers are already blocking they will re-execute the memdb query instead of using the cached value in #5050.

Would love to know how much that matters. Do you have metrics on how many "late readers" you have typically due to jitter? That would be super useful.

My rationale for thinking that this would get a significant part of the benefit, even when some readers are late is:

  • assuming later readers are only late a small amount of the time (when they happen to have returned while updates occurred), most readers most of the time will be waiting on the next update
  • all readers waiting will be blocking on the same watchset due to Adds a WatchPool to optimise blocking RPC handling in servers. #5081
  • they all get "woken" at the same time and all enter the blocking query function and hit singleflight.Do pretty much concurrently
  • only one of them does the work and then they all process the result independently.
  • so most of the time, most queries will be de-duped.

If you actually have readers that are rate limited then during a spike in updates which is faster than their rate limit they will end up being "late" pretty constantly which is maybe the situation you have sometimes - in this case the cache would help too.

Adding a very simple Index-aware cache on top of singleflight I think could be done easily and I don't think changes the abstraction too much - the pattern of a reusable component inside existing blocking query I think is the biggest benefit over the sharedBlockingQuery approach in #5050 which seems to add quite a bit of complexity to blocking loop too etc. It would also be re-usable in other endpoints where it's worthwhile, usually with very little re-writing. That said, most other read endpoints are trivial in-memory reads from MemDB so probably will benefit much less from de-duplicating that work.

@ShimmerGlass
Copy link
Contributor Author

ShimmerGlass commented Dec 18, 2018

@banks,

I ran some benchmarks on both solutions, here are the results :

No "late callers":

master:         CPU: 302.53, FPS: 49.19, QPS: 1027.47, Runtime (s): 60.00
x/singleflight: CPU: 252.08, FPS: 49.31, QPS: 1128.93, Runtime (s): 60.02
shared-bq:      CPU: 261.71, FPS: 49.25, QPS: 1138.70, Runtime (s): 60.02
                   
30% "late callers":
                   
master:         CPU: 320.13, FPS: 49.48, Late QPS: 334.32, QPS: 1124.22, Runtime (s): 60.00
x/singleflight: CPU: 277.67, FPS: 49.78, Late QPS: 319.32, QPS: 1067.78, Runtime (s): 60.00
shared-bq:      CPU: 266.36, FPS: 49.42, Late QPS: 367.66, QPS: 1218.81, Runtime (s): 60.00

All values are averages during the 1min bench.
CPU: consul agent CPU usage
FPS: flap/s
QPS: query/s (including late)
Late QPS: late query/s
Parameters : instances: 100, watchers: 100, flap: each instance every 2s (spread)

singlefligh and shared show similar results, with singleflight performing better in the standard "no late callers" case.
The changes you are proposing are simpler and less intrusive than this PR. Going for this until a long term solution (such as materialized view or streaming, as you pointed out) is devised will be beneficial during high load situations such as the ones we are experiencing.

What do you think is missing from the diffs you posted before you consider them as production ready (aside from some code cleaning) ?

@banks
Copy link
Member

banks commented Dec 18, 2018

@Aestek great news - thank you so much for putting that effort in to test. Can you share a little more about you methodology was that using your real workload or with a synthetic one? If synthetic is there any way to share the code? How did you measure CPU usage - for whole host or just consul process?

I think the main things that the singleflight PR needs is some thought about testing. Especially around correctness with callers at different indexes. I think we'd need plenty of warnings in code too about the dangers of doing anything with ACL tokens inside the singleflighted function - health is OK right now but we should probably not just rely on us remembering never to accidentally move a token check inside there somehow.

My point yesterday about how we should probably optimise health queries for reads since they are the most prevalent sort in Consul anyway and the most expensive currently still holds. If we did do that, it would seem to me like the single flight change wouldn't necessarily help any more and could be removed. That said, at least it stops short of caching results in a whole new place. The concerns about correctness for mixed clients or token availability or re-using the agent cache key all remain somewhat.

I think the one other thing that it might be worth considering before we merge that is what the simplest path for denormalizing the health results is - if we can do that with one new table and a few hooks in the state store to populate it on a write then it might actually be a simpler thing to reason about, test and merge as there is no complex concurrent behaviour to model and would be the more ideal long-term solution. The downside is, you didn't benchmark that, I wonder if your benchmarks are easy to share/reproduce so we can quickly get a handle on how that possible solution fits. If it is just as good or close it would be my preference.

@ShimmerGlass
Copy link
Contributor Author

ShimmerGlass commented Dec 18, 2018

@banks sorry I wasn't precise enough, these results come from https://github.com/criteo/consul-bench/tree/next (careful some flags aren't merged in master yet).
The CPU measurements are for consul only. The blocking queries are made using RPC directly, while the flapping is made via TTL checks through the agent (I need to work on that so the agent<>server sync does not impact the bench).
Platform is Debian Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz.

Here are the full command lines used:

consul-bench -register 100 -flap-interval 2s -watchers 100 -rpc -monitor $(pgrep consul) -time 1m

and

consul-bench -register 100 -flap-interval 2s -watchers 100 -rpc -monitor $(pgrep consul) -time 1m -late-ratio 0.3

@banks
Copy link
Member

banks commented Dec 18, 2018

Great thanks.

I looked quickly at the denomalization option. I think the goal would be to replace checkServiceNodes helper in state store with a single table fetch (also a single chan to watch for blocking!) which would make it way cheaper. The other good news is that it's result type is structs.CheckServiceNodes which if you follow the types through is a list of structs that contain only pointers to the actual nodes, checks, and serviceNodes. That means that storing one of these per service should not consume a significant amount of extra RAM!

The tradeoff is that everything that is in that response would need to potentially rebuild all of the index entries that might be effected:

  • ensureNodeTxn would need to iterate all services on that node and possible rebuild their entries. (like it already does for updating Service Indexes from Pierre's optimisation) Good news is that number of services on a node isn't likely to grow really huge, so the amount of work is bounded and since it's all in-memory during a raft apply, I suspect won't impact performance much since the network and disk IO will still be so much slower.
  • ensureServiceTxn would obviously need to rebuild the index
  • ensureCheckTxn would need to rebuild the index for a specific service if it's a service level check, and possible all services on node if it's a node level check although we should be careful not to that twice if we also update the node.
  • delete*Txn would need to rebuild and possibly delete.

Interestingly, all of these are the same places as those where Pierre added service-specific index updates for a good reason!

We can re-use most of the existing Fetch code to actually populate the results too.

The only other detail I spotted is that we probably need to do it twice for each service once for service.foo and once for connect.foo equivalents.

What do you think? In some ways adding a new table seems like a big change, in another this seems like the cleanest option and will likely achieve best performance for this workload with none of the risks (and zero code change outside of the state store). And will leave is in a decent place for the future with streaming etc.

I might see if I can bash together a really simple version that's just enough to PoC and benchmark against. If there are other things I'm missing that would complicate this then maybe singleflight is good for now but I really like the idea of solving the root cause and benefitting everyone with simpler code in the end!

@ShimmerGlass
Copy link
Contributor Author

@banks This indeed sounds like a great idea.
Another thing that might get optimized by this is the removal of allocs in parseServiceNodes as the result will be ready to send in the new table.
I was suggesting singleflight first because adding a new table / views seems more complicated than this cache change.
I am looking too at how this could be done (at least to understand the implications).

@banks
Copy link
Member

banks commented Dec 18, 2018

I think the most complex part is deciding how to trigger the updates - if you actually do it low level in those methods I mentioned then in case of a typical registration you might end up rebuilding the index many times inside each lower level method. I think it would need to be done as a separate pass and threaded through all of the high-level entry points like ensureRegistrationTxn, EnsureService, etc.

Some other good news though is that we get snapshot for free if we just make sure ensureRegistrationTxn always does the right thing!

@ShimmerGlass
Copy link
Contributor Author

We did some benchmarking while upgrading from 1.3.1 to 1.4.4 in production.
We deployed Consul 1.4.4 vanilla on one server on one of our largest DCs and generated some load similar to a big app deployment :

image
In this graph servers 04-07 are running 1.3.1 + shared blocking queries while 03 is running vanilla 1.4.4.
We see a significant CPU load increase for the server running 1.4.4.

We then deployed 1.4.4 on a second server but with shared blocking queries this time :
image
In this graph servers 05-07 are running 1.3.1 + shared blocking queries while 03 is running vanilla 1.4.4 and 04 is running 1.4.4 + shared blocking queries.
We can see that 04's load is on par with 05-07 with this patch.

We will continue to run this patch in production as we cannot afford to loose perf on our consul clusters.

@banks
Copy link
Member

banks commented May 13, 2019 via email

@hanshasselberg
Copy link
Member

Is this still an issue for you?

@pierresouchay
Copy link
Contributor

@i0rek Yes, we have our patched version running for 1y for this reason

hanshasselberg added a commit that referenced this pull request Feb 3, 2020
The previous value was too conservative and users with many instances
were having problems because of it. This change increases the limit to
8192 which reportedly fixed most of the issues with that.

Related: #4984, #4986, #5050.
hanshasselberg added a commit that referenced this pull request Feb 4, 2020
The previous value was too conservative and users with many instances
were having problems because of it. This change increases the limit to
8192 which reportedly fixed most of the issues with that.

Related: #4984, #4986, #5050.
@banks
Copy link
Member

banks commented Jul 30, 2020

I need to tie up the loose end that's been left here for a long time!

Firstly, thank you @ShimmerGlass and @pierresouchay for both making the PR and all the work we've done together in #4984 and since on Consul blocking performance.

This PR we know is a big performance improvements that for now you are still relying on in production, however for all the many reasons discussed in #4984 (comment) and later comments we don't plan to merge this as it is so will finally close it!

A summary of the reasons not to merge:

  • We are concerned that it adds another layer of caching and complexity to blocking queries which are already reasonably complicated. Maintainability and subtle issues like lagging clients seeing the wrong cached state or missing updates, or interactions with other components. Our experience with the agent cache was that there is a huge amount of complexity potential here so we'd prefer to avoid it if we can achieve the same or better performance another way.
  • In the issue linked above, we researched and prototyped 3 different approaches that had benefits over this in terms of being simpler/less invasive changes (single flight), being more easily testable and deterministic (denormalising the results in memdb), or flat out more efficient (streaming updates based on memdb change capture). All of those were viable and would be preferred for reasons given.
  • Ultimately we realised that this optimisation saved some CPU cycles but was much less significant than the gains we could get from re-architecting Consul to allow for delta streaming instead of polling and sending full responses. This work has sadly taken longer than anticipated for a few reasons, but we are now on track and have a dedicated engineer working to get this shipped in the next major release.
  • Once streaming is available for an endpoint this optimization shouldn't be needed any more, so we'd rather focus on finishing that and getting it working everywhere than continuing to test and work on merging this PR.

Thanks again for all your hard work here!

@banks banks closed this Jul 30, 2020
@ShimmerGlass ShimmerGlass deleted the shared-blocking-queries branch July 30, 2020 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants