-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance on large clusters] Share blocking queries between RPC requests #5050
[Performance on large clusters] Share blocking queries between RPC requests #5050
Conversation
Only run one blocking query per request and share the result with all RPC calls. This optimizes the blocking query process when many agents are watching the same service.
This is really cool - thanks! I have a few thoughts but at this stage I want to consider all the possible options available to solve this including others discussed in #4984 so lets do that over in the issue and come back here if we decide this option is worth taking forward. |
Deployment feedback of this patch@banks You'll see the results are simply amazing with our usage of Consul PreprodDeployed in whole preprod for 24h, significant decrease of CPU, write latency and number of goroutines. ProdDeployed on a single server (among 5) in each of our DCs for comparison All the graphs below represent 2 servers in the same cluster in "Follower" mode. consul-05 are using the patch, others are not. Red line represent start of deployment of feature in all of our DCs CPUSystem CPU usage from 20% to 6% The changes are very especially impressive when large services do change a lot (see the peak at 11:40): in that case, the CPU is almost stable on the patched version will it hit the sky on non-patched version. MemoryFrom 11G to 4G of memory, almost divided by 3! FSM Apply latencyFSM latency divided by 2 on both average and 90% percentile! Num GoRoutinesFrom 180k to 65k ! |
Awesome graphs @pierresouchay it's cool we can get such big wins for this or something like it. See the discussion in the issue for finding the sweetspot between minimal changes and most performance gain! |
@banks Sure... let's find a way to have gains in a way which is convenient for you. We currently have an incident on one of our DCs (meaning, much more discovery requests than it used to), here are some graph of comparing the 2 best behaving servers during incident, so I can share some real data when things goes bad. When things go badWhat is really impressive is that the patched server serves far more requests than the non-patched version (1.6Gb/s VS max 1.2Gb/s for non-patched ones), meaning that it consumes far less while being far more busy! Go rountines comparedGo routines stay stable and low on patched server (red line, patched server): CPU comparedOther server CPU hit the limit, while our patched instance stay far below (1944% - machine is CPU bound, while the patched version stays at 735% max): Note the patched servers can also handle far more requests at the same time (so CPU difference is even worse than that): Peeks around 1.6Gb/s on patched server VS max 1.2Gb/s for all other ones. Memory comparedMemory of non-patched server during incident: Memory of patched server during incident (and handling far more QPS): So, memory stays almost stable (between 7.5 and 8.5 during very high loads) while non-patched goes from 9 to 11Gb Warning: non-patched servers = Vanilla + most of our perf patches + #4986 set to 8192When I said non-patched servers, I am talking about servers already patched with #4986 and watch softLimit set to 8192. Vanilla Consul servers cannot handle those loads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I want to say that overall this PR is awesome @Aestek, but just wanted to be a little more concrete about why I'm keen to understand how we can get some or all of the same performance win with slightly different design.
I'll think some more and maybe have a more concrete suggestion that would compose better with #5081
@@ -443,6 +445,157 @@ RUN_QUERY: | |||
return err | |||
} | |||
|
|||
type sharedQueryFn func(memdb.WatchSet, *state.Store) (uint64, func(uint64, interface{}) error, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function type sums up one of my big concerns here - that seems like a really unintuitive and fragile abstraction to build blocking queries on top of with funcs returning funcs with special types etc. It seems inevitable we'll find we need to plumb something new through and have to change types all over the place to do it.
I could be wrong of course, but it feels like there must be a cleaner abstraction we can make here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, I will think of a better way to do this.
} | ||
|
||
return queryState.Apply.Load().(func(uint64, interface{}) error)(atomic.LoadUint64(&queryState.Index), res) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor concern is how much of this is duplicated from blockingQuery
. I guess once we adopt this everywhere we can remove normal ones but that seems like a decent amount of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the goal is to remove blockingQuery() and just keep the shared version. I however kept both to keep the PR small.
If this gets merged / or is needed to get merged, I can migrate the other endpoints to this system and remove the old code.
timeout = time.NewTimer(queryOpts.MaxQueryTime) | ||
defer timeout.Stop() | ||
|
||
cacheInfo := req.CacheInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep swaying backwards and forwards on whether the agentcache.Request
interface is the right thing to use for deduplication here. On one level it's pretty similar purpose - uniquely identify this request - but on another there could be semantically different caching concerns. I fear that in practice we'll eventually hit a bug in one or the other place which needs a change to the CacheInfo key for a certain request, but it won't apply to the other place and we'll be hacking about.
This is more of a feeling of unease than a concrete issue though - I'm not totally sure what a better alternative would be. I'll think more because if we do find that de-duplicating the query processing as well as the watching is important then this would be something to solve however we do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share your feeling, I used this because it did the right thing for this use case and was ready to use.
This is a naming / namespacing issue to me, both the cache and this PR need an unique identifier for the request parameters. It could be moved out of cache.Request
interface to have its own method with both the cache and this PR depending on it.
FWIW I think I have a very rough branch locally that combined with #5081 gets all of the rest of the benefit here with minimal changes. It looks like this (although this should probably be cleaned up and abstracted more: Basically the only changes are to add a diff --git a/agent/consul/health_endpoint.go b/agent/consul/health_endpoint.go
index 103fb2faf..e625c16be 100644
--- a/agent/consul/health_endpoint.go
+++ b/agent/consul/health_endpoint.go
@@ -142,12 +142,31 @@ func (h *Health) ServiceNodes(args *structs.ServiceSpecificRequest, reply *struc
&args.QueryOptions,
&reply.QueryMeta,
func(ws memdb.WatchSet, state *state.Store) error {
- index, nodes, err := f(ws, state, args)
+
+ // TODO(banks): using key here feels wrong. It has many gotchas like ACL token
+ // and DC not being included which is kinda what we want but only if we are
+ // extremely careful in what we do under this key...
+ sharedKey := args.CacheInfo().Key
+
+ type result struct {
+ ws memdb.WatchSet
+ index uint64
+ nodes structs.CheckServiceNodes
+ }
+ v, err, _ := h.srv.sfGroup.Do(sharedKey, func() (interface{}, error) {
+ index, nodes, err := f(ws, state, args)
+ return result{ws, index, nodes}, err
+ })
if err != nil {
return err
}
+ // Replace our watchset with the one populated in the de-duped operation
+ // otherwise `ws` will be empty and we won't notice any further changes...
+ for ch := range v.(result).ws {
+ ws.Add(ch)
+ }
+ reply.Index, reply.Nodes = v.(result).index, v.(result).nodes
- reply.Index, reply.Nodes = index, nodes The whole diff is just a few other lines to add the watchpool to struct and a single-file lib vendor. I made a tool to simulate watchers on large service consistently but using direct RPC so we don't have the overhead of local agent http->RPC as well: https://gist.github.com/banks/f8429dfbf6b3c8c0145afeb9be16caa4 This is all just on my Mac so super unscientific but I consistently see master using more CPU than #5081 which uses a bit more than with the diff above. Very roughly with 500 blocking queries on 1000 service instances which change every 5 seconds, it seems to be about:
(measured using Instruments taking Consul's combined CPU %) That is super unscientific though - I will try and reproduce on a dedicated server with better monitoring. I'm optimisitic about What do you think? |
Here is the whole diff for the |
@banks you are on Fire! ;) Read from my phone in several tabs, but if I read correctly, the implementation does not address late readers which do get optimized in our version. In your case, If I understand correctly for a given index 42 in the service, if a first request come at index 42, it will block until index of service get at least 43. There might be several strategies :
This is a real world scenario as jitter of requests accounts for 1/16th of max wait time we set to 10 min, and in some cases we have several changes/sec when platform is unstable |
Actually it's super subtle but I don't think that is the case currently. Note that the But I don't think it would cause unnecessary waiting, even if two requests with different indexes race: Consider request A with index 42 and B with 40. Current state is at 42:
If request B comes in after A has completed the singleflight.Do and is already blocking then it will do the query on its own and so will get the same result. So I don't think there is ever any additional waiting this was, but you are right that in the case where later readers come while other readers are already blocking they will re-execute the memdb query instead of using the cached value in #5050. Would love to know how much that matters. Do you have metrics on how many "late readers" you have typically due to jitter? That would be super useful. My rationale for thinking that this would get a significant part of the benefit, even when some readers are late is:
If you actually have readers that are rate limited then during a spike in updates which is faster than their rate limit they will end up being "late" pretty constantly which is maybe the situation you have sometimes - in this case the cache would help too. Adding a very simple Index-aware cache on top of |
I ran some benchmarks on both solutions, here are the results :
All values are averages during the 1min bench. singlefligh and shared show similar results, with singleflight performing better in the standard "no late callers" case. What do you think is missing from the diffs you posted before you consider them as production ready (aside from some code cleaning) ? |
@Aestek great news - thank you so much for putting that effort in to test. Can you share a little more about you methodology was that using your real workload or with a synthetic one? If synthetic is there any way to share the code? How did you measure CPU usage - for whole host or just consul process? I think the main things that the singleflight PR needs is some thought about testing. Especially around correctness with callers at different indexes. I think we'd need plenty of warnings in code too about the dangers of doing anything with ACL tokens inside the singleflighted function - health is OK right now but we should probably not just rely on us remembering never to accidentally move a token check inside there somehow. My point yesterday about how we should probably optimise health queries for reads since they are the most prevalent sort in Consul anyway and the most expensive currently still holds. If we did do that, it would seem to me like the single flight change wouldn't necessarily help any more and could be removed. That said, at least it stops short of caching results in a whole new place. The concerns about correctness for mixed clients or token availability or re-using the agent cache key all remain somewhat. I think the one other thing that it might be worth considering before we merge that is what the simplest path for denormalizing the health results is - if we can do that with one new table and a few hooks in the state store to populate it on a write then it might actually be a simpler thing to reason about, test and merge as there is no complex concurrent behaviour to model and would be the more ideal long-term solution. The downside is, you didn't benchmark that, I wonder if your benchmarks are easy to share/reproduce so we can quickly get a handle on how that possible solution fits. If it is just as good or close it would be my preference. |
@banks sorry I wasn't precise enough, these results come from https://github.com/criteo/consul-bench/tree/next (careful some flags aren't merged in master yet). Here are the full command lines used:
and
|
Great thanks. I looked quickly at the denomalization option. I think the goal would be to replace The tradeoff is that everything that is in that response would need to potentially rebuild all of the index entries that might be effected:
Interestingly, all of these are the same places as those where Pierre added service-specific index updates for a good reason! We can re-use most of the existing Fetch code to actually populate the results too. The only other detail I spotted is that we probably need to do it twice for each service once for What do you think? In some ways adding a new table seems like a big change, in another this seems like the cleanest option and will likely achieve best performance for this workload with none of the risks (and zero code change outside of the state store). And will leave is in a decent place for the future with streaming etc. I might see if I can bash together a really simple version that's just enough to PoC and benchmark against. If there are other things I'm missing that would complicate this then maybe singleflight is good for now but I really like the idea of solving the root cause and benefitting everyone with simpler code in the end! |
@banks This indeed sounds like a great idea. |
I think the most complex part is deciding how to trigger the updates - if you actually do it low level in those methods I mentioned then in case of a typical registration you might end up rebuilding the index many times inside each lower level method. I think it would need to be done as a separate pass and threaded through all of the high-level entry points like Some other good news though is that we get snapshot for free if we just make sure |
Thanks @aestek.
Understandable that the CPU usage is still higher as explained in the
issue. I hope the more complete solution outlined in the issue will be
enough to improve that over this PR eventually when done!
…On Mon, Apr 8, 2019 at 3:00 PM Aestek ***@***.***> wrote:
We did some benchmarking while upgrading from 1.3.1 to 1.4.4 in production.
We deployed Consul 1.4.4 vanilla on one server on one of our largest DCs
and generated some load similar to a big app deployment :
[image: image]
<https://user-images.githubusercontent.com/1712219/55729290-2818b900-5a16-11e9-8e34-51f319ab63ef.png>
In this graph servers 04-07 are running 1.3.1 + shared blocking queries
while 03 is running vanilla 1.4.4.
We see a significant CPU load increase for the server running 1.4.4.
We then deployed 1.4.4 on a second server but with shared blocking queries
this time :
[image: image]
<https://user-images.githubusercontent.com/1712219/55729576-ae34ff80-5a16-11e9-902e-b5c92184c6ab.png>
In this graph servers 05-07 are running 1.3.1 + shared blocking queries
while 03 is running vanilla 1.4.4 and 04 is running 1.4.4 + shared blocking
queries.
We can see that 04's load is on par with 05-07 with this patch.
We will continue to run this patch in production as we cannot afford to
loose perf on our consul clusters.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5050 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYU64a9Qt8dsgghqLRW__Z3rrTt6Ohks5ve0tjgaJpZM4ZA0i5>
.
|
Is this still an issue for you? |
@i0rek Yes, we have our patched version running for 1y for this reason |
I need to tie up the loose end that's been left here for a long time! Firstly, thank you @ShimmerGlass and @pierresouchay for both making the PR and all the work we've done together in #4984 and since on Consul blocking performance. This PR we know is a big performance improvements that for now you are still relying on in production, however for all the many reasons discussed in #4984 (comment) and later comments we don't plan to merge this as it is so will finally close it! A summary of the reasons not to merge:
Thanks again for all your hard work here! |
Only run one blocking query per request and share the result with all RPC calls. It optimizes the blocking query process when many agents are watching the same service.
Implementation
When a blocking query RPC call is received, the server first checks if a blocking query with the same parameters is already running. If true, then just wait for this query to complete (or
MaxWaitTime
) without doing any other work. If no blocking query is running, run one in background and wait on it.The background blocking queries have no max time, however they are cancelled once no more requests are watching them.
The new
args.CacheInfo().Key
is used to differentiate a query from another.A map keeps track of all running blocking queries.
Improvements
ServiceNodes
) code path, and the msgpack serialization. This changes just runs oneServiceNodes
call no matter how many watchers there are, reducing that hot path.softWatchLimit
(discussed here [Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances #4984) aims at capping the number of goroutines used by blocking queries. This change only keeps oneWatchSet
per blocking query type instead of one per request, ultimately keeping the number of goroutines low.Notes
This PR only use shared blocking queries for
/v1/health/service
to keep it small and easy to review, and since this query consumes many G per service instance. All blocking endpoints can be migrated to this system.Some of our tests show a CPU usage divided by two as well as far less memory and goroutines usage.