-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize State merge #126
Closed
Closed
Optimize State merge #126
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
State/new expects a second argument which is the shard name.
* Extract was used wrongly. Same state object was used in all three arguments. First argument should be the local and other two should be remote. * State.replica_up/2 needs to be called, otherwise extract does not return a proper remote map. * Added couple of assert statements to ensure merging actually works and everything is set up correctly.
indrekj
force-pushed
the
merge-perf-for-pr
branch
3 times, most recently
from
July 6, 2019 20:41
1b2a616
to
3fc1468
Compare
* observe_removes observe_removes was one of the slowest parts of the State merge function. Each time observe_removes was called, we did a full ets table scan on the values table and copied all elements except from local replica and then filtered them one by one. This was especially slow when there were multiple nodes each with many tracked processes. E.g. given 4 replicas each with 25K records, we had to scan all 100K records and then copied 75K from ets for additional processing. We had to go through that each time if there was a non-empty delta, even if we didn't need to remove any records. I created a new ets table to map tags to value keys. This allows us to find out which records we first want to query before even using ets. Best case scenario when the delta only includes adds we don't have to do any ets queries and the whole observe_removes is basically skipped. With removes we now only fetch what's needed. Also because we have tags table we now don't have to do full table scans any more. * compact compact was another slow part. Most slowness came from `Enum.sort(MapSet.to_list(cloud)))` part. I don't really see why we have to sort this. Instead I'm now passing the cloud without modifications. I changed the do_compact function to expect the cloud to be unordered. I don't see why it had to be sorted. = Benchmarks == phoenix.pub_sub.bench `mix phoenix.pub_sub.bench --size 100000 --delta-size 10000` ``` >> Merging delta with 10000 joins and leaves into 200000 element set... = 132.18ms 0.13s (OLD) = 49.9ms 0.05s (NEW) ``` Everything else stayed more or less the same. == my custom benchmark I wrote a benchmark that spawned 4 VMs and set up a tracker on each of them with pool size of 1. I then added 10K records to each tracker diveded by 10 topics and after 5 seconds, I updated 2.5K records on each tracker. These are the merge calls: Before: ``` [node1] >> Connected nodes: [:"node2@10.200.0.207", :"node3@10.200.0.207", :"node4@10.200.0.207"] Starting pubsub with pool size of 1 [node1] >> Starting tracker with pool size of 1 [node1] >> MERGE 16.44ms [node1] >> Tracked 10000 clients [node1] >> MERGE 48.97ms [node1] >> MERGE 38.71ms [node1] >> MERGE 54.84ms [node1] >> MERGE 41.26ms [node1] >> MERGE 41.48ms [node1] >> Updated 2.5e3 clients [node1] >> MERGE 62.81ms [node1] >> MERGE 68.38ms [node1] >> MERGE 77.4ms [node1] >> MERGE 37.67ms [node1] >> MERGE 21.12ms [node1] >> EXIT ``` After: ``` [node1] >> Connected nodes: [:"node4@10.200.0.207", :"node2@10.200.0.207", :"node3@10.200.0.207"] Starting pubsub with pool size of 1 [node1] >> Starting tracker with pool size of 1 [node1] >> MERGE 18.82ms [node1] >> Tracked 10000 clients 304.18ms [node1] >> MERGE 19.07ms [node1] >> MERGE 28.12ms [node1] >> MERGE 21.37ms [node1] >> MERGE 31.98ms [node1] >> MERGE 22.54ms [node1] >> Updated 2.5e3 clients 98.42ms [node1] >> MERGE 36.29ms [node1] >> MERGE 46.59ms [node1] >> MERGE 43.54ms [node1] >> MERGE 3.81ms [node1] >> MERGE 2.02ms [node1] >> EXIT ```
Using Map.update has few inefficiencies: both initial argument and update function are created, which is unnecessary and it also involves anonymous update function call. Related: https://medium.com/learn-elixir/optimizing-work-with-nested-maps-in-elixir-3131da27a13e I got about a 25% performance increase locally for the given reduce. In the big picture this is a very minor improvement though.
indrekj
force-pushed
the
merge-perf-for-pr
branch
from
September 8, 2019 15:37
3fc1468
to
f11c404
Compare
Closing this atm. Will remove the compaction part as I'm still not sure if it is needed or not. Will add observe_removes part to our own managed branch and do some testing in beta and production. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: Merge PR 136 and 127 before this. This PR has more complex changes and needs a more in depth review.
I was inspecting why
Presence.list()
was quite slow in our production servers. It took couple of milliseconds when traffic was low but over few hundred milliseconds when traffic was at peak. We have about 40K presence connections between 5 instances. So far it seems that processing heartbeats is the most expensive part which block those list calls.Within my investigation I have discovered few places that can be optimized. I've tried to ensure I didn't break anything as well as I could, but the codebase is complex and I cannot be sure that I didn't miss some edge case. All the tests however pass.
All feedback is welcome and if you have recommendations what else to check / optimize then please let me know.
Optimizations
observe_removes
observe_removes was one of the slowest parts of the State merge
function. Each time observe_removes was called, we did a full ets table
scan on the values table and copied all elements except from local
replica and then filtered them one by one. This was especially slow when
there were multiple nodes each with many tracked processes. E.g. given 4
replicas each with 25K records, we had to scan all 100K records and then
copied 75K from ets for additional processing.
We had to go through that each time if there was a non-empty delta, even
if we didn't need to remove any records.
I created a new ets table to map tags to value keys. This allows us to
find out which records we first want to query before even using ets.
Best case scenario when the delta only includes adds we don't have to do
any ets queries and the whole observe_removes is basically skipped.
With removes we now only fetch what's needed. Also because we have tags
table we now don't have to do full table scans any more.
compact
compact was another slow part. Most slowness came from
Enum.sort(MapSet.to_list(cloud)))
part. I don't really see why we haveto sort this. Instead I'm now passing the cloud without modifications.
I changed the do_compact function to expect the cloud to be unordered. I
don't see why it had to be sorted.
Benchmarks
phoenix.pub_sub.bench
mix phoenix.pub_sub.bench --size 100000 --delta-size 10000
Everything else stayed more or less the same.
my custom benchmark
I wrote a benchmark that spawned 4 VMs and set up a tracker on each of
them with pool size of 1. I then added 10K records to each tracker
divided between 10 topics and after 5 seconds, I updated 2.5K records on each
tracker. These are the merge calls:
Before:
After: