Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance On Large clusters] Reduce updates on large services #4720

Conversation

Projects
None yet
5 participants
@pierresouchay
Copy link
Contributor

commented Sep 28, 2018

Checks do update services/nodes only when really modified to avoid too many updates on very large clusters.

In a large cluster, when having a few thousands of nodes, the anti-entropy mechanism performs lots of changes (several per seconds) while there is no real change. This patch wants to improve this in order to increase Consul scalability when using many blocking requests on health for instance.

For the record, we are having anti-entropy running every 6-7 minutes, just this change does around more than 15 changes in catalog/nodes, but in our large services (around 1k nodes, that's still a lot) even in stable services.
What it means is that our large services are constantly updated and all health watcher being notified.

This has huge impact on load on Consul server and completely break performance.

This optimization reduces the load on Consul servers since all watchers will not being notified for nothing when Anti-Antropy or any request not changing the content or a node/service/check is being performed.

We have lots of performance issues those days, we would really appreciate a quick review.

Thank you very much

pierresouchay added some commits Sep 28, 2018

[Performance On Large clusters] Checks do update services/nodes only …
…when really modified to avoid too many updates on very large clusters

In a large cluster, when having a few thousands of nodes, the anti-entropy
mechanism performs lots of changes (several per seconds) while
there is no real change. This patch wants to improve this in order
to increase Consul scalability when using many blocking requests on
health for instance.
@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Sep 28, 2018

@banks this modification might greatly enhance the performance of your cache implementation in progress for large clusters with services having many instances

@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2018

@banks @mkeeler @pearkes We applied this patch on a middle size cluster of ours (3400 nodes, 1300 kind of services). It more than divided by 2 the network bandwidth needed on our servers!

We also tested it on our preprod clusters with nice changes, similar to this, but the improvement increase quite a lot with the number of nodes!

On a large cluster, this is a game changer, as index per service was a very important change for scalability.

This patch basically allow services not changing much to avoid changing for nothing and notifies clients only when real changes occur. Here are our result...

Rate of Changes with wait=10m on a stable Service

This shows the number of updates on a stable service with 32 instances (means that anti-entropy does not change the service unless really needed). It means that if your use wait=10m with correct index on a blocking query, on a stable service, you will get a result after 10m. On our side, we have more than 1300 real services (x2 because of proxies), it means all watcher will get far less notifications

One single small service (32 instances) watched by many apps, rate per 10m:

screen shot 2018-10-03 at 17 30 51

On all services at the same time, here is the result (rate per 10m)

screen shot 2018-10-03 at 17 27 52

Real Life impact on this Datacenter

Ok, this was theoretical, here are the raw results

99th percentile on Read Stale delay

While we used to have up to 1s of latency on this DC, the latency did drop to less than 256ms

screen shot 2018-10-03 at 16 47 33

Load Average / CPU Load

Decreasing really nicely on the servers
screen shot 2018-10-03 at 16 44 30

Network bandwidth on server

Divided by more than 2...
screen shot 2018-10-03 at 17 46 32

On 7 days (last bar on right is new version with this patch)

Req/s on a single service

Req/sec on server for services shown above (32 nodes). Some missing points before the MEP because of temp breakage of our metric system.

screen shot 2018-10-04 at 23 55 29

Various Consul Metrics for 7 days (last bar is the MEP with this patch)

screen shot 2018-10-04 at 23 59 40

Conclusion

This patch has a real impact on large clusters. I am also gonna test it tomorrow on larger DCs (up to 7k nodes), we expect even larger improvements.

With this patch, this is the first time in weeks we do succeed to be below the SLA we provide to our internal clients, so I really think this is a very important optimization (and not that big or intrusive).

Could you please have a look?

Kind regards

@pearkes

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

Thanks @pierresouchay this looks like some great investigative work and the benefits seem awesome 😄. Sorry we haven't had a chance to review in detail yet but we will likely come back around after a release we're leading up to in the next week or so (1.3).

@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2018

Thank you very much @pearkes,

I get it, perfectly fine. However, would it be possible to have a look to #3551 which was kind of ready for a while with all fixes provided and that we need to backport, update all conflicts whith each release for almost 1 year.

This one is almost without any risk since it does not touch anything to servers nor modify any kind of API...

Kind regards

@mkeeler mkeeler requested a review from hashicorp/consul-core Oct 5, 2018

@mkeeler
Copy link
Member

left a comment

@pierresouchay This looks really good.

Is it safe to say the performance gains are due to not waking anything up blocking on the catalog due to not reinserting data when its the same. Or inversely the performance benefit is not mostly due to the overhead of reindexing within memdb.

Marking this as "Requesting Changes" because I have a few questions and requests for an additional code comment or 2 but the actual code looks ready to me.

entry.CreateIndex = serviceNode.CreateIndex
entry.ModifyIndex = serviceNode.ModifyIndex
if entry.IsSame(serviceNode) {
modified = false

This comment has been minimized.

Copy link
@mkeeler

mkeeler Oct 5, 2018

Member

I assume there is a reason why we cannot just return here like you did for the IsSame check on the node. If so it would be good to have a comment indicating why you can't, why the node needs to still be Inserted but why the "index" table doesn't require updating.

This comment has been minimized.

Copy link
@pierresouchay

pierresouchay Oct 5, 2018

Author Contributor

Not really, was just to avoid to change too many tests.
Another possible reason was adding into Consul some values in DB, for instance pre-filled values from ToServiceNodes(), for instance Weights.
While semantics to not change, we are sure we are keeping the same kind of data before and after the patch (while upgrade semantics stays untouched).

Added comments in next patch

This comment has been minimized.

Copy link
@mkeeler

mkeeler Oct 5, 2018

Member

As for the default values coming from ToServiceNode, that call happens prior to the IsSame check so if any of the defaults change then IsSame should return false and cause reinsertion.

As for the other reason of wanting to return ErrMissingNode, why can't we just move the node check block to before the ToServiceNode call?

This comment has been minimized.

Copy link
@pierresouchay

pierresouchay Oct 5, 2018

Author Contributor

You are right, I wanted to have a change as small as possible, but you are definitely right, changing that...

Show resolved Hide resolved agent/structs/structs.go Outdated
@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2018

@mkeeler I added the comment and made contract for IsSame() a no-brainer for future user of function.

@mkeeler

This comment has been minimized.

Copy link
Member

commented Oct 5, 2018

@pierresouchay Looks like GitHub isn't showing the updated comment at the bottom of this page so I will ask it here again to make sure its easily visible.

I am not certain we can't return early when the IsSame check for services returns true. As for wanting to keep backwards compatibility with returning ErrMissingNode I think you are right. But the next question becomes why not move that whole check to before the ToServiceNode call.

As for the other reason of ensuring default values get inserted shouldn't the IsSame check pick up those changes and ensure that the reinsertion happens?

@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2018

@mkeeler I am also a bit lost with thread handling on GitHub :-)

You are right, my last patch fixes it, I removed the modfied attribute, so I think you will be happy. I wanted to change as few lines as possible, but your approach is more reasonnable.

@mkeeler

mkeeler approved these changes Oct 8, 2018

Copy link
Member

left a comment

Looks good to me.

@banks banks added this to the 1.3.0 milestone Oct 9, 2018

@pierresouchay pierresouchay force-pushed the pierresouchay:checks_do_not_modify_services_when_not_needed branch from a1ed401 to aab2d16 Oct 9, 2018

Show resolved Hide resolved agent/structs/structs_test.go Outdated

pierresouchay added some commits Oct 10, 2018

@mkeeler

This comment has been minimized.

Copy link
Member

commented Oct 10, 2018

@pierresouchay That travis failure is legit. I think you need to update the ServiceNode.IsSame check. The f-envoy branch which just got merged adds a couple new fields to the ServiceNode which you will need to check for in that function as well.

@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Oct 10, 2018

@mkeeler @banks I have a very strange error.

On my Mac Laptop, I cannot reproduce the same error as Travis:

I have:

go test -timeout 30s github.com/hashicorp/consul/api -run '^TestAPI_CatalogConnect$'
...
--- FAIL: TestAPI_CatalogConnect (9.19s)
    retry.go:116: catalog_test.go:433: Unexpected response code: 500 (1 error(s) occurred:

        * ProxyDestination must be non-empty for Connect proxy services)

FAIL
FAIL	github.com/hashicorp/consul/api	9.206s

I have no clue about the reason why (spent 3 hours on it)...
My version is:

go version
go version go1.11 darwin/amd64

The weird thing is that I have the same exact failure on master with be52793aad664c771637689e8666d5bbf26ac028 :

git rev-parse --verify HEAD && go clean && go test -timeout 30s github.com/hashicorp/consul/api -run '^TestAPI_CatalogConnect$'
be52793aad664c771637689e8666d5bbf26ac028
2018/10/11 01:26:54 CONFIG JSON: {"node_name":"node-72de5520-ce8e-ccd0-4bfd-72a951eabe51","node_id":"72de5520-ce8e-ccd0-4bfd-72a951eabe51","performance":{"raft_multiplier":1},"bootstrap":true,"server":true,"data_dir":"/tmp/consul-test/TestAPI_CatalogConnect-consul163839338/data","segments":null,"disable_update_check":true,"log_level":"debug","bind_addr":"127.0.0.1","addresses":{},"ports":{"dns":29501,"http":29502,"https":29503,"serf_lan":29504,"serf_wan":29505,"server":29506},"acl_enforce_version_8":false,"connect":{"ca_config":{"cluster_id":"11111111-2222-3333-4444-555555555555"},"enabled":true,"proxy":{"allow_managed_api_registration":true}}}
bootstrap = true: do not enable unless necessary
==> Starting Consul agent...
[...]
--- FAIL: TestAPI_CatalogConnect (9.20s)
    retry.go:116: catalog_test.go:433: Unexpected response code: 500 (1 error(s) occurred:

        * ProxyDestination must be non-empty for Connect proxy services)

FAIL
FAIL	github.com/hashicorp/consul/api	9.214s

So either my version of Golang 1.11 (I also tried with same result on Go 1.10.3) is completely broken (I just installed it a few hours ago), either something really weird has changed in the tests.

I'll try tomorrow from a Linux machine, but I have no clue of the reason of this weirdness. Does any of you have an idea?

@banks

This comment has been minimized.

Copy link
Member

commented Oct 11, 2018

@pierresouchay I suspect it's because you are running API tests without using make file.

API tests run the consul binary from your $PATH as the server so you probably have an older version of consul built in your path.

This is really gross and we should be able to fix it, but you either need to make dev first or run the test using make test GOTEST_PKGS="./api".

The API test reliance on the consul binary in path is a sad thing that has caught most Consul devs out at least once...

@banks

This comment has been minimized.

Copy link
Member

commented Oct 11, 2018

I checked out your branch and it passes locally for me - CI is green too now 🎉 thanks for the time spent!

@banks banks merged commit 51b33ef into hashicorp:master Oct 11, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@pierresouchay

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2018

@pierresouchay I suspect it's because you are running API tests without using make file.

API tests run the consul binary from your $PATH as the server so you probably have an older version of consul built in your path.

This is really gross and we should be able to fix it, but you either need to make dev first or run the test using make test GOTEST_PKGS="./api".

The API test reliance on the consul binary in path is a sad thing that has caught most Consul devs out at least once...

Well spotted, I had the exact same issue while trying to implement api/agent in #3551 (I spent so much time trying to figure out why my changes were not taken into account)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.