Fix fine-grained blocking queries for Catalog.NodeServices #12399

wjordan · 2022-02-19T23:27:13Z

Fixes #12398 by adding fine-grained node.[node] entries to the index table, allowing blocking queries to return fine-grained indexes that prevent them from returning immediately when unrelated nodes/services are updated.

wjordan · 2022-02-19T23:31:18Z

This patch provides a significant impact to overall load in our Consul cluster, which performs a heavy number of blocking queries to /catalog/node/:node:

agent/consul/state/catalog_test.go

kisunji · 2022-02-25T19:53:01Z

Thanks for this PR! Trying to get another set of eyes on it from the team. There will be some work involved on our side to merge to our enterprise side cleanly but hoping to get back to you by end of next week.

dnephin

Nice, thank you for the PR! I think at a high level this is a good approach. I haven't looked to see if catalogUpdateNodeIndexes is called in all necessary places, but generally it seems right.

Left one comment about dealing with deleted nodes.

dnephin · 2022-02-25T21:26:01Z

agent/consul/state/catalog.go

+	if err := catalogUpdateNodeIndexes(tx, nodeName, idx, entMeta); err != nil {
+		return fmt.Errorf("failed updating node index: %s", err)
+	}


I think for deletes we probably need to do something like catalogUpdateServiceExtinctionIndex. If we track an index for every deleted node, that will grow forever. We'll keep an index for every deleted node name, which is unbounded.

I think by using an extinction index for deleted items we only ever have 1 entry for every deleted node. It's slightly less optimal in terms of reducing blockingQuery churn, but it is much safer in terms of space used by the index tracking.

Yes, I noticed that extra 'extinction index' used by the services query implementation for this purpose, and punted on it here since it didn't affect my use-case (the nodes in our cluster are pretty stable), I agree it's necessary in general though. I can work on adding it sometime soon.

OK, my first attempt at this is in 440db29, but it's largely a copy-paste of catalogUpdateServiceExtinctionIndex and related methods (this code is all unfamiliar territory to me). I haven't tested beyond the existing unit tests, and could use some guidance on any extra testing/refactoring that might be helpful for this extra part.

hashicorp-cla · 2022-03-12T16:44:29Z

All committers have signed the CLA.

NeckBeardPrince · 2022-03-31T14:41:13Z

Any update to this being merged?

Amier3 · 2022-03-31T19:40:33Z

Hey @wjordan

As we stated earlier:

There will be some work involved on our side to merge to our enterprise side cleanly

Turns out the work needed on the enterprise side will take more work than we initially thought. Since the nodes in our enterprise edition of consul are partitioned, the new indexing changes will need to be re-written in our enterprise fork to apply correctly in a 'per-partition' context.

Just like the RPC timeouts PR , we'll have to put this on-hold until those enterprise changes are made. After that we can circle back on the testing and any other changes we might need.
We're trying to be as careful as possible with these PRs considering they have a big effect throughout consul. Want to reiterate that we appreciate this PR and all your others, and we're working hard to get them merged 🧑‍🏭 . We'll keep you updated when we get things finished on our side.

@NeckBeardPrince Please see the message above, we do plan on merging this but right now I couldn't give an accurate time-table.

github-actions · 2022-05-31T01:36:44Z

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

kisunji · 2022-05-31T13:20:55Z

This is still on the radar; it overlaps with some feature work we are working on so there will be conflicts that we'll fix periodically until we are ready to merge this from the enterprise side as well

Test change reflects the desired behavior, the index returned by the NodeServices query should match the requested node's latest service update.

kisunji · 2022-06-22T14:43:45Z

@wjordan are you okay with me force-pushing to your upstream? I've rebased from main and fixed a bunch of conflicts to prep for merge

kisunji

Approved from enterprise side

Thank you for this PR and your patience. We've been working on our new Cluster Peering feature for the past few months which also touched a lot of the catalog bits, making it hard to find the right timing to get this merged. This is valuable for a lot of our users and we appreciate your continued contributions!

lukluk reviewed Feb 22, 2022

View reviewed changes

agent/consul/state/catalog_test.go Show resolved Hide resolved

dnephin mentioned this pull request Feb 22, 2022

Document how names and IDs can impact blocking query performance #12344

Open

kisunji requested review from kisunji and a team February 24, 2022 19:10

kisunji assigned rboyer Feb 25, 2022

kisunji requested a review from rboyer February 25, 2022 19:55

dnephin reviewed Feb 25, 2022

View reviewed changes

github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label May 31, 2022

kisunji removed the meta/stale Automatically flagged for inactivity by stalebot label May 31, 2022

wjordan and others added 3 commits June 22, 2022 10:32

Update TestStateStore_EnsureService

e726e69

Test change reflects the desired behavior, the index returned by the NodeServices query should match the requested node's latest service update.

Use extinction index to track deleted nodes

5ce58f2

Update with peering changes

dbcf543

kisunji force-pushed the catalog_node_watch_fix branch from 440db29 to dbcf543 Compare June 23, 2022 14:28

Add changelog

070a026

kisunji approved these changes Jun 23, 2022

View reviewed changes

kisunji merged commit 34ecbc1 into hashicorp:main Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fine-grained blocking queries for Catalog.NodeServices #12399

Fix fine-grained blocking queries for Catalog.NodeServices #12399

wjordan commented Feb 19, 2022 •

edited by kisunji

Loading

wjordan commented Feb 19, 2022

kisunji commented Feb 25, 2022

dnephin left a comment

dnephin Feb 25, 2022

wjordan Feb 25, 2022

wjordan Mar 1, 2022

hashicorp-cla commented Mar 12, 2022 •

edited

Loading

NeckBeardPrince commented Mar 31, 2022

Amier3 commented Mar 31, 2022

github-actions bot commented May 31, 2022

kisunji commented May 31, 2022

kisunji commented Jun 22, 2022

kisunji left a comment

Fix fine-grained blocking queries for Catalog.NodeServices #12399

Fix fine-grained blocking queries for Catalog.NodeServices #12399

Conversation

wjordan commented Feb 19, 2022 • edited by kisunji Loading

wjordan commented Feb 19, 2022

kisunji commented Feb 25, 2022

dnephin left a comment

Choose a reason for hiding this comment

dnephin Feb 25, 2022

Choose a reason for hiding this comment

wjordan Feb 25, 2022

Choose a reason for hiding this comment

wjordan Mar 1, 2022

Choose a reason for hiding this comment

hashicorp-cla commented Mar 12, 2022 • edited Loading

NeckBeardPrince commented Mar 31, 2022

Amier3 commented Mar 31, 2022

github-actions bot commented May 31, 2022

kisunji commented May 31, 2022

kisunji commented Jun 22, 2022

kisunji left a comment

Choose a reason for hiding this comment

wjordan commented Feb 19, 2022 •

edited by kisunji

Loading

hashicorp-cla commented Mar 12, 2022 •

edited

Loading