Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fine-grained blocking queries for Catalog.NodeServices #12399

Merged
merged 4 commits into from
Jun 23, 2022

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Feb 19, 2022

Fixes #12398 by adding fine-grained node.[node] entries to the index table, allowing blocking queries to return fine-grained indexes that prevent them from returning immediately when unrelated nodes/services are updated.

@wjordan
Copy link
Contributor Author

wjordan commented Feb 19, 2022

This patch provides a significant impact to overall load in our Consul cluster, which performs a heavy number of blocking queries to /catalog/node/:node:

image

image

@kisunji
Copy link
Contributor

kisunji commented Feb 25, 2022

Thanks for this PR! Trying to get another set of eyes on it from the team. There will be some work involved on our side to merge to our enterprise side cleanly but hoping to get back to you by end of next week.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you for the PR! I think at a high level this is a good approach. I haven't looked to see if catalogUpdateNodeIndexes is called in all necessary places, but generally it seems right.

Left one comment about dealing with deleted nodes.

Comment on lines 679 to 776
if err := catalogUpdateNodeIndexes(tx, nodeName, idx, entMeta); err != nil {
return fmt.Errorf("failed updating node index: %s", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for deletes we probably need to do something like catalogUpdateServiceExtinctionIndex. If we track an index for every deleted node, that will grow forever. We'll keep an index for every deleted node name, which is unbounded.

I think by using an extinction index for deleted items we only ever have 1 entry for every deleted node. It's slightly less optimal in terms of reducing blockingQuery churn, but it is much safer in terms of space used by the index tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I noticed that extra 'extinction index' used by the services query implementation for this purpose, and punted on it here since it didn't affect my use-case (the nodes in our cluster are pretty stable), I agree it's necessary in general though. I can work on adding it sometime soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, my first attempt at this is in 440db29, but it's largely a copy-paste of catalogUpdateServiceExtinctionIndex and related methods (this code is all unfamiliar territory to me). I haven't tested beyond the existing unit tests, and could use some guidance on any extra testing/refactoring that might be helpful for this extra part.

@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@NeckBeardPrince
Copy link

Any update to this being merged?

@Amier3
Copy link
Contributor

Amier3 commented Mar 31, 2022

Hey @wjordan

As we stated earlier:

There will be some work involved on our side to merge to our enterprise side cleanly

Turns out the work needed on the enterprise side will take more work than we initially thought. Since the nodes in our enterprise edition of consul are partitioned, the new indexing changes will need to be re-written in our enterprise fork to apply correctly in a 'per-partition' context.

Just like the RPC timeouts PR , we'll have to put this on-hold until those enterprise changes are made. After that we can circle back on the testing and any other changes we might need.
We're trying to be as careful as possible with these PRs considering they have a big effect throughout consul. Want to reiterate that we appreciate this PR and all your others, and we're working hard to get them merged 🧑‍🏭 . We'll keep you updated when we get things finished on our side.

@NeckBeardPrince Please see the message above, we do plan on merging this but right now I couldn't give an accurate time-table.

@github-actions
Copy link

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label May 31, 2022
@kisunji kisunji removed the meta/stale Automatically flagged for inactivity by stalebot label May 31, 2022
@kisunji
Copy link
Contributor

kisunji commented May 31, 2022

This is still on the radar; it overlaps with some feature work we are working on so there will be conflicts that we'll fix periodically until we are ready to merge this from the enterprise side as well

wjordan and others added 3 commits June 22, 2022 10:32
Test change reflects the desired behavior,
the index returned by the NodeServices query should match
the requested node's latest service update.
@kisunji
Copy link
Contributor

kisunji commented Jun 22, 2022

@wjordan are you okay with me force-pushing to your upstream? I've rebased from main and fixed a bunch of conflicts to prep for merge

Copy link
Contributor

@kisunji kisunji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from enterprise side

Thank you for this PR and your patience. We've been working on our new Cluster Peering feature for the past few months which also touched a lot of the catalog bits, making it hard to find the right timing to get this merged. This is valuable for a lot of our users and we appreciate your continued contributions!

@kisunji kisunji merged commit 34ecbc1 into hashicorp:main Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spurious blocking-query returns on catalog/node queries
8 participants