-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fine-grained blocking queries for Catalog.NodeServices #12399
Conversation
This patch provides a significant impact to overall load in our Consul cluster, which performs a heavy number of blocking queries to |
Thanks for this PR! Trying to get another set of eyes on it from the team. There will be some work involved on our side to merge to our enterprise side cleanly but hoping to get back to you by end of next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you for the PR! I think at a high level this is a good approach. I haven't looked to see if catalogUpdateNodeIndexes
is called in all necessary places, but generally it seems right.
Left one comment about dealing with deleted nodes.
agent/consul/state/catalog.go
Outdated
if err := catalogUpdateNodeIndexes(tx, nodeName, idx, entMeta); err != nil { | ||
return fmt.Errorf("failed updating node index: %s", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for deletes we probably need to do something like catalogUpdateServiceExtinctionIndex
. If we track an index for every deleted node, that will grow forever. We'll keep an index for every deleted node name, which is unbounded.
I think by using an extinction index for deleted items we only ever have 1 entry for every deleted node. It's slightly less optimal in terms of reducing blockingQuery churn, but it is much safer in terms of space used by the index tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed that extra 'extinction index' used by the services
query implementation for this purpose, and punted on it here since it didn't affect my use-case (the nodes in our cluster are pretty stable), I agree it's necessary in general though. I can work on adding it sometime soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, my first attempt at this is in 440db29, but it's largely a copy-paste of catalogUpdateServiceExtinctionIndex
and related methods (this code is all unfamiliar territory to me). I haven't tested beyond the existing unit tests, and could use some guidance on any extra testing/refactoring that might be helpful for this extra part.
Any update to this being merged? |
Hey @wjordan As we stated earlier:
Turns out the work needed on the enterprise side will take more work than we initially thought. Since the nodes in our enterprise edition of consul are partitioned, the new indexing changes will need to be re-written in our enterprise fork to apply correctly in a 'per-partition' context. Just like the RPC timeouts PR , we'll have to put this on-hold until those enterprise changes are made. After that we can circle back on the testing and any other changes we might need. @NeckBeardPrince Please see the message above, we do plan on merging this but right now I couldn't give an accurate time-table. |
This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions. |
This is still on the radar; it overlaps with some feature work we are working on so there will be conflicts that we'll fix periodically until we are ready to merge this from the enterprise side as well |
Test change reflects the desired behavior, the index returned by the NodeServices query should match the requested node's latest service update.
@wjordan are you okay with me force-pushing to your upstream? I've rebased from main and fixed a bunch of conflicts to prep for merge |
440db29
to
dbcf543
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved from enterprise side
Thank you for this PR and your patience. We've been working on our new Cluster Peering feature for the past few months which also touched a lot of the catalog bits, making it hard to find the right timing to get this merged. This is valuable for a lot of our users and we appreciate your continued contributions!
Fixes #12398 by adding fine-grained
node.[node]
entries to theindex
table, allowing blocking queries to return fine-grained indexes that prevent them from returning immediately when unrelated nodes/services are updated.