Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Handle loops and ignore duplicates in aggregated cluster handling #5317

Merged
merged 6 commits into from
May 5, 2022

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Apr 21, 2022

This PR ignores duplicate clusters and also adds a maximum stack depth in aggregate cluster handling. This takes away the possibility of a stack overflow in aggregate cluster handling.

I'm still adding a few more test cases, but should be good to look at.

RELEASE NOTES: None

createdClusters map[string]*clusterNode

// To prevent duplicate clusters in a cluster update.
clustersSeenForUpdate map[string]bool // Switch this cluster handler set state -> function argument?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a field of the handler? I think you can just pass a new map as a parameter to the construct function each time you build the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good point, changed. This relates to Doug's preaching about only keeping state if absolutely necessary.

c.cancelFunc()
delete(c.clusterHandler.createdClusters, c.clusterUpdate.ClusterName)
for _, child := range c.children {
c.clusterHandler.createdClusters[child].delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an extreme case, if a node is a child of itself, could this cause a nil panic, because the node was just deleted from the map in line 194?

If so, maybe add a nil check before calling delete()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the case that a node is a child of itself is interesting. Make sure there's a test covering it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it wouldn't delete in 194, since if a node is a child of itself, the ref count wouldn't be zero. However, I added the nil check just in case. Added a test for node being a child of itself. I thought a lot about what it should do if it builds an update without erroring (due to ignoring dups) and is an empty cluster update. I decided to ignore, you can see the behavior in my TestNodeChildOfItself test.

xds/internal/balancer/cdsbalancer/cluster_handler.go Outdated Show resolved Hide resolved
@menghanl menghanl assigned zasweq and unassigned menghanl Apr 22, 2022
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments and back and forth :D!

createdClusters map[string]*clusterNode

// To prevent duplicate clusters in a cluster update.
clustersSeenForUpdate map[string]bool // Switch this cluster handler set state -> function argument?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good point, changed. This relates to Doug's preaching about only keeping state if absolutely necessary.

xds/internal/balancer/cdsbalancer/cluster_handler.go Outdated Show resolved Hide resolved
c.cancelFunc()
delete(c.clusterHandler.createdClusters, c.clusterUpdate.ClusterName)
for _, child := range c.children {
c.clusterHandler.createdClusters[child].delete()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it wouldn't delete in 194, since if a node is a child of itself, the ref count wouldn't be zero. However, I added the nil check just in case. Added a test for node being a child of itself. I thought a lot about what it should do if it builds an update without erroring (due to ignoring dups) and is an empty cluster update. I decided to ignore, you can see the behavior in my TestNodeChildOfItself test.

xds/internal/balancer/cdsbalancer/cluster_handler.go Outdated Show resolved Hide resolved
@menghanl menghanl assigned menghanl and unassigned zasweq Apr 26, 2022
if c.maxDepthErr != nil {
return nil, c.maxDepthErr
}
// Ignore duplicates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comment here, to explain why it's OK to ignore duplicates.
Use the example of [C,D,C] -> [C,D], where the second C doesn't matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added - "Ignore duplicates. It's ok to ignore duplicates because the second occurrence of a cluster will never be used. I.e. in [C, D, C], the second C will never be used (the only way to fall back to lower priority D is if C is down, which means second C will never be chosen). Thus, [C, D, C] is logically equivalent to [C, D].

}

// Add and construct any new child nodes.
for child := range newChildren {
if _, inChildrenAlready := mapCurrentChildren[child]; !inChildrenAlready {
createdChild = true
mapCurrentChildren[child] = createClusterNode(child, c.clusterHandler.parent.xdsClient, c.clusterHandler)
if c.depth == maxDepth-1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, depth is not related to whether the children list is different, right? If this already exceeds the maxdepth, and receives another aggregated cluster update later, it's still the same error. Should we not send the second error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah what a catch! Moved to the loop in for _, childName := range clusterUpdate.PrioritizedClusterNames. Putting it inside makes it so empty child list doesn't trigger it. Yeah, I agree that another resp for the cluster that causes it to exceed depth should cause an error to be written to the buffer. However, I still think that another resp for another cluster as it builds cluster update shouldn't cause error to be written to buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added assertion in test covering this.

@@ -298,6 +302,19 @@ func (c *clusterNode) handleResp(clusterUpdate xdsresource.ClusterUpdate, err er
// Aggregate cluster handling.
newChildren := make(map[string]bool)
for _, childName := range clusterUpdate.PrioritizedClusterNames {
if c.depth == maxDepth-1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why run this inside the loop? Wouldn't it be the same before the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If the update doesn't have Prioritized Cluster Names, it means that the aggregate cluster doesn't have children, so doesn't exceed max depth. Yes, this adds the overhead of this conditional every iteration through the for loop, but I think it's incorrect to put it outside of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in QAZSE, moved outside the for loop and into an explicit if.

@menghanl menghanl assigned zasweq and unassigned menghanl Apr 29, 2022
@zasweq zasweq assigned menghanl and unassigned zasweq May 3, 2022
@menghanl menghanl assigned zasweq and unassigned menghanl May 5, 2022
@zasweq zasweq merged commit ee67b3d into grpc:master May 5, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants