Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563

Merged
merged 4 commits into from Aug 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions xds/internal/balancer/clusterresolver/clusterresolver.go
Expand Up @@ -280,7 +280,7 @@ func (b *clusterResolverBalancer) handleErrorFromUpdate(err error, fromParent bo
// EDS resource was removed. No action needs to be taken for this, and we
// should continue watching the same EDS resource.
if fromParent && xdsresource.ErrType(err) == xdsresource.ErrorTypeResourceNotFound {
b.resourceWatcher.stop()
b.resourceWatcher.stop(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble following this logic (false plumbed down). Is it to support the use case described in this comment:

// Save the previous childrenMap to stop the children outside the mutex,
(i.e. reusing the lb policy in the future, so the component callback serializer the resource_resolver holds onto is still active?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is basically to handle the comment here:

// stop() is called when the LB policy is closed or when the underlying

stop() is called when:

  • the CDS resource is deleted, or
  • the LB policy is being stopped

And in the first case, when the CDS resource is added back, the cluster_resolver will get a config update with new mechanisms and will need to process it.

I initially thought about cancelling the serializer in stop() unconditionally and recreate it in updateMechanisms(), instead of in newResourceResolver. But this was hard to do because then you need to protect access to the serializer with a mutex, and if we have to do that, then there is no way we can guarantee that onUpdate won't block on the mutex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good, thanks for the explanation. It's keeping it's lifecycle coupled with this resource resolver type, which sounds fine since it's a buffer the resource resolver type uses.

}

if b.child != nil {
Expand Down Expand Up @@ -326,7 +326,7 @@ func (b *clusterResolverBalancer) run() {
// Close results in stopping the endpoint resolvers and closing the
// underlying child policy and is the only way to exit this goroutine.
case <-b.closed.Done():
b.resourceWatcher.stop()
b.resourceWatcher.stop(true)

if b.child != nil {
b.child.Close()
Expand Down