Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unregister prom gauges when recycling cluster watcher #11875

Merged
merged 3 commits into from Jan 6, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
15 changes: 9 additions & 6 deletions multicluster/service-mirror/cluster_watcher.go
Expand Up @@ -173,10 +173,6 @@ func NewRemoteClusterServiceWatcher(
if err != nil {
return nil, fmt.Errorf("cannot initialize api for target cluster %s: %w", clusterName, err)
}
_, err = remoteAPI.Client.Discovery().ServerVersion()
if err != nil {
return nil, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)
}

// Create k8s event recorder
eventBroadcaster := record.NewBroadcaster()
Expand All @@ -188,7 +184,7 @@ func NewRemoteClusterServiceWatcher(
})

stopper := make(chan struct{})
return &RemoteClusterServiceWatcher{
rcsw := RemoteClusterServiceWatcher{
serviceMirrorNamespace: serviceMirrorNamespace,
link: link,
remoteAPIClient: remoteAPI,
Expand All @@ -207,7 +203,14 @@ func NewRemoteClusterServiceWatcher(
headlessServicesEnabled: enableHeadlessSvc,
// always instantiate the gatewayAlive=true to prevent unexpected service fail fast
gatewayAlive: true,
}, nil
}

_, err = remoteAPI.Client.Discovery().ServerVersion()
if err != nil {
return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally not considered sound to handle a value when err != nil; so we probably ought to omit the value.

Suggested change
return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)
return nil, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)

However, your description seems to indicate that this is load-bearing:

This fix reorders NewRemoteClusterServiceWatcher so that an object is returned even when there's an error, so cleanup on that object can be performed.

But the return value is not used at the call-site when an error is returned:

clusterWatcher, err = servicemirror.NewRemoteClusterServiceWatcher(
ctx,
namespace,
controllerK8sAPI,
cfg,
&link,
requeueLimit,
repairPeriod,
ch,
enableHeadlessSvc,
)
if err != nil {
return fmt.Errorf("unable to create cluster watcher: %w", err)
}

So, how does this change fix the problem exactly? How do we avoid introducing another problem like this. Can you add a comment so we don't easily run into this problem again?

Copy link
Member

@olix0r olix0r Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH! this is setting a global value. I don't think this is a sound pattern. Instead, the caller should use:

 	 cw, err = servicemirror.NewRemoteClusterServiceWatcher( 
	 	ctx, 
	 	namespace, 
	 	controllerK8sAPI, 
	 	cfg, 
	 	&link, 
	 	requeueLimit, 
	 	repairPeriod, 
	 	ch, 
	 	enableHeadlessSvc, 
	 ) 
	 if err != nil { 
	 	return fmt.Errorf("unable to create cluster watcher: %w", err) 
	 }
     clusterWatcher = cw

This removes the need for the change to NewRemoteClusterServiceWatcher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't modify NewRemoteClusterServiceWatcher to return rcsw on an error, the caller won't be able to perform the gauges cleanup. Actually I've just thought of something else; we should be able to perform the cleanup directly inside NewRemoteClusterServiceWatcher before returning the error. I've just pushed that, LMKWYT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. Let's add a comment above this line explaining that the remoteAPI registers gauges and they must be explicitly unregistered on error. https://github.com/linkerd/linkerd2/pull/11875/files#diff-58391f2b0ac5849326792fbaf12a8e4aa8b06886acbe9fda308357d131ed38dcR172

}

return &rcsw, nil
}

func (rcsw *RemoteClusterServiceWatcher) mirroredResourceName(remoteName string) string {
Expand Down