Unregister prom gauges when recycling cluster watcher #11875

alpeb · 2024-01-03T22:03:53Z

When in restartClusterWatcher we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings.

The problem lies in NewRemoteClusterServiceWatcher, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. cleanupWorkers at the beginning of restartClusterWatcher won't unregister those metrics because of that nil object.

This fix reorders NewRemoteClusterServiceWatcher so that an object is returned even when there's an error, so cleanup on that object can be performed.

Fixes #11839 When in `restartClusterWatcher` we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings. The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. `cleanupWorkers` at the beginning of `restartClusterWatcher` won't unregister those metrics because of that nil object. This fix reorders `NewRemoteClusterServiceWatcher` so that an object is returned even when there's an error, so cleanup on that object can be performed.

mateiidavid

Looks good. So, to fix, this basically ensures that any error encountered after building the remote client will still retain a reference to a cluster watcher. e.g.

linkerd2/multicluster/cmd/service-mirror/main.go

Lines 155 to 163 in 999eff3

    
           	err = restartClusterWatcher(ctx, link, *namespace, creds, controllerK8sAPI, *requeueLimit, *repairPeriod, metrics, *enableHeadlessSvc) 
        
           	if err != nil { 
        
           		// failed to restart cluster watcher; give a bit of slack 
        
           		// and restart the link watch to give it another try 
        
           		log.Error(err) 
        
           		time.Sleep(linkWatchRestartAfter) 
        
           		linkWatch.Stop() 
        
           	} 
        
           case watch.Deleted:

if we reach this point, and we instantiated a client (and as a result registered the metrics) we'll clean-up the reference in the next iteration of the link watch and ensure we don't register the same metrics again.

Good way to fix it imo.

olix0r · 2024-01-05T01:34:20Z

multicluster/service-mirror/cluster_watcher.go

+
+	_, err = remoteAPI.Client.Discovery().ServerVersion()
+	if err != nil {
+		return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)


It's generally not considered sound to handle a value when err != nil; so we probably ought to omit the value.

Suggested change

return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)

return nil, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)

However, your description seems to indicate that this is load-bearing:

This fix reorders NewRemoteClusterServiceWatcher so that an object is returned even when there's an error, so cleanup on that object can be performed.

But the return value is not used at the call-site when an error is returned:

linkerd2/multicluster/cmd/service-mirror/main.go

Lines 306 to 319 in cf2999d

clusterWatcher, err = servicemirror.NewRemoteClusterServiceWatcher(

ctx,

namespace,

controllerK8sAPI,

cfg,

&link,

requeueLimit,

repairPeriod,

ch,

enableHeadlessSvc,

)

if err != nil {

return fmt.Errorf("unable to create cluster watcher: %w", err)

}

So, how does this change fix the problem exactly? How do we avoid introducing another problem like this. Can you add a comment so we don't easily run into this problem again?

OH! this is setting a global value. I don't think this is a sound pattern. Instead, the caller should use:

cw, err = servicemirror.NewRemoteClusterServiceWatcher( ctx, namespace, controllerK8sAPI, cfg, &link, requeueLimit, repairPeriod, ch, enableHeadlessSvc, ) if err != nil { return fmt.Errorf("unable to create cluster watcher: %w", err) } clusterWatcher = cw

This removes the need for the change to NewRemoteClusterServiceWatcher.

If we don't modify NewRemoteClusterServiceWatcher to return rcsw on an error, the caller won't be able to perform the gauges cleanup. Actually I've just thought of something else; we should be able to perform the cleanup directly inside NewRemoteClusterServiceWatcher before returning the error. I've just pushed that, LMKWYT.

I think this makes sense. Let's add a comment above this line explaining that the remoteAPI registers gauges and they must be explicitly unregistered on error. https://github.com/linkerd/linkerd2/pull/11875/files#diff-58391f2b0ac5849326792fbaf12a8e4aa8b06886acbe9fda308357d131ed38dcR172

…er instead

…API for creating kube-api clients

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <matei@buoyant.io>

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <matei@buoyant.io>

alpeb requested a review from a team as a code owner January 3, 2024 22:03

mateiidavid approved these changes Jan 4, 2024

View reviewed changes

olix0r reviewed Jan 5, 2024

View reviewed changes

Call remoteAPI.UnregisterGauges() inside NewRemoteClusterServiceWatch…

43fe71c

…er instead

alpeb force-pushed the alpeb/dupe-gauges branch from d8392a5 to 43fe71c Compare January 5, 2024 14:55

Add comments about calling UnregisterGauges() after using the public …

b2e688c

…API for creating kube-api clients

olix0r approved these changes Jan 6, 2024

View reviewed changes

olix0r merged commit 7b2b01d into main Jan 6, 2024
34 checks passed

olix0r deleted the alpeb/dupe-gauges branch January 6, 2024 02:07

mateiidavid mentioned this pull request Jan 12, 2024

edge-24.1.1 #11922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unregister prom gauges when recycling cluster watcher #11875

Unregister prom gauges when recycling cluster watcher #11875

alpeb commented Jan 3, 2024

mateiidavid left a comment

olix0r Jan 5, 2024

olix0r Jan 5, 2024 •

edited

alpeb Jan 5, 2024

olix0r Jan 5, 2024

	err = restartClusterWatcher(ctx, link, namespace, creds, controllerK8sAPI, requeueLimit, repairPeriod, metrics, enableHeadlessSvc)
	if err != nil {
	// failed to restart cluster watcher; give a bit of slack
	// and restart the link watch to give it another try
	log.Error(err)
	time.Sleep(linkWatchRestartAfter)
	linkWatch.Stop()
	}
	case watch.Deleted:

	return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)
	return nil, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)

	clusterWatcher, err = servicemirror.NewRemoteClusterServiceWatcher(
	ctx,
	namespace,
	controllerK8sAPI,
	cfg,
	&link,
	requeueLimit,
	repairPeriod,
	ch,
	enableHeadlessSvc,
	)
	if err != nil {
	return fmt.Errorf("unable to create cluster watcher: %w", err)
	}

Unregister prom gauges when recycling cluster watcher #11875

Unregister prom gauges when recycling cluster watcher #11875

Conversation

alpeb commented Jan 3, 2024

mateiidavid left a comment

Choose a reason for hiding this comment

olix0r Jan 5, 2024

Choose a reason for hiding this comment

olix0r Jan 5, 2024 • edited

Choose a reason for hiding this comment

alpeb Jan 5, 2024

Choose a reason for hiding this comment

olix0r Jan 5, 2024

Choose a reason for hiding this comment

olix0r Jan 5, 2024 •

edited