Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unregister prom gauges when recycling cluster watcher #11875

Merged
merged 3 commits into from Jan 6, 2024
Merged

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Jan 3, 2024

Fixes #11839

When in restartClusterWatcher we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings.

The problem lies in NewRemoteClusterServiceWatcher, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. cleanupWorkers at the beginning of restartClusterWatcher won't unregister those metrics because of that nil object.

This fix reorders NewRemoteClusterServiceWatcher so that an object is returned even when there's an error, so cleanup on that object can be performed.

Fixes #11839

When in `restartClusterWatcher` we fail to connect to the target cluster
for whatever reason, the function gets called again 10s later, and tries
to register the same prometheus metrics without unregistering them
first, which generates warnings.

The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates
the remote kube-api client and registers the metrics, returning a nil
object if the client can't connect. `cleanupWorkers` at the beginning of
`restartClusterWatcher` won't unregister those metrics because of that
nil object.

This fix reorders `NewRemoteClusterServiceWatcher` so that an object is
returned even when there's an error, so cleanup on that object can be
performed.
@alpeb alpeb requested a review from a team as a code owner January 3, 2024 22:03
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. So, to fix, this basically ensures that any error encountered after building the remote client will still retain a reference to a cluster watcher. e.g.

err = restartClusterWatcher(ctx, link, *namespace, creds, controllerK8sAPI, *requeueLimit, *repairPeriod, metrics, *enableHeadlessSvc)
if err != nil {
// failed to restart cluster watcher; give a bit of slack
// and restart the link watch to give it another try
log.Error(err)
time.Sleep(linkWatchRestartAfter)
linkWatch.Stop()
}
case watch.Deleted:
if we reach this point, and we instantiated a client (and as a result registered the metrics) we'll clean-up the reference in the next iteration of the link watch and ensure we don't register the same metrics again.

Good way to fix it imo.


_, err = remoteAPI.Client.Discovery().ServerVersion()
if err != nil {
return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally not considered sound to handle a value when err != nil; so we probably ought to omit the value.

Suggested change
return &rcsw, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)
return nil, fmt.Errorf("cannot connect to api for target cluster %s: %w", clusterName, err)

However, your description seems to indicate that this is load-bearing:

This fix reorders NewRemoteClusterServiceWatcher so that an object is returned even when there's an error, so cleanup on that object can be performed.

But the return value is not used at the call-site when an error is returned:

clusterWatcher, err = servicemirror.NewRemoteClusterServiceWatcher(
ctx,
namespace,
controllerK8sAPI,
cfg,
&link,
requeueLimit,
repairPeriod,
ch,
enableHeadlessSvc,
)
if err != nil {
return fmt.Errorf("unable to create cluster watcher: %w", err)
}

So, how does this change fix the problem exactly? How do we avoid introducing another problem like this. Can you add a comment so we don't easily run into this problem again?

Copy link
Member

@olix0r olix0r Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH! this is setting a global value. I don't think this is a sound pattern. Instead, the caller should use:

 	 cw, err = servicemirror.NewRemoteClusterServiceWatcher( 
	 	ctx, 
	 	namespace, 
	 	controllerK8sAPI, 
	 	cfg, 
	 	&link, 
	 	requeueLimit, 
	 	repairPeriod, 
	 	ch, 
	 	enableHeadlessSvc, 
	 ) 
	 if err != nil { 
	 	return fmt.Errorf("unable to create cluster watcher: %w", err) 
	 }
     clusterWatcher = cw

This removes the need for the change to NewRemoteClusterServiceWatcher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't modify NewRemoteClusterServiceWatcher to return rcsw on an error, the caller won't be able to perform the gauges cleanup. Actually I've just thought of something else; we should be able to perform the cleanup directly inside NewRemoteClusterServiceWatcher before returning the error. I've just pushed that, LMKWYT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. Let's add a comment above this line explaining that the remoteAPI registers gauges and they must be explicitly unregistered on error. https://github.com/linkerd/linkerd2/pull/11875/files#diff-58391f2b0ac5849326792fbaf12a8e4aa8b06886acbe9fda308357d131ed38dcR172

@olix0r olix0r merged commit 7b2b01d into main Jan 6, 2024
34 checks passed
@olix0r olix0r deleted the alpeb/dupe-gauges branch January 6, 2024 02:07
mateiidavid added a commit that referenced this pull request Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <matei@buoyant.io>
@mateiidavid mateiidavid mentioned this pull request Jan 12, 2024
mateiidavid added a commit that referenced this pull request Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised ([#11917])

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <matei@buoyant.io>
mateiidavid added a commit that referenced this pull request Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised ([#11917])

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <matei@buoyant.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplicate metrics warnings in service-mirror
3 participants