Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak in service mirror #10833

Merged
merged 1 commit into from May 2, 2023
Merged

Fix memory leak in service mirror #10833

merged 1 commit into from May 2, 2023

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Apr 27, 2023

Fixes #10746

Repro

Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the container_memory_working_set_bytes metric) and goroutines count (via the go_goroutines metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5.

Screenshot from 2023-04-27 09-28-48

Diagnosis

Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling:

time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote

Enabling pprof, the beginning of the goroutine stack dump look like this:

goroutine profile: total 94
10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41
#	0x175ee76	k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:968
#	0xb2aefd	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:226
#	0xb2adb5	k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:227
#	0xb2aca8	k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:204
#	0x175edea	k8s.io/apimachinery/pkg/util/wait.Until+0x6a				/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:161
#	0x175ed94	k8s.io/client-go/tools/cache.(*processorListener).run+0x14		/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:967
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41
#	0x175eb98	k8s.io/client-go/tools/cache.(*processorListener).pop+0x118	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:938
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41
#	0x19f35d2	k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/record/event.go:320

8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41
#	0xa88824	k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/watch/mux.go:268

8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41
#	0x46abb4	time.Sleep+0x134								/usr/local/go/src/runtime/time.go:195
#	0x1a2a7a4	github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64	/linkerd-build/multicluster/service-mirror/jittered_ticker.go:55

The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks.

The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producing the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers.

Resolution

We can map each one of those entries in the dump to these leaks:

jitterred_ticker.go

Each probe worker starts a jittered ticker, whose Stop() method was never called. Fixed that through a defer statement.

event.go

With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a Shutdown() call to the event's broadcaster in the cluster watcher's Stop() method.

mux.go, shared_informer.go and friends

The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added RemoveEventHandler() to them on the cluster watcher's Stop() method.

Result

After the fix, we can see the goroutines count remains stable.

image

Fixes #10746

## Repro

Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5 (see PR for chart).

## Diagnosis

Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling:

```
time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote
```

Enabling pprof, the beginning of the goroutine stack dump look like this:

```
goroutine profile: total 94
10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41
#	0x175ee76	k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:968
#	0xb2aefd	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:226
#	0xb2adb5	k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:227
#	0xb2aca8	k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:204
#	0x175edea	k8s.io/apimachinery/pkg/util/wait.Until+0x6a				/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:161
#	0x175ed94	k8s.io/client-go/tools/cache.(*processorListener).run+0x14		/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:967
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41
#	0x175eb98	k8s.io/client-go/tools/cache.(*processorListener).pop+0x118	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:938
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41
#	0x19f35d2	k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/record/event.go:320

8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41
#	0xa88824	k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/watch/mux.go:268

8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41
#	0x46abb4	time.Sleep+0x134								/usr/local/go/src/runtime/time.go:195
#	0x1a2a7a4	github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64	/linkerd-build/multicluster/service-mirror/jittered_ticker.go:55
```

The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks.

The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producint the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers.

## Resolution

We can map each one of those entries in the dump to these leaks:

### jitterred_ticker.go

Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement.

### event.go

With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method.

### mux.go, shared_informer.go and friends

The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method.

## Result

After the fix, we can see the goroutines count remains stable (see PR for chart).
@alpeb alpeb requested a review from a team as a code owner April 27, 2023 22:57
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find and fix! 🌮

@alpeb alpeb merged commit 1bc2fc2 into main May 2, 2023
35 checks passed
@alpeb alpeb deleted the alpeb/svc-mirror-memleak branch May 2, 2023 13:56
adleong pushed a commit that referenced this pull request Jun 20, 2023
Fixes #10746

## Repro

Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5.

![Screenshot from 2023-04-27 09-28-48](https://user-images.githubusercontent.com/554287/235007199-604dd517-47ae-4f95-a863-21ae16eca898.png)

## Diagnosis

Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling:

```
time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote
```

Enabling pprof, the beginning of the goroutine stack dump look like this:

```
goroutine profile: total 94
10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41
#	0x175ee76	k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:968
#	0xb2aefd	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:226
#	0xb2adb5	k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:227
#	0xb2aca8	k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:204
#	0x175edea	k8s.io/apimachinery/pkg/util/wait.Until+0x6a				/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:161
#	0x175ed94	k8s.io/client-go/tools/cache.(*processorListener).run+0x14		/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:967
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41
#	0x175eb98	k8s.io/client-go/tools/cache.(*processorListener).pop+0x118	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:938
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41
#	0x19f35d2	k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/record/event.go:320

8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41
#	0xa88824	k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/watch/mux.go:268

8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41
#	0x46abb4	time.Sleep+0x134								/usr/local/go/src/runtime/time.go:195
#	0x1a2a7a4	github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64	/linkerd-build/multicluster/service-mirror/jittered_ticker.go:55
```

The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks.

The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producing the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers.

## Resolution

We can map each one of those entries in the dump to these leaks:

### jitterred_ticker.go

Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement.

### event.go

With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method.

### mux.go, shared_informer.go and friends

The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method.

## Result

After the fix, we can see the goroutines count remains stable.

![image](https://user-images.githubusercontent.com/554287/235007727-3b3b6f31-901d-4c05-aedb-9fa3d3178ee1.png)
alpeb added a commit that referenced this pull request Sep 6, 2023
Fixes #10746

## Repro

Just by leaving two linked clusters by themselves and monitoring the service mirror pod's RSS (via the `container_memory_working_set_bytes` metric) and goroutines count (via the `go_goroutines` metric) one can see the small but consistent memory increase accompanied by periodic increases of goroutines in batches of 5.

![Screenshot from 2023-04-27 09-28-48](https://user-images.githubusercontent.com/554287/235007199-604dd517-47ae-4f95-a863-21ae16eca898.png)

## Diagnosis

Those goroutines count increases can be correlated in the controller's log with the probe worker and cluster watcher recycling:

```
time="2023-04-27T10:06:14Z" level=info msg="Stopping probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Starting probe worker" probe-key=target
time="2023-04-27T10:06:14Z" level=info msg="Received: Stop" apiAddress="https://192.168.32.4:6443" cluster=remote
```

Enabling pprof, the beginning of the goroutine stack dump look like this:

```
goroutine profile: total 94
10 @ 0x43c976 0x406d3b 0x406878 0x175ee77 0xb2aefe 0xb2adb6 0xb2aca9 0x175edeb 0x175ed95 0xb2cfda 0x46de41
#	0x175ee76	k8s.io/client-go/tools/cache.(*processorListener).run.func1+0x56	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:968
#	0xb2aefd	k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1+0x3d		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:226
#	0xb2adb5	k8s.io/apimachinery/pkg/util/wait.BackoffUntil+0xb5			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:227
#	0xb2aca8	k8s.io/apimachinery/pkg/util/wait.JitterUntil+0x88			/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:204
#	0x175edea	k8s.io/apimachinery/pkg/util/wait.Until+0x6a				/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/backoff.go:161
#	0x175ed94	k8s.io/client-go/tools/cache.(*processorListener).run+0x14		/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:967
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59		/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

10 @ 0x43c976 0x44c81c 0x175eb99 0xb2cfda 0x46de41
#	0x175eb98	k8s.io/client-go/tools/cache.(*processorListener).pop+0x118	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/shared_informer.go:938
#	0xb2cfd9	k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1+0x59	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/util/wait/wait.go:72

8 @ 0x43c976 0x406d3b 0x406878 0x19f35d3 0x46de41
#	0x19f35d2	k8s.io/client-go/tools/record.(*eventBroadcasterImpl).StartEventWatcher.func1+0x72	/go/pkg/mod/k8s.io/client-go@v0.27.1/tools/record/event.go:320

8 @ 0x43c976 0x406d3b 0x406878 0xa88825 0x46de41
#	0xa88824	k8s.io/apimachinery/pkg/watch.(*Broadcaster).loop+0x64	/go/pkg/mod/k8s.io/apimachinery@v0.27.1/pkg/watch/mux.go:268

8 @ 0x43c976 0x46abb5 0x1a2a7a5 0x46de41
#	0x46abb4	time.Sleep+0x134								/usr/local/go/src/runtime/time.go:195
#	0x1a2a7a4	github.com/linkerd/linkerd2/multicluster/service-mirror.(*Ticker).loop+0x64	/linkerd-build/multicluster/service-mirror/jittered_ticker.go:55
```

The number of goroutines shown here is almost the same as the number of goroutines count jumps we've seen in the chart when this was taken, which suggests these are indeed the leaks.

The service mirror main function contains a loop that restarts the cluster watcher (and with it its probe worker) whenever the main watch following the Link resources terminates (I don't know why that happens). And some resources aren't getting cleaned up upon the restart, producing the leak. It would be interesting to see (as a followup) why that restart is required here, but not in the other controllers.

## Resolution

We can map each one of those entries in the dump to these leaks:

### jitterred_ticker.go

Each probe worker starts a jittered ticker, whose `Stop()` method was never called. Fixed that through a `defer` statement.

### event.go

With every watcher restart a new k8s event recorder was instantiated but it wasn't cleaned. Added a `Shutdown()` call to the event's broadcaster in the cluster watcher's `Stop()` method.

### mux.go, shared_informer.go and friends

The cluster watcher attached event handlers to informers for Services, Endpoints and Namespaces. Added `RemoveEventHandler()` to them on the cluster watcher's `Stop()` method.

## Result

After the fix, we can see the goroutines count remains stable.

![image](https://user-images.githubusercontent.com/554287/235007727-3b3b6f31-901d-4c05-aedb-9fa3d3178ee1.png)
mateiidavid pushed a commit that referenced this pull request Sep 8, 2023
* build(deps): NOT bump sigs.k8s.io/gateway-api from 0.5.1 to 0.6.0 (#10038)

* build(deps): bump sigs.k8s.io/gateway-api from 0.5.1 to 0.6.0

** NOTE **
This was cherry-picked from 62d6d7c in
order to acquire the `AddEventHandler` changes that went there. The
actualy gateway-api bump was discarded.

Bumps [sigs.k8s.io/gateway-api](https://github.com/kubernetes-sigs/gateway-api) from 0.5.1 to 0.6.0.
- [Release notes](https://github.com/kubernetes-sigs/gateway-api/releases)
- [Changelog](https://github.com/kubernetes-sigs/gateway-api/blob/main/CHANGELOG.md)
- [Commits](kubernetes-sigs/gateway-api@v0.5.1...v0.6.0)

---
updated-dependencies:
- dependency-name: sigs.k8s.io/gateway-api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...


* Account for possible errors returned from `AddEventHandler`

* client-go v0.26.0 removed the openstack plugin

* Fix memory leak in service mirror (#10833)


* Bump go to 1.19 in go.mod

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: tomasz.ziolkowski <e.prace@gmail.com>
Signed-off-by: Alejandro Pedraza <alejandro@buoyant.io>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Oliver Gould <ver@buoyant.io>
Co-authored-by: ziollek <e.prace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multicluster Service Mirror Mem Leak
3 participants