Memory leaks in nsmgr #675

ljkiraly · 2024-04-23T13:02:37Z

Increased memory usage of nsmgr pods can be detected.
On some environment the memory usage reaches it's limit (400M) and leads to OOMKill.

Reproduction:

With the following example I was able to reproduce a considerable memory increase by scaling out and scaling in the NSCs.

https://github.com/ljkiraly/deployments-k8s/blob/1ac366dcb8cbffe52cf5428fe9bb6aa79a9511bd/examples/use-cases/2registry3nseXnsc/README.md

The test ran on kind cluster with 4 nodes and NSM v1.13.0 was deployed.

Sample output of kubectl top command

deployment.apps/nsc-kernel scaled
deployment.apps/nsc-kernel scaled
2024. 04. 23. 13:34:31 CEST
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-5s65v   102m         32Mi            
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-fms9q   16m          26Mi            
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-s7gz8   16m          27Mi            
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-vtd9q   33m          25Mi 

...

2024. 04. 23. 14:09:02 CEST
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-5s65v   391m         84Mi            
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-fms9q   36m          45Mi            
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-s7gz8   126m         65Mi            
NAME          CPU(cores)   MEMORY(bytes)   
nsmgr-vtd9q   145m         42Mi

The text was updated successfully, but these errors were encountered:

szvincze · 2024-04-23T13:45:20Z

I tried the instructions that @ljkiraly attached above but added scaling of NSEs too. From my test it seems that by creating NSEs NSMgr exhausts more memory than with scaling NSCs.

ljkiraly · 2024-04-24T10:17:43Z

Description updated with version information. Also might be important that I was tested on a kind cluster with 4 nodes.

ljkiraly · 2024-04-25T15:19:58Z

Another important detail, that I tested with nsmgr pod without exclude prefixes container and the same behavior can be seen, the memory increase still present. Edited the issue slogan.

NikitaSkrynnik · 2024-04-30T06:39:57Z

Hello! I think I managed to reproduce the leak. I tried several setups:

NSEs with CIDR `172.16.0.0/16`

kind cluster with 4 nodes
- Scaling only NSCs (no leak)
- Scaling NSEs and NSCs (no leak)
- Scaling NSEs and NSCs with different number of k8s-registries (no leak)
kind cluster with 1 node
- Scaling only NSCs (no leak)
- Scaling NSEs nd NSCs (no leak)
- Scaling NSEs and NSCs with different number of k8s-registries (no leak)

NSEs with CIDR `172.16.0.0/30`

kind cluster with 1 node
- Scaling only NSCs (leak)
- Scaling NSEs and NSCs (leak)

It looks like we have a leak when there are no enough NSEs for all NSCs. After scaling NSEs and NSCs 10 times nsmgr consumes 116M of memory. After several hours it still consumes the same amount of memory even though NSCs and NSEs scaled to zero.

Profiles

goroutines.pdf
memory.pdf
block.pdf
mutex.pdf
threadcreate.pdf

Profiles doesn't show any leaks. Memory profile tells only about 4.5M memory used by nsmgr. The number of goroutines is also reasonable. Usually nsmgr has about 50 goroutines running when there are no clients and endpoints.

Maybe there is a problem with the logs. Trying to check it now.

szvincze · 2024-04-30T09:09:20Z

Hi Nikita,

Thanks for sharing your results.
Does "no leak" mean that the memory consumption of the nsmgr container goes back to the original value when you scale NSCs and/or NSEs to zero?
In my tests nsmgr consumes 14-15M memory before I deploy the endpoint and client. The nsmgr container on the node where the NSE runs start consuming 21-23M, on the other node where the NSC is nsmgr consumes around 20M.
When I scale NSE and NSC down to 0 then the first nsmgr still shows 20M,

the second one 17M consumption,

and it does not really change in time.

So, I cannot really reproduce a situation when the memory consumption goes down or at least near to the original level.

szvincze · 2024-05-04T07:10:51Z

Hi,

I created a heap profile during a long running test. After 10 hours this is the memory situation in nsmgr:

I used tinden/cmd-nsmgr:v1.13.0-fix.5 and tinden/cmd-forwarder-vpp:v1.13.0-fix.5 images.
Right now one nsmgr uses 108M, the other 78M. The increase is much slower than before. It seems the runtime uses less than 30M in both cases.

It seems that metrics-server also counts if we are talking about memory increase.

However I haven't monitored the registry-k8s pod, but it was OOMKilled few hours ago.

szvincze · 2024-05-06T11:48:07Z

Here I add three heap profiles I created during my tests. The first one is from an idle state, then one after scaling of NSCs and NSEs started and another one from a later phase of scaling.

ljkiraly · 2024-05-07T09:38:27Z

Hi @denis-tingaikin,

I created an nsmgr image based on the following commits:
eefee38ab907156eafc3d7f2a69552c4779af393 - tmp disable connectionmonitor authroization
7202075a97e5bf1874afec4268fb38d9f063f199 - fix linter
ce37208c0b9ea9bee8bf9b0cfe68752445288f86 - fix mem leak in authorize

ccf42a564dce826dd7e0b5647393c70037643447 - fix memory leaks

from PRs 1616 and 1617.
Contains an grpc upllift to google.golang.org/grpc v1.63.2 based on github.com/szvincze/grpcfd v1.0.0
Also added a code to produce a memprofile in each hour.

Asked to test it in a customer like environment with more then 80 endpoints and traffic running.
The result was better than before (with NSMv1.13).

Still there was a memory increase, especially in one of the nsmgr container:

nsmgr-lr96t-n5
==============
After install: 13.3 MB
UTC 11:38: 93.9MB
UTC 23:38: 97.2MB
UTC 03:38 (after trffic test): 109MB
UTC 05:38 (after uninstalling the application using NSM): 84.5 MB

Find the collected memprofiles attached.

nsmgr-lr96t-n5.tar.gz

As you can see at the slice from May7 2:38am (CEST), the profiling tool shows that the memory used by nsmgr was 38428.29kB (~40MB). It is strange that kubelet's metrics server showing a higher RSS that time (around 90MB-100MB).

File: nsmgr
Type: inuse_space
Time: May 7, 2024 at 2:38am (CEST)
Showing nodes accounting for 38428.29kB, 100% of 38428.29kB total
      flat  flat%   sum%        cum   cum%
 7922.50kB 20.62% 20.62%  7922.50kB 20.62%  bufio.NewReaderSize (inline)
 6866.17kB 17.87% 38.48%  6866.17kB 17.87%  google.golang.org/grpc/internal/transport.newBufWriter (inline)
 3076.16kB  8.00% 46.49%  3076.16kB  8.00%  fmt.Sprintf
 3073.31kB  8.00% 54.49%  3073.31kB  8.00%  runtime.malg
 1538.03kB  4.00% 58.49%  1538.03kB  4.00%  bytes.growSlice

Hope that helps.

denis-tingaikin added the bug Something isn't working label Apr 23, 2024

VitalyGushin assigned NikitaSkrynnik Apr 24, 2024

ljkiraly changed the title ~~Possible memory leak in nsmgr or excluded-prefixes~~ Possible memory leak in nsmgr Apr 25, 2024

This was referenced Apr 30, 2024

Fix fd mem leaks in nsmgr and forwarder networkservicemesh/sdk#1616

Merged

fix mem leak in authorize networkservicemesh/sdk#1617

Draft

denis-tingaikin changed the title ~~Possible memory leak in nsmgr~~ Memory leaks in nsmgr May 7, 2024

denis-tingaikin mentioned this issue May 7, 2024

OPA leaks networkservicemesh/sdk#1622

Closed

ljkiraly mentioned this issue May 7, 2024

What should be the real memory consumed by nsmgr process? #679

Open

edwarnicke closed this as completed in networkservicemesh/sdk#1616 May 14, 2024

edwarnicke reopened this May 14, 2024

denis-tingaikin mentioned this issue Jun 3, 2024

Update grpcfd to latest networkservicemesh/sdk#1640

Merged

9 tasks

denis-tingaikin mentioned this issue Jun 25, 2024

Fix potential leaks of nse/ns streams in case of lost close networkservicemesh/sdk#1641

Merged

9 tasks

denis-tingaikin closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks in nsmgr #675

Memory leaks in nsmgr #675

ljkiraly commented Apr 23, 2024 •

edited

Loading

szvincze commented Apr 23, 2024

ljkiraly commented Apr 24, 2024

ljkiraly commented Apr 25, 2024

NikitaSkrynnik commented Apr 30, 2024

szvincze commented Apr 30, 2024

szvincze commented May 4, 2024 •

edited

Loading

szvincze commented May 6, 2024

ljkiraly commented May 7, 2024

Memory leaks in nsmgr #675

Memory leaks in nsmgr #675

Comments

ljkiraly commented Apr 23, 2024 • edited Loading

szvincze commented Apr 23, 2024

ljkiraly commented Apr 24, 2024

ljkiraly commented Apr 25, 2024

NikitaSkrynnik commented Apr 30, 2024

NSEs with CIDR 172.16.0.0/16

NSEs with CIDR 172.16.0.0/30

Profiles

szvincze commented Apr 30, 2024

szvincze commented May 4, 2024 • edited Loading

szvincze commented May 6, 2024

ljkiraly commented May 7, 2024

ljkiraly commented Apr 23, 2024 •

edited

Loading

NSEs with CIDR `172.16.0.0/16`

NSEs with CIDR `172.16.0.0/30`

szvincze commented May 4, 2024 •

edited

Loading