TAPA mem leak during periodic Stream Close-Open #500

zolug · 2024-02-13T09:35:32Z

Describe the bug
Consider the following resource settings for TAPA container:

       resources:
            limits:
              cpu:     64m
              memory:  48Mi
            requests:
              cpu:     32m
              memory:  24Mi

If TAPA keeps closing and re-opening the single Stream it was connected to from start, the memory footprint increases. And eventually leads to OOM kill.

To Reproduce
Steps to reproduce the behavior:

Deploy and setup Prometheus to monitor the memory usage of example-target POD's TAPA container.
Deploy 1 Trench with 1 Conduit, 1 Attractor, 1 Stream and 1 Flow.
Deploy example-target with the resource settings above, and let it open the single Steam configured.
Open a dashboard and pick a target-example POD instance to monitor its memory usage.
With a simple script open the Stream and after 10 seconds Close it in a loop for the chosen target-example POD. E.g.:

while [ 1 ]; do date; kc exec target-a-54df657d88-55h8m -n nvip -- ./target-client open -t trench-a -c load-balancer-a1 -s stream-a1; sleep 10;kc exec target-a-54df657d88-55h8m -n nvip -- ./target-client close -t trench-a -c load-balancer-a1 -s stream-a1;echo ""; done

Expected behavior
Should not leak.

Context

Kubernetes: v1.26.6
Network Service Mesh: v1.12.0
Meridio: commit 7e2669c (HEAD -> master, origin/master, origin/HEAD)
Author: Lugossy Zoltan zoltan.lugossy@est.tech
Date: Tue Feb 6 15:40:29 2024 +0100

metrics; fix collecting flow stats

Logs
NA

The text was updated successfully, but these errors were encountered:

zolug · 2024-02-15T15:59:08Z

There seems to be a goroutine leak in interfacename.InterfaceNameChache:

ReleaseTrigger() usage is faulty: started goroutine is not cancelled if the release gets aborted
There's also no point calling releaseTrigger() a second time in pendingRelease() if the inteface name couldn't be cancelled immediately in Release().

This problem might arise, if NSM is trying to heal a TAPA connection. The longer it takes to connect with a Proxy, the more goroutines are created. If I'm not mistaken, these goroutines should eventually exit (after 10 minutes or so), but they increase memory usage till then.

Yet, there's another independent issue leading to increased memory usage when Conduit connection is periodically closed and re-opened as described in the bug description.

zolug · 2024-02-16T14:34:43Z

Another leak looks to be related to the recurring connectNSPService() call that spams workloadapi.NewX509Source() through credentials.GetClient(), while the new source will get never released. The meridio/pkg/security/credentials pkg is basically faulty and requires redesign.

Note:
The Trench is cleaned-up upon Stream close when all Conduits are disconnected in a Trench. Then will get re-created on a subsequent Stream open.

TODO: Would be worth tracking down and checking all occurrences of workloadapi.NewX509Source() for possible resource leakage.

zolug added kind/bug Something isn't working component/TAPA labels Feb 13, 2024

zolug self-assigned this Feb 16, 2024

zolug mentioned this issue Feb 16, 2024

TAPA; fix goroutine cleanup in intefacename cache #503

Merged

10 tasks

zolug mentioned this issue Feb 16, 2024

Fix X509 source leak in credentials pkg #505

Merged

10 tasks

zolug closed this as completed Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAPA mem leak during periodic Stream Close-Open #500

TAPA mem leak during periodic Stream Close-Open #500

zolug commented Feb 13, 2024 •

edited

Loading

zolug commented Feb 15, 2024 •

edited

Loading

zolug commented Feb 16, 2024 •

edited

Loading

TAPA mem leak during periodic Stream Close-Open #500

TAPA mem leak during periodic Stream Close-Open #500

Comments

zolug commented Feb 13, 2024 • edited Loading

zolug commented Feb 15, 2024 • edited Loading

zolug commented Feb 16, 2024 • edited Loading

zolug commented Feb 13, 2024 •

edited

Loading

zolug commented Feb 15, 2024 •

edited

Loading

zolug commented Feb 16, 2024 •

edited

Loading