Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAPA mem leak during periodic Stream Close-Open #500

Closed
zolug opened this issue Feb 13, 2024 · 2 comments
Closed

TAPA mem leak during periodic Stream Close-Open #500

zolug opened this issue Feb 13, 2024 · 2 comments
Assignees
Labels
component/TAPA kind/bug Something isn't working

Comments

@zolug
Copy link
Collaborator

zolug commented Feb 13, 2024

Describe the bug
Consider the following resource settings for TAPA container:

       resources:
            limits:
              cpu:     64m
              memory:  48Mi
            requests:
              cpu:     32m
              memory:  24Mi

If TAPA keeps closing and re-opening the single Stream it was connected to from start, the memory footprint increases. And eventually leads to OOM kill.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy and setup Prometheus to monitor the memory usage of example-target POD's TAPA container.
  2. Deploy 1 Trench with 1 Conduit, 1 Attractor, 1 Stream and 1 Flow.
  3. Deploy example-target with the resource settings above, and let it open the single Steam configured.
  4. Open a dashboard and pick a target-example POD instance to monitor its memory usage.
  5. With a simple script open the Stream and after 10 seconds Close it in a loop for the chosen target-example POD. E.g.:
while [ 1 ]; do date; kc exec target-a-54df657d88-55h8m -n nvip -- ./target-client open -t trench-a -c load-balancer-a1 -s stream-a1; sleep 10;kc exec target-a-54df657d88-55h8m -n nvip -- ./target-client close -t trench-a -c load-balancer-a1 -s stream-a1;echo ""; done

Expected behavior
Should not leak.

Context

  • Kubernetes: v1.26.6

  • Network Service Mesh: v1.12.0

  • Meridio: commit 7e2669c (HEAD -> master, origin/master, origin/HEAD)
    Author: Lugossy Zoltan zoltan.lugossy@est.tech
    Date: Tue Feb 6 15:40:29 2024 +0100

    metrics; fix collecting flow stats

Logs
NA

@zolug zolug added kind/bug Something isn't working component/TAPA labels Feb 13, 2024
@zolug
Copy link
Collaborator Author

zolug commented Feb 15, 2024

There seems to be a goroutine leak in interfacename.InterfaceNameChache:

  • ReleaseTrigger() usage is faulty: started goroutine is not cancelled if the release gets aborted
  • There's also no point calling releaseTrigger() a second time in pendingRelease() if the inteface name couldn't be cancelled immediately in Release().

This problem might arise, if NSM is trying to heal a TAPA connection. The longer it takes to connect with a Proxy, the more goroutines are created. If I'm not mistaken, these goroutines should eventually exit (after 10 minutes or so), but they increase memory usage till then.

Yet, there's another independent issue leading to increased memory usage when Conduit connection is periodically closed and re-opened as described in the bug description.

@zolug
Copy link
Collaborator Author

zolug commented Feb 16, 2024

Another leak looks to be related to the recurring connectNSPService() call that spams workloadapi.NewX509Source() through credentials.GetClient(), while the new source will get never released. The meridio/pkg/security/credentials pkg is basically faulty and requires redesign.

Note:
The Trench is cleaned-up upon Stream close when all Conduits are disconnected in a Trench. Then will get re-created on a subsequent Stream open.

TODO: Would be worth tracking down and checking all occurrences of workloadapi.NewX509Source() for possible resource leakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/TAPA kind/bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

1 participant