Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal]Koordlet frequent restarts when CPI PSI monitoring is enabled due to high memory usage #1046

Closed
Re-Grh opened this issue Feb 22, 2023 · 7 comments
Labels
area/koordlet kind/proposal Create a report to help us improve lifecycle/stale

Comments

@Re-Grh
Copy link
Contributor

Re-Grh commented Feb 22, 2023

What is your proposal:
When the psi cpi function is enabled, the koordlet container uses memory to maintain around 240mb, which is prone to trigger the 256mb limit and cause oom and pod restart problems, as shown in the figure below.

image
Why is this needed:
To improve the stability of the koodlet and reduce the overhead of metric collection

Is there a suggested solution, if so, please add it:
Maybe can use Golang's memory profiling tools and increase the frequency of garbage collection.

@Re-Grh Re-Grh added the kind/proposal Create a report to help us improve label Feb 22, 2023
@Re-Grh
Copy link
Contributor Author

Re-Grh commented Feb 22, 2023

/area koordlet

@maaoBit
Copy link

maaoBit commented Mar 7, 2023

Enable CPI PSI, and test in one node cluster

 ~ kubectl get node -o wide
NAME       STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
maao-dev   Ready    control-plane   11d   v1.24.2   192.168.2.2   <none>        Ubuntu 20.04.4 LTS   5.15.0-60-generic   containerd://1.6.18
 ~ kubectl get po -A | wc -l
104
~ kubectl get po -n koordinator-system koordlet-vq6bg -o yaml | grep CPI
    - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,PSICollector=true,CPICollector=true

Pasted image 20230306110542

The container used about 260MB memory, but go_memstats_alloc_bytes is always below 50MB. It seems that most of the memory is used by sqlite3 (cgo). As per this, reducing GOGC does not reduce memory usage much.

When set metric-expire-seconds=900(default 1800), container memory usage drops to around 160MB.
Pasted image 20230306164704

Does the modification of metric-expire-seconds affect other modules which use metric_cache? In addition, does sqlite3 have memory leaks? There seem to be related issue reports.

@Re-Grh
Copy link
Contributor Author

Re-Grh commented Mar 7, 2023

Thank you for your reply. For now, the memory usage statistics of Golang after enabling the feature are not large, and the oom issue may be related to the memory usage of cgo's SQLite3. Although there are few oom situations in other components of Koordlet, replacing the database with TSDB is indeed under consideration, and with the use of TSDB in the future, the memory issue in this issue could be resolved.The issue link for the previous metric_cache refactoring plan is here:#586

@songtao98
Copy link
Contributor

Enable CPI PSI, and test in one node cluster

 ~ kubectl get node -o wide
NAME       STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
maao-dev   Ready    control-plane   11d   v1.24.2   192.168.2.2   <none>        Ubuntu 20.04.4 LTS   5.15.0-60-generic   containerd://1.6.18
 ~ kubectl get po -A | wc -l
104
~ kubectl get po -n koordinator-system koordlet-vq6bg -o yaml | grep CPI
    - -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,PSICollector=true,CPICollector=true
Pasted image 20230306110542

The container used about 260MB memory, but go_memstats_alloc_bytes is always below 50MB. It seems that most of the memory is used by sqlite3 (cgo). As per this, reducing GOGC does not reduce memory usage much.

When set metric-expire-seconds=900(default 1800), container memory usage drops to around 160MB. Pasted image 20230306164704

Does the modification of metric-expire-seconds affect other modules which use metric_cache? In addition, does sqlite3 have memory leaks? There seem to be related issue reports.

@maaoBit Thanks for your contribution for testing possible memory leak in CPI and PSI collectors! The opnion about sqlite3 is quite useful and we will pay more attention on it such as replace it with TSDB.

However, when this problem first occurred, I was wondering if it is an accessible and reasonable memory usage lift which simply caused by new collector. So I did some observations to see if set a higher memory limit can solve the OOM problem. The fact is that over a long period of time(3-4 days), the memory usage remains stable for a few hours, then starts to increase, and so on. This really shocks me.
image

Could you give it a longer observe to see if the same phenomenon happens? I hope the way I did my evaluation is wrong and there is no other problems in the source code that leads to memory leak.

@maaoBit
Copy link

maaoBit commented Mar 7, 2023

OK, I'll keep watching for a few days.

@stale
Copy link

stale bot commented Jun 7, 2023

This issue has been automatically marked as stale because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close
    Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Jun 7, 2023
@stale
Copy link

stale bot commented Jul 9, 2023

This issue has been automatically closed because it has not had recent activity.
This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed
    You can:
  • Reopen this PR with /reopen
    Thank you for your contributions.

@stale stale bot closed this as completed Jul 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/koordlet kind/proposal Create a report to help us improve lifecycle/stale
Projects
None yet
Development

No branches or pull requests

3 participants