New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Longhorn manager pods in 1.5.1 consuming 20GB+ RAM and 3-4 vCPUs #6866
Comments
Hello @penoux, Before diving into the issue, need to clarify some questions.
Can you send us a support bundle to longhorn-support-bundle@suse.com for further analysis? Thank you. cc @DamiaSan (community coordinator), @shuo-wu @PhanLe1010 @ejweber @james-munson |
Hello,
|
As you can see, mainly all longhorn-manager pod use LOTS of ram / cpu , but only few (2 or 3) instances manager consume more ram than average |
For instance-manager pods using high memories or CPUs, can you check if they hit the known issues? A support bundle or longhorn-manager pods logs are appreciated for debugging. Thanks. |
support bundle is ... coming, but veryyyyyyyy long to generate |
We don't hit any known issue on longhorn-manager specifically. Our issue is not on instance-manager btw. Support bundle generation has been launched more than 4 hours ago and is still running. Probably failed no? Logs of longhorn-manager pods are filled-up with memory pressure logs from all nodes...
|
OK. If you have checked the instance-manager pods, let's skip the instance-manger pods with high memories and cpus.
Got it. Then, generating a support bundle might be not a good idea. You can provide the log of one longhorn-manager first and see if we can find something in it. |
Logs of a longhorn-manager pod: https://gist.github.com/penoux/cc3cb97c1c4d02c164aa6d442d9ec96c |
I don't see any suspicious messages in the log file.
It is related to the metrics collection and should not introduce memory leak. @PhanLe1010 @ejweber @shuo-wu @innobead do you have ideas for investigation? |
Support Bundle comming (10Go) |
Hello, this is the Support Bundle https://drive.google.com/file/d/1XKoALFSeN3cIIAjpAAlRhYIRWfu2UDVT/view?usp=drive_link |
We have observed (on 1 node among others), a large set of NFS mount processes, that increases over hours (much more than the total number of volumes, on 1 single node)
|
Can you provide 'ps aux' you observed? I want to check backup or share manager pod causes the mounts. Thanks. |
Sure: https://gist.github.com/penoux/af2435a2f80d415dce397d6f7e423400 |
They are RWX volumes. It looks the volumes are somehow unable to mount the NFS share exported by the NFS server in the share manager pods. Are the share manager pods for the volumes are still alive? |
all share-manager running
|
still don't understand why we have hundreds of process mount per PVC... |
I find that CSI plugin pod is consuming 20GB on this node, longhorn-manager 7GB
|
longhorn csi plugin seems to have problem too Reading next message, could it be nfs issue too ?
ps -efT | grep longhorn-manager | grep csi | wc -l |
There are some findings:
|
Can you show |
Nothing |
Then, I think the symptom described in #6866 (comment) is not related to NFS mount issue. |
after killing one thread ( in #6866 (comment) ) , all other thread dies and disapear and load average come back to normal |
Basic info on this cluster: First, we can say that the longhorn-manager high resource usage is probably not caused by either the stuck NFS backup target connection or the flooded snapshot/backup resources reconciliation. Then, I checked some pod logs:
|
Does killing process |
it will alse prevent High Cpu usage :) |
Good catch! I suspected there were other resources/workloads leading to the longhorn-manager high memory usage but I never thought that the secret resources could be a culprit |
@derekbit do we need to backport to 1.4.x? |
Yeah, we can backport to v1.4.x. |
Yes, let's backport it. |
I've filed a PR longhorn/longhorn-manager#2243 for fixing the cache issue by filtering the Kubernetes resources. The PR is currently under review. If you want to test it in your LAB or NOPROD cluster, you can apply the customized longhorn-manager image |
A quick summary for the longhorn-manager high memory and cpu consumption. There are two culprits found in the investigation cooperated with @shuo-wu @penoux @amapi. Multiple clients and caches
Cache all Kubernetes resources in ALL namespaces
|
Nice improvement again :) before your last fix
just show the first pods (same for others) And after your new images
50% less memory usage If you have other improvements like theses we will be very happy to test :) |
@derekbit @ChanYiLin @innobead @PhanLe1010 thanks a lot for those prompt fixes! I don't know btw if they are the last optimizations to be considered into LH 1.5.2. When is planned 1.5.2 release (for us to roll-out important fixes on PROD)? [EDIT]: now the RAM consumption seems directly linked to the size of the cluster (pods/nodes/volumes). That is wy on NOPROD we still have 2.2GB RAM (few nodes / pods but 1046 volumes out of which 1001 are detached)... Still a bit worried about the target RAM on PROD:
Other related questions I ask myself (discovering LH internals with this issue), considering LH software architecture and separation of concerns:
|
The two PRs in #6866 (comment) and longhorn/longhorn-manager#2242 are be included in the upcoming v1.5.2. If there is no accident, the release date will be in the beginning of this Nov.
Agree. We are also discussing the similar optimization, but we need to check if it is feasible in the current implementation.
It is related to #6936 (comment).
You are absolutely right. Let me check how to filter these resources. Update: filtering storageclass, pv, pvc and csi drivers is infeasible because the |
Could you provide us the number of resources (secret, configmap, pod, pvc, pv, storageclass, ...) in your NOPROD cluster? Thank you. |
|
Full status of NO PROD / PROD and LAB cluster https://gist.github.com/amapi/383e55b7c85676f350b5c040e01ada65 |
@derekbit : figures from all our clusters (PROD, NOPROD, LAB), linked to Kube resources you cache in your code (non filtered ones) and categorized LH / not LH for volume related ones. PROD:
NOPROD:
LAB:
|
@derekbit Hello, i have deployed your image on ne prod cluster I get the mem usage "before and after". Again there is less memory usage. Small amout this time, because daemonset and deployement a not too many, and take less space in ram Values before :
Values after:
|
@derekbit sure that we will observe a better impact of this optimization on PROD cluster, where the number of deployments/pods is huge compared to the new NOPROD cluster. So you can keep this. |
Yes !!! Thx a lot |
Hello, we have installed official release 1.5.2 on our LAB and NOPROD clusters, so far no issue. The rollout to the biggest PROD cluster is planned on Nov 13th. Hopefully we will not observe much more RAM consumption than in NOPROD (2.3GB) |
Closing this issue first because of 1.5.2 released out. Feel free to reopen this if needed. |
Describe the bug (🐛 if you encounter this issue)
Since the migration to Longhorn 1.5.1 (from 1.4.1), longhorn-manager pods are consuming 20GB+ RAM (at least x3 since before) on most of our workers nodes and 3 to 4 vCPUs, making our whole production cluster unstable.
To Reproduce
Create a cluster:
Migrate to Longhorn 1.5.1
Expected behavior
longhorn-manager pods should consume much less amount of RAM (expected: less than 5GB max, ideally under 1GB).
For vCPUs, it should not consume more than 1vCPU max.
Support bundle for troubleshooting
Environment
Additional context
We have migrated from 1.4.1 to 1.5.1 3 weeks ago, with lots of difficulties: longhorn manager pods being constantly evicted during the migration due to:
After having set priorityClass + 10GB requests on longhorn-manager daemonset, we have mitigated the issues and completed the migration.
However since then, we have observed a huge RAM consumption, over 20GB on many of our nodes, leading to:
Also, we observe longhorn-manager pods consuming sometimes 3 to 4 vCPU (50% of nodes CPU capacity), it is completely abnormal too but not the main blocking point atm.
RCA
#6866 (comment)
The text was updated successfully, but these errors were encountered: