-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] share-manager-pvc appears to be leaking memory #8394
Comments
not sure but I think the big spikes may correspond to attaching the volume to a different pod given the timings |
supportbundle_5c8ad69a-4ef9-47a4-8c8c-6aaa06a38849_2024-04-19T06-15-54Z.zip |
from running top in the share-manager-pvc pods it looks like ganesha.nfsd that's at fault (nfs-ganesha/nfs-ganesha#1105 ?) |
Can you login into the
|
@PhanLe1010 Have you ever observed the memory leak in the RWX performance investigation? |
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 1237096 13200 ? Ssl Apr14 2:23 /longhorn-share-manager --debug daemon --volume pvc-0cb784d8-46f9-42e1-9476-591
root 38 1.6 1.6 5015480 2145440 ? Sl Apr14 113:57 ganesha.nfsd -F -p /var/run/ganesha.pid -f /tmp/vfs.conf
root 610865 0.8 0.0 6956 4268 pts/0 Ss 07:30 0:00 bash
root 610893 133 0.0 13748 4048 pts/0 R+ 07:30 0:00 ps aux
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # cat /proc/38/status
Name: ganesha.nfsd
Umask: 0000
State: S (sleeping)
Tgid: 38
Ngid: 0
Pid: 38
PPid: 1
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups: 0
NStgid: 38
NSpid: 38
NSpgid: 1
NSsid: 1
VmPeak: 5083072 kB
VmSize: 5015480 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 2145444 kB
VmRSS: 2145444 kB
RssAnon: 2134608 kB
RssFile: 10836 kB
RssShmem: 0 kB
VmData: 2303600 kB
VmStk: 136 kB
VmExe: 16 kB
VmLib: 18704 kB
VmPTE: 4784 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 20
SigQ: 1/514185
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000005001
SigIgn: 0000000000000000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Seccomp_filters: 0
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: conditional enabled
Cpus_allowed: fff
Cpus_allowed_list: 0-11
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 56
nonvoluntary_ctxt_switches: 109
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # cat /proc/1/status
Name: longhorn-share-
Umask: 0022
State: S (sleeping)
Tgid: 1
Ngid: 0
Pid: 1
PPid: 0
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups: 0
NStgid: 1
NSpid: 1
NSpgid: 1
NSsid: 1
VmPeak: 1237096 kB
VmSize: 1237096 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 15228 kB
VmRSS: 13200 kB
RssAnon: 4784 kB
RssFile: 8416 kB
RssShmem: 0 kB
VmData: 46880 kB
VmStk: 136 kB
VmExe: 4380 kB
VmLib: 8 kB
VmPTE: 120 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 18
SigQ: 1/514185
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffc3bba3a00
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Seccomp_filters: 0
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: conditional enabled
Cpus_allowed: fff
Cpus_allowed_list: 0-11
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 150
nonvoluntary_ctxt_switches: 59
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # |
The VmRSS of the nfs-ganesha process is too high (2145444 kB). The high VmRSS value was not observed in my cluster, and the value went back to a lower value after an IO intensive task. To verify if it is caused by the upstream regression you mentioned, could you provide us with the steps to reproduce as well as the information
|
I'm not sure if it's caused by the regression, I was just looking for leaks in upstream given it's the nfs server using all the ram. The workload is that maybe 10-15 times a day rsync is run on ~4000 files, and half a dozen files will be copied to the nfs share (which in our case corresponds to releasing some code), then a few hundred times a day a few dozen of those files will be read at a time. It's quite light use and read mostly. By manifest are you after a test workload? |
OK. I will try the steps in our lab. |
@derekbit I can't be sure but v1.5.5 looks better, below is the memory usage with v1.6.1 jumping when the files are read by some nightly jobs. One thing I hadn't thought of is that those pods are all transient, and there are a few thousand of them, so this also corresponds to a lot of mounting and unmounting: |
@blushingpenguin I can build a customized share-manager image with the fix of the memory leak. Can you help test if it is the culprit in v1.6.1? What do you think? |
@derekbit I'm happy to test a fixed build, however I think it's probably worth giving v1.5.5 another 24h to confirm that the leak doesn't occur with our weekday usage pattern. I'll check it tomorrow and report back. Thanks, Mark |
@blushingpenguin To replace the share-manager image, edit the longhorn-manager daemonset by
Many thanks. |
@derekbit yes, it looks like that was the major cause. |
Thanks @blushingpenguin for the quick update. |
Pre Ready-For-Testing Checklist
longhorn/nfs-ganesha#13
|
This is a very, very, very bad situation! Will we have and release like... today?! In my cluster with just 5 nodes 12 disk the Longhorn is using more than 100GB of ram over all nodes!!! This response time will show us if this is a reliable storage system, how can a critical bug like this happen in a release marked as stable?! Please, just fix it! |
Can you elaborate more on your use case? I've provided a share-manager image for mitigating the issue before the v1.6.2 release. Please see #8394 (comment). |
@brunosmartin It's upstream that's at fault here, and is possibly related to usage patterns (I was seeing much higher ram usage with volumes that were being mounted/unmounted a lot). This a problem of modern software development I suppose -- pretty much all software is built of other components these days, and validating they all work together is a pretty difficult task. I actually think the response has been excellent here, @derekbit was on the case immediately. You can just patch your deployment as outlined in #8394 (comment) until the next release -- this has mostly fixed the problem for me (it still looks a bit leaky, but nothing like before). |
Maybe this is underestimated, my cluster doesn't make "tons of unmount and mount operation", but see the images above how it acted in face of this bug: we upgrade 04/23 from 1.5.4 to 1.6.1, and the memory leak clearly consumes lots of ram. Then we applied this fix 04/26, and its stable now. Every peak in this image was a very painful huge outage! Besides we don't make "tons of unmount and mount operation", we have like five workloads where we have 3 pods using the same volume, maybe this is related somehow. I repeat this critical bug is underestimated, I don't have any special use, just using normal (small) kubernetes cluster with rancher and made my company's services burn for a week, it's very likely there are tons off users affected and after one week we don't have even an RC build. @blushingpenguin for modern software development problems we have modern management methods to deal with. My point was that this critical bug shows was a mistake to call the version 1.6.1 stable. I also work on some open source projects and my intention here is to help this project to be more stable and reliable, as an storage system must be. Please tell me if I can provide any additional information on this issue. |
@brunosmartin The high memory usage in your cluster sounds triggered by the same nfs-ganesha bug. |
@PhanLe1010 just to clarify, I have 5 deployments with scale > 1, but I have much more RXW PVCs, I think about 50 RWX PVCs, most of then are deployed with scale = 1. |
Verified on master-head 20240507
The test steps
deployment_rwx_test.sh#!/bin/bash
# Define the deployment name
DEPLOYMENT_NAME="rwx-test"
KUBECONFIF="/home/ryao/Desktop/note/longhorn-tool/ryao-161.yaml"
for ((i=1; i<=100; i++)); do
# Scale deployment to 10 replicas
kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=3
# Wait for the deployment to have 3 ready replicas
until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "3" ]]; do
ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
sleep 1
done
# Check if all pods are in the "Running" state
while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running Running Running" ]]; do
echo "Not all pods are in the 'Running' state. Waiting..."
sleep 5
done
# Scale deployment down to 1 replicas
kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=1
# Wait for the deployment to have 1 ready replicas
until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "1" ]]; do
ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
sleep 1
done
# Check if all pods are in the "Running" state
while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running" ]]; do
echo "Not all pods are in the 'Running' state. Waiting..."
sleep 5
done
done
Result Passed
For the
|
When validating this issue, there is one thing that needs to be noted. I originally followed the test plan steps below, but I couldn't reproduce the issue using that method. The test steps
pod_mount_vol-0.yamlkind: Pod
apiVersion: v1
metadata:
name: ubuntu-mountvol0
namespace: default
spec:
containers:
- name: ubuntu
image: ubuntu
command: ["/bin/sleep", "3650d"]
volumeMounts:
- mountPath: "/data/"
name: vol-0
volumes:
- name: vol-0
persistentVolumeClaim:
claimName: vol-0
2nd_pod_mount_rwx_vol.yamlkind: Pod
apiVersion: v1
metadata:
name: ubuntu-mount-rwx-vol
namespace: default
spec:
containers:
- name: ubuntu
image: ubuntu
command: ["/bin/sleep", "3650d"]
volumeMounts:
- mountPath: "/data/"
name: vol-0
volumes:
- name: vol-0
persistentVolumeClaim:
claimName: vol-0
attach_detach_pod_rwx.sh#!/bin/bash
# This script from Derek in the Github issue #6776.
# It assumes that pod_mount_rwx_vol.yaml has already been applied, so that the PVC in pod_mount_rwx_vol.yaml will exist.
# Set the path to the pod_mount_rwx_vol.yaml file
POD_YAML_FILE="pod_mount_rwx_vol.yaml"
# Set the path to the kubeconfig file
KUBECONFIG_PATH="/home/ryao/Desktop/note/longhorn-tool/ryao-master.yaml"
# Set the number of iterations to perform
NUM_ITERATIONS=200
# Set timeout values for attachment and detachment
ATTACH_WAIT_SECONDS=120
DETACH_WAIT_SECONDS=300
for ((i=1; i<=$NUM_ITERATIONS; i++))
do
echo "Iteration $i/$NUM_ITERATIONS"
echo "Attaching"
kubectl apply -f "$POD_YAML_FILE" --kubeconfig="$KUBECONFIG_PATH"
c=0
while [ $c -lt $ATTACH_WAIT_SECONDS ]
do
phase=$(kubectl --kubeconfig="$KUBECONFIG_PATH" get pod ubuntu-mount-rwx-vol -o=jsonpath="{['status.phase']}" 2>/dev/null)
if [ x"$phase" == x"Running" ]; then
break
fi
sleep 1
c=$((c+1))
if [ x"$c" = x"$ATTACH_WAIT_SECONDS" ]; then
echo "Found error"
echo "Error: Pod failed to reach Running state within $ATTACH_WAIT_SECONDS seconds"
exit 1
fi
done
echo "Detaching"
kubectl delete -f "$POD_YAML_FILE" --kubeconfig="$KUBECONFIG_PATH"
# Wait for a few seconds before the next iteration
sleep 5
done
cc @derekbit @longhorn/qa |
Describe the bug
Over time share-manager-pvc appears to leak memory, starting at ~100Mb and growing as far as we've seen to ~15Gb
To Reproduce
Not sure, we just wait over time. Seems like a leak as the graph as jumps in it, perhaps related to pod restarts / mounts?
Expected behavior
Stable memory usage
Support bundle for troubleshooting
Will do
Environment
Additional context
The text was updated successfully, but these errors were encountered: