[BUG] share-manager-pvc appears to be leaking memory #8394

blushingpenguin · 2024-04-19T06:15:27Z

Describe the bug

Over time share-manager-pvc appears to leak memory, starting at ~100Mb and growing as far as we've seen to ~15Gb

To Reproduce

Not sure, we just wait over time. Seems like a leak as the graph as jumps in it, perhaps related to pod restarts / mounts?

Expected behavior

Stable memory usage

Support bundle for troubleshooting

Will do

Environment

Longhorn version: 1.6.1
Impacted volume (PV): share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: microk8s v1.28.8
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 0
Node config
- OS type and version: ubuntu 22.04
- Kernel version: 5.15.0-102-generic
- CPU per node: 12
- Memory per node: 128Gb
- Disk type (e.g. SSD/NVMe/HDD): NVMe
- Network bandwidth between the nodes (Gbps): 1Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 5

Additional context

blushingpenguin · 2024-04-19T06:17:46Z

examples of the leak since last share-manager pvc restart:

blushingpenguin · 2024-04-19T06:18:36Z

not sure but I think the big spikes may correspond to attaching the volume to a different pod given the timings

blushingpenguin · 2024-04-19T06:25:58Z

supportbundle_5c8ad69a-4ef9-47a4-8c8c-6aaa06a38849_2024-04-19T06-15-54Z.zip
I've snipped some logs out of this that contain too much information (for example the node syslogs), I can supply relevant sections if needed or possibly share them privately (I'd need to get permission to do that)

blushingpenguin · 2024-04-19T06:49:46Z

from running top in the share-manager-pvc pods it looks like ganesha.nfsd that's at fault (nfs-ganesha/nfs-ganesha#1105 ?)
longhorn-share-managers memory usage is ~13Mb in each pod

derekbit · 2024-04-19T07:08:19Z

Can you login into the share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f pod and provide us the results

ps aux
cat /proc/<PID of ganesha.nfsd>/status
cat /proc/1/status

derekbit · 2024-04-19T07:14:34Z

@PhanLe1010 Have you ever observed the memory leak in the RWX performance investigation?

blushingpenguin · 2024-04-19T07:31:53Z

share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 1237096 13200 ?       Ssl  Apr14   2:23 /longhorn-share-manager --debug daemon --volume pvc-0cb784d8-46f9-42e1-9476-591
root          38  1.6  1.6 5015480 2145440 ?     Sl   Apr14 113:57 ganesha.nfsd -F -p /var/run/ganesha.pid -f /tmp/vfs.conf
root      610865  0.8  0.0   6956  4268 pts/0    Ss   07:30   0:00 bash
root      610893  133  0.0  13748  4048 pts/0    R+   07:30   0:00 ps aux
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # cat /proc/38/status 
Name:   ganesha.nfsd
Umask:  0000
State:  S (sleeping)
Tgid:   38
Ngid:   0
Pid:    38
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups: 0 
NStgid: 38
NSpid:  38
NSpgid: 1
NSsid:  1
VmPeak:  5083072 kB
VmSize:  5015480 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   2145444 kB
VmRSS:   2145444 kB
RssAnon:         2134608 kB
RssFile:           10836 kB
RssShmem:              0 kB
VmData:  2303600 kB
VmStk:       136 kB
VmExe:        16 kB
VmLib:     18704 kB
VmPTE:      4784 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        20
SigQ:   1/514185
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000005001
SigIgn: 0000000000000000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Seccomp_filters:        0
Speculation_Store_Bypass:       thread vulnerable
SpeculationIndirectBranch:      conditional enabled
Cpus_allowed:   fff
Cpus_allowed_list:      0-11
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        56
nonvoluntary_ctxt_switches:     109
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ # cat /proc/1/status
Name:   longhorn-share-
Umask:  0022
State:  S (sleeping)
Tgid:   1
Ngid:   0
Pid:    1
PPid:   0
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups: 0 
NStgid: 1
NSpid:  1
NSpgid: 1
NSsid:  1
VmPeak:  1237096 kB
VmSize:  1237096 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     15228 kB
VmRSS:     13200 kB
RssAnon:            4784 kB
RssFile:            8416 kB
RssShmem:              0 kB
VmData:    46880 kB
VmStk:       136 kB
VmExe:      4380 kB
VmLib:         8 kB
VmPTE:       120 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        18
SigQ:   1/514185
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffc3bba3a00
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Seccomp_filters:        0
Speculation_Store_Bypass:       thread vulnerable
SpeculationIndirectBranch:      conditional enabled
Cpus_allowed:   fff
Cpus_allowed_list:      0-11
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        150
nonvoluntary_ctxt_switches:     59
share-manager-pvc-0cb784d8-46f9-42e1-9476-5918989ec94f:/ #

derekbit · 2024-04-19T08:19:20Z

The VmRSS of the nfs-ganesha process is too high (2145444 kB). The high VmRSS value was not observed in my cluster, and the value went back to a lower value after an IO intensive task.

To verify if it is caused by the upstream regression you mentioned, could you provide us with the steps to reproduce as well as the information

What's your workload (manifest is appreciated)
Is it IO intensive? Write or read?

blushingpenguin · 2024-04-19T11:41:13Z

I'm not sure if it's caused by the regression, I was just looking for leaks in upstream given it's the nfs server using all the ram.

The workload is that maybe 10-15 times a day rsync is run on ~4000 files, and half a dozen files will be copied to the nfs share (which in our case corresponds to releasing some code), then a few hundred times a day a few dozen of those files will be read at a time. It's quite light use and read mostly.

By manifest are you after a test workload?

derekbit · 2024-04-19T11:57:27Z

I'm not sure if it's caused by the regression, I was just looking for leaks in upstream given it's the nfs server using all the ram.

The workload is that maybe 10-15 times a day rsync is run on ~4000 files, and half a dozen files will be copied to the nfs share (which in our case corresponds to releasing some code), then a few hundred times a day a few dozen of those files will be read at a time. It's quite light use and read mostly.

By manifest are you after a test workload?

OK. I will try the steps in our lab.
If you are available, you can also try v1.5.5 and see if it doesn't have the issue.

derekbit · 2024-04-20T03:46:21Z

cc @james-munson

blushingpenguin · 2024-04-22T06:02:25Z

@derekbit I can't be sure but v1.5.5 looks better, below is the memory usage with v1.6.1 jumping when the files are read by some nightly jobs. One thing I hadn't thought of is that those pods are all transient, and there are a few thousand of them, so this also corresponds to a lot of mounting and unmounting:

and here is the comparison from yesterday's nightly jobs:

you can see there is a small increase in memory usage but it's nothing like before

derekbit · 2024-04-22T06:16:41Z

@blushingpenguin
Thanks for the update.

I can build a customized share-manager image with the fix of the memory leak. Can you help test if it is the culprit in v1.6.1? What do you think?

blushingpenguin · 2024-04-22T09:23:45Z

@derekbit I'm happy to test a fixed build, however I think it's probably worth giving v1.5.5 another 24h to confirm that the leak doesn't occur with our weekday usage pattern. I'll check it tomorrow and report back.

Thanks,

Mark

derekbit · 2024-04-22T13:18:47Z

@blushingpenguin
You can test the share-manager image derekbit/longhorn-share-manager:v1.6.1-fix-leak on Longhorn v1.6.1.

To replace the share-manager image, edit the longhorn-manager daemonset by kubectl -n longhorn-system edit daemonset longhorn-manager, then replace the share-manager image

....
    spec:
      containers:
      - command:
        - longhorn-manager
        - -d
        - daemon
        - --engine-image
        - longhornio/longhorn-engine:v1.6.1
        - --instance-manager-image
        - longhornio/longhorn-instance-manager:v1.6.1
        - --share-manager-image
        - longhornio/longhorn-share-manager:v1.6.1 <---- Replace this image with derekbit/longhorn-share-manager:v1.6.1-fix-leak
        - --backing-image-manager-image
        - longhornio/backing-image-manager:v1.6.1
        - --support-bundle-manager-image
        - longhornio/support-bundle-kit:v0.0.36
        - --manager-image
        - longhornio/longhorn-manager:v1.6.1
        - --service-account
        - longhorn-service-account
        - --upgrade-version-check
...

Many thanks.

blushingpenguin · 2024-04-23T05:35:15Z

v1.5.5 still looks a bit leaky, but nothing on the scale of before:

I will switch to your patched version now and report back tomorrow. Thanks very much for following up on this!

blushingpenguin · 2024-04-24T04:32:55Z

@derekbit yes, it looks like that was the major cause.
v1.6.1-fix-leak:

(you can see the restart yesterday at around 05:30)
Thanks very much for digging into it so quickly.

derekbit · 2024-04-24T04:49:01Z

Thanks @blushingpenguin for the quick update.
The fix will be included in v1.6.2. Temporarily, you can continue using derekbit/longhorn-share-manager:v1.6.1-fix-leak.
I also found nfs-ganesha still seems a bit leaky after applying the fix. It's minor. I will check this part later.

longhorn-io-github-bot · 2024-04-24T05:32:41Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Create a 3 node cluster
Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
Create second workload with the RWX volume.
Scale down the second workload and scale up repeatedly 100 times
Find the PID of the nfs-ganesha in the share-manager pod by ps aux
Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS
VmRSS in LH v1.6.1 is significantly larger than the value after applying the fix.

Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

longhorn/nfs-ganesha#13
longhorn/longhorn-share-manager#203

Which areas/issues this PR might have potential impacts on?
Area: RWX volume, memory leak, upstream
Issues

brunosmartin · 2024-04-25T15:40:49Z

This is a very, very, very bad situation! Will we have and release like... today?!

In my cluster with just 5 nodes 12 disk the Longhorn is using more than 100GB of ram over all nodes!!!

This response time will show us if this is a reliable storage system, how can a critical bug like this happen in a release marked as stable?!

Please, just fix it!

derekbit · 2024-04-25T15:51:44Z

This is a very, very, very bad situation! Will we have and release like... today?!

In my cluster with just 5 nodes 12 disk the Longhorn is using more than 100GB of ram over all nodes!!!

This response time will show us if this is a reliable storage system, how can a critical bug like this happen in a release marked as stable?!

Please, just fix it!

Can you elaborate more on your use case?
The issue is triggered only when there are tons of unmount and mount operation.

I've provided a share-manager image for mitigating the issue before the v1.6.2 release. Please see #8394 (comment).

blushingpenguin · 2024-04-25T15:52:50Z

@brunosmartin It's upstream that's at fault here, and is possibly related to usage patterns (I was seeing much higher ram usage with volumes that were being mounted/unmounted a lot). This a problem of modern software development I suppose -- pretty much all software is built of other components these days, and validating they all work together is a pretty difficult task. I actually think the response has been excellent here, @derekbit was on the case immediately.

You can just patch your deployment as outlined in #8394 (comment) until the next release -- this has mostly fixed the problem for me (it still looks a bit leaky, but nothing like before).

brunosmartin · 2024-04-30T01:54:35Z

This is a very, very, very bad situation! Will we have and release like... today?!
In my cluster with just 5 nodes 12 disk the Longhorn is using more than 100GB of ram over all nodes!!!
This response time will show us if this is a reliable storage system, how can a critical bug like this happen in a release marked as stable?!
Please, just fix it!

Can you elaborate more on your use case? The issue is triggered only when there are tons of unmount and mount operation.

I've provided a share-manager image for mitigating the issue before the v1.6.2 release. Please see #8394 (comment).

Maybe this is underestimated, my cluster doesn't make "tons of unmount and mount operation", but see the images above how it acted in face of this bug:

we upgrade 04/23 from 1.5.4 to 1.6.1, and the memory leak clearly consumes lots of ram. Then we applied this fix 04/26, and its stable now. Every peak in this image was a very painful huge outage!

Besides we don't make "tons of unmount and mount operation", we have like five workloads where we have 3 pods using the same volume, maybe this is related somehow.

I repeat this critical bug is underestimated, I don't have any special use, just using normal (small) kubernetes cluster with rancher and made my company's services burn for a week, it's very likely there are tons off users affected and after one week we don't have even an RC build.

@blushingpenguin for modern software development problems we have modern management methods to deal with. My point was that this critical bug shows was a mistake to call the version 1.6.1 stable.

I also work on some open source projects and my intention here is to help this project to be more stable and reliable, as an storage system must be.

Please tell me if I can provide any additional information on this issue.

derekbit · 2024-04-30T02:04:27Z

@brunosmartin
Thank you for the update. We will review the strategy of marking stable release. cc @innobead

The high memory usage in your cluster sounds triggered by the same nfs-ganesha bug.
Can you elaborate more on the workload such as the applications, io patterns, read intensive or write intensive? These information will help us define test cases and avoid falling into the known issues again.

PhanLe1010 · 2024-04-30T02:07:16Z

@brunosmartin

May I have some question about your setup

Is it correct that you have 5 deployment?
Each deployment uses 1 RWX PVC?
Each deployment have 3 pods?
What are the size of the RWX PVCs?

Also, do you know what happen at the timestamp in the below picture? Did your workload deployments restart?

brunosmartin · 2024-04-30T02:27:02Z

@brunosmartin

May I have some question about your setup

Sure

* Is it correct that you have 5 deployment?

Yep, 3 production

* Each deployment uses 1 RWX PVC?

Yes

* Each deployment have 3 pods?

exactly, scale = 3

* What are the size of the RWX PVCs?

the smaller is 200GB, the biggest is 500GB. They are wordpress sites, with Cloudflare as CDN (with page cache) so not expected to be IO intensive, only the apache has RWX PVCs, the database does the intensive IOPs operations as a Single Node Read Write PVC.

Also, do you know what happen at the timestamp in the below picture? Did your workload deployments restart?

Yes, they did restart at that point, but I'm not sure this was related to longhorn, at that point the patch the patch was applied already (since the huge memory drop at 04/26 @ 12pm).

Hope it helps, good lucky on this.

brunosmartin · 2024-04-30T03:22:44Z

@PhanLe1010 just to clarify, I have 5 deployments with scale > 1, but I have much more RXW PVCs, I think about 50 RWX PVCs, most of then are deployed with scale = 1.

roger-ryao · 2024-05-07T10:10:15Z

Verified on master-head 20240507

longhorn master-head 8026d1a
nfs-ganesha longhorn-ganesha-v5 longhorn/nfs-ganesha@996a59c
longhorn-share-manager master-head longhorn/longhorn-share-manager@88d4db1

The test steps
#8394 (comment)

Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
Scale up the replicas to 3.
Check if 3 workloads are in the "Running" state.
Scale down the replicas to 1.
Check if one workload are in the "Running" state.
We can test steps 2-5 using the following shell script.

deployment_rwx_test.sh

#!/bin/bash

# Define the deployment name
DEPLOYMENT_NAME="rwx-test"
KUBECONFIF="/home/ryao/Desktop/note/longhorn-tool/ryao-161.yaml"

for ((i=1; i<=100; i++)); do
    # Scale deployment to 10 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=3

    # Wait for the deployment to have 3 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "3" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running Running Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done

    # Scale deployment down to 1 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=1

    # Wait for the deployment to have 1 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "1" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done
done

Find the PID of the nfs-ganesha in the share-manager pod by ps aux
Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS

Result Passed

We were also able to reproduce this issue on v1.6.1.
After executing the script, the output for v1.6.1 is as follows:

Every 2.0s: cat /proc/29/status | grep VmRSS                     share-manager-pvc-119d403e-ae17-4f4f-aa7f-06e7bf40fca2: Tue May  7 09:54:38 2024

VmRSS:     47192 kB

For the master-head:

Every 2.0s: cat /proc/43/status | grep VmRSS                    share-manager-pvc-d0a41b4a-5bb8-4cd5-b60a-d260c6ce8a34: Tue May  7 10:08:18 2024

VmRSS:     42588 kB

roger-ryao · 2024-05-08T14:12:46Z

When validating this issue, there is one thing that needs to be noted.

I originally followed the test plan steps below, but I couldn't reproduce the issue using that method.
The memory leak issue seems to occur only when scaling up or down the deployment replicas.

The test steps

Created an RWX volume named vol-0 with a PV/PVC in the default namespace.
Created the first pod using the vol-0 RWX volume.

pod_mount_vol-0.yaml

kind: Pod
apiVersion: v1
metadata:
  name: ubuntu-mountvol0
  namespace: default
spec:
  containers:
    - name: ubuntu
      image: ubuntu
      command: ["/bin/sleep", "3650d"]
      volumeMounts:
      - mountPath: "/data/"
        name: vol-0
  volumes:
    - name: vol-0
      persistentVolumeClaim:
        claimName: vol-0

Created the second pod using the vol-0 RWX volume.

2nd_pod_mount_rwx_vol.yaml

kind: Pod
apiVersion: v1
metadata:
  name: ubuntu-mount-rwx-vol
  namespace: default
spec:
  containers:
    - name: ubuntu
      image: ubuntu
      command: ["/bin/sleep", "3650d"]
      volumeMounts:
      - mountPath: "/data/"
        name: vol-0
  volumes:
    - name: vol-0
      persistentVolumeClaim:
        claimName: vol-0

Deleted and created the second pod repeatedly 200 times using the following shell script.

attach_detach_pod_rwx.sh

#!/bin/bash

# This script from Derek in the Github issue #6776.
# It assumes that pod_mount_rwx_vol.yaml has already been applied, so that the PVC in pod_mount_rwx_vol.yaml will exist.
# Set the path to the pod_mount_rwx_vol.yaml file
POD_YAML_FILE="pod_mount_rwx_vol.yaml"

# Set the path to the kubeconfig file
KUBECONFIG_PATH="/home/ryao/Desktop/note/longhorn-tool/ryao-master.yaml"


# Set the number of iterations to perform
NUM_ITERATIONS=200

# Set timeout values for attachment and detachment
ATTACH_WAIT_SECONDS=120
DETACH_WAIT_SECONDS=300

for ((i=1; i<=$NUM_ITERATIONS; i++))
do
  echo "Iteration $i/$NUM_ITERATIONS"

  echo "Attaching"
  kubectl apply -f "$POD_YAML_FILE" --kubeconfig="$KUBECONFIG_PATH"

  c=0
  while [ $c -lt $ATTACH_WAIT_SECONDS ]
  do
    phase=$(kubectl --kubeconfig="$KUBECONFIG_PATH" get pod ubuntu-mount-rwx-vol -o=jsonpath="{['status.phase']}"  2>/dev/null)
    if [ x"$phase" == x"Running" ]; then
      break
    fi

    sleep 1

    c=$((c+1))
    if [ x"$c" = x"$ATTACH_WAIT_SECONDS" ]; then
      echo "Found error"
      echo "Error: Pod failed to reach Running state within $ATTACH_WAIT_SECONDS seconds"
      exit 1
    fi
  done

  echo "Detaching"
  kubectl delete -f "$POD_YAML_FILE" --kubeconfig="$KUBECONFIG_PATH"

  # Wait for a few seconds before the next iteration
  sleep 5

done

cc @derekbit @longhorn/qa

blushingpenguin added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 19, 2024

derekbit added investigation-needed Need to identify the case before estimating and starting the development area/volume-rwx Volume RWX related labels Apr 19, 2024

derekbit added the area/upstream Upstream related like tgt upstream library label Apr 19, 2024

derekbit added this to the v1.7.0 milestone Apr 20, 2024

innobead added the priority/0 Must be fixed in this release (managed by PO) label Apr 24, 2024

innobead assigned derekbit Apr 24, 2024

innobead added the backport/1.6.2 label Apr 24, 2024

github-actions bot mentioned this issue Apr 24, 2024

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

Closed

derekbit mentioned this issue Apr 24, 2024

Fix memory leak longhorn/nfs-ganesha#13

Merged

derekbit mentioned this issue Apr 30, 2024

Dockerfile: update nfs-ganesha to v5_20240430 longhorn/longhorn-share-manager#203

Merged

mergify bot mentioned this issue Apr 30, 2024

Dockerfile: update nfs-ganesha to v5_20240430 (backport #203) longhorn/longhorn-share-manager#204

Merged

roger-ryao self-assigned this May 7, 2024

roger-ryao closed this as completed May 7, 2024

derekbit mentioned this issue May 8, 2024

[BUG] Excessive memory consumption caused by RWX volumes / ganesha.nfsd #8523

Open

yangchiu added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label May 8, 2024

github-actions bot mentioned this issue May 8, 2024

[TEST][BUG] share-manager-pvc appears to be leaking memory #8532

Open

roger-ryao mentioned this issue May 14, 2024

Update v1.5.x LONGHORN_STABLE_VERSION to v1.5.5 longhorn/infra#171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] share-manager-pvc appears to be leaking memory #8394

[BUG] share-manager-pvc appears to be leaking memory #8394

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

derekbit commented Apr 19, 2024

derekbit commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

derekbit commented Apr 19, 2024 •

edited

blushingpenguin commented Apr 19, 2024

derekbit commented Apr 19, 2024

derekbit commented Apr 20, 2024

blushingpenguin commented Apr 22, 2024

derekbit commented Apr 22, 2024

blushingpenguin commented Apr 22, 2024

derekbit commented Apr 22, 2024

blushingpenguin commented Apr 23, 2024

blushingpenguin commented Apr 24, 2024

derekbit commented Apr 24, 2024 •

edited

longhorn-io-github-bot commented Apr 24, 2024 •

edited by derekbit

brunosmartin commented Apr 25, 2024

derekbit commented Apr 25, 2024

blushingpenguin commented Apr 25, 2024

brunosmartin commented Apr 30, 2024

derekbit commented Apr 30, 2024 •

edited

PhanLe1010 commented Apr 30, 2024

brunosmartin commented Apr 30, 2024 •

edited

brunosmartin commented Apr 30, 2024

roger-ryao commented May 7, 2024

roger-ryao commented May 8, 2024

[BUG] share-manager-pvc appears to be leaking memory #8394

[BUG] share-manager-pvc appears to be leaking memory #8394

Comments

blushingpenguin commented Apr 19, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

derekbit commented Apr 19, 2024

derekbit commented Apr 19, 2024

blushingpenguin commented Apr 19, 2024

derekbit commented Apr 19, 2024 • edited

blushingpenguin commented Apr 19, 2024

derekbit commented Apr 19, 2024

derekbit commented Apr 20, 2024

blushingpenguin commented Apr 22, 2024

derekbit commented Apr 22, 2024

blushingpenguin commented Apr 22, 2024

derekbit commented Apr 22, 2024

blushingpenguin commented Apr 23, 2024

blushingpenguin commented Apr 24, 2024

derekbit commented Apr 24, 2024 • edited

longhorn-io-github-bot commented Apr 24, 2024 • edited by derekbit

Pre Ready-For-Testing Checklist

brunosmartin commented Apr 25, 2024

derekbit commented Apr 25, 2024

blushingpenguin commented Apr 25, 2024

brunosmartin commented Apr 30, 2024

derekbit commented Apr 30, 2024 • edited

PhanLe1010 commented Apr 30, 2024

brunosmartin commented Apr 30, 2024 • edited

brunosmartin commented Apr 30, 2024

roger-ryao commented May 7, 2024

roger-ryao commented May 8, 2024

derekbit commented Apr 19, 2024 •

edited

derekbit commented Apr 24, 2024 •

edited

longhorn-io-github-bot commented Apr 24, 2024 •

edited by derekbit

derekbit commented Apr 30, 2024 •

edited

brunosmartin commented Apr 30, 2024 •

edited