Skip to content

[Bug] GpuBuffer memory leak when nvls enabled #470

@liangyuRain

Description

@liangyuRain

Hi, it seems that the device memory of GpuBuffer is not being properly recycled when NVLS is supported. I wonder if it is by design or some setup issue on my end or bug in the code?

(mscclpp) root@coriander:~/forestcoll-private# python
Python 3.9.21 (main, Dec 11 2024, 16:24:11) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mscclpp.utils import GpuBuffer
>>> import cupy as cp
>>> cp.cuda.Device(0).use()
<CUDA Device 0>
>>> for i in range(100):
...     if i % 10 == 0:
...             print(f"{i=}", flush=True)
...     mem = GpuBuffer(2 ** 30, dtype=cp.int32)
...     del mem
... 
i=0
i=10
i=20
i=30
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/root/miniconda3/envs/mscclpp/lib/python3.9/site-packages/mscclpp/utils.py", line 155, in __new__
    buffer = RawGpuBuffer(np.prod(shape) * np.dtype(dtype).itemsize)
mscclpp._mscclpp.CuError: (2, 'Call to cuMemCreate(&memHandle, nbytes, &prop, 0 ) failed./root/mscclpp/src/gpu_utils.cc:113 (Cu failure: out of memory)')

Environements:

(mscclpp) root@coriander:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(mscclpp) root@coriander:~# nvidia-smi
Sun Feb 23 09:18:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    Off |   00000000:03:00.0 Off |                    0 |
| N/A   33C    P0             76W /  700W |       1MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions