-
Notifications
You must be signed in to change notification settings - Fork 90
Closed
Description
Hi, it seems that the device memory of GpuBuffer is not being properly recycled when NVLS is supported. I wonder if it is by design or some setup issue on my end or bug in the code?
(mscclpp) root@coriander:~/forestcoll-private# python
Python 3.9.21 (main, Dec 11 2024, 16:24:11)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mscclpp.utils import GpuBuffer
>>> import cupy as cp
>>> cp.cuda.Device(0).use()
<CUDA Device 0>
>>> for i in range(100):
... if i % 10 == 0:
... print(f"{i=}", flush=True)
... mem = GpuBuffer(2 ** 30, dtype=cp.int32)
... del mem
...
i=0
i=10
i=20
i=30
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/root/miniconda3/envs/mscclpp/lib/python3.9/site-packages/mscclpp/utils.py", line 155, in __new__
buffer = RawGpuBuffer(np.prod(shape) * np.dtype(dtype).itemsize)
mscclpp._mscclpp.CuError: (2, 'Call to cuMemCreate(&memHandle, nbytes, &prop, 0 ) failed./root/mscclpp/src/gpu_utils.cc:113 (Cu failure: out of memory)')Environements:
(mscclpp) root@coriander:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(mscclpp) root@coriander:~# nvidia-smi
Sun Feb 23 09:18:27 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H200 Off | 00000000:03:00.0 Off | 0 |
| N/A 33C P0 76W / 700W | 1MiB / 143771MiB | 0% Default |
| | | Disabled |
...Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels