-
Notifications
You must be signed in to change notification settings - Fork 88
Description
Software versions
Python : 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0]
Platform : Linux-6.8.0-1021-azure-x86_64-with-glibc2.35
Legion : 25.3.0 (commit: 04b7d5068c5e75f29684703e8a1b8568b3e59b9a)
Legate : 25.03.02
cuPynumeric : 25.03.02
Numpy : 1.26.4
Scipy : 1.15.2
Numba : (failed to detect)
CTK package : cuda-version-12.8-h5d125a7_3 (conda-forge)
GPU driver : 570.86.15
GPU devices :
GPU 0: NVIDIA H100 NVL
MIG 1g.12gb Device 0:
MIG 1g.12gb Device 1:
MIG 1g.12gb Device 2:
MIG 1g.12gb Device 3:
MIG 1g.12gb Device 4:
MIG 1g.12gb Device 5:
Jupyter notebook / Jupyter Lab version
Jupyter Lab
Expected behavior
when running a program from a Jupyter notebook and the program fails (due to something like an OOM error), the process will be orphaned and note show up in nvidia-smi.
Observed behavior
The program failed (as expected, due to OOM error). The kernel will continue to run and have to be killed, even when the kernel is running and the program successfully exists there is no process id in nvidia-smi
Example code or instructions
Copy the github repo and from the examples folder run the cg.py file as seen below:
!legate --gpus 4 --sysmem 40000 ./examples/cg.py --num 225 --check --time
Stack traceback or browser console output
output from the Jupyter cell:
Generating 50625x50625 2-D adjacency system without corners...
[0 - 7026a1093740] 0.138879 {5}{legate.mapper}: Failed to allocate 5125781248 bytes on memory 1e00000000000004 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 48)
[0 - 7026a1093740] 0.138911 {5}{legate.mapper}: corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a1093740] 0.138922 {5}{legate.mapper}: Failed to allocate 5125578752 bytes on memory 1e00000000000006 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 49)
[0 - 7026a1093740] 0.138927 {5}{legate.mapper}: corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a109f740] 0.138951 {5}{legate.mapper}: Failed to allocate 5125983752 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 46)
[0 - 7026a109f740] 0.138966 {5}{legate.mapper}: corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a1093740] 0.139013 {5}{legate.mapper}: Failed to allocate 5125781248 bytes on memory 1e00000000000005 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 47)
[0 - 7026a1093740] 0.139027 {5}{legate.mapper}: corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a1093740] 0.139039 {5}{legate.mapper}: There is not enough space because Legate is reserving 10251562496 of the available 10964959232 bytes for the following LogicalStores:
[0 - 7026a1093740] 0.139045 {5}{legate.mapper}: LogicalStore allocated at ["/dli/task/./examples/cg.py:52", {"file": "/dli/task/./examples/cg.py", "line": 52}]:
[0 - 7026a1093740] 0.139052 {5}{legate.mapper}: Instance 4000000001000004 of size 5125781248 covering elements <0,25313>..<25312,50624>
[0 - 7026a1093740] 0.139056 {5}{legate.mapper}: created for an operation launched at /dli/task/./examples/cg.py:49
[0 - 7026a1093740] 0.139059 {5}{legate.mapper}: LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]:
[0 - 7026a1093740] 0.139063 {5}{legate.mapper}: Instance 4000000001000001 of size 5125781248 covering elements <0,25313>..<25312,50624>
[0 - 7026a1093740] 0.139066 {5}{legate.mapper}: created for an operation launched at /dli/task/./examples/cg.py:49
LEGATE ERROR: ================================================================================
LEGATE ERROR: System: Linux, 6.8.0-1021-azure, 09507c2f6697, #25-Ubuntu SMP Wed Jan 15 20:45:09 UTC 2025, x86_64
LEGATE ERROR: Legate version: 25.3.2 (75dc0a92bbd2dfb79b6b680a0f37cbd0370d0181)
LEGATE ERROR: Legion version: 25.3.0 (04b7d5068c5e75f29684703e8a1b8568b3e59b9a)
LEGATE ERROR: Configure options: --LEGATE_ARCH=arch-conda --with-python --with-cc=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-cc --with-cxx=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-c++ --build-march=x86-64 --legion-max-dim=6 --with-openmp --with-cuda --with-cal --build-type=release --with-ucx
LEGATE ERROR: Exception stack contains 1 exception(s) (bottom-most exception first):
LEGATE ERROR:
LEGATE ERROR: #0 Legate called abort at /tmp/conda-croot/legate/work/src/cpp/legate/mapping/detail/base_mapper.cc:1282 in report_failed_mapping_()
LEGATE ERROR: #0 Out of memory
LEGATE ERROR: Stack trace (most recent call first):
LEGATE ERROR: #0 0x00007026c277a4d7 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #1 0x00007026c277264a at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #2 0x00007026c2773bac at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #3 0x00007026c2773eed at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #4 0x00007026c27746b0 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #5 0x00007026c27773fd at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #6 0x00007026a9ac8843 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #7 0x00007026a99ffd5c at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #8 0x00007026a9a00e99 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #9 0x00007026a9a0dbbb at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #10 0x00007026a99ebed5 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #11 0x00007026a9bde87f at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #12 0x00007026a6f03ab0 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #13 0x00007026a6f03b45 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #14 0x00007026a6f02089 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #15 0x00007026a6f07cc6 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #16 0x000070280782612f at /usr/lib/x86_64-linux-gnu/libc.so.6
LEGATE ERROR: ================================================================================
Legion process received signal 6: Aborted
Process 104 on node 09507c2f6697 is frozen!
The output from Nvidia-smi is:
`
Every 2.0s: nvidia-smi 50:00c2f6697: Wed Aug 6 22:29:18 2025
Wed Aug 6 22:30:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15 Driver Version: 570.86.15 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000001:00:00.0 Off | On |
| N/A 39C P0 92W / 400W | 42875MiB / 95830MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 7 0 0 | 10711MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 2MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 8 0 1 | 10711MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 2MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 9 0 2 | 10711MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 2MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 11 0 3 | 10711MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 2MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 12 0 4 | 17MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 13 0 5 | 17MiB / 11008MiB | 16 0 | 1 0 1 0 1 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
`