Skip to content

[BUG] Jupyter bang command creating orphaned python processes #1222

@Marcus-M1999

Description

@Marcus-M1999

Software versions

Python : 3.10.17 | packaged by conda-forge | (main, Apr 10 2025, 22:19:12) [GCC 13.3.0]
Platform : Linux-6.8.0-1021-azure-x86_64-with-glibc2.35
Legion : 25.3.0 (commit: 04b7d5068c5e75f29684703e8a1b8568b3e59b9a)
Legate : 25.03.02
cuPynumeric : 25.03.02
Numpy : 1.26.4
Scipy : 1.15.2
Numba : (failed to detect)
CTK package : cuda-version-12.8-h5d125a7_3 (conda-forge)
GPU driver : 570.86.15
GPU devices :
GPU 0: NVIDIA H100 NVL
MIG 1g.12gb Device 0:
MIG 1g.12gb Device 1:
MIG 1g.12gb Device 2:
MIG 1g.12gb Device 3:
MIG 1g.12gb Device 4:
MIG 1g.12gb Device 5:

Jupyter notebook / Jupyter Lab version

Jupyter Lab

Expected behavior

when running a program from a Jupyter notebook and the program fails (due to something like an OOM error), the process will be orphaned and note show up in nvidia-smi.

Observed behavior

The program failed (as expected, due to OOM error). The kernel will continue to run and have to be killed, even when the kernel is running and the program successfully exists there is no process id in nvidia-smi

Example code or instructions

Copy the github repo and from the examples folder run the cg.py file as seen below:
!legate --gpus 4 --sysmem 40000 ./examples/cg.py --num 225 --check --time

Stack traceback or browser console output

output from the Jupyter cell:

Generating 50625x50625 2-D adjacency system without corners...
[0 - 7026a1093740]    0.138879 {5}{legate.mapper}: Failed to allocate 5125781248 bytes on memory 1e00000000000004 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 48)
[0 - 7026a1093740]    0.138911 {5}{legate.mapper}:   corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a1093740]    0.138922 {5}{legate.mapper}: Failed to allocate 5125578752 bytes on memory 1e00000000000006 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 49)
[0 - 7026a1093740]    0.138927 {5}{legate.mapper}:   corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a109f740]    0.138951 {5}{legate.mapper}: Failed to allocate 5125983752 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 46)
[0 - 7026a109f740]    0.138966 {5}{legate.mapper}:   corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a1093740]    0.139013 {5}{legate.mapper}: Failed to allocate 5125781248 bytes on memory 1e00000000000005 (of kind GPU_FB_MEM) for region requirement(s) {1} of Task cupynumeric::BinaryOpTask[/dli/task/./examples/cg.py:49] (UID 47)
[0 - 7026a1093740]    0.139027 {5}{legate.mapper}:   corresponding to a LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]
[0 - 7026a1093740]    0.139039 {5}{legate.mapper}: There is not enough space because Legate is reserving 10251562496 of the available 10964959232 bytes for the following LogicalStores:
[0 - 7026a1093740]    0.139045 {5}{legate.mapper}: LogicalStore allocated at ["/dli/task/./examples/cg.py:52", {"file": "/dli/task/./examples/cg.py", "line": 52}]:
[0 - 7026a1093740]    0.139052 {5}{legate.mapper}:   Instance 4000000001000004 of size 5125781248 covering elements <0,25313>..<25312,50624> 
[0 - 7026a1093740]    0.139056 {5}{legate.mapper}:     created for an operation launched at /dli/task/./examples/cg.py:49
[0 - 7026a1093740]    0.139059 {5}{legate.mapper}: LogicalStore allocated at ["/dli/task/./examples/cg.py:49", {"file": "/dli/task/./examples/cg.py", "line": 49}]:
[0 - 7026a1093740]    0.139063 {5}{legate.mapper}:   Instance 4000000001000001 of size 5125781248 covering elements <0,25313>..<25312,50624> 
[0 - 7026a1093740]    0.139066 {5}{legate.mapper}:     created for an operation launched at /dli/task/./examples/cg.py:49
LEGATE ERROR: ================================================================================
LEGATE ERROR: System: Linux, 6.8.0-1021-azure, 09507c2f6697, #25-Ubuntu SMP Wed Jan 15 20:45:09 UTC 2025, x86_64
LEGATE ERROR: Legate version: 25.3.2 (75dc0a92bbd2dfb79b6b680a0f37cbd0370d0181)
LEGATE ERROR: Legion version: 25.3.0 (04b7d5068c5e75f29684703e8a1b8568b3e59b9a)
LEGATE ERROR: Configure options: --LEGATE_ARCH=arch-conda --with-python --with-cc=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-cc --with-cxx=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-c++ --build-march=x86-64 --legion-max-dim=6 --with-openmp --with-cuda --with-cal --build-type=release --with-ucx
LEGATE ERROR: Exception stack contains 1 exception(s) (bottom-most exception first):
LEGATE ERROR:
LEGATE ERROR: #0 Legate called abort at /tmp/conda-croot/legate/work/src/cpp/legate/mapping/detail/base_mapper.cc:1282 in report_failed_mapping_()
LEGATE ERROR: #0 Out of memory
LEGATE ERROR: Stack trace (most recent call first):
LEGATE ERROR: #0  0x00007026c277a4d7 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #1  0x00007026c277264a at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #2  0x00007026c2773bac at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #3  0x00007026c2773eed at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #4  0x00007026c27746b0 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #5  0x00007026c27773fd at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../../liblegate.so.25.03.02
LEGATE ERROR: #6  0x00007026a9ac8843 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #7  0x00007026a99ffd5c at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #8  0x00007026a9a00e99 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #9  0x00007026a9a0dbbb at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #10 0x00007026a99ebed5 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #11 0x00007026a9bde87f at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././liblegion-legate.so.1
LEGATE ERROR: #12 0x00007026a6f03ab0 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #13 0x00007026a6f03b45 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #14 0x00007026a6f02089 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #15 0x00007026a6f07cc6 at /opt/conda/envs/legate/lib/python3.10/site-packages/legate/core/_lib/data/../../../../../.././librealm-legate.so.1
LEGATE ERROR: #16 0x000070280782612f at /usr/lib/x86_64-linux-gnu/libc.so.6
LEGATE ERROR: ================================================================================

Legion process received signal 6: Aborted
Process 104 on node 09507c2f6697 is frozen!

The output from Nvidia-smi is:

`

Every 2.0s: nvidia-smi                                                                                                       50:00c2f6697: Wed Aug  6 22:29:18 2025
Wed Aug  6 22:30:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 NVL                On  |   00000001:00:00.0 Off |                   On |
| N/A   39C    P0             92W /  400W |   42875MiB /  95830MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  0    7   0   0  |           10711MiB / 11008MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 2MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    8   0   1  |           10711MiB / 11008MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 2MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0    9   0   2  |           10711MiB / 11008MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 2MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   11   0   3  |           10711MiB / 11008MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 2MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   12   0   4  |              17MiB / 11008MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+
|  0   13   0   5  |              17MiB / 11008MiB    | 16      0 |  1   0    1    0    1 |
|                  |                 0MiB / 16383MiB  |           |                       |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

`

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions