You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our application XGC has conditional coding for GPU-aware MPI, which has been working correctly on some systems such as Perlmutter with NVIDIA A100 GPUs (cray-mpich/8.1.28) and Frontier with AMD MI250X GPUs (cray-mpich).
Testing this on the Sunspot testbed at Argonne using Intel PVC GPUs (Aurora MPICH: mpich/icc-all-pmix-gpu/52.2)), I observe uncontrolled memory growth apparently stemming from an MPI_Alltoallv() with large message sizes (O(GB)). The Aurora MPICH developers at Intel asked me to create a ticket here and provide them the ticket number.
This output shows memory usage queries at various timesteps in the test run, eventually leading to running out of GPU memory:
Step 1:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 47.39/47.39/47.39GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 11.94/11.94/11.94GB (64.00GB total available), min=0, max=1
…
Step 5:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.27/56.27/56.27GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 35.79/35.79/35.79GB (64.00GB total available), min=0, max=1
…
Step 10:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.24/56.24/56.24GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 60.71/60.71/60.71GB (64.00GB total available), min=0, max=1
…
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com 1: terminate called after throwing an instance of 'std::runtime_error'
what(): Kokkos failed to allocate memory for label "sendbuf". Allocation using MemorySpace named "SYCLDeviceUSM" failed with the following error: Allocation of size 2.067 G failed because of an unknown error. (The allocation mechanism was sycl::malloc_device().)
The text was updated successfully, but these errors were encountered:
@zippylab do you have a small reproducer you can share which can mimic the workload and the described issue? We believe to understand the cause, but need to be able to validate the solution
@abrooks98 I don't have a small reproducer yet. Where we observed it is pretty deep down in XGC functionality, and involves a number of template instances as well as Kokkos views of more than one variety including unmanaged views. Constructing something simple to demonstrate it may take quite a bit of trial-and-error. I'll start working on it, but meanwhile it may be that @zhenggb72, one of the Intel people I've been working on this with, could help with validating the solution using XGC.
Our application XGC has conditional coding for GPU-aware MPI, which has been working correctly on some systems such as Perlmutter with NVIDIA A100 GPUs (cray-mpich/8.1.28) and Frontier with AMD MI250X GPUs (cray-mpich).
Testing this on the Sunspot testbed at Argonne using Intel PVC GPUs (Aurora MPICH: mpich/icc-all-pmix-gpu/52.2)), I observe uncontrolled memory growth apparently stemming from an MPI_Alltoallv() with large message sizes (O(GB)). The Aurora MPICH developers at Intel asked me to create a ticket here and provide them the ticket number.
This output shows memory usage queries at various timesteps in the test run, eventually leading to running out of GPU memory:
The text was updated successfully, but these errors were encountered: