Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HYPRE+HIP Runtime Error #2910

Closed
wcdawn opened this issue Mar 22, 2022 · 21 comments
Closed

HYPRE+HIP Runtime Error #2910

wcdawn opened this issue Mar 22, 2022 · 21 comments

Comments

@wcdawn
Copy link
Contributor

wcdawn commented Mar 22, 2022

I'm trying to compile with HIP & HYPRE. Compiling with just HIP works fine, but I'd like to use HypreBoomerAMG.
I get the following runtime error when running ex1p. Any help would be much appreciated.

Options used:
   --mesh ../data/star.mesh
   --order 1
   --no-static-condensation
   --no-partial-assembly
   --device cpu
   --visualization
Device configuration: cpu
Memory configuration: host-std
Number of finite element unknowns: 82561
Memory access fault by GPU node-1 (Agent handle: 0x24fa560) on address 0x7f3f67a0e000. Reason: Page not present or supervisor privilege.
[jezebel:1158855] *** Process received signal ***
[jezebel:1158855] Signal: Aborted (6)
[jezebel:1158855] Signal code:  (-6)
[jezebel:1158855] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f4103a4b3c0]
[jezebel:1158855] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f40e7a7703b]
[jezebel:1158855] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f40e7a56859]
[jezebel:1158855] [ 3] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x72847)[0x7f40e75d1847]
[jezebel:1158855] [ 4] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x7545b)[0x7f40e75d445b]
[jezebel:1158855] [ 5] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x18757)[0x7f40e7577757]
[jezebel:1158855] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f4103a3f609]
[jezebel:1158855] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f40e7b53163]
[jezebel:1158855] *** End of error message ***
Aborted (core dumped)

I'm using the master branch of MFEM & HYPRE v2.24.0.

HYRPE config

./configure \
  --with-hip \
  --disable-fortran \
  CC=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicc \
  CXX=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicxx \
  --with-MPI-include='/home/wcdawn/bin/openmpi-4.1.2/include' \
  --with-MPI-libs='mpi mpi_cxx' \
  --with-MPI-lib-dirs='/home/wcdawn/bin/openmpi-4.1.2/lib'

MFEM config

make config \
  MFEM_USE_HIP=YES HIP_ARCH=gfx1030 \
  MFEM_USE_MPI=YES \
  MPICC=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicc \
  MPICXX=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicxx \
  HYPRE_DIR=@MFEM_DIR@/../hypre-2.24.0/src/hypre \
  METIS_DIR=@MFEM_DIR@/../metis-5.1.0 MFEM_USE_METIS_5=YES
@tzanio
Copy link
Member

tzanio commented Mar 22, 2022

@wcdawn, just to clarify -- you are using mfem:master as of today, i.e. mfem-4.4?

@wcdawn
Copy link
Contributor Author

wcdawn commented Mar 22, 2022

@tzanio correct. MFEM v4.4. I just did a pull today.

@jandrej
Copy link
Member

jandrej commented Mar 22, 2022

Device configuration: cpu
Memory configuration: host-std

looks suspicious?

@wcdawn
Copy link
Contributor Author

wcdawn commented Mar 22, 2022

@jandrej I get the same message when passing -d hip. Here is the full command line that I ran.

$ /home/wcdawn/bin/openmpi-4.1.2/bin/mpirun -np 1 ./ex1p -d hip

Options used:
   --mesh ../data/star.mesh
   --order 1
   --no-static-condensation
   --no-partial-assembly
   --device hip
   --visualization
Device configuration: hip,cpu
Memory configuration: host-std,hip
Number of finite element unknowns: 82561
Memory access fault by GPU node-1 (Agent handle: 0x3c04670) on address 0x7f3c0fe04000. Reason: Page not present or supervisor privilege.
[jezebel:1158921] *** Process received signal ***
[jezebel:1158921] Signal: Aborted (6)
[jezebel:1158921] Signal code:  (-6)
[jezebel:1158921] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f3d538003c0]
[jezebel:1158921] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f3d3782c03b]
[jezebel:1158921] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f3d3780b859]
[jezebel:1158921] [ 3] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x72847)[0x7f3d37386847]
[jezebel:1158921] [ 4] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x7545b)[0x7f3d3738945b]
[jezebel:1158921] [ 5] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x18757)[0x7f3d3732c757]
[jezebel:1158921] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f3d537f4609]
[jezebel:1158921] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f3d37908163]
[jezebel:1158921] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node jezebel exited on signal 6 (Aborted).
--------------------------------------------------------------------------

@pazner
Copy link
Member

pazner commented Mar 22, 2022

Hello @wcdawn,

Are you able to get a backtrace by running with gdb?

Also, we have tested with hypre version 2.23, do you encounter the same crashes with that version as well?

@wcdawn
Copy link
Contributor Author

wcdawn commented Mar 22, 2022

@pazner I rebuilt with HYPRE v2.23.0 and get the same error. I also rebuilt MFEM with MFEM_DEBUG=YES so I could get backtrace info. Here is the backtrace.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fffdbf95859 in __GI_abort () at abort.c:79
#2  0x00007fffdbb10847 in rocr::core::Runtime::VMFaultHandler(long, void*) ()
   from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#3  0x00007fffdbb1345b in rocr::core::Runtime::AsyncEventsLoop(void*) ()
   from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#4  0x00007fffdbab6757 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#5  0x00007ffff7f7e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fffdc092163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@v-dobrev
Copy link
Member

It looks like this stack trace and crash happen on a separate thread, not on the main thread. Can you try switching to the main thread and getting the stack trace there? I suspect that this new thread that crashes is created during some call to the HIP/ROCm runtime but it is best to confirm that and find out what call exactly causes this.

@wcdawn
Copy link
Contributor Author

wcdawn commented Mar 23, 2022

@v-dobrev Here is the output of info threads from gdb

  Id   Target Id                                  Frame
  1    Thread 0x7fffdb7f7980 (LWP 1183544) "ex1p" 0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
  2    Thread 0x7fffdb704700 (LWP 1183548) "ex1p" 0x00007fffdc0859cf in __GI___poll (fds=0x7fffd4000b60, nfds=1, timeout=3599969) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7fffdad42700 (LWP 1183549) "ex1p" 0x00007fffdc09249e in epoll_wait (epfd=10, events=0x2250d20, maxevents=32, timeout=119968) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
* 4    Thread 0x7fffd99df700 (LWP 1183550) "ex1p" __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  6    Thread 0x7fffd8f7f700 (LWP 1183552) "ex1p" 0x00007fffdc0873db in ioctl () at ../sysdeps/unix/syscall-template.S:78

The backtrace from thread 1 looks interesting.

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fffdb7f7980 (LWP 1183544))]
#0  0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
2937    malloc.c: No such file or directory.
(gdb) bt
#0  0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
#1  __GI___libc_malloc (bytes=23) at malloc.c:3051
#2  0x00007fffdc20fb39 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffc96c891f in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#4  0x00007fffc5205994 in amd_comgr_get_metadata_string () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#5  0x00007ffff71eac24 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#6  0x00007ffff71ecc43 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#7  0x00007fffc5205c51 in amd_comgr_iterate_map_metadata () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#8  0x00007ffff71ee5f8 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#9  0x00007ffff71c1e7b in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#10 0x00007ffff7180653 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#11 0x00007ffff718108d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#12 0x00007ffff71a9dec in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#13 0x00007ffff702265e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#14 0x00007ffff702418e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#15 0x00007ffff6fe5f7d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#16 0x00007ffff71073bf in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#17 0x00007ffff70ea857 in hipLaunchKernel () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#18 0x0000000001d51808 in std::enable_if<rocprim::detail::default_scan_config<0u, int>::use_lookback, hipError_t>::type rocprim::detail::scan_impl<true, rocprim::detail::default_scan_config<0u, int>, int*, int*, int, thrust::plus<void> >(void*, unsigned long&, int*, int*,
int, unsigned long, thrust::plus<void>, ihipStream_t*, bool) ()
#19 0x0000000001d50b08 in thrust::hip_rocprim::exclusive_scan_n<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base>, int*, long, int*, int, thrust::plus<void> >(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>)::workaround::par(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>) ()
#20 0x0000000001dc3311 in hypreDevice_IntegerExclusiveScan(int, int*) ()
#21 0x0000000001d73db2 in hypre_BoomerAMGCreateSDevice ()
#22 0x0000000001c4c027 in hypre_BoomerAMGCreateS ()
#23 0x0000000001bdd85f in hypre_BoomerAMGSetup ()
#24 0x0000000001428131 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3638
#25 0x0000000001428f53 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3721
#26 0x000000000143b5ad in mfem::CGSolver::Mult (this=0x7fffffffa238, b=..., x=...) at linalg/solvers.cpp:730
#27 0x0000000001397596 in main (argc=<optimized out>, argv=<optimized out>) at ex1p.cpp:255

I'm starting to suspect that it could be a problem with the HIP/ROCm runtime as well.

@v-dobrev
Copy link
Member

v-dobrev commented Mar 24, 2022

I just noticed that you did not set the GPU arch in your hypre config command -- try adding --with-gpu-arch=gfx... -- I'm not sure what happens when it is not set.

@wcdawn
Copy link
Contributor Author

wcdawn commented Mar 24, 2022

Thanks for catching that. It doesn't seem to have changed anything and the backtrace looks the same.

#0  0x00007fffc99b3b91 in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#1  0x00007fffc96c87ac in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#2  0x00007fffc5205994 in amd_comgr_get_metadata_string () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#3  0x00007ffff71eac24 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#4  0x00007ffff71eb738 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#5  0x00007fffc5205c51 in amd_comgr_iterate_map_metadata () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#6  0x00007ffff71ee131 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#7  0x00007ffff71ee595 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#8  0x00007ffff71c1e7b in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#9  0x00007ffff7180653 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#10 0x00007ffff718108d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#11 0x00007ffff71a9dec in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#12 0x00007ffff702265e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#13 0x00007ffff702418e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#14 0x00007ffff6fe5f7d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#15 0x00007ffff71073bf in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#16 0x00007ffff70ea857 in hipLaunchKernel () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#17 0x0000000001d51808 in std::enable_if<rocprim::detail::default_scan_config<0u, int>::use_lookback, hipError_t>::type rocprim::detail::scan_impl<true, rocprim::detail::default_scan_config<0u, int>, int*, int*, int, thrust::plus<void> >(void*, unsigned long&, int*, int*, int, unsigned long, thrust::plus<void>, ihipStream_t*, bool) ()
#18 0x0000000001d50b08 in thrust::hip_rocprim::exclusive_scan_n<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base>, int*, long, int*, int, thrust::plus<void> >(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>)::workaround::par(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>) ()
#19 0x0000000001dc3311 in hypreDevice_IntegerExclusiveScan(int, int*) ()
#20 0x0000000001d73db2 in hypre_BoomerAMGCreateSDevice ()
#21 0x0000000001c4c027 in hypre_BoomerAMGCreateS ()
#22 0x0000000001bdd85f in hypre_BoomerAMGSetup ()
#23 0x0000000001428131 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3638
#24 0x0000000001428f53 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3721
#25 0x000000000143b5ad in mfem::CGSolver::Mult (this=0x7fffffffa248, b=..., x=...) at linalg/solvers.cpp:730
#26 0x0000000001397596 in main (argc=<optimized out>, argv=<optimized out>) at ex1p.cpp:255

@v-dobrev
Copy link
Member

Another suggestion/question: were you able to run older MFEM versions on this machine, e.g. right after #2750 was merged?

Also, just to confirm, if you build MFEM with HIP and HYPRE without HIP, does this work?

@wcdawn
Copy link
Contributor Author

wcdawn commented Mar 24, 2022

I did git checkout 4e6b6f7d472b65d5fd25855a309e1666df7209d1 which was the commit that merged #2750 and got the same error.

Building MFEM with HIP and HYPRE without HIP does work. Additionally, building both without HIP works.

Is there a HYPE example to use to test this? It seems like it could be something in HYPRE itself or maybe something in the MFEM/HYRPE interface.

@v-dobrev
Copy link
Member

Hi @wcdawn,

were you able to figure out what the problem is?

@liruipeng, we suspect that the above issue (see the backtrace here: #2910 (comment)) maybe in hypre. What will be a good way for @wcdawn to test this in hypre itself without mfem?

@wcdawn
Copy link
Contributor Author

wcdawn commented Apr 15, 2022

@v-dobrev Unfortunately not. I think it could be something with HYPRE. I'm not sure if it has been tested with this particular GPU.

@v-dobrev
Copy link
Member

cc: @noelchalmers

@noelchalmers
Copy link

and @pbauman

@pbauman
Copy link

pbauman commented Apr 29, 2022

Hi folks. There is certainly an issue with HYPRE at present that I will try to address when I can. The issue is that the Navi gaming cards (gfx1030 indicates an RDNA2 card, so something like a 6900XT) run with warp/wavefront sizes of 32. Currently, HYPRE on AMD GPUs is setup for warp/wavefront size of 64. I'll post a note here when we update HYPRE to support wavefront size 32 on AMD GPUs.

@v-dobrev
Copy link
Member

@pbauman, thank you for looking into this issue.

@stale
Copy link

stale bot commented May 30, 2022

⚠️ This issue or PR has been automatically marked as stale because it has not had any activity in the last month. If no activity occurs in the next week, it will be automatically closed. Thank you for your contributions.

@stale stale bot added the stale label May 30, 2022
@wcdawn
Copy link
Contributor Author

wcdawn commented May 30, 2022

@pbauman @v-dobrev any update here or a potential timeline?

@stale stale bot removed the stale label May 30, 2022
@stale
Copy link

stale bot commented Jul 2, 2022

⚠️ This issue or PR has been automatically marked as stale because it has not had any activity in the last month. If no activity occurs in the next week, it will be automatically closed. Thank you for your contributions.

@stale stale bot added the stale label Jul 2, 2022
@stale stale bot closed this as completed Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants