Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program hangs running on AMD Ryzen CPUs #536

Closed
L30nardoSV opened this issue Oct 10, 2017 · 10 comments
Closed

Program hangs running on AMD Ryzen CPUs #536

L30nardoSV opened this issue Oct 10, 2017 · 10 comments
Assignees
Milestone

Comments

@L30nardoSV
Copy link

L30nardoSV commented Oct 10, 2017

I am testing pocl 0.15-pre (compiled with LLVM 4.0.0) on AMD Ryzen 5 1600X CPUs.
The problem is that my application hangs at different points in execution. Backtracing this:

Executing docking runs:
        20%        40%       60%       80%       100%
---------+---------+---------+---------+---------+
*****************************************^C

Thread 1 "ocladock_cpu_16" received signal SIGINT, Interrupt.
0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b791cc in pthread_scheduler_wait_cq () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b78830 in pocl_pthread_join () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff7b4a74a in clFinish () from /usr/local/lib64/libOpenCL.so.1
#4  0x00007ffff7b381d4 in clEnqueueMapBuffer () from /usr/local/lib64/libOpenCL.so.1
#5  0x00000000004028cb in memMap (cmd_queue=0x6c5220, dev_mem=0x897c10, flags=1, size=40)
    at ./wrapcl/src/BufferObjects.cpp:265
#6  0x000000000040c9cc in docking_with_gpu (mygrid=0x7fffffffde48, cpu_floatgrids=0x7ffff7f27010, 
    mypars=0x7ffffffd6f70, myligand_init=0x7ffffffd71a0, argc=0x7fffffffdf18, argv=0x7fffffffe008, 
    clock_start_program=15970) at ./host/src/performdocking.cpp:614
#7  0x000000000041790f in main (argc=7, argv=0x7fffffffe008) at ./host/src/main.cpp:78
(gdb)

Eventually the program is able to continue executing up to completion. A second problem is, however, that computing results are not correct. I am using only single precision FP, but I am not sure if this issue is related to the first one.

Also, the temporal hanging described first (taking up to ~10 minutes), is unexpected as I tried with pocl 15-pre on i5 CPU - LLVM 3.9.1 having a smooth program execution.

Any ideas on this? thank you!

@pjaaskel
Copy link
Member

Can you do 'thread apply all bt' to display where the other thread(s) are at? It sounds weird if this is CPU specific, but if it's race then it can be just good luck in the other platform that it passes.

@pjaaskel
Copy link
Member

Also, is this is a regression, does it reproduce with 0.14?

@L30nardoSV
Copy link
Author

L30nardoSV commented Oct 12, 2017

This is where the other threads are at:

Executing docking runs:
        20%        40%       60%       80%       100%
---------+---------+---------+---------+---------+
^C
Thread 1 "ocladock_cpu_16" received signal SIGINT, Interrupt.
0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
Missing separate debuginfos, use: dnf debuginfo-install hwloc-libs-1.11.0-6.fc24.x86_64 libffi-3.1-9.fc24.x86_64 libgcc-6.4.1-1.fc25.x86_64 libstdc++-6.4.1-1.fc25.x86_64 libtool-ltdl-2.4.6-14.fc25.x86_64 ncurses-libs-6.0-6.20160709.fc25.x86_64 numactl-libs-2.0.11-2.fc24.x86_64 zlib-1.2.8-10.fc24.x86_64
(gdb) thread apply all bt

Thread 13 (Thread 0x7fffe92e0700 (LWP 14198)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7fffe9ae1700 (LWP 14197)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7fffea2e2700 (LWP 14196)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7fffeaae3700 (LWP 14195)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7fffeb2e4700 (LWP 14194)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7fffebae5700 (LWP 14193)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 7 (Thread 0x7fffec2e6700 (LWP 14192)):
#0  0x00007fffe813e329 in _pocl_launcher_perform_LS () from /home/wimi/lvs/.cache/pocl/kcache/LC/DJNEDINPAJINLNOOJPAOJNPAMEHHEPBHGJOMD/perform_LS/16-1-1/perform_LS.so
#1  0x00007fffe8144d45 in _pocl_launcher_perform_LS_workgroup ()
   from /home/wimi/lvs/.cache/pocl/kcache/LC/DJNEDINPAJINLNOOJPAOJNPAMEHHEPBHGJOMD/perform_LS/16-1-1/perform_LS.so
#2  0x0000000000000060 in ?? ()
#3  0x0000000000b2e700 in ?? ()
#4  0x00000000009ca380 in ?? ()
#5  0x00000000009cbb00 in ?? ()
#6  0x0000000000b8c380 in ?? ()
#7  0x0000000000000096 in ?? ()
#8  0x000000000000000b in ?? ()
#9  0x0000000000000009 in ?? ()
#10 0x0000000000000004 in ?? ()
#11 0x000000000000012c in ?? ()
#12 0x0000000000923a00 in ?? ()
#13 0x0000000000923d80 in ?? ()
#14 0x0000000000915580 in ?? ()
#15 0x000000000091b700 in ?? ()
#16 0x0000000000951d00 in ?? ()
#17 0x0000000000952200 in ?? ()
#18 0x0000000000952480 in ?? ()
#19 0x0000000000943600 in ?? ()
#20 0x0000000000898000 in ?? ()
#21 0x0000000000898300 in ?? ()
#22 0x0000000000898680 in ?? ()
#23 0x000000000088c480 in ?? ()
#24 0x000000000088c800 in ?? ()
#25 0x00000000009c6480 in ?? ()
#26 0x00007fffe0000900 in ?? ()
#27 0x00007fffe0000a80 in ?? ()
#28 0x00007fffe0000c00 in ?? ()
#29 0x00007fffe0000d80 in ?? ()
#30 0x00007fffe0000e00 in ?? ()
#31 0x00007fffe0001280 in ?? ()
#32 0x00007fffe0001080 in ?? ()
#33 0x00007fffe0001100 in ?? ()
#34 0x00007fffe0000e80 in ?? ()
#35 0x00007fffe0001300 in ?? ()
#36 0x00007fffe0000f80 in ?? ()
#37 0x00007fffe0001400 in ?? ()
#38 0x00007fffe0001500 in ?? ()
---Type <return> to continue, or q <return> to quit---
#39 0x00007fffe0001580 in ?? ()
#40 0x00007fffe0001780 in ?? ()
#41 0x00007fffe0001980 in ?? ()
#42 0x00007fffe0001b80 in ?? ()
#43 0x00007fffec2e5dd0 in ?? ()
#44 0x0000000000000090 in ?? ()
#45 0x00007fffe0001210 in ?? ()
#46 0x00007fffe00011f0 in ?? ()
#47 0x00007fffe00011d0 in ?? ()
#48 0x00007fffe0000de0 in ?? ()
#49 0x00007fffe0000cf0 in ?? ()
#50 0x00007fffe0000b80 in ?? ()
#51 0x00007fffe0000a10 in ?? ()
#52 0x00007fffe00012c0 in ?? ()
#53 0x000000000088cad8 in ?? ()
#54 0x00000000009ce6b0 in ?? ()
#55 0x000000000088c758 in ?? ()
#56 0x0000000000898938 in ?? ()
#57 0x00000000008985d8 in ?? ()
#58 0x0000000000897fe8 in ?? ()
#59 0x0000000000897fc8 in ?? ()
#60 0x000000000091bca8 in ?? ()
#61 0x00007fffec2e5dd0 in ?? ()
#62 0x00000000009523d8 in ?? ()
#63 0x00000000009521a8 in ?? ()
#64 0x000000000091bb68 in ?? ()
#65 0x000000000091b658 in ?? ()
#66 0x0000000000923f78 in ?? ()
#67 0x0000000000923d18 in ?? ()
#68 0x0000000000897df8 in ?? ()
#69 0x000000000000012c in ?? ()
#70 0x0000000000000004 in ?? ()
#71 0x0000000000948f28 in ?? ()
#72 0x000000000088d348 in ?? ()
#73 0x000000000088d208 in ?? ()
#74 0x0000000000000009 in ?? ()
#75 0x000000000088d0c8 in ?? ()
#76 0x000000000000000b in ?? ()
#77 0x0000000000000096 in ?? ()
#78 0x0000000000000060 in ?? ()
#79 0x000000000088cbe8 in ?? ()
#80 0x00007fffec2e5da0 in ?? ()
---Type <return> to continue, or q <return> to quit---
#81 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7fffecae7700 (LWP 14191)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fffed2e8700 (LWP 14190)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fffedae9700 (LWP 14189)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fffee2ea700 (LWP 14188)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fffeeaeb700 (LWP 14187)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b79455 in pthread_scheduler_sleep () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b7a091 in pocl_pthread_driver_thread () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff6a7e73a in start_thread () from /lib64/libpthread.so.0
#4  0x00007ffff6fa6e7f in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7fbf780 (LWP 14183)):
#0  0x00007ffff6a84829 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff7b791cc in pthread_scheduler_wait_cq () from /usr/local/lib64/libOpenCL.so.1
#2  0x00007ffff7b78830 in pocl_pthread_join () from /usr/local/lib64/libOpenCL.so.1
#3  0x00007ffff7b4a74a in clFinish () from /usr/local/lib64/libOpenCL.so.1
---Type <return> to continue, or q <return> to quit---
#4  0x00007ffff7b381d4 in clEnqueueMapBuffer () from /usr/local/lib64/libOpenCL.so.1
#5  0x00000000004028cb in memMap (cmd_queue=0x6c5220, dev_mem=0x88d360, flags=1, size=40) at ./wrapcl/src/BufferObjects.cpp:265
#6  0x000000000040c9cc in docking_with_gpu (mygrid=0x7fffffffde48, cpu_floatgrids=0x7ffff7f27010, mypars=0x7ffffffd6f70, myligand_init=0x7ffffffd71a0, 
    argc=0x7fffffffdf18, argv=0x7fffffffe008, clock_start_program=17176) at ./host/src/performdocking.cpp:614
#7  0x000000000041790f in main (argc=7, argv=0x7fffffffe008) at ./host/src/main.cpp:78
(gdb) 

@franz
Copy link
Contributor

franz commented Oct 12, 2017

Well, from a quick look, one (driver) thread is in the kernel ( _pocl_launcher_perform_LS_workgroup ), while another is waiting for command queue completion ( pthread_scheduler_wait_cq ), so it seems everything is working. There should be one core with high usage (the one running the kernel), can you check if that's the case ?

@franz
Copy link
Contributor

franz commented Oct 12, 2017

Also, what are the global WG sizes ( global_work_size parameter to clEnqueueNDRangeKernel) that you usually use for this kernel ?

@L30nardoSV
Copy link
Author

@pjaaskel, with pocl 0-14 (LLVM 4.0.0) release I do not experience this problem, i.e. the execution looks smooth and numeric results are correct. Perhaps it is a regression ... ?

@franz, the WG size is 144 for the perform_LS kernel.

@pjaaskel
Copy link
Member

Can you share the kernel?

@pjaaskel pjaaskel added this to the 1.0 milestone Oct 12, 2017
@L30nardoSV
Copy link
Author

L30nardoSV commented Oct 12, 2017

Sure. The kernel is belongs to this project I am currently working:
https://git.esa.informatik.tu-darmstadt.de/docking/ocladock

Kernel source files are under "device" folder.

Compilation:
make DEVICE=CPU

Execution:
./bin/ocladock_cpu_16wi -ffile ./input/1stp/derived/1stp_protein.maps.fld -lfile ./input/1stp/derived/1stp_ligand.pdbqt -nrun 10

@franz franz self-assigned this Nov 16, 2017
@franz
Copy link
Contributor

franz commented Nov 16, 2017

Tried on a AMD Ryzen, seems to work:

$ ./bin/ocladock_cpu_16wi -ffile ./input/1stp/derived/1stp_protein.maps.fld -lfile ./input/1stp/derived/1stp_ligand.pdbqt -nrun 10

Kernel source used for development:      ./device/calcenergy.cl                  
Kernel string used for building:         ./host/inc/stringify.h                  
Kernel compilation flags:                 -I ./device -I ./common -DN16WI        

Executing docking runs:
        20%        40%       60%       80%       100%
---------+---------+---------+---------+---------+
**************************************************

Program run time 511.119 sec 

@L30nardoSV this seems like a hardware issue..

@pjaaskel
Copy link
Member

...or just hard to reproduce indeterministic (race) issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants