-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pciaccess not thread-safe, crashes during parallel topology loads #297
Comments
Hello Does this crash always happen? Or does it only crash from time to time, like a race condition? Same if you remove the pthread_barrier? I don't know if libpciaccess is thread-safe, we may have an issue in there. In the meantime:
I am digging inside the libpciaccess code now. I wouldn't be surprised if pci_device_get_device_name() loads the pci names in a global table shared by all threads :/ |
Hi,
Always with pthread_barrier. With out it crash from time to time.
It's work, thx! |
Indeed there's a global value describing the entire PCI hierarchy. And it can be modified dynamically when loading PCI ids. I think we'll need to serialize calls to pciaccess unfortunately :/ |
I added some locking around the pciaccess discovery, it seems to fix your issue. If you get a chance, can you try this tarball https://ci.inria.fr/hwloc/job/zbgoglin-0-tarball/lastSuccessfulBuild/artifact/hwloc-v1.11-20180216.1603.gitcc13c66.tar.gz ? Just configure --prefix=$PWD/install && make && make install and then set LD_LIBRARY_PATH to $PWD/install/lib before running your test program. |
It seems that everything is working properly. |
libpciaccess uses an internal global variable to describe the PCI hierarchy without internal refcounting. If multiple threads build a topology simultaneously, the first one will destroy the pciaccess stuff that others still need. Moreover, there are internal arrays that are dynamically reallocated and populated with vendor/device names according to caller's requests. Put a mutex around all of this discovery. In the mono-threaded case, the overhead of a single lock/unlock is negligible. In the multi-threaded case, pci discovery for individual topologies are serialized, which isn't very good, but people are encouraged not to build multiple topologies in multiple threads anyway. Closes #297 Finer grain serializing would require refcounting of users and careful review of pciaccess internal behavior with respect to dynamic realloc/populate of at least device/vendor names. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> (cherry picked from commit edabca0) Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
libpciaccess uses an internal global variable to describe the PCI hierarchy without internal refcounting. If multiple threads build a topology simultaneously, the first one will destroy the pciaccess stuff that others still need. Moreover, there are internal arrays that are dynamically reallocated and populated with vendor/device names according to caller's requests. Put a mutex around all of this discovery. In the mono-threaded case, the overhead of a single lock/unlock is negligible. In the multi-threaded case, pci discovery for individual topologies are serialized, which isn't very good, but people are encouraged not to build multiple topologies in multiple threads anyway. Closes #297 Finer grain serializing would require refcounting of users and careful review of pciaccess internal behavior with respect to dynamic realloc/populate of at least device/vendor names. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> (cherry picked from commit edabca0)
Thanks a lot for reporting the issue, for providing a simple test case, and for confirming that the patch helps. Things will be officially fixed in 1.11.10 and 2.0.1, to be released within a couple weeks. In the meantime, you can keep the nightly git snapshot (it's a stable branch, I don't expect many issues there). |
What version of hwloc are you using?
hwloc 1.11.9-1
linux-vdso.so.1 (0x00007ffde5d99000)
libhwloc.so.5 => /usr/lib/x86_64-linux-gnu/libhwloc.so.5 (0x00007f39e9f79000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f39e9c2e000)
libtinfo.so.5 => /lib/x86_64-linux-gnu/libtinfo.so.5 (0x00007f39e9a04000)
libcairo.so.2 => /usr/lib/x86_64-linux-gnu/libcairo.so.2 (0x00007f39e96e7000)
libSM.so.6 => /usr/lib/x86_64-linux-gnu/libSM.so.6 (0x00007f39e94df000)
libICE.so.6 => /usr/lib/x86_64-linux-gnu/libICE.so.6 (0x00007f39e92c2000)
libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007f39e8f82000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f39e8bcc000)
libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f39e89c1000)
libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f39e87b7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f39ea3d8000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f39e8599000)
libpixman-1.so.0 => /usr/lib/x86_64-linux-gnu/libpixman-1.so.0 (0x00007f39e82f3000)
libfontconfig.so.1 => /usr/lib/x86_64-linux-gnu/libfontconfig.so.1 (0x00007f39e80ae000)
libfreetype.so.6 => /usr/lib/x86_64-linux-gnu/libfreetype.so.6 (0x00007f39e7df9000)
libpng16.so.16 => /usr/lib/x86_64-linux-gnu/libpng16.so.16 (0x00007f39e7bc6000)
libxcb-shm.so.0 => /usr/lib/x86_64-linux-gnu/libxcb-shm.so.0 (0x00007f39e79c2000)
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007f39e779a000)
libxcb-render.so.0 => /usr/lib/x86_64-linux-gnu/libxcb-render.so.0 (0x00007f39e758c000)
libXrender.so.1 => /usr/lib/x86_64-linux-gnu/libXrender.so.1 (0x00007f39e7382000)
libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007f39e7170000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f39e6f56000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f39e6d4e000)
libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x00007f39e6b49000)
libbsd.so.0 => /lib/x86_64-linux-gnu/libbsd.so.0 (0x00007f39e6934000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f39e6730000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f39e64fe000)
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007f39e62fa000)
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007f39e60f4000)
Which operating system and hardware are you running on?
Linux n2-1 4.14.0-3-amd64 #1 SMP Debian 4.14.13-1 (2018-01-14) x86_64 GNU/Linux
Machine (126GB total)
NUMANode L#0 (P#0 63GB)
Package L#0 + L3 L#0 (25MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#20)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#21)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#22)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#23)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#24)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#25)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#26)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#27)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#28)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#29)
HostBridge L#0
PCIBridge
PCI 9005:028c
Block(Disk) L#0 "sdh"
Block(Disk) L#1 "sdf"
Block(Disk) L#2 "sdd"
Block(Disk) L#3 "sdb"
Block(Disk) L#4 "sdi"
Block(Disk) L#5 "sdg"
Block(Disk) L#6 "sde"
Block(Disk) L#7 "sda"
Block(Disk) L#8 "sdc"
PCIBridge
PCI 10de:1023
GPU L#9 "renderD128"
GPU L#10 "card0"
PCIBridge
PCI 10de:1023
GPU L#11 "card1"
GPU L#12 "renderD129"
PCIBridge
PCI 8086:1d6b
PCIBridge
PCI 15b3:1003
Net L#13 "ib0"
OpenFabrics L#14 "mlx4_0"
PCIBridge
PCI 102b:0532
GPU L#15 "card4"
GPU L#16 "controlD68"
PCI 8086:1d02
NUMANode L#1 (P#1 63GB)
Package L#1 + L3 L#1 (25MB)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#30)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#31)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#32)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#33)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
PU L#28 (P#14)
PU L#29 (P#34)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
PU L#30 (P#15)
PU L#31 (P#35)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
PU L#32 (P#16)
PU L#33 (P#36)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
PU L#34 (P#17)
PU L#35 (P#37)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
PU L#36 (P#18)
PU L#37 (P#38)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
PU L#38 (P#19)
PU L#39 (P#39)
HostBridge L#7
PCIBridge
PCI 8086:1521
Net L#17 "enp129s0f0"
PCI 8086:1521
Net L#18 "enp129s0f1"
PCIBridge
PCI 10de:1023
GPU L#19 "card2"
GPU L#20 "renderD130"
PCIBridge
PCI 10de:1023
GPU L#21 "card3"
GPU L#22 "renderD131"
Details of the problem
In multithreaded applicaton (with pthread), synchronically call hwloc_topology_load with HWLOC_TOPOLOGY_FLAG_IO_DEVICES flag set from multiple threads results in a SEGSEGV or SIGABRT.
It's test for error:
#include <hwloc.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
int thr_num[] = {1,2};
pthread_t pthr[2];
unsigned int add_topo_flags;
pthread_barrier_t barrier;
void * pth_hwloc(void * thr_num){
hwloc_topology_t topo;
unsigned int topo_flag_loc;
}
void main(int argc, char * argv[]){
int thr_total = 2;
}
gdb backtrace:
Starting program: /home/worker/git/hwloc_issue/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Load topology with out I/O devices. Threads - 2
[New Thread 0x7ffff6c61700 (LWP 2493)]
[New Thread 0x7ffff6460700 (LWP 2494)]
Thread - 1. All is OK!
[Thread 0x7ffff6c61700 (LWP 2493) exited]Thread - 2. All is OK!
Load topology with I/O devices. Threads - 2
[Thread 0x7ffff6460700 (LWP 2494) exited]
[New Thread 0x7ffff6460700 (LWP 2495)]
[New Thread 0x7ffff6c61700 (LWP 2496)]
Thread 5 "test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6c61700 (LWP 2496)]
0x00007ffff5448f62 in ?? () from /usr/lib/x86_64-linux-gnu/libpciaccess.so.0
(gdb) bt
#0 0x00007ffff5448f62 in ?? () from /usr/lib/x86_64-linux-gnu/libpciaccess.so.0
#1 0x00007ffff5449142 in ?? () from /usr/lib/x86_64-linux-gnu/libpciaccess.so.0
#2 0x00007ffff54492e4 in pci_device_get_device_name () from /usr/lib/x86_64-linux-gnu/libpciaccess.so.0
#3 0x00007ffff56505af in ?? () from /usr/lib/x86_64-linux-gnu/hwloc/hwloc_pci.so
#4 0x00007ffff7ba65cc in hwloc_topology_load () from /usr/lib/x86_64-linux-gnu/libhwloc.so.5
#5 0x0000555555554b07 in pth_hwloc (thr_num=0x555555756084 <thr_num+4>) at hwloc_issue.c:25
#6 0x00007ffff798351a in start_thread (arg=0x7ffff6c61700) at pthread_create.c:465
#7 0x00007ffff76bb3ef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
The text was updated successfully, but these errors were encountered: