-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x86: divide by zero because of invalid CPUID vendor on FreeBSD11.1 with Clang 4.0 #282
Comments
If cache->nbthreads_sharing is 0, we indeed have a problem. Can you take a git master tarball from https://ci.inria.fr/hwloc/job/master-0-tarball/ , build it, verify that its hwloc-info crashes too, run hwloc-gather-cpuid, and then send the "cpuid" directory that it should generate? It should give me anything I need to debug from here. |
"cpuid_type=86212608" looks strange in your gdb output, and the code would indeed crash if we try to discover both AMD and Intel caches. I can workaround that crash (first patch below), but there's something else going wrong about the vendor reported in cpuid (second patch below).
|
Oh we had a similar issue in OpenMPI on FreeBSD11.1 when using clang 4.0. No problem with GCC. |
I applied your patches to 1.11.7 and the crash is resolved. Is there anything else we should do to fully verify proper function? Thanks! Jason root@login:/usr/ports/wip/hwloc # hwloc-info |
The vendor string is correct so the compiler was indeed generating buggy code that failed to recognize that string. |
The old code increased numcaches in the second case but it added additional caches at the beginning of the array. Uninitialized caches in the array caused a divide by zero (cache->nbthreads_sharing) when used later. Only occurs if the CPUID vendor isn't recognized (neither Intel, nor AMD, nor Zhaoxin) or in case of clang 4.0 bug on FreeBSD11.1 (#282). That's also why the code was crashing on Zhaoxin instead of just reporting wrong topology (#279). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The old code increased numcaches in the second case but it added additional caches at the beginning of the array. Uninitialized caches in the array caused a divide by zero (cache->nbthreads_sharing) when used later. Only occurs if the CPUID vendor isn't recognized (neither Intel, nor AMD, nor Zhaoxin) or in case of clang 4.0 bug on FreeBSD11.1 (#282). That's also why the code was crashing on Zhaoxin instead of just reporting wrong topology (#279). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> (cherry picked from commit a6f013c)
Actually, something else is going on here. If I remove the printf() patch, it crashes again. :-/ |
So in my experience, including much of my own code, problems like this almost never turn out to be a compiler bug. Instead, different platforms, compilers, optimization flags, or inputs expose subtle bugs in the code that go unnoticed elsewhere. For that reason I try to test my own code with a variety of tools. While there isn't a strong motivation to shore up support for 10-year-old processors, I think this is worth a little more examination as it might lead us to an undiscovered hwloc issue. I'll keep poking around for a bit in an effort to pinpoint this problem, as I seem to be the only one with the magic combination of hardware and software to trigger this. I'd appreciate some feedback along the way. Here's the latest backtrace with the patches (minus the printf) in place. If anything looks odd, please let me know and I'll try to trace it to the cause. Thanks, Jason (gdb) where |
You're not the only one, at least two person running OpenMPI jenkins platforms have seen similar issues without ever understanding what was going on. All of them came from Clang 4.0 on FreeBSD11.1, and the crash moving from one place to another depending on debug printf that we added, and depending on which hwloc release was used. I also had problems with that compiler on my hwloc jenkins 6 months ago, and I quickly switched to gcc but I don't remember what the actual issue was. I am still convinced this is a compiler bug, possibly caused by lots of inline assembly being used in that file (topology-x86.c). Anyway, first thing to do is to pass --enable-debug on the command-line to get additional debug printf. Then, you should go back to the original report where cpuid_type was set to a crazy value (86212608) according to gdb. Adding some assert() or printf() around the first call to hwloc_x86_cpuid() inside hwloc_look_x86() to check the value of cpuid_type might be a good starting point. It should be unknown before and intel later. Your backtrace isn't helpful because it crashes when calling a callback that is passed by the caller. However that callback is a function that should be visible to gdb, so you may want to printf() the value of get_cpubind when set in hwloc_look_x86() and later before called in look_procs(). |
Oops, I'd already added the config flag but forgot to paste in the output along with the gdb trace: Let me know if anything looks funny here and I'll try to wrap my head around the code starting from there. FreeBSD login.wren bacon ~ 24: hwloc-info
cpu 0 (os 0) has cpuset 0x00000001 hwloc verbose debug enabled, may be disabled with HWLOC_DEBUG_VERBOSE=0 in the environment. |
Additional info: I'm able to reproduce the issue on newer hardware, including several Opteron systems and a Core i7. A simple example: int c, list[10]; If the compiler assigns c the address right after list, c will be corrupted by the loop. If c is placed before list, the bug will go unnoticed. Changing compilers or optimization flags could alter the memory map as the compiler attempts to pack the variables for minimum space and optimal alignment. I'll try inserting more debug output when I have some time. In the meantime, I'd appreciate any comments on the hwloc-info output above. Thanks. |
There's nothing useful in the hwloc-info output unfortunately :/ topology-freebsd.c is pretty much empty because the FreeBSD kernel doesn't report topology information. That's why hwloc falls back to topology-x86.c on this OS. There are only 7 commits in topology-x86.c between 1.11.2 and 1.11.3. I went on my FreeBSD11.1 jenkins VM and reenable clang 4.0. I reverted these one by one until the crash disappeared with 316a42f but that doesn't make much sense: |
Well, it's information anyway, that might help us narrow it down. |
Running |
@Jehops can you get a backtrace from gdb? |
|
@Jehops That's exactly the same issue as above. "cpuid_type=(unknown: 2784894432)" is just crazy. Try with another compiler until we understand what's going on with Clang. |
Details in this Github issue: open-mpi/hwloc#282 MFH: 2018Q1 git-svn-id: svn+ssh://svn.freebsd.org/ports/head@460636 35697150-7ecd-e111-bb59-0022644237b5
Details in this Github issue: open-mpi/hwloc#282 MFH: 2018Q1
devel/hwloc: Fix segfaults on Intel CPUs Details in this Github issue: open-mpi/hwloc#282
Details in this Github issue: open-mpi/hwloc#282 MFH: 2018Q1 git-svn-id: svn+ssh://svn.freebsd.org/ports/head@460636 35697150-7ecd-e111-bb59-0022644237b5
A couple suggested workarounds are in the ports PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225229 You can try them out using the WIP collection: |
I think it appeared in 1.11.3, when I started having this problem, which seems related: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221946 |
Yes we confirmed above that it started in 1.11.3 with commit 316a42f but that didn't make any sense. See my comment from 17 days ago. |
Details in this Github issue: open-mpi/hwloc#282 MFH: 2018Q1 git-svn-id: svn+ssh://svn.freebsd.org/ports/head@460636 35697150-7ecd-e111-bb59-0022644237b5
I have a report from a developer at work that this bug was not appearing on an Intel CPU when the FreeBSD OS was running as a KVM VM. |
…D 11.1. This caused a crash when compiled with Clang due to a different stack layout compared to GCC. PR: 225229 See also: open-mpi/hwloc#282 git-svn-id: svn+ssh://svn.freebsd.org/ports/head@462617 35697150-7ecd-e111-bb59-0022644237b5
…D 11.1. This caused a crash when compiled with Clang due to a different stack layout compared to GCC. PR: 225229 See also: open-mpi/hwloc#282
…D 11.1. This caused a crash when compiled with Clang due to a different stack layout compared to GCC. PR: 225229 See also: open-mpi/hwloc#282 git-svn-id: svn+ssh://svn.freebsd.org/ports/head@462617 35697150-7ecd-e111-bb59-0022644237b5
…D 11.1. This caused a crash when compiled with Clang due to a different stack layout compared to GCC. PR: 225229 See also: open-mpi/hwloc#282 git-svn-id: svn+ssh://svn.freebsd.org/ports/head@462617 35697150-7ecd-e111-bb59-0022644237b5
devel/hwloc: Fix segfaults on Intel CPUs Details in this Github issue: open-mpi/hwloc#282
Details in this Github issue: open-mpi/hwloc#282 MFH: 2018Q1
…D 11.1. This caused a crash when compiled with Clang due to a different stack layout compared to GCC. PR: 225229 See also: open-mpi/hwloc#282
Discovered an issue with slurmd crashing on a few specific machines and traced it back to hwloc.
Running on an old Xeon processor, most hwloc programs produce a SEGV if built with -O2, and a floating exception when compiled with -g and no optimization flags.
root@login:/usr/ports/devel/hwloc # hwloc-info
Segmentation fault (core dumped)
root@login:/usr/ports/devel/hwloc # hwloc-info
Floating exception (core dumped)
The problem occurs with 1.11.7 and later. It does not occur with 1.11.1. I have not tried any versions in between (but I will if we need to pinpoint the commit where the problem appeared).
Compiled with -g and running under gdb provides the location of the problem. Does the output below indicate an obvious cause/solution? Do the arguments to look_proc() seem reasonable? I don't know what cpuid actually means, but this system has only 4 cores.
root@login:/usr/ports/devel/hwloc # gdb hwloc-info
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
(gdb) run
Starting program: /usr/local/bin/hwloc-info
Program received signal SIGFPE, Arithmetic exception.
0x000000080085c01e in look_proc (backend=0x8020380c0, infos=0x80203d000,
highest_cpuid=10, highest_ext_cpuid=2147483656, features=0x7fffffffe750,
cpuid_type=86212608) at topology-x86.c:482
482 cache->cacheid = infos->apicid / cache->nbthreads_sharing;
Current language: auto; currently minimal
Some relevant system info:
FreeBSD 11.1-RELEASE-p4 #0: Tue Nov 14 06:12:40 UTC 2017
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM 4.0.0)
VT(vga): resolution 640x480
CPU: Intel(R) Xeon(R) CPU 5160 @ 3.00GHz (2992.56-MHz K8-class CPU)
Origin="GenuineIntel" Id=0x6f6 Family=0x6 Model=0xf Stepping=6
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x4e3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA>
AMD Features=0x20100800<SYSCALL,NX,LM>
AMD Features2=0x1
VT-x: (disabled in BIOS) HLT,PAUSE
TSC: P-state invariant, performance statistics
real memory = 8589934592 (8192 MB)
avail memory = 8261402624 (7878 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table:
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 2 package(s) x 2 core(s)
I'm seeing the problem on 3 identical systems out of 4. All running FreeBSD 11.1 fully updated, same CPUs. Maybe some difference in BIOS settings is the trigger?
Probably related issue with another app:
fireice-uk/xmr-stak#817
The text was updated successfully, but these errors were encountered: