-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.0.0rc2: FreeBSD: divide-by-zero in hwloc #3992
Comments
If I configure using Alternatively, README could suggest installing/using an external hwloc. In case it matters, this is a KVM-based virtual machine.
|
Interesting; I can't replicate on EC2 with 11.0; wonder if something changed between 11.0 and 11.1 or if it's something in the QEMU processor emulation.
|
Is there any reason not to update to 1.11.7? I ask because there are some improvements and bug fixes in those latter releases that are good to have. I'm willing to roll the PR if you want it. |
@rhc54, I don't think so; just the usual wanting to get something out the door. But if that's the easiest way to get this bug removed, then let's do that. |
I've tried #4022, but the problem persists. Perhaps I was wrong when I concluded that no patches were applied to the hwloc build in the FreeBSD package repo. The fastest way to get 3.0 out the door seems to me to be to simply document that If one does want to continue chasing this bug, here is something new... it looks like the hwloc cache is all zeros at the time of the crash:
|
@bgoglin Any thoughts? |
I am not aware of any patch being applied to FreeBSD hwloc. @PHHargrove when cache->type == 0, we ignore the cache, you shouldn't ever use that cache in the code. I have a FreeBSD 11.1 running on KVM, no problem. But it may depend on the underlying hardware since we are in the cpuid/x86 code here. Can you run hwloc-gather-cpuid from hwloc git master and send the resulting cpuid directory? I can debug remotely from there. |
@bgoglin
|
Ergl, it likely fails for similar reasons. Can you set HWLOC_COMPONENTS=-x86 in the environment before running? |
@bgoglin tarball of cpuid directory will be sent by email shortly. |
Thanks. One thing that seems strange is that the VM exposes an AMD with cache information in both AMD and Intel CPUID leaves. Our code is supposed to ignore Intel-specific caches on non-Intel but I am not sure we have ever tested that. Any chance you give me access to the VM? Or any way to share the image of the VM on a ffile sharing website? If we have to stay in your own gdb, I would like to see cpuid_type, highest_cpuid, highest_ext_cpuid, cachenum, infos->numcaches and then infos->cache[i] with i from 0 to numcaches-1. |
@bgoglin VM access is possible. Switching to private email. |
I am on the VM but I can't reproduce any single crash with 1.11.3, 1.11.7 or git master. Even hwloc-gather-cpuid works fine. Do you have anything in your environment and/or on your configure command-line? Note that I only tried lstopo from vanilla hwloc, nothing from OMPI's hwloc. |
@bgoglin Output of "env" (with one redaction) is below. Another possible difference is that I am a member of the system group "wheel", but I've ruled that out by removing myself. My configure command line for master was empty on the first try, and I added --enable-debug after the first SEGV.
@jsquyres and @bwbarrett can have a nice chuckle at the fact that |
@bgoglin |
Ah! I had to install gcc on my FreeBSD11 VM too because cc/clang was causing strange segfaults that moved around when adding printf. |
@PHHargrove so does that mean that your system hwloc in /usr/local was built with gcc and not clang? It sounds like the problem is in the clang compiler for that VM. |
@rhc54 No, I've confirmed that the system hwloc-1.11.7 is built w/ CC=cc CXX=c++. Keep in mind that (like Apple), the FreeBSD folks have adopted Clang as their official compiler, and gcc is just an add-on. Theor openmpi2 (2.0.2, FWIW) package is also built w/ Clang. So, that is the compiler I would prefer to test RCs against. |
My hwloc 1.11.7 built with cc/c++ fails miserably. gdb shows things that make no sense. And adding debug printf makes the stack get crazy and segfaults move around. All these failures occur in topology-x86.c which is the only lib file that uses inline asm for cpuid. So maybe clang doesn't like our asm. And hwloc-gather-cpuid also uses that asm. Do you guys test things with clang/Linux? |
yes, the various CI and some MTT tests are done against it |
I'm just poking to understand what action needs to be done in response to all this info. I'm not sure I see it yet. |
@rhc54 |
Actually, I had a slightly different perspective in mind. I have installed clang 3.4.2 on my CentOS7 box and confirmed that we both build against it and can run without issue (not a surprise as our CI regularly tests against clang). My question, therefore, was whether the FreeBSD clang compiler on your VM might somehow be borked, and thus we should simply ignore that failure. |
@rhc54 One comment from @bgoglin seems to indicate clang is giving him problems on FreeBSD as well. So this is not isolated to "my" VM. This is the system compiler we are talking about here. I am not sure what sounds best as a path forward for 3.0.0 on FreeBSD |
@PHHargrove if you're planning to report a bug against FreeBSD 11.1, I'd like to see it (in case I can generate a small testcase with the hwloc x86 cpuid code). |
@bgoglin I don't have current plans to file a bug report because (a) I don't have time and (b) without a reduced test case I don't think the bug would get any attention. If you can produce a reduced testcase, I can probably make time to collaborate on submitting the bug report. @rhc54 I think this is a documentation issue for both 3.0.0 and 2.1.2. My suggestion: The system compiler (clang-4.0) on FreeBSD-11.1/amd64 is believed to compile hwloc incorrectly.
|
@hppritcha @bwbarrett I concur with the documentation suggestion, but it's up to you folks |
I'll do a README update |
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with hwloc (older than 1.11.7), so if the following conditions hold - one is building a version of Open MPI to include the internal hwloc package - and the version of Open MPI's internal hwloc package is older than 1.11.7 - using the clang 4.0 compiler that ships with this release of FreeBSD then one may observe segfaults, floating pointer exceptions, etc. One workaround is to use the GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov>
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with hwloc packaged with OpenMPI. Workaround is to use a GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov>
In testing 2.1.2rc3 I have again found /usr/bin/cc (clang) on FreeBSD-11.1/amd64 to be leading to unexpected SEGVs. See https://www.mail-archive.com/devel@lists.open-mpi.org/msg20351.html I believe it would be appropriate to remove the Blocker tag (and perhaps Bug, too) from this issue, since it seems to be entirely due to external problems, which Howard documented in 800a971 Note that unless I missed it, something still needs to go in the 2.1.2 README. |
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with OpenMPI. Workaround is to use a GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov>
this problem seems to be fairly generic across multiple releases, so adding more tags. |
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with OpenMPI. Workaround is to use a GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 083e6e6)
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with OpenMPI. Workaround is to use a GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 083e6e6)
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with OpenMPI. Workaround is to use a GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 083e6e6)
The 3.0 NEWS blurb is updated. Moving the milestone to 2.1 to track that NEWS blurb. |
The clang 4.0 compiler that ships with FreeBSD 11.1 doesn't work well with OpenMPI. Workaround is to use a GNU compiler. Related to open-mpi#3992. [skip ci] Signed-off-by: Howard Pritchard <howardp@lanl.gov> (cherry picked from commit 083e6e6)
committed to all four branches. closing now. |
@PHHargrove reported a problem bubbling up from hwloc when testing 3.0.0rc2 on
FreeBSD/amd64:
reported on devel mail list:
https://www.mail-archive.com/devel@lists.open-mpi.org//msg20326.html
The text was updated successfully, but these errors were encountered: