Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal_lifo test fails on s390x #10988

Open
opoplawski opened this issue Oct 27, 2022 · 24 comments
Open

opal_lifo test fails on s390x #10988

opoplawski opened this issue Oct 27, 2022 · 24 comments

Comments

@opoplawski
Copy link
Contributor

Looking at updating the Fedora openmpi package to 5.0.0rc9. I'm getting the following test failure on s390x:

FAIL: opal_lifo
===============
 Failure :  lifo push/pop multi-threaded with atomics
 Failure :  list pop all items
SUPPORT: OMPI Test failed: opal_lifo_t (2 of 7 failed)
Single thread test. Time: 0 s 3501 us 3 nsec/poppush
Atomics thread finished. Time: 0 s 17674 us 17 nsec/poppush
Atomics thread finished. Time: 0 s 60205 us 60 nsec/poppush
Atomics thread finished. Time: 0 s 63509 us 63 nsec/poppush
Atomics thread finished. Time: 0 s 74591 us 74 nsec/poppush
Atomics thread finished. Time: 0 s 83370 us 83 nsec/poppush
Atomics thread finished. Time: 0 s 84206 us 84 nsec/poppush
Atomics thread finished. Time: 0 s 85807 us 85 nsec/poppush
Atomics thread finished. Time: 0 s 103993 us 103 nsec/poppush
Atomics thread finished. Time: 0 s 106321 us 106 nsec/poppush
All threads finished. Thread count: 8 Time: 0 s 106549 us 13 nsec/poppush
FAIL opal_lifo (exit status: 1)

Full log here (for a few days at least): https://kojipkgs.fedoraproject.org//work/tasks/3745/93473745/build.log

@jsquyres jsquyres added this to the v5.0.0 milestone Oct 27, 2022
@devreal
Copy link
Contributor

devreal commented Oct 27, 2022

@opoplawski I cannot find the error in the build log. Could you please provide some more info on your environment? What compiler are you using?

@opoplawski
Copy link
Contributor Author

Ah, shoot. linked the wrong build. This is Fedora Rawhide - gcc 12.2.1

https://kojipkgs.fedoraproject.org//work/tasks/4918/93474918/build.log

@devreal
Copy link
Contributor

devreal commented Oct 27, 2022

Thanks! Any chance you can access test/class/test-suite.log from that run?

@opoplawski
Copy link
Contributor Author

That's the output I pasted in the first comment.

@devreal
Copy link
Contributor

devreal commented Oct 27, 2022

I see. The tests run succesfully on Summit with GCC 12.1.0 (the latest GCC available on that machine) but I'm not even sure that's the same architecture. I don't have access to any other IBM machine.

@opoplawski
Copy link
Contributor Author

Can you build in a Fedora Rawhide mock environment on that machine? Looks like I have access to some kind of s390x test machine if there are any particular things you'd like me to try.

vendor_id       : IBM/S390
# processors    : 4
bogomips per cpu: 3241.00
max thread id   : 0
features        : esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt sie 
facilities      : 0 1 2 3 4 6 7 8 9 10 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 61 73 74 75 76 77 80 81 82 128 129 130 131 133 134 135 146 147 148 150 151 152 155 156 168
cache0          : level=1 type=Data scope=Private size=128K line_size=256 associativity=8
cache1          : level=1 type=Instruction scope=Private size=128K line_size=256 associativity=8
cache2          : level=2 type=Data scope=Private size=4096K line_size=256 associativity=8
cache3          : level=2 type=Instruction scope=Private size=4096K line_size=256 associativity=8
cache4          : level=3 type=Unified scope=Shared size=262144K line_size=256 associativity=32
cache5          : level=4 type=Unified scope=Shared size=983040K line_size=256 associativity=60
processor 0: version = FF,  identification = 200000,  machine = 8561
processor 1: version = FF,  identification = 200001,  machine = 8561
processor 2: version = FF,  identification = 200002,  machine = 8561
processor 3: version = FF,  identification = 200003,  machine = 8561

@devreal
Copy link
Contributor

devreal commented Oct 28, 2022

Any chance you could try a different compiler like LLVM?

Also, we recently switched from C11 atomics being the default to GCC builtin atomics. Could you try running with C11 atomics instead by passing --enable-c11-atomics to configure?

Summit has Power9 CPUs, so different architecture.

@opoplawski
Copy link
Contributor Author

So, some data points:

  • --enable-c11-atomics doesn't work:
checking dependency style of gcc... gcc3
checking if gcc supports GCC inline assembly... no - architecture not supported
checking for 32-bit GCC built-in atomics... yes
checking for 64-bit GCC built-in atomics... yes
checking if 64-bit GCC built-in atomics are lock-free... yes
checking for __atomic_compare_exchange_n... no
checking for __atomic_compare_exchange_n with -mcx16... no
checking for __atomic_compare_exchange_n with -latomic... yes
checking if __atomic_compare_exchange_n() gives correct results... yes
checking if __int128 atomic compare-and-swap is always lock-free... yes
configure: Using GCC built-in style atomics
configure: WARNING: C11 atomics were requested but are not supported
configure: error: Cannot continue
  • The test succeeds when compiled with clang. I don't think this is going to be an option for us though as it currently breaks the FORTRAN build.

@opoplawski
Copy link
Contributor Author

I'll also note that the test succeeds in 4.1.4 - but maybe the test has changed in 5.0.

@gpaulsen gpaulsen self-assigned this Oct 31, 2022
@opoplawski
Copy link
Contributor Author

I've managed to reproduce it on our test machine in case there are any other local tests you would like me to run.

@opoplawski
Copy link
Contributor Author

Still present in 5.0.0rc10

@gpaulsen gpaulsen removed their assignment Aug 22, 2023
@opoplawski
Copy link
Contributor Author

Still present in 5.0.0 final

@jsquyres
Copy link
Member

@opoplawski Realistically, I don't know who is going to fix this. I don't know if anyone has ever run an MPI job on an IBM s390 mainframe. I don't think anyone in the known community has the resources to fix this.

@jsquyres jsquyres modified the milestones: v5.0.0, v5.0.1 Oct 30, 2023
@devreal
Copy link
Contributor

devreal commented Jan 2, 2024

I tried to reproduce this on current main in a docker image with s390x emulation but all tests succeed there.

@devreal
Copy link
Contributor

devreal commented Jan 3, 2024

@opoplawski Is this still an issue in Fedora rawhide? I tried again with a s390x rawhide emulated docker container and couldn't reproduce this error.

@opoplawski
Copy link
Contributor Author

Well, 5.0.1 still failed - see the s390x build.log from https://koji.fedoraproject.org/koji/buildinfo?buildID=2336590

   Open MPI 5.0.1: test/class/test-suite.log
===============================================
# TOTAL: 10
# PASS:  9
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: opal_lifo
===============
 Failure :  lifo push/pop multi-threaded with atomics
 Failure :  list pop all items
SUPPORT: OMPI Test failed: opal_lifo_t (2 of 7 failed)

I can't manage to build the rpm from a github tarball. autogen fails with:

Running: autoreconf -ivf --warnings=all,no-obsolete,no-override -I config -I config/oac
autoreconf: export WARNINGS=all
autoreconf: Entering directory '.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal -I config -I config/oac --force -I config
autoreconf: configure.ac: tracing
autoreconf: running: libtoolize --copy --force
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'config'.
libtoolize: copying file 'config/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'config'.
libtoolize: copying file 'config/libtool.m4'
libtoolize: copying file 'config/ltoptions.m4'
libtoolize: copying file 'config/ltsugar.m4'
libtoolize: copying file 'config/ltversion.m4'
libtoolize: copying file 'config/lt~obsolete.m4'
autoreconf: configure.ac: not using Intltool
autoreconf: configure.ac: not using Gtkdoc
autoreconf: running: aclocal -I config -I config/oac --force -I config
autoreconf: running: /usr/bin/autoconf --include=config --include=config/oac --force
configure.ac:87: error: possibly undefined macro: AC_MSG_ERROR
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:98: error: possibly undefined macro: AC_MSG_WARN
configure.ac:123: error: possibly undefined macro: AC_MSG_CHECKING
configure.ac:124: error: possibly undefined macro: AC_MSG_RESULT
configure.ac:388: error: possibly undefined macro: AC_COMPILE_IFELSE
configure.ac:389: error: possibly undefined macro: AC_LANG_SOURCE
configure.ac:533: error: possibly undefined macro: AC_LANG_PUSH
configure.ac:534: error: possibly undefined macro: AC_LANG_PROGRAM
configure.ac:901: error: possibly undefined macro: AC_LINK_IFELSE
configure:18889: error: possibly undefined macro: m4_if
configure:18894: error: possibly undefined macro: AS_VAR_SET
configure:18905: error: possibly undefined macro: AC_LANG_POP
configure:54493: error: possibly undefined macro: AS_VAR_COPY
autoreconf: error: /usr/bin/autoconf failed with exit status: 1

@rhc54
Copy link
Contributor

rhc54 commented Jan 5, 2024

I can't manage to build the rpm from a github tarball.

I don't believe OMPI supports that approach, if you are talking about the GitHub tarballs they attach to the repo tags. I've had a rare request/discussion about that and believe it traces to the use of submodules, which leaves some dangling connections in the GitHub tarball (since it is literally just created using tar cf . at the top of the repo).

That said, this specific error is one I encountered elsewhere and resolved by executing libtoolize --force before doing anything else. You might give that a try to see if it helps. Would be interesting to know.

@opoplawski
Copy link
Contributor Author

Since we already build with external libs, I'm getting around the submodule issue with:

./autogen.pl --force --no-3rdparty=hwloc,libevent,pmix,prrte

Thanks for the suggestion, but the libtoolize --forcedidn't help.

@devreal
Copy link
Contributor

devreal commented Jan 5, 2024

I got me a free instance on the IBM community cloud but still not luck reproducing this. Since there are only two cores available and this seems to be some multi-threading/atomic issue I might not be able to trigger it there.

vendor_id       : IBM/S390
# processors    : 2
bogomips per cpu: 24038.00
max thread id   : 0
features        : esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt sie 
facilities      : 0 1 2 3 4 6 7 8 9 10 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 61 69 70 71 72 73 74 75 76 77 80 81 82 128 129 130 131 133 134 135 146 147 148 150 151 152 155 156 168
cache0          : level=1 type=Data scope=Private size=128K line_size=256 associativity=8
cache1          : level=1 type=Instruction scope=Private size=128K line_size=256 associativity=8
cache2          : level=2 type=Data scope=Private size=4096K line_size=256 associativity=8
cache3          : level=2 type=Instruction scope=Private size=4096K line_size=256 associativity=8
cache4          : level=3 type=Unified scope=Shared size=262144K line_size=256 associativity=32
cache5          : level=4 type=Unified scope=Shared size=983040K line_size=256 associativity=60
processor 0: version = FF,  identification = 03DD98,  machine = 8561
processor 1: version = FF,  identification = 03DD98,  machine = 8561
# uname -a
Linux 81c891683033 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:57:26 UTC 2022 s390x GNU/Linux
# dnf info gcc
Last metadata expiration check: 1:20:43 ago on Fri Jan  5 14:12:51 2024.
Installed Packages
Name         : gcc
Version      : 13.2.1
Release      : 6.fc40
Architecture : s390x
Size         : 71 M
Source       : gcc-13.2.1-6.fc40.src.rpm
Repository   : @System
From repo    : rawhide
Summary      : Various compilers (C, C++, Objective-C, ...)
URL          : http://gcc.gnu.org
License      : GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions and LGPLv2+ and BSD
Description  : The gcc package contains the GNU Compiler Collection version 13.
             : You'll need this package in order to compile C code.

Interestingly, when building with clang 17 on that system make check fails:

--> Testing atomic_cmpset                                                                                                                                                                                              
../../../config/test-driver: line 112: 756703 Illegal instruction     (core dumped) "$@" >> "$log_file" 2>&1                                                                                                           
FAIL: atomic_cmpset
0x0000000001002c2c in opal_atomic_compare_exchange_strong_128 (addr=0x10050c8 <vol128>, oldval=0x10050d8 <old128>, newval=50) at ../../../opal/include/opal/sys/gcc_builtin/atomic.h:147
147         opal_int128_t prev = __sync_val_compare_and_swap(addr, *oldval, newval);
(gdb) display/ni $pc
1: x/i $pc
=> 0x1002c2c <opal_atomic_compare_exchange_strong_128+108>:     lgr     %r0,%r2

To summarize what I observed:

  1. With GCC, we do not detect 128bit support so we fall back to what seems to be a spinlock implementation in opal_lifo_pop_atomic.
  2. With clang, we do detect 128bit CAS support but then we get illegal instructions.

I am running out of time to spend on this, unfortunately. And finding proper docs on this architecture is tedious. Either someone who cares about s390x (anyone from IBM?) will pick it up or OMPI stays broken on that arch. Sorry.

@rhc54
Copy link
Contributor

rhc54 commented Jan 6, 2024

Since we already build with external libs, I'm getting around the submodule issue with:
./autogen.pl --force --no-3rdparty=hwloc,libevent,pmix,prrte

Sadly, that won't take care of it - the problem is that there is another submodule attached to the config/oac directory that contains some m4 code. Afraid you cannot turn that one off - it is required.

You should check to see if the GitHub tarball populates that directory. Pretty sure it doesn't, and that is why you are hitting all those errors.

@opoplawski
Copy link
Contributor Author

I re-discovered the "nightly" tarballs - https://www.open-mpi.org/nightly/main/ and that works for me.

FWIW - build on s390x with failure: https://kojipkgs.fedoraproject.org//work/tasks/1335/111371335/build.log

@janjust janjust modified the milestones: v5.0.1, v5.0.2 Jan 8, 2024
@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2024

@joseemoreira Can you help here?

@joseemoreira
Copy link

Hello. Sorry for my delay in responding. I was not aware of this issue until a colleague from IBM just pointed it to me. I have to find the right person in our System z development team to address this. Will get back to you all soon.

PS: Do I need to do something so that issues like this show up in my Dashboard?

@jsquyres
Copy link
Member

I think an @-mention will just send you an email (depending on what your github notification settings for this org are). I just assigned the issue to you, so perhaps it will show up in your dashboard now...?

@jsquyres jsquyres modified the milestones: v5.0.2, v5.0.3 Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants