Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI 5.0.1-3 Installation Failure #12282

Open
vorlkets opened this issue Jan 27, 2024 · 4 comments
Open

OpenMPI 5.0.1-3 Installation Failure #12282

vorlkets opened this issue Jan 27, 2024 · 4 comments

Comments

@vorlkets
Copy link

vorlkets commented Jan 27, 2024

I’m installing OpenMPI 5.0.1-3 (https://aur.archlinux.org/packages/openmpi-ucx) on Archlinux 6.7.1 (GCC 13.2.1-3, CUDA 11.1 11.1.1-3) on Dell Poweredge T620. Everything goes well till (No issue on Archlinux 6.7.0 (GCC 13.2.1-3 CUDA 12.3.1-2) on Dell Precision M6800):

Making check in datatype
make[2]: Entering directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make  opal_datatype_test unpack_hetero checksum position position_noncontig ddt_test ddt_raw ddt_raw2 unpack_ooo ddt_pack external32 large_data partial to_self reduce_local
make[3]: Entering directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
  CCLD     opal_datatype_test
  CCLD     unpack_hetero
  CCLD     checksum
  CCLD     position
  CCLD     position_noncontig
  CCLD     ddt_test
  CCLD     ddt_raw
  CCLD     ddt_raw2
  CCLD     unpack_ooo
  CCLD     ddt_pack
  CCLD     external32
  CCLD     large_data
  CCLD     partial
  CCLD     to_self
  CCLD     reduce_local
make[3]: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make  check-TESTS
make[3]: Entering directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[4]: Entering directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
../../config/test-driver: line 112: 1380808 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: opal_datatype_test
PASS: unpack_hetero
../../config/test-driver: line 112: 1380857 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: checksum
../../config/test-driver: line 112: 1380884 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: position
../../config/test-driver: line 112: 1380916 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: position_noncontig
../../config/test-driver: line 112: 1380944 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: ddt_test
../../config/test-driver: line 112: 1380975 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: ddt_raw
PASS: ddt_raw2
PASS: unpack_ooo
../../config/test-driver: line 112: 1381044 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: ddt_pack
../../config/test-driver: line 112: 1381070 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: external32
PASS: large_data
../../config/test-driver: line 112: 1381120 Segmentation fault      (core dumped) "$@" >> "$log_file" 2>&1
FAIL: partial
============================================================================
Testsuite summary for Open MPI 5.0.1
============================================================================
# TOTAL: 13
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  9
# XPASS: 0
# ERROR: 0
============================================================================
See test/datatype/test-suite.log
Please report to https://www.open-mpi.org/community/help/
============================================================================
make[4]: *** [Makefile:2012: test-suite.log] Error 1
make[4]: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[3]: *** [Makefile:2120: check-TESTS] Error 2
make[3]: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[2]: *** [Makefile:2277: check-am] Error 2
make[2]: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test/datatype'
make[1]: *** [Makefile:1416: check-recursive] Error 1
make[1]: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1/test'
make: *** [Makefile:1533: check-recursive] Error 1
make: Leaving directory '/home/vorlket/build/openmpi-ucx/src/openmpi-5.0.1'
==> ERROR: A failure occurred in check().
    Aborting...

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
@ggouaillardet
Copy link
Contributor

Thanks for the report.

I am able to reproduce the issue on archlinux and I will investigate it from now.

@ggouaillardet
Copy link
Contributor

This is a bizarre error ...
make check uses some components from the installed Open MPI 4.1 and crashes on finalize.

A simple workaround is to

pacman -R openmpi

before you makepkg

@vorlkets
Copy link
Author

The workaround works. Thanks for help.

@ggouaillardet
Copy link
Contributor

A better fix (so you don't have to uninstall the openmpi package) is to patch your PKGBUILD like this.

I am reopening this issue since I believe Open MPI should do that automatically under the hood

diff --git a/PKGBUILD b/PKGBUILD
index d7d8f1b..07d675f 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -86,7 +86,7 @@ build() {
 }
 
 check() {
-  make check -C $pkgname-$pkgver
+  env OMPI_MCA_mca_component_path= make check -C $pkgname-$pkgver
 }
 
 package() {

@ggouaillardet ggouaillardet reopened this Jan 28, 2024
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jan 28, 2024
By default, the tests will use the components in the build directory
but also in the installation directory if any. That can cause some
bizarre issues when they are not compatible.

Thanks Kook Jin Noh for the initial bug report.

Refs. open-mpi#12282

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Feb 2, 2024
When a framework has the MCA_BASE_FRAMEWORK_FLAG_NOCOMPONENT flag,
it will not try to register any components. That can prevent issues
at finalization when a component is registered but not properly
unregistered.

Thanks Kook Jin Noh for the bug report.

Refs. open-mpi#12282

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Feb 5, 2024
pmix is now a first class citizen and hence do not need
to be part of a MCA framework anymore

Refs. open-mpi#12282

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Feb 6, 2024
pmix is now a first class citizen and hence do not need
to be part of a MCA framework anymore

Refs. open-mpi#12282

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants