Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong attributes on MPI_win_create() on big-endian systems? #1446

Closed
amckinstry opened this issue Mar 10, 2016 · 14 comments
Closed

wrong attributes on MPI_win_create() on big-endian systems? #1446

amckinstry opened this issue Mar 10, 2016 · 14 comments

Comments

@amckinstry
Copy link

I'm debugging a test issue on mpi4py (2.0.0) linked against libopenmpi 1.10.2 in Debian.
This fails on a units on mpi_win_* on some archs.
In particular s390x, sparc64, ppc64, but works on ppc64el.

This implies a big-endian problem. looking at mpi4py, it appears that certain attrs are being set to 0;
unit (disp_unit) is 0, not 1 in the testAttributes() test (after MPI_Win_Create, with unit=1 AFAIK)
testCreateFlavor() fails as MPI.WIN_CREATE_FLAVOR is 0 not one of MPI.WIN_FLAVOR_CREATE, MPI.WIN_FLAVOR_ALLOCATE, MPI.WIN_FLAVOR_DYNAMIC, MPI.WIN_FLAVOR_SHARED,)
( or MPI.KEYVAL_INVALID).

I'm not sure at this stage whether the bug is in openmpi or mpi4py, as I don't see any mpi_win_* tests being run on build (the debian build autodetects and runs make check/test). What tests do you suggest to check out the openmpi 1.10.2 builds on these archs?

@ggouaillardet
Copy link
Contributor

can you please confirm the failure occurs in the mpi4py test suite ?

@amckinstry
Copy link
Author

Correct. This is on the mpi4py test suite

@ggouaillardet
Copy link
Contributor

I will double check that tomorrow
I saw some places where an int (32 bits) is passes where an MPI_Aint (64 bits) is expected,
that is the kind of things that little endian is fine with, but not big endian ...

@ggouaillardet
Copy link
Contributor

@amckinstry i was able to reproduce the issue with a C program, but i could not (yet) build mpi4py on a big endian box.
could you please apply https://github.com/ggouaillardet/ompi-release/commit/339a2893b4bee8fd67426fd42025634146112847.patch and test again ?
@dalcinl all ompi versions are affected on big endian systems, so you might want to exclude tests involving MPI_WIN_DISP_UNIT, MPI_WIN_CREATE_FLAVOR and MPI_WIN_MODEL windows attributes with openmpi up to 1.10.2

@amckinstry
Copy link
Author

Thanks. Submitting an experimental version through our build infra now. Should have confirmation by Monday

@jsquyres
Copy link
Member

@amckinstry Sweet. We're actually debating this fix over on open-mpi/ompi-release#1018 -- I see what @ggouaillardet did as a good code cleanup, but I confess to not understanding why it would have fixed your bug. If you can confirm / deny that it actually fixed the bug, that would be great. Thank you!

@dalcinl
Copy link
Contributor

dalcinl commented Mar 13, 2016

@ggouaillardet Any chance you could send me the full list of failing tests?

@dalcinl
Copy link
Contributor

dalcinl commented Mar 13, 2016

@ggouaillardet @jsquyres The patch is OK as long as sizeof(MPI_Fint)==sizeof(int), however such assumption is already made while setting predefined attributes in MPI_COMM_WORLD (helper funtion set_f() in ompi/attribute/attribute_predefined.c).

A better implementation of the attribute engine should also handle the case of predefined attributes whose value is a C int value. I would re-implement all that using an union type with int,MPI_Fint,MPI_Aint slots.

A quick workaround that would not require changing too much the codebase would be to asume that either sizeof(MPI_Fint)==sizeof(int) then use the ompi_attr_set_fortran_mpi1; or sizeof(MPI_Fint)==sizeof(MPI_Aint) then use ompi_attr_set_fortran_mpi2, or #error the compilation otherwise . This logic should be implemented for setting all predefined attributes for MPI_COMM_WORLD (reimplementing the set_f() routine using conditional compilation with the preprocessor), and also for setting MPI_WIN_DISP_UNIT, MPI_WIN_CREATE_FLAVOR and MPI_WIN_MODEL (it would be a good idea to use a helper routine for setting these attributes).

@ggouaillardet I'm a little busy as to write a full patch myself, but assuming I was clear enough above, you could write it in 5 minutes and I'll happily review it.

@dalcinl
Copy link
Contributor

dalcinl commented Mar 13, 2016

@ggouaillardet Apply the following patch to mpi4py sources from release tarball, I expect it to workaround the issue.
https://bitbucket.org/mpi4py/mpi4py/commits/fda0cd06aabc88799e378997a9852bde259aeb62
Once you confirm it works (I do not have easy access to any big-endian platform), I'll merge into branches maint and master.

PS: I wrote that workaround in mpi4py assuming that if Open MPI 1.10.3 is ever released, it will include a proper fix for this issue. Otherwise, the issue will pop-up again and we will need to update the preprocessor version check guard.

@amckinstry
Copy link
Author

I've tested the patch on s390x, and unfortunately it segfaults:

testIsThreadMain (test_threads.TestMPIThreads) ... ok
testThreadLevels (test_threads.TestMPIThreads) ... ok
testAttributes (test_win.TestWinAllocateSelf) ... [zelenka:08216] ***
Process received signal ***
[zelenka:08216] Signal: Segmentation fault (11)
[zelenka:08216] Associated errno: Unknown error 6932272 (6932272)
[zelenka:08216] Signal code: Address not mapped (1)
[zelenka:08216] Failing at address: 0xf1da23eadcf000
[zelenka:08216] [ 0] [0x3ffffb57250]
[zelenka:08216] [ 1]
/home/mckinstry/mpi4py-2.0.0/build/lib.linux-s390x-2.7/mpi4py/MPI.so(+0x3a51a)[0x3fffc39d51a]

I'm still awaiting an update of openmpi (experimental) on the test
system so that I can test
the proposed fix for openmpi.

regards
Alastair

On 13/03/2016 12:48, Lisandro Dalcin wrote:

@ggouaillardet https://github.com/ggouaillardet Apply the following
patch to mpi4py sources from release tarball, I expect it to
workaround the issue.
https://bitbucket.org/mpi4py/mpi4py/commits/fda0cd06aabc88799e378997a9852bde259aeb62
Once you confirm it works (I do not have easy access to any big-endian
platform), I'll merge into branches maint and master.

PS: I wrote that workaround in mpi4py assuming that if Open MPI 1.10.3
is ever released, it will include a proper fix for this issue.
Otherwise, the issue will pop-up again and we will need to update the
preprocessor version check guard.


Reply to this email directly or view it on GitHub
#1446 (comment).

Alastair McKinstry, alastair@sceal.ie, mckinstry@debian.org, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.

@dalcinl
Copy link
Contributor

dalcinl commented Mar 14, 2016

@amckinstry Do you have shell access to the machine? Any chance you can run it under valgrind? Otherwise, could you add a quick printf() for the value of the pos variable in my patch?

@ggouaillardet
Copy link
Contributor

@dalcinl I was unable to test that today.
to me, fixing ompi bugs in mpi4py is like opening Pandora's box ...

anyway, what about

if (1 == htonl(1)) {
    /* hack for big endian arch */
    int * v = *(int **)attrval;
    *attrval = v+1;
}

that being said, I have no idea what will happen if/when attrval is free'd

@amckinstry
Copy link
Author

Just to confirm: I got access to an s390x with the patched openmpi
installed, and mpi4py builds and
tests correctly. ie. the patch works.

regards
Alastair

On 11/03/2016 14:07, Jeff Squyres wrote:

@amckinstry https://github.com/amckinstry Sweet. We're actually
debating this fix over on #1018
#1018 -- I see what
@ggouaillardet https://github.com/ggouaillardet did as a good code
cleanup, but I confess to not understanding why it would have fixed
your bug. If you can confirm / deny that it actually fixed the bug,
that would be great. Thank you!


Reply to this email directly or view it on GitHub
#1446 (comment).

Alastair McKinstry, alastair@sceal.ie, mckinstry@debian.org, https://diaspora.sceal.ie/u/amckinstry
Misentropy: doubting that the Universe is becoming more disordered.

@ggouaillardet
Copy link
Contributor

@amckinstry @dalcinl the fix has landed into the v1.10 and v2.x branches, which it means it will be available when Open MPI 1.10.3 and 2.0.0 are released

i will then close this issue from now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants