Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error when running NormalModes under Open MPI #3

Closed
jsquyres opened this issue Oct 29, 2019 · 6 comments · Fixed by #4
Closed

Fatal error when running NormalModes under Open MPI #3

jsquyres opened this issue Oct 29, 2019 · 6 comments · Fixed by #4

Comments

@jsquyres
Copy link
Contributor

jsquyres commented Oct 29, 2019

In attempting to run NormalModes under Open MPI, I notice that there is a Fortran --> C translation of an MPI_Comm that is not happening correctly between NormalModes and ParMetis (http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview).

I filed a bug with ParMetis (see http://glaros.dtc.umn.edu/flyspray/task/174), but ParMetis appears to be abondonware (** see below), so I am filing the bug here for you, NormalModes maintainers. 😄

I represent a vendor partner mentoring one of the student teams for SC'19; they are attempting to run NormalModes with Open MPI. I also happen to be one of the core developers of Open MPI.

The problem is here:

  1. geometry_mod::pnm_apply_parmetis() is a Fortran subroutine that invokes ParMETIS_V3_PartKway().
  2. The last parameter to ParMETIS_V3_PartKway() is an MPI communicator handle.
  3. geometry_mod::pnm_apply_parmetis() passes in 0, which is actually the correct Fortran integer value for MPI_COMM_WORLD.
  4. ParMETIS_V3_PartKway() is a C function. The last parameter is of type (MPI_Comm*). THIS IS WRONG

MPI defines MPI handles as INTEGERs in the Fortran mpi module (which is what geometry_mod uses). The Fortran compiler/runtime will therefor pass that parameter as a pointer to an integer. MPI defines that this type must be received on the C side as (MPI_Fint*).

Specifically, ParMETIS does two things wrong:

  1. It defines the MPI handle in the Fortran entry point function as type (MPI_Comm*). It needs to be (MPI_Fint*).
  2. It does not convert the Fortran handle to a C handle before using it with C MPI API functions. It needs to use MPI_Comm_f2c() on the (MPI_Fint) to convert it to an (MPI_Comm).

Unfortunately, ParMETIS does not do this: it just dereferences the (incorrect) (MPI_Comm*) argument and tries to use it with C MPI API calls.

In MPICH-flavored MPI implementations, this just happens to work fine because MPICH MPI handles are integers in both C and Fortran.

Open MPI's C handles, however, are pointers (which has been valid since MPI-1.0 in 1994). As such, when NormalModes effectively passes MPI_COMM_WORLD (which is defined to be integer value 0) to the ParMETIS function, ParMETIS dereferences it and gets a 0 back. When it passes 0 to the Open MPI C MPI API functions, they (correctly) interpret that as NULL and throw an MPI exception.


NOTE: The last release from ParMetis was in 2013. With a quick look through their bug tracker, it looks like the last time a maintainer assigned a bug was also back in 2013. I don't know for sure, but it looks like ParMetis is abondonware. 😢

If ParMETIS is abondonware and won't be fixed, I think the only real option you have is to write C wrappers around the ParMETIS calls -- i.e., you invoke the MPI handle f2c functions and then invoke the real underlying ParMETIS C API calls.

@Icarusradio
Copy link

There is a maintained version of Metis and ParMetis. Maybe you can try them out.

@jsquyres
Copy link
Contributor Author

Looks like they have the same code problem.

I'd be happy to report the issue there, but there does not appear to be a way to open an issue on that BitBucket repository...?

Screen Shot 2019-10-29 at 9 18 45 PM

@Icarusradio
Copy link

Sorry, I don't know. I thought they might have solved this problem. It's a pity they didn't.

@jsquyres
Copy link
Contributor Author

I think that this is going to be resolved as a bad user build of the overall stack (i.e., I think there might have been some remnants of a mixed Intel MPI + Open MPI build in there somewhere). Doing a completely clean, reproducible build from scratch seems to have fixed the MPI handle munging issue.

@jsquyres jsquyres reopened this Oct 31, 2019
@jsquyres
Copy link
Contributor Author

jsquyres commented Oct 31, 2019

With a little more testing to convince myself, I'm re-opening this issue.

Although we were initially incorrect about the reason, the end effect is the same: ParMETIS is getting a Fortran MPI communicator handle, and Open MPI (rightfully) invokes an MPI exception.

The mechanism for how this is happening is a little different than we thought, however.


Specifically, ParMETIS does have Fortran API entry points that invoke the correct MPI "f2c" conversion of handles. However, NormalModes is somehow bypassing those ParMETIS Fortran API entry points and directly invoking the ParMETIS C API entry point. This leads to MPI handles not being converted from Fortran to C properly, and therefore Open MPI (rightfully) invokes an MPI exception (which, by default, aborts the job).

You can see the call stack in gdb, for example (I inserted a call to sleep(3) in the ParMETIS function ParMETIS_V3_PartKway() so that I could attach a debugger and see what was going on):

(gdb) bt
#0  0x00000034916acced in nanosleep () from /lib64/libc.so.6
#1  0x00000034916acb60 in sleep () from /lib64/libc.so.6
#2  0x00000000006e3d58 in ParMETIS_V3_PartKway (vtxdist=0x2aaac6adb600, xadj=0x2aaac6ab7ec0, adjncy=0x2aaac6ae4b80, vwgt=0x2aaac6adb740, adjwgt=0x0, wgtflag=0x7fffffffa900, numflag=0x7fffffffa8f0, ncon=0x7fffffffa920, nparts=0x7fffffffa930, tpwgts=0x2aaac6aa3f00, ubvec=0x2aaac6acbe20, options=0x2d32b20 <geometry_mod_mp_pnm_apply_parmetis_$OPTIONS>, edgecut=0x7fffffffa910, part=0x2aaac6ab7d80, comm=0x7fffffffa940) at /home/jsquyres/apps/parmetis/4.0.3/src/parmetis-4.0.3/libparmetis/kmetis.c:41
#3  0x00000000004c911a in geometry_mod::pnm_apply_parmetis () at mod_geometry.f90:757
#4  0x000000000048ccd9 in geometry_mod::build_geometry () at mod_geometry.f90:67
#5  0x00000000006e3b6b in cg_evsl () at mainnm.f90:31
#6  0x0000000000432482 in main ()
#7  0x000000349161ed1d in __libc_start_main () from /lib64/libc.so.6
#8  0x0000000000432329 in _start ()

You can see that geometry_mod::pnm_apply_parmetis() is directly invoking the ParMETIS C API entry point ParMETIS_V3_PartKway(), instead of going through any of ParMETIS's Fortran API entry points:

$ nm libparmetis.a | grep ParMETIS_V3_PartKway
0000000000000964 T PARMETIS_V3_PARTKWAY
                 U ParMETIS_V3_PartKway
0000000000000a44 T parmetis_v3_partkway
0000000000000b24 T parmetis_v3_partkway_
0000000000000c04 T parmetis_v3_partkway__
0000000000000000 T ParMETIS_V3_PartKway
                 U ParMETIS_V3_PartKway

You can see the usual convention of parmetis_v3_partkway[_[_]] and PARMETIS_V3_PARTKWAY Fortran entry points. These C functions all call the MPI "f2c" conversions before calling the C ParMETIS_V3_PartKway() function.

I'm not enough of a Fortran expert to know why this is happening, but I suspect that NormalModes' use of use ISO_C_BINDING has a role to play here (i.e., it might be bypassing the "usual convention" Fortran symbol munging and directly invoking the back-end symbol instead, which has the effect of invoking ParMETIS' C API entry point rather than its Fortran API entry points).

Even though this is quite definitely wrong, it just happens to work with MPICH for the reasons previously cited: MPICH's MPI handles are integers in both C and Fortran. Open MPI's handles are pointers in C; if a Fortran MPI handle is passed to an Open MPI C function, it will (rightfully) go "kaboom".

You can argue who is at fault here (ParMETIS or NormalModes), but ParMETIS looks like it is adondonware, so the chances of something being fixed in it are pretty low.

Regardless, until something is changed, NormalModes will not work with any version of Open MPI because of this issue.

@jsquyres
Copy link
Contributor Author

PR #4 opened to address this issue.

@js1019 js1019 closed this as completed in #4 Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants