-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error when running NormalModes under Open MPI #3
Comments
Sorry, I don't know. I thought they might have solved this problem. It's a pity they didn't. |
I think that this is going to be resolved as a bad user build of the overall stack (i.e., I think there might have been some remnants of a mixed Intel MPI + Open MPI build in there somewhere). Doing a completely clean, reproducible build from scratch seems to have fixed the MPI handle munging issue. |
With a little more testing to convince myself, I'm re-opening this issue. Although we were initially incorrect about the reason, the end effect is the same: ParMETIS is getting a Fortran MPI communicator handle, and Open MPI (rightfully) invokes an MPI exception. The mechanism for how this is happening is a little different than we thought, however. Specifically, ParMETIS does have Fortran API entry points that invoke the correct MPI "f2c" conversion of handles. However, NormalModes is somehow bypassing those ParMETIS Fortran API entry points and directly invoking the ParMETIS C API entry point. This leads to MPI handles not being converted from Fortran to C properly, and therefore Open MPI (rightfully) invokes an MPI exception (which, by default, aborts the job). You can see the call stack in gdb, for example (I inserted a call to
You can see that
You can see the usual convention of I'm not enough of a Fortran expert to know why this is happening, but I suspect that NormalModes' use of Even though this is quite definitely wrong, it just happens to work with MPICH for the reasons previously cited: MPICH's MPI handles are integers in both C and Fortran. Open MPI's handles are pointers in C; if a Fortran MPI handle is passed to an Open MPI C function, it will (rightfully) go "kaboom". You can argue who is at fault here (ParMETIS or NormalModes), but ParMETIS looks like it is adondonware, so the chances of something being fixed in it are pretty low. Regardless, until something is changed, NormalModes will not work with any version of Open MPI because of this issue. |
PR #4 opened to address this issue. |
In attempting to run NormalModes under Open MPI, I notice that there is a Fortran --> C translation of an
MPI_Comm
that is not happening correctly between NormalModes and ParMetis (http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview).I filed a bug with ParMetis (see http://glaros.dtc.umn.edu/flyspray/task/174), but ParMetis appears to be abondonware (** see below), so I am filing the bug here for you, NormalModes maintainers. 😄
I represent a vendor partner mentoring one of the student teams for SC'19; they are attempting to run NormalModes with Open MPI. I also happen to be one of the core developers of Open MPI.
The problem is here:
geometry_mod::pnm_apply_parmetis()
is a Fortran subroutine that invokesParMETIS_V3_PartKway()
.ParMETIS_V3_PartKway()
is an MPI communicator handle.geometry_mod::pnm_apply_parmetis()
passes in 0, which is actually the correct Fortran integer value forMPI_COMM_WORLD
.ParMETIS_V3_PartKway()
is a C function. The last parameter is of type(MPI_Comm*)
. THIS IS WRONGMPI defines MPI handles as
INTEGER
s in the Fortranmpi
module (which is whatgeometry_mod
uses). The Fortran compiler/runtime will therefor pass that parameter as a pointer to an integer. MPI defines that this type must be received on the C side as(MPI_Fint*)
.Specifically, ParMETIS does two things wrong:
(MPI_Comm*)
. It needs to be(MPI_Fint*)
.MPI_Comm_f2c()
on the(MPI_Fint)
to convert it to an(MPI_Comm)
.Unfortunately, ParMETIS does not do this: it just dereferences the (incorrect)
(MPI_Comm*)
argument and tries to use it with C MPI API calls.In MPICH-flavored MPI implementations, this just happens to work fine because MPICH MPI handles are integers in both C and Fortran.
Open MPI's C handles, however, are pointers (which has been valid since MPI-1.0 in 1994). As such, when NormalModes effectively passes
MPI_COMM_WORLD
(which is defined to be integer value 0) to the ParMETIS function, ParMETIS dereferences it and gets a 0 back. When it passes 0 to the Open MPI C MPI API functions, they (correctly) interpret that as NULL and throw an MPI exception.NOTE: The last release from ParMetis was in 2013. With a quick look through their bug tracker, it looks like the last time a maintainer assigned a bug was also back in 2013. I don't know for sure, but it looks like ParMetis is abondonware. 😢
If ParMETIS is abondonware and won't be fixed, I think the only real option you have is to write C wrappers around the ParMETIS calls -- i.e., you invoke the MPI handle f2c functions and then invoke the real underlying ParMETIS C API calls.
The text was updated successfully, but these errors were encountered: