Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fortran MPI program titled with "rmdir/mkdir" hangs if it is compiled with IBM xlf compiler #4514

Closed
mathbird opened this issue Nov 17, 2017 · 15 comments

Comments

@mathbird
Copy link

The following Fortran program titled with "rmdir" or "mkdir" hangs if IBM xlf compiler is used to compile the code. With gfortran, the test passed with OMPI. Any special configure requirement for XLF?

program  rmdir
include 'mpif.h'
implicit none
integer ierr
call mpi_init(ierr)
print *, 'program mkdir success'
call mpi_finalize(ierr)
end program

(gdb) bt
#0 0x0000100000e01df8 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000100000dfb004 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2 0x00001000001b5ac0 in opal_mutex_lock (m=0x100000341878 <ompi_mpi_bootstrap_mutex>) at ../../opal/threads/mutex_unix.h:137
#3 0x00001000001b6334 in ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x3fffed8b5ce0) at ../../ompi/runtime/ompi_mpi_init.c:401
#4 0x0000100000232184 in PMPI_Init (argc=0x3fffed8b5d34, argv=0x3fffed8b5d38) at pinit.c:66
#5 0x00001000000ef51c in ompi_init_f (ierr=0x3fffed8b5dc0) at pinit_f.c:84
#6 0x0000000010000b3c in rmdir ()

@mathbird
Copy link
Author

mathbird commented Nov 17, 2017

changing rmdir to rmdir0 could let the test passed.

And also, the test without MPI also passed.

program  rmdir
implicit none
integer ierr
print *, 'program mkdir success'
end program

@mathbird mathbird changed the title fortran program titled with "rmdir/mkdir" hang with IBM xlf compiler A fortran MPI program titled with "rmdir/mkdir" hangs if it is compiled with IBM xlf compiler Nov 17, 2017
@rhc54
Copy link
Contributor

rhc54 commented Nov 17, 2017

It seems to me that naming any program the same as a standard libc function is begging for trouble.

@mathbird
Copy link
Author

mathbird commented Nov 17, 2017

I also tried "rm", "ls" "cd". all passed. And as I said, GNU gfortran compiler let it passed.

And MPICH + XLF also let the test passed.

@rhc54
Copy link
Contributor

rhc54 commented Nov 17, 2017

There is no substitute for luck when you introduce symbol confusion. 😄

@mathbird
Copy link
Author

mathbird commented Nov 18, 2017

Actually, it is from one user. The fortran program title should be just treated as a string, even it is the same as a linux command.

@jsquyres
Copy link
Member

This sounds like a race condition more than it sounds like an argv0 issue.

Is there any further info about that back trace? What version of Open MPI are we talking about? How was it configured / installed? Can you supply all the information that we asked for in the github issue template?

@mathbird
Copy link
Author

I built OMPI master brand with the following configure and make. I don't have further btrace treee info.

autogen.pl --force  ## for developer's master branch
rm -Rf build
mkdir  build
cd     build
../configure --prefix=${installdir} \
--enable-mem-debug \
--enable-debug \
--enable-orterun-prefix-by-default
make -j 4 all
make install

@ggouaillardet
Copy link
Contributor

@mathbird can you please run nm ./rmdir | grep -i rmdir ?
as previously pointed by @rhc54, the xlf compiler might generate the rmdir symbol that is conflicting with glibc's rmdir and that is used internally by Open MPI.
note neither rm nor ls are glibc functions (they are binaries, the glibc functions are unlink and readdir and friends).

assuming the xlf compiler does generate the rmdir symbol, feel free to contact IBM and ask for a rationale and/or report an issue.
that being said, i agree with @rhc54, you are looking for trouble when naming your program after a glibc (or any other library used by your program) symbol, so simply do not do that.

@mathbird
Copy link
Author

mathbird commented Nov 20, 2017

[c656f6n03:~/000/tests/175856] nm pm1 | grep -i mkdir
0000000010000d20 T mkdir

the test program name is called pm1, not mkdir.

if I changed program mkdir to program mkdir0, it showed

[c656f6n03:~/000/tests/175856] nm pm1 | grep -i mkdir
0000000010000d00 T mkdir0

@ggouaillardet
Copy link
Contributor

thanks, so my intuition was right ... this is how xlf compiler works.
as i wrote previously, feel free to contact IBM support if you are unhappy with this behavior.
meanwhile, just refrain to name your program after a subroutine from the glibc or any other Open MPI dependency.

@jsquyres
Copy link
Member

If this is an xlf bug, then it's not our issue to fix. Please feel free to re-open if there's something we need to fix in Open MPI. Thanks!

@mathbird
Copy link
Author

Thanks. One question is why MPICH + IBMXLF let the test passed?

@ggouaillardet
Copy link
Contributor

A possible explanation is MPICH does not invoke mkdir() internally

@mathbird
Copy link
Author

would you explain in more detail how and where OMPI invoke rmdir/mkdir internally?

@ggouaillardet
Copy link
Contributor

Open MPI creates a session_dir to store per node information so it can be easily removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants